最新最全自然语言处理之情感分析常用数据集-程序员宅基地

技术标签: 计算机视觉  自然语言处理  

本文介绍情感分析领域最常见的一些数据集。

目录

  • TOC
    {:toc}

SemEval

SemEval-2014 Task 4: Aspect Based Sentiment Analysis

任务的介绍主要参考了SemEval-2014 Task 4: Aspect Based Sentiment Analysis,官方网站为SemEval-2014 Task 4

这是SemEval-2014语义评测任务的第4个任务,它又包含4个子任务。

子任务1:Aspect term extraction

给定针对某个entity(比如餐馆)的一些句子,识别其中的aspect term。比如句子"The food was nothing much, but I loved the staff",我们需要识别"food"和"staff"这两个aspect term。一个句子里可能会出现多个(或者零个)aspect term。另外aspect term可能包含多个词,比如"The hard disk is very noisy",这里的aspect term是"hard disk"。

子任务2:Aspect term的极性分类

给定一个句子和这个句子里的所有aspect term,判定每一个term的情感极性。可能的极性包括正面(positive)、负面(negative)、中性(neutral)和冲突(conflict)。比如:

“I loved their fajitas” → {fajitas: positive}
“I hated their fajitas, but their salads were great” → {fajitas: negative, salads: positive}
“The fajitas are their first plate” → {fajitas: neutral}
“The fajitas were great to taste, but not to see” → {fajitas: conflict}

冲突的意思是在这个句子里既有正面的评价也有负面的评价,比如上面的第四个句子。

子任务3:Aspect类别(category)识别

因为很多不同的aspect term都可以归为一类,比如fajitas和salads都是餐馆的菜品,我们希望把它们都归类到food。这个任务定义了几个类别,比如餐馆(restaurant)的数据集上定义里food, service, price, ambience, anecdotes/miscellaneous等5个类别。这个任务为:给定一个句子,识别出其中的类别(注意一个句子可能包含多个类别)。比如:

“The restaurant was too expensive” → { price}
“The restaurant was expensive, but the menu was great” → { price, food}

有的读者可能回想,如果能识别aspect term,然后再判断aspect term是哪个category。这可能有一个问题,对于隐式的aspect,可能只有形容词而没有名词,比如第一个句子没有price这样的aspect term,我们需要根据形容词expensive来推测类别为price。

子任务4:Aspect类别的情感分类

给定一个句子以及句子里的一个或者多个aspect类别,输出每个类别的情感分类。和前面的term分类一样,这里的分类也是正面(positive)、负面(negative)、中性(neutral)和冲突(conflict)。比如:

“The restaurant was too expensive” → {price: negative}
“The restaurant was expensive, but the menu was great” → {price: negative, food: positive}

对于上面的第一个例子,输入是句子和negative与food两个类别,输出是这两个类别的极性。

示例数据

全部数据可以在这里下载,它包括餐馆和笔记本电脑两个数据集,其中餐馆数据集包含上面的4个子任务的标注,而笔记本电脑的数据只有前两个任务的标注数据(没有类别的标注)。

下面是餐馆的一个示例数据:

<sentence id="813">
          <text>All the appetizers and salads were fabulous, the steak was mouth watering and the pasta was delicious!!!</text>
          <aspectTerms>
                    <aspectTerm term="appetizers" polarity="positive" from="8" to="18"/>
                    <aspectTerm term="salads" polarity="positive" from="23" to="29"/>
                    <aspectTerm term="steak" polarity="positive" from="49" to="54"/>
                    <aspectTerm term="pasta" polarity="positive" from="82" to="87"/>
          </aspectTerms>
          <aspectCategories>
                    <aspectCategory category="food" polarity="positive"/>
          </aspectCategories>
</sentence>

下面是笔记本电脑的示例:

<sentence id="353">
          <text>From the build quality to the performance, everything about it has been sub-par from what I would have expected from Apple.</text>
          <aspectTerms>
                    <aspectTerm term="build quality" polarity="negative" from="9" to="22"/>
                    <aspectTerm term="performance" polarity="negative" from="30" to="41"/>
          </aspectTerms>
</sentence>

SemEval-2015 Task 12: Aspect Based Sentiment Analysis

任务的介绍主要参考了SemEval-2015 Task 12: Aspect Based Sentiment Analysis,官方网站为SemEval-2015 Task 12

这是SemEval-2014任务的任务假定评论的都是给定实体(餐馆或者笔记本电脑)的某个属性,但是我们也可能点评这个实体的部件,比如笔记本电脑的鼠标。前面介绍过,aspect更加通用的表示方法是一棵树。不过这里的任务还是简化里一些,认为这棵树最多两层,树根是餐馆或者笔记本电脑,我们可以点评电脑的属性(比如价格),也可以点评部件鼠标的属性(比如鼠标的灵敏度)。此外,有一些点评aspect的句子并不见得会出现对应的名词,比如下面的文字:

They sent it back with a huge crack in it and it still didn't work; and that was the fourth time I’ve sent it to them to get fixed

它点评的实体是餐馆的服务(service),属性是服务的质量(quality),但是文字中没有任何service或者quality相关的文字。这和前面的expensive的句子类似的。因此2015年的任务预定义里所有的Entity和属性,然后让我们识别文本中出现里哪些实体和属性的组合,也就是E#A。比如上面的句子,输出就是service#quality。另外这个任务的输入不是一个一个的句子,而是整段评论,这样我们可以利用上下文信息。当然标注和识别的粒度还是句子,只不过我们的算法可以(但大部分算法都没有)利用上下文的信息。

任务1:In-domain任务

给定一个完整的评论,我们需要完成如下3个子任务。

Aspect类别识别

识别评论里所有的实体(E)和属性(A)对。E和A都是预定义集合中的某一个值,比如餐馆数据集,E包含laptop, keyboard, operating system, restaurant, food, drinks等实体和performance, design, price, quality等属性。

更具体的,对于笔记本电脑数据集来说,E共用22个实体类别(比如LAPTOP, DISPLAY, CPU, MOTHERBOARD, HARD DISC, MEMORY, BATTERY等)和9个属性标签(比如GENERAL, PRICE, QUALITY, OPERATION_PERFORMANCE等)。完整的实体列表和属性标签列表可以参考这里,下面是一些示例:

(1)  It fires up in the morning in less than 30 seconds and I have never had any issues with it freezing. → {LAPTOP#OPERATION_PERFORMANCE}
(2)  Sometimes you will be moving your finger and the pointer will not even move.  → {MOUSE#OPERATION_PERFORMANCE}
(3)  The backlit keys are wonderful when you are working in the dark.  → {KEYBOARD#DESIGN_FEATURES}
(4)  I dislike the quality and the placement of the speakers. {MULTIMEDIA DEVICES#QUALITY}, {MULTIMEDIA DEVICES#DESIGN_FEATURES}
(5)  The applications are also very easy to find and maneuver.  → {SOFTWARE#USABILITY}
(6)  I took it to the shop and they said it would cost too much to repair it.  → {SUPPORT#PRICE}
(7)  It is extremely portable and easily connects to WIFI at the library and elsewhere. → {LAPTOP#PORTABILITY}, {LAPTOP#CONNECTIVITY}

比如第一个句子是说笔记本的操作响应很快,而第二个是说鼠标的操作很不灵敏。

对于餐馆数据集来说,E有6个实体类别(RESTAURANT, FOOD, DRINKS, SERVICE, AMBIENCE, LOCATION)和5个属性标签(GENERAL, PRICES, QUALITY, STYLE_OPTIONS, MISCELLANEOUS),详细信息可以参考这里,下面是一些示例:

(1) Great for a romantic evening, but over-priced. → {AMBIENCE#GENERAL}, {RESTAURANT#PRICES}
(2) The fajitas were delicious, but expensive. → {FOOD#QUALITY}, {FOOD# PRICES}
(3)The exotic food is beautifully presented and is a delight in delicious combinations. → {FOOD#STYLE_OPTIONS}, {FOOD#QUALITY}
(4) The atmosphere isn't the greatest , but I suppose that's how they keep the prices down. → {AMBIENCE#GENERAL}, {RESTAURANT# PRICES}
(5) The staff is incredibly helpful and attentive. → {SERVICE# GENERAL}
Opinion Target Expression(OTE)识别

这个任务只有餐馆数据集上有标注数据。OTE任务的输入是所有的E#A对,需要识别E#A对里实体E对应的字符串。当隐式的表达实体时用特殊的"NULL"表示,比如代词"它"这样的代词,有的文本甚至根本找不到和E相关的字符串。下面是一些例子:

(1) Great for a romantic evening, but over-priced. → {AMBIENCE#GENERAL, “NULL”}, {RESTAURANT# PRICES, “NULL”}
(2) The fajitas were delicious, but expensive. → {FOOD#QUALITY, “fajitas”}, {FOOD# PRICES, “fajitas”}
(3) The exotic food is beautifully presented and is a delight in delicious combinations. → {FOOD#STYLE_OPTIONS, “exotic food”}, {FOOD# QUALITY, “exotic food”}
(4) The atmosphere isn't the greatest , but I suppose that's how they keep the prices down. → {AMBIENCE#GENERAL, “atmosphere”}, {RESTAURANT# PRICES, “NULL”}
(5) The staff is incredibly helpful and attentive. → {SERVICE# GENERAL, “staff”}

比如在第4个句子里,they指代的是餐馆,但是它不是OTE。

情感分类

给定一个句子(有上下文)和所有的E#A对,判断其情感分类,可能的分类为正面、负面和中性。这和2014年的任务有所不同。下面是一些示例:

(1) The applications are also very easy to find and maneuver. → {SOFTWARE#USABILITY,  positive}
(2) The fajitas were great to taste, but not to see”→ {FOOD#QUALITY, “fajitas”, positive},  {FOOD#STYLE_OPTIONS, “fajitas”, negative }
(3) We were planning to get dessert, but the waitress basically through the bill at us before we had a chance to order.  → {SERVICE# GENERAL, “waitress”, negative}
(4) It does run a little warm but that is a negligible concern.  → {LAPTOP#QUALITY neutral}
(5) The fajitas are nothing out of the ordinary” → {FOOD#GENERAL, “fajitas”,  neutral}
(6) I bought this laptop yesterday. → {}
(7) The fajitas are their first plate  → {}
任务2:Out-of-domain任务

增加里一个酒店的测试数据集(没有训练数据),然后考察模型则不同领域的泛化能力。它的输入是E#A对和句子,要求我们输出这个E#A对的情感极性。

下面是一个完整的评论的标注数据:

Review id:"1004293"

 Judging from previous posts this used to be a good place, but not any longer.
 {target:"NULL" category:"RESTAURANT#GENERAL" polarity:"negative" from:"-" to="-"}

 We, there were four of us, arrived at noon - the place was empty - and the staff acted 
 like we were imposing on them and they were very rude.
 {target:"staff" category:"SERVICE#GENERAL" polarity:"negative" from:"75" to:"80"}

 They never brought us complimentary noodles, ignored repeated requests for sugar, 
 and threw our dishes on the table.
 {target:"NULL" category:"SERVICE#GENERAL" polarity:"negative" from:"-" to:"-"}

 The food was lousy - too sweet or too salty and the portions tiny.
 {target:"food" category="FOOD#QUALITY" polarity="negative" from:"4" to:"8"}
 {target:"portions" category:"FOOD#STYLE_OPTIONS" polarity:"negative" from:"52" to:"60"}

 After all that, they complained to me about the small tip.
 {target:"NULL" category:"SERVICE#GENERAL" polarity:"negative" from:"-" to:"-"}
			
 Avoid this place!
 {target:"place" category:"RESTAURANT#GENERAL" polarity:"negative" from:"11" to:"16"}

其中from和to表示OTE字符串在句子开始和结束的下标。

SemEval-2016 Task 5: Aspect Based Sentiment Analysis

任务的介绍主要参考了SemEval-2016 Task 5: Aspect Based Sentiment Analysis,官方网站为SemEval-2016 Task 5

2016年的任务延续里2015年的任务,为它增加了新的测试数据(15年的训练数据和测试数据都变成16年的训练数据),此外它还首次加入了英语之外的多种语言,包括中文。它包括如下几个子任务:

句子级别的ABSA(Aspect-Based Sentiment Analysis)

给定某个实体(笔记本电脑、餐馆或者酒店)的一篇评论的一个句子,需要确定所有观点三元组的如下内容(slot):

Aspect Category Detection

这个任务是确定文本里所有出现的E#A对,其中E来自预定义的实体类列表,A来自预定义的属性标签列表。

Opinion Target Expression (OTE)

和上年的任务一样,需要确定每个E#A对里实体对应的字符串的开始和结束下标,如果找不到则输出"NULL"。只有餐馆的数据有这个子任务。

情感极性

判断每一个E#A对的情感分类,类别包括正面、负面和中性。

上面的任务在人工标注时每次处理一个句子,但是会参考它的前后上下文的其它句子。下面是一个笔记本的评论的标注示例:

S1:The So called laptop Runs to Slow and I hate it! →
{LAPTOP#OPERATION_PERFORMANCE, negative}, {LAPTOP#GENERAL, negative}
S2:Do not buy it! → {LAPTOP#GENERAL, negative}
S3:It is the worst laptop ever. → {LAPTOP#GENERAL, negative}

下面是餐馆的数据:

S1:I was very disappointed with this restaurant. →
{RESTAURANT#GENERAL, “restaurant”, negative, from="34" to="44"}
S2:I’ve asked a cart attendant for a lotus leaf wrapped rice and she replied back rice and just walked away. →{SERVICE#GENERAL, “cart attendant”, negative, from="12" to="26"}
S3:I had to ask her three times before she finally came back with the dish I’ve requested. →
{SERVICE#GENERAL, “NULL”, negative}
S4:Food was okay, nothing great. →
{FOOD#QUALITY, “Food”, neutral, from="0" to="4"}
S5:Chow fun was dry; pork shu mai was more than usually greasy and had to share a table with loud and rude family. →
{FOOD#QUALITY, “Chow fun”, negative, from="0" to="8"},
{FOOD#QUALITY, “pork shu mai”, negative, from="18" to="30"},
{AMBIENCE#GENERAL, “NULL”, negative}
S6:I/we will never go back to this place again. →
{RESTAURANT#GENERAL, “place”, negative, from="32" to="37"}
文本级别的ABSA

上面的句子级别的问题是模型不能参考上下文(人工标注是参考了的),因此还有一个文本级别的ABSA任务。它的任务和前面是一样的,只不过输入是整个评论文本,下面是一些示例。

下面是整个评论文本:

Review id:LPT1 (Laptop)
"The So called laptop Runs to Slow and I hate it! Do not buy it! It is the worst laptop ever."

期望的输出(标注)为:

{LAPTOP#OPERATION_PERFORMANCE, negative}
{LAPTOP#GENERAL, negative}

但是它并不能简单的把文本分成句子,然后把所有句子的结果合并起来,因为一个段落里可能有多个句子都在说同一个E#A对,如果是这样的话需要判断最主要的情感倾向,比如下面的例子:

Review id:RST1 (Restaurant)
"I was very disappointed with this restaurant. I’ve asked a cart attendant for a lotus leaf wrapped rice and she replied back rice and just walked away. I had to ask her three times before she finally came back with the dish I’ve requested. Food was okay, nothing great. Chow fun was dry; pork shu mai was more than usually greasy and had to share a table with loud and rude family. I/we will never go back to this place again."

它的输出为:

{RESTAURANT#GENERAL, negative}
{SERVICE#GENERAL, negative}
{FOOD#QUALITY, negative}
{AMBIENCE#GENERAL, negative}

它就是前面句子级别的同一段文本,关于FOOD#QUALITY有一个中性两个负面的,因此总的情感倾向是负面的。如果多个句子的情感倾向是冲突的,比如一个正面一个负面,则需要识别为冲突(conflict)。比如下面的例子:

Review id: RST2 (Restaurant)
“This little place has a cute interior decor and affordable city prices. The pad seew chicken was delicious, however the pad thai was far too oily. I would just ask for no oil next time.”

它的输出为:

{AMBIENCE#GENERAL, positive}
{RESTAURANT#PRICES, positive}
{FOOD#QUALITY, conflict}
{RESTAURANT#GENERAL, positive}

FOOD#QUALITY既有正面的又有负面的,因此标注为冲突。

Out-of-domain ABSA

这个任务的测试数据的领域没有训练数据,它考察的是模型在不同领域的泛化能力。

IMDB

电影的评论数据,二分类任务,包括25,000个训练数据和25,000个测试数据。可以在这里下载。

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/mrjkzhangma/article/details/102678506

智能推荐

java.lang.NoSuchMethodError: org.springframework.core.annotation.AnnotationUtils.isInJavaLangAnnotat_caused by: java.lang.nosuchmethoderror: org.spring-程序员宅基地

文章浏览阅读1.4w次。最近在做spring框架练习时,遇到启动程序报错,摘取部分片段如下:Caused by: java.lang.NoSuchMethodError: org.springframework.core.annotation.AnnotationUtils.isInJavaLangAnnotationPackage(Ljava/lang/annotation/Annotation;)Z at o..._caused by: java.lang.nosuchmethoderror: org.springframework.core.annotation.

大数据开发需要学习哪些编程语言_大数据开发语言-程序员宅基地

文章浏览阅读6.5k次。大数据开发需要学什么编程语言?随着大数据的持续升温,越来越多的人投身于大数据的浪潮之中,不少完全没基础的小伙伴,难免会有这样的疑问,从事大数据需要学习什么编程语言呢?其实这个问题没有固定的答案,像Python、R、Java和Scala都是很好的选择,大家可以根据自身的实际情况进行选择1、Python一般的数据科学家都会选择Python作为大数据语言的首选。一直以来,Python流行于学术界,在自然语言处理(NLP)等领域尤其如此。所以,当有一个需要NLP处理的项目时,就会面临数量多得让人眼花缭乱的选_大数据开发语言

Python读取文件内容-程序员宅基地

文章浏览阅读3.5w次,点赞23次,收藏125次。Python 读取文件内容_python读取文件内容

港科夜闻|香港科大举办网上招生说明会,详细解读招生政策。-程序员宅基地

文章浏览阅读189次。关注并星标每周阅读港科夜闻建立新视野 开启新思维/近日要闻一览/▼1、香港科大举办网上招生说明会,详细解读招生政策。招生说明会邀请了香港科大本科招生办主任和学生大使担任讲者,为有兴趣入读...

【Linux】bash: /home/stefan/jdk1.7.0_55/bin/java: /lib/ld-linux.so.2: bad ELF interpreter_-bash: /home/rootplace/jdk8/bin/java: /lib/ld-linu-程序员宅基地

文章浏览阅读215次。今天在linux系统中配置jdk的时候,配置完成之后提示bash: /home/stefan/jdk1.7.0_55/bin/java: /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory后来感觉可能只因为之前卸载yum导致的问题,因为我在另外一个虚拟机中也是这么配置的于是搜了一下问题,执行如下命令即可yu..._-bash: /home/rootplace/jdk8/bin/java: /lib/ld-linux.so.2: bad elf interpreter: 没有那个文件或目录

高级SQL优化(一) -程序员宅基地

文章浏览阅读319次。目录:Oracle数据完整性和锁机制 索引及优化之表分析 表分析、约束及表间关系 Oracle体系结构1Oracle体系结构2 海量数据库及分区1 海量数据库及分区2 海量数据库及分区3 海量数据库及分区4 高级SQL优化(一) PPT和源码下载: http://sishuok.com/forum/posts/list/6..._高级sql优化

随便推点

Editor_Selection,MenuItem学习02_editor.selection-程序员宅基地

文章浏览阅读305次。//也就选中的物体public class Editor_Selection{/* 1.选中的物体(可操作)2.使用快捷键操作3.选中右击添加组体,要先判断是否有此组件4.MenuItem最二个参数的用法 **/ [MenuItem("EditTools/selection1")] static void SelectionObj() { i_editor.selection

更新版:软件测试菜鸟入门_软件测试在哪更新-程序员宅基地

文章浏览阅读2.5k次,点赞16次,收藏92次。改版前链接:https://weltest.blog.csdn.net/article/details/79847217前言​ 随着技术的发展,各种应用程序、各种App应运而生!在早期,这些应用程序只是通过开发人员、产品以及部分用户使用之后,给出相应的修改意见,感觉都OK后就进行上线,在网上或一些app下载平台上就可以直接使用,没有进行过规范的软件测试!这些软件或多或少会存在一些bug,这些bug有可能是功能上、兼容性、性能等各方面的问题!为了改善软件质量不高的问题,软件测试这门行业才开始受到重视!软件_软件测试在哪更新

SpringBoot保存数据库中文乱码以及输出json乱码_springboot保存中文到数据库乱码-程序员宅基地

文章浏览阅读811次。首先第一个,保存数据库乱码,凡是保存到中文字符的地方都出现乱码,一堆???看不懂的文字,这时候需要检查三个地方,第一检查链接数据库驱动配置是否配置了utf-8url: jdbc:mysql://localhost:3306/sell?characterEncoding=utf-8&useSSL=false第二检查数据库属性,当时创建数据库时是否是utf-8的编码第三若..._springboot保存中文到数据库乱码

Objective-C 类别使用_objective-c 类别shiyong-程序员宅基地

文章浏览阅读387次。Objective-C 类别使用_objective-c 类别shiyong

js 写入图片Exif信息piexifjs-程序员宅基地

文章浏览阅读3.2k次。SamplesInsert Exif into jpegfunction handleFileSelect(evt) { var file = evt.target.files[0]; var zeroth = {}; var exif = {}; var gps = {}; zeroth[piexif.ImageIFD.Ma_piexifjs

设计模式——命令模式实现撤销_软件设计模式命令模式撤销实验的实现-程序员宅基地

文章浏览阅读1k次。总结一下工作中用到的这个设计模式,,看了下大话设计模式里好像也有这个,,以后看到了在完善吧,,现在这个是项目要求实现一个撤销功能,就跟我说用命令模式写就行~听简单的~QAQ然后看了很多关于命令模式的博客,感觉我写的应该差不多吧,,应该没有理解错,直接放代码public class 命令模式 : MonoBehaviour{ CommandManager comdMag = new CommandManager(); public int TestValue = 0; // Start is_软件设计模式命令模式撤销实验的实现

推荐文章

热门文章

相关标签