Yaojie Zhang,Bing Xu,and Tiejun Zhao
Abstract—This paper presents a method for aspect based sentiment classification tasks,named convolutional multi-head self-attention memory network(CMA-MemNet). This is an improved model based on memory networks,and makes it possible to extract more rich and complex semantic information from sequences and aspects. In order to fix the memory network’s inability to capture context-related information on a word-level,we propose utilizing convolution to capture n-gram grammatical information. We use multi-head self-attention to make up for the problem where the memory network ignores the semantic information of the sequence itself. Meanwhile, unlike most recurrent neural network(RNN) long short term memory(LSTM), gated recurrent unit (GRU) models, we retain the parallelism of the network. We experiment on the open datasets SemEval-2014 Task 4 and SemEval-2016 Task 6. Compared with some popular baseline methods, our model performs excellently.
ASPECT based sentiment analysis(ABSA)[1]–[3]is a detailed sentiment analysis task which aims to analyze the sentiment polarity(positive, negative or neutral)expressed by different aspects of the same text.In many cases,we need to focus not only on the overall sentiment in product reviews,as in ordinary sentiment analysis(SA)tasks, but also on more detailed and in-depth sentiment expressions.The sentiment expressions of different aspects in a sentence may be different.For example,in the sentence“Good performance, but too little battery power.”,there is a positive attitude towards“performance”, but a negative attitude towards“battery”.This task is important and challenging,and many shared task studies have been conducted in recent years,such as SemEval-2014 Task 4[3],SemEval-2015 Task 12[4],and SemEval-2016 Task 5[5].ABSA tasks are generally divided into aspect extraction (AE)subtasks[6]and aspect sentiment classification(ASC)subtasks[7].with the development of a series of related research,the task definition of ABSA has become more complete.It is divided into three parts[8]:opinion target extraction (OTE),aspect category detection,and sentiment polarity(SP).This paper mainly studies SP task;that is,given a sentence with some aspects,how one analyzes the sentiment polarity of aspects in the sentence.SP/ASC can be divided into two types:aspect-category sentiment analysis(ACSA)and aspect-term sentiment analysis(ATSA)[9].The main difference between ACSA and ATSA is that ACSA classifies many kinds of targets to be analyzed into several categories,and identifies the sentiment polarity of each aspect category in the sentences.The goal of ATSA is to directly identify the sentiment polarity of targets being analyzed,whose categories are uncertain.This paper studies these two tasks.
Early research used traditional methods based on rules[10]or statistics[11].Support vector machine (SVM) with external resources[12]is one of the most successful methods, but its performance depends heavily on the construction of artificial features.Target dependent(TD)-LSTM (long short term memory)and target connection(TC)-LSTM[13]take the prediction target as the central word and build two LSTM s from left to right and from right to left.Considering that only using LSTM will result in information loss when processing long sequences,aspect-attention-aspect-embedding(ATAE)-LSTM[7]uses an aspect-related attention mechanism.However,these LSTM based methods are always difficult to integrate statements with dispersed important feature locations.For example,in the sentence“Everything except memory is terrible.”,“except”and “terrible” have a positive effect on the word “memory”.Reference[14]first applied memory network to ABSA and achieved good results.LSTM has strong aspect-sequence modeling ability, but it loses context-related information besides word-level,and lacks the modeling of complex semantic expression.Although multilayer attention can alleviate this defect,it only focuses on the semantic relationship between aspect and sequence,and ignores the semantic relationship between the words of the sequence.There are many subsequent improvements based on memory in ABSA tasks[15]–[18],and they have all achieved good results, but most lose network parallelism.
To solve the aforementioned problems,we propose to use convolution to integrate text features of words and multiwords,and use amulti-head self-attention of transformer[19]encoder instead of recurrent neural network(RNN)to extract semantic information in the sequence.The output of the encoder is then used as memory.Convolutional multi-head self-attention is first proposed in hierarchical convolutional attention network(HCAN)[20].HCAN is a hierarchical feature extraction method for document-level text classification.Finally,we classify the aspect’s sentiment polarity with the help of an aspect-oriented memory network.In this way,the model considers long-term dependence information of aspect words and sequences by aspect attention,context-related information besides word-level by convolutional calculation and considers semantic-related information of sequence itself by self-attention.This is an improved model based on memory network,and makes it possible to extract more complex and richer semantic information from sequences and aspects.The whole model retains the parallelism of network computing.Each component is differentiable,and can be trained end-to-end with gradient descent.We evaluate our approach on four typical datasets: three from SemEval 2014’s laptop dataset and restaurant review dataset[3],and one from SemEval 2016’s tweets dataset[21].We apply datasets to ACSA and ATSA tasks respectively.The experimental results show that our model performs well on different types of data for two kinds of tasks.
The rest of this paper is organized as follows.Section II introduces our methods in detail.Section III introduces our experimental results and analysis on open datasets.Section IV describes some of our summaries and future work directions.
In this section,we will introduce our method for ACSA and ATSA tasks.The ACSA task is defined as:given a sentence and an aspect category,the model predicts the sentiment polarity(positive,negative or neutral)of the sentence to the aspect category.The ATSA task starts with being:given a sentence and an aspect(usually one or more words)that appears in the sentence.The model predicts the sentiment polarity of the sentence to the aspect.The overall structure of model is shown in Fig.1.
Our experiments used four open datasets,two for aspectcategory sentiment analysis(ACSA)tasks and two for aspectterm sentiment analysis(ATSA)tasks.Table I shows the statistics of datasets, where Res-ACSA,Res-ATSA and Lap-ATSA are customer comments on restaurants and laptops provided by SemEval-2014 Task 411 http://alt.qcri.org/semeval2014/task4/[3],and Tweet-ACSA is tweets provided by SemEval-2016 Task 62http://alt.qcri.org/semeval2016/task6/[21].
The Res-ACSA dataset contains customer evaluations of five aspects of categories,namely “misc”,“food”,“service”,“price” and “ambience”.Res-ATSA is same as Res-ACSA,but each sentence contains the customer's evaluation of the specific terms.Lap-ATSA is the evaluation of specific terms of the laptops by customers.Some existing work[9]on three datasets in SemEval-2014 removed “conflict”labels.Tweet-ACSA is the user’s sentiment expression on five topics of“feminist movement”,“hillary clinton”,“climate change is a real concern”,“l(fā)egalization of abortion”and “atheis”.We divide the sentiment of the four datasets into three categories:“positive”,“negative”and “neutral”.
TABLE I Statistics of the Da tasets
In our experiments,we use 300-dimension word embedding vectors pre-trained by GloVe3http://nlp.stanford.edu/projects/glove/[22]which is trained from web data where the vocabulary size is1.9m2.Word embedding vectors are not fine-tuned during training.Position embedding vectors are randomly initialized.The number of convolution filters is300.We set the learning rate as 7×10–5and L2 regularization coefficient as 1×10–5.We set dropout to be 0.2.We will discuss the window size and hops in detail later.In order to learn semantic information from easy to difficult and reduce the padding to 0,we sort the training data by sentence length,and let the network learn short sentences before long ones.The batch size is2 0 instances and the maximal epochs is 40.We randomly sampled 0.2 training data as dev set,and saved the best performance model parameters on the dev set,then calculated evaluation on the test set.
In experiments, we compare our proposed model with the following models:
1) Feature+SVM:Feature-based SVM shows good performance on aspect sentiment classification.The system usesn-gram, parse and lexicon features[12].
2) LSTM:A standard LSTM[23]encodes a sentence from the starting to the final word,and the average value of all the hidden states is regarded as the final representation.For different aspects in a sentence,the model will give the same sentiment polarity.
3)TD-LSTM:It uses two LSTM s start from left and right to term words respectively[13].Then it takes the hidden states of LSTM at the last time step to represent the features for prediction.
4) ATAE-LSTM:An aspect sentiment classification method using attention-based LSTM[7].The model concatenates aspect embedding and the embedding of each word and feeds them to LSTM,and then passes through an attention layer.
5) IAN:Interactive attention network(IAN)[24]uses two LSTM on aspect embedding and word embedding,and regards the result of average-pooling as the query vector of other party's attention.
6) MemNet:This applies attention multiple times on word embedding,and feeds the last attention’s output to softmax for prediction [14].
7)GCAE:Gated convolutional network[9]is an efficient model based on CNN.It uses two convolution with different activation functions on embedding,and uses the result of convolution to structure Gated Tanh-Relu Units.
For model comparability, we evaluate our model’s accuracy[9],[14],[24]and macro-averaged F-score.CMA-MemNet achieves the best performance compared with baselines on 4 datasets.Conv-MemNet only uses convolutions while MAMemNet only uses multi-head self-attention on embedding.The result of ATSA task is shown in Table II,and ACSA is shown in Table III.
TABLE II Experimental Results for ATSA.The Models with “1” Are Provided by [18],“2”Are Provided by [15],“3” Are Provided by [9],“4”Are Provided by [25]
TABLE III Experimental Results for ACSA without TD-LSTM and IAN.the Meaning of Markup Is the Same as in Table II
As can be seen from Tables II and III,SVM provides a relatively strong machine learning baseline,which has outstanding performance in ABSA tasks.However,its performance depends strongly on feature engineering and effective vocabulary,and its effect is not as good as those of neural networks when there are not enough features.LSTM networks have more advantages than most networks in sequence modeling,and do not need to manually extract features to generate effective feature representation.Among all LSTM based methods,standard LSTM is the worst, mainly because it ignores aspect information. ATAE-LSTM pays close attention to the expression of sentiment in the sequence of aspect,and has made a significant improvement,especially in the Res-ATSA dataset,where the accuracy has been improved by 2.95%.IAN is the best LSTM based method for ATSA tasks,mainly because it utilizes the strong sequence modeling ability of LSTM,and combines the information of aspect influencing sequence and sequence influencing aspect.It is5 .64% more accurate than LSTM on Lap-ATSA dataset.
MemNet is an excellent network for ACSA tasks.It wins all baselines on Res-ACSA and Tweet-ACSA datasets,and its accuracy on Res-ATSA dataset is only 0.44%lower than that on IAN.Compared with MemNet,Conv-MemNet collects context information and MA-MemNet collects the semantic relevance of sequence itself,which is improved.This proves that this part of semantic information is effective in improving performance.We can draw the conclusion that MemNet has a strong aspect-sequence modeling capability, but lacks context information and sequence information,which limits its performance.CMA-MemNet can also combine this information well,while retaining the original information.
As shown in Table IV,we take the Lap-ATSA dataset as an example to illustrate the effect of convolution window size and the number of memory network hops on the performance of the model.Window size affects the length of the context semantic information extracted from the network.The number of hops is the layer amount of aspect attention, which affects the abstraction of semantic information captured by the network.Experimental results show that the impact of window size and the number of hops on network performance is not a monotonous trend.The value of optimum performance on different datasets is often different.
TABLE IV Effect of Convolution’s window Size and Number of Hops on Network Accuracy for Lap-ATSA
We find that the accuracy rate is the highest on the Lap-ATSA datasets when the Window size is 3 and the number of hops is 2.For when more than 1 hop is needed,[14]explainsthat it is necessary to extract deeper semantic information.In the experiment of MemNet, when hops is7, the model works best.Our network is not based on word embedding.The model has extracted deep semantic information through convolutional multi-head self-attention,so fewer hops are needed.When window size is1,it is equivalent to paying attention to word level information.When window size is too large, the network is easily affected by some non-related information noise in the same window.
TABLE V Cases in the Lap-ATSA Dataset.the Bold Is Aspect and the Subscript Is Label
Fig.4.Comparison of attention on word-level memory and CMA memory.Attention score by(9)is used as color-coding.Deep color means a high attention score.
We use the same method to get the best value of the model on other datasets.On Res-ATSA dataset,Window size is3 and the number of hops is5.On Res-ACSA dataset,Window size is2 and the number of hops is2.On Tweet-ACSA dataset,Window size is2 and the number of hops is2.
In this section,we analyze some cases in the Lap-ATSA dataset,as shown in Table V and Fig.4,to illustrate the effectiveness of the mechanism.
There are three types of examples that most methods find difficult to identify.The first is implicit sentiment expression.In Case 1,it uses“gestures”unconsciously to explain it likes them.However,there are no obvious sentiment words,and our system can recognize such examples completely correctly.This is another important research direction in SA.The second is the complex expression of important information.The aspect-sequence attention in MemNet can capture useful information for aspect, but often not accurately enough,and does not recognize all aspects correctly such as in Case 2.As shown in Fig.4(a),“beats windows easily” in “speed”,shows a negative the polarity for “windows”.But it is hard for wordlevel mechanisms to capture information such as“A beats B”.Convolution can combine some related and important features,and self-attention pays attention to the semantics of the sequence itself, while networks can better understand the relationship between important words.The third is context expression such as negation,comparison and condition.As with the comparative expression in Case 3,this is a difficult problem for word-level mechanisms.As shown in Fig.4(b),if a model lacks a sequence semantic,it may only see“price”and “higher”in the sentence when analyzing “PC”.The model is likely to give negative judgment to both “PC”and “Mac”.Convolution and self-attention can better understand this kind of contextual information,and enable the model to focus on the word “compared”.
In this paper,we propose a highly parallel convolutional multi-head self-attention based memory network.Compared with an embedding based memory network,CMA-MemNet can capture complex semantic information of the context better and give more attention to the semantic relations between the words in the sequence itself.We show the performance of the model on four datasets for ATSA and ACSA tasks and prove its effectiveness.In the future, we would like to consider more types of memory modules in semantic information representation,and synthetically analyze their aspects according to the scores outputted by different memory modules.
IEEE/CAA Journal of Automatica Sinica2020年4期