嚴紅 陳興蜀 王文賢 王海舟 殷明勇
摘 要:現(xiàn)有法語命名實體識別(NER)研究中,機器學(xué)習(xí)模型多使用詞的字符形態(tài)特征,多語言通用命名實體模型使用字詞嵌入代表的語義特征,都沒有綜合考慮語義、字符形態(tài)和語法特征。針對上述不足,設(shè)計了一種基于深度神經(jīng)網(wǎng)絡(luò)的法語命名實體識別模型CGCfr。首先從文本中提取單詞的詞嵌入、字符嵌入和語法特征向量; 然后由卷積神經(jīng)網(wǎng)絡(luò)(CNN)從單詞的字符嵌入序列中提取單詞的字符特征; 最后通過雙向門控循環(huán)神經(jīng)網(wǎng)絡(luò)(BiGRU)和條件隨機場(CRF)分類器根據(jù)詞嵌入、字符特征和語法特征向量識別出法語文本中的命名實體。實驗中,CGCfr在測試集的F1值能夠達到82.16%,相對于機器學(xué)習(xí)模型NERCfr、多語言通用的神經(jīng)網(wǎng)絡(luò)模型LSTMCRF和Char attention模型,分別提升了5.67、1.79和1.06個百分點。實驗結(jié)果表明,融合三種特征的CGCfr模型比其他模型更具有優(yōu)勢。
關(guān)鍵詞:命名實體識別;法語;深度神經(jīng)網(wǎng)絡(luò);自然語言處理;序列標注
中圖分類號:TP391.1
文獻標志碼:A
Abstract: In the existing French Named Entity Recognition (NER) research, the machine learning models mostly use the character morphological features of words, and the multilingual generic named entity models use the semantic features represented by word embedding, both without taking into account the semantic, character morphological and grammatical features comprehensively. Aiming at this shortcoming, a deep neural network based model CGCfr was designed to recognize French named entity. Firstly, word embedding, character embedding and grammar feature vector were extracted from the text. Then, character feature was extracted from the character embedding sequence of words by using Convolution Neural Network (CNN). Finally, Bidirectional Gated Recurrent Unit Network (BiGRU) and Conditional Random Field (CRF) were used to label named entities in French text according to word embedding, character feature and grammar feature vector. In the experiments, F1 value of CGCfr model can reach 82.16% in the test set, which is 5.67 percentage points, 1.79 percentage points and 1.06 percentage points higher than that of NERCfr, LSTM(Long ShortTerm Memory network)CRF and Char attention models respectively. The experimental results show that CGCfr model with three features is more advantageous than the others.
英文關(guān)鍵詞Key words: Named Entity Recognition (NER); French; neural network; Natural Language Processing (NLP); sequence labeling
0 引言
命名實體識別(Named Entity Recognition, NER)是指從文本中識別出特定類型事務(wù)名稱或者符號的過程[1]。它提取出更具有意義的人名、組織名、地名等,使得后續(xù)的自然語言處理任務(wù)能根據(jù)命名實體進一步獲取需要的信息。隨著全球化發(fā)展,各國之間信息交換日益頻繁。相對于中文,外語信息更能影響其他國家對中國的看法,多語言輿情分析應(yīng)運而生。法語在非英語的語種中影響力相對較大,其文本是多語種輿情分析中重要目標之一。法語NER作為法語文本分析的基礎(chǔ)任務(wù),重要性不可忽視。
專門針對法語NER進行的研究較少,早期研究主要是基于規(guī)則和詞典的方法[2], 后來,通常將人工選擇的特征輸入到機器學(xué)習(xí)模型來識別出文本中存在的命名實體[3-7]。Azpeitia等[3]提出了NERCfr模型,模型采用最大熵方法來識別法語命名實體,用到的特征包括詞后綴、字符窗口、鄰近詞、詞前綴、單詞長度和首字母是否大寫等。該方法取得了不錯的結(jié)果,但可以看出用到的特征多為單詞的形態(tài)結(jié)構(gòu)特征而非語義特征,缺乏語義特征可能限制了模型的識別準確率。
近幾年深度神經(jīng)網(wǎng)絡(luò)在自然語言處理領(lǐng)域取得了很好的效果: Hammerton[8]將長短時記憶網(wǎng)絡(luò)(Long ShortTerm Memory network, LSTM)用于英語NER; Rei等[9]提出了多語言通用的Char attention模型,利用Attention機制融合詞嵌入和字符嵌入,將其作為特征輸入到雙向長短時記憶網(wǎng)絡(luò)(Bidirectional Long ShortTerm Memory network, BiLSTM)中,得到序列標注產(chǎn)生的命名實體; Lample等[10]提出BiLSTM后接條件隨機場(Conditional Random Field, CRF)的LSTMCRF模型,它也是多語言通用的,使用了字詞嵌入作為特征來識別英語的命名實體, 但LSTMCRF模型應(yīng)用在法語上,和英語差距較大,這個問題可能是因為沒有用到該語言的語法特征,畢竟法語語法的復(fù)雜程度大幅超過英語。
為了在抽取過程中兼顧語義、字符形態(tài)和語法特征,更為準確地抽取法語的命名實體,本文設(shè)計了模型CGCfr。該模型使用詞嵌入表示文本中單詞的語義特征,使用卷積神經(jīng)網(wǎng)絡(luò)(Convolutional Neural Network, CNN)提取字符嵌入蘊含的單詞字符形態(tài)特征以及預(yù)先提取的法語語法特征,拼接后輸入到雙向門控循環(huán)網(wǎng)絡(luò)(Gated Recurrent Unit Neural Network, GRU)和條件隨機場結(jié)合的復(fù)合網(wǎng)絡(luò)中,來識別出法語命名實體。CGCfr充分利用了這些特征,通過實驗證明了每種特征的貢獻度,并與其他模型進行比較證明了融合三種特征的CGCfr模型更具有優(yōu)勢。除此之外,本文貢獻了一個法語的數(shù)據(jù)集,包含1005篇文章,29016個實體, 增加了法語命名實體識別的數(shù)據(jù)集,使得后續(xù)可以有更多的研究不被數(shù)據(jù)集的問題困擾。
4 結(jié)語
本文設(shè)計了用于法語命名實體識別的深度神經(jīng)網(wǎng)絡(luò)CGCfr模型,并構(gòu)建了一個法語命名實體識別數(shù)據(jù)集。CGCfr模型將法語文本中單詞的詞嵌入作為語義特征,從單詞對應(yīng)的字符嵌入序列提取單詞的形態(tài)結(jié)構(gòu)特征,結(jié)合語法特征完成對命名實體的識別。這增加了傳統(tǒng)統(tǒng)計機器學(xué)習(xí)方法中特征的多樣性,豐富了特征的內(nèi)涵, 也避免了多語言通用方法對法語語法的忽視。實驗對比模型中各個特征的貢獻度,驗證了它們的有效性;還將CGCfr模型與最大熵模型NERCfr、多語言通用模型Char attention和LSTMCRF對比。實驗結(jié)果表明,CGCfr模型相對三者的F1值都有提高,驗證了融合三種特征的本文模型在法語命名實體識別上的有效性,進一步提高了法語命名實體的識別率。
然而,本文模型也存在著不足,在法語文本中組織名的識別率相比其余兩種命名實體類型差距較大,模型對形式存在較大變化的命名實體類型的識別效果不是很好;其次,相對于英語較高的命名實體識別準確率,法語命名實體識別還有較大的提升空間。
參考文獻 (References)
[1] NADEAU D, SEKINE S. A survey of named entity recognition and classification[J]. Lingvisticae Investigationes, 2007, 30(1): 3-26.
[2] WOLINSKI F, VICHOT F, DILLET B. Automatic processing of proper names in texts[C]// Proceedings of the 7th Conference on European Chapter of the Association for Computational Linguistics. San Francisco, CA: Morgan Kaufmann Publishers, 1995: 23-30.
[3] AZPEITIA A, CUDADROS M, GAINES S, et al. NERCfr: supervised named entity recognition for French[C]// TSD 2014: Proceedings of the 2014 International Conference on Text, Speech and Dialogue. Berlin: Springer, 2014: 158-165.
[4] POIBEAU T. The multilingual named entity recognition framework[C]// Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2003: 155-158.
[5] PETASIS G, VICHOT F, WOLINSKI F, et al. Using machine learning to maintain rulebased namedentity recognition and classification systems[C]// Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2001: 426-433.
[6] WU D, NGAI G, CARPUAT M. A stacked, voted, stacked model for named entity recognition[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT. Stroudsburg, PA: Association for Computational Linguistics, 2003: 200-203.
[7] NOTHMAN J, RINGLAND N, RADFORD W, et al. Learning multilingual named entity recognition from Wikipedia[J]. Artificial Intelligence, 2013, 194:151-175.
[8] HAMMERTON J. Named entity recognition with long shortterm memory[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT. Stroudsburg, PA: Association for Computational Linguistics, 2003: 172-175.
[9] REI M, CRICHTON G, PYYSALO S. Attending to characters in neural sequence labeling models[J/OL]. arXiv Preprint, 2016, 2016: arXiv:1611.04361[2016-11-14]. https://arxiv.org/abs/1611.04361.
[10] LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2016: 260-270.
[11] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]// Proceedings of the 31st International Conference on Machine Learning. New York: JMLR.org, 2014: 1188-1196.
[12] PENNINGTON J, SOCHER R, MANNING C. Glove: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2014: 1532-1543.
[13] SANTOS C D, ZADROZNY B. Learning characterlevel representations for partofspeech tagging[C]// Proceedings of the 31st International Conference on Machine Learning. New York: JMLR.org, 2014: 1818-1826.
[14] CHO K, van MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoderdecoder for statistical machine translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2014: 1724-1734.
[15] SANG E F, VEENSTRA J. Representing text chunks[C]// Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 1999: 173-179.