吳賽賽,周愛(ài)蓮,謝能付,梁曉賀,汪匯涓,李小雨,陳桂鵬
基于深度學(xué)習(xí)的作物病蟲(chóng)害可視化知識(shí)圖譜構(gòu)建
吳賽賽1,周愛(ài)蓮1※,謝能付1,梁曉賀1,汪匯涓1,李小雨1,陳桂鵬2
(1. 中國(guó)農(nóng)業(yè)科學(xué)院農(nóng)業(yè)信息研究所,北京 100086;2. 江西省農(nóng)業(yè)科學(xué)院農(nóng)業(yè)經(jīng)濟(jì)與信息研究所,南昌 330200)
針對(duì)作物病蟲(chóng)害領(lǐng)域存在實(shí)體關(guān)系交叉關(guān)聯(lián)、多源異構(gòu)數(shù)據(jù)聚合能力差、知識(shí)共享困難等問(wèn)題,利用知識(shí)圖譜以結(jié)構(gòu)化的形式描述實(shí)體間復(fù)雜關(guān)系的優(yōu)勢(shì),該研究提出了一種基于深度學(xué)習(xí)的作物病蟲(chóng)害知識(shí)圖譜構(gòu)建方法。該方法在領(lǐng)域本體的基礎(chǔ)上,以一種與領(lǐng)域語(yǔ)料相適應(yīng)的新標(biāo)注模式實(shí)現(xiàn)實(shí)體和關(guān)系的聯(lián)合抽取。將實(shí)體和關(guān)系抽取任務(wù)轉(zhuǎn)化為序列標(biāo)注問(wèn)題,對(duì)實(shí)體和關(guān)系進(jìn)行同步標(biāo)注,有效提高標(biāo)注效率;為了解決重疊關(guān)系抽取問(wèn)題,直接對(duì)三元組建模而不是分別對(duì)實(shí)體和關(guān)系建模,通過(guò)標(biāo)簽匹配和映射即可獲得三元組數(shù)據(jù)。利用來(lái)自轉(zhuǎn)換器的雙向編碼器表征量(Bidirectional Encoder Representations from Transformers,BERT)-雙向長(zhǎng)短期記憶網(wǎng)絡(luò)(Bi-directional Long-Short Term Memory,BiLSTM)+條件隨機(jī)場(chǎng)(Conditional Random Field,CRF)端到端模型進(jìn)行試驗(yàn),結(jié)果表明效果優(yōu)于基于普通標(biāo)注方式的流水線方法和聯(lián)合學(xué)習(xí)方法中的卷積神經(jīng)網(wǎng)絡(luò)(Convolutional Neural Networks,CNN)+BiLSTM+CRF、BiLSTM+CRF等經(jīng)典模型,F(xiàn)1得分為91.34%。最后,將抽取到的知識(shí)存儲(chǔ)到Neo4j圖數(shù)據(jù)庫(kù)中,直觀地反映知識(shí)圖譜的內(nèi)部結(jié)構(gòu),實(shí)現(xiàn)知識(shí)可視化和知識(shí)推理。該研究構(gòu)建的知識(shí)圖譜可為作物病蟲(chóng)害智能問(wèn)答系統(tǒng)、推薦系統(tǒng)、智能搜索等下游應(yīng)用提供高質(zhì)量的知識(shí)庫(kù)基礎(chǔ)。
作物;病蟲(chóng)害;模型;知識(shí)圖譜;深度學(xué)習(xí);實(shí)體關(guān)系聯(lián)合抽取
2012年,谷歌推出知識(shí)圖譜(Knowledge Graph,KG)概念,為知識(shí)管理提供了一種新途徑。知識(shí)圖譜實(shí)質(zhì)上是一種結(jié)構(gòu)化的語(yǔ)義知識(shí)庫(kù),以結(jié)構(gòu)化的形式描述客觀世界中概念、實(shí)體及其關(guān)系,一般以(實(shí)體,關(guān)系,實(shí)體)、(實(shí)體,屬性,屬性值)的三元組形式來(lái)表示。知識(shí)圖譜能將領(lǐng)域的異構(gòu)知識(shí)結(jié)構(gòu)化,且擅于描述實(shí)體之間交互關(guān)系,將領(lǐng)域知識(shí)做了顯性化沉淀和關(guān)聯(lián),很好地解決領(lǐng)域內(nèi)數(shù)據(jù)分散、復(fù)雜以及孤島化問(wèn)題,在醫(yī)療、生物、金融等領(lǐng)域得到廣泛應(yīng)用[1]。根據(jù)知識(shí)覆蓋范圍不同,知識(shí)圖譜分為開(kāi)放領(lǐng)域知識(shí)圖譜[2-5]和垂直領(lǐng)域知識(shí)圖譜[6-8]。開(kāi)放知識(shí)圖譜比較注重廣度,垂直領(lǐng)域知識(shí)圖譜則注重深度,但由于缺乏標(biāo)注訓(xùn)練語(yǔ)料、過(guò)度依賴于專家等原因,一般規(guī)模較小,構(gòu)建成本較高。
病蟲(chóng)害一直以來(lái)都是影響中國(guó)農(nóng)作物生產(chǎn)的重要因素,隨著信息技術(shù)的發(fā)展,互聯(lián)網(wǎng)成為獲取病蟲(chóng)害防控知識(shí)的主要來(lái)源,然而當(dāng)前作物病蟲(chóng)害領(lǐng)域開(kāi)源知識(shí)主要以傳統(tǒng)數(shù)據(jù)庫(kù)形式進(jìn)行存儲(chǔ),存在聚合能力差、利用率低下、知識(shí)共享困難等問(wèn)題。鑒于知識(shí)圖譜對(duì)領(lǐng)域知識(shí)管理的良好表現(xiàn),目前農(nóng)業(yè)領(lǐng)域知識(shí)圖譜已有一些成果,但對(duì)于作物病蟲(chóng)害知識(shí)圖譜的深入研究仍較少。華東師范大學(xué)基于碎片化農(nóng)業(yè)大數(shù)據(jù)構(gòu)建了面向智慧農(nóng)業(yè)的知識(shí)圖譜及其應(yīng)用系統(tǒng)(https://github.com/qq547276542/ Agriculture_KnowledgeGraph);夏迎春[9]首先根據(jù)作物病蟲(chóng)害數(shù)據(jù)分類標(biāo)準(zhǔn)生成本體層,再在其基礎(chǔ)上擴(kuò)展實(shí)體層,初步形成知識(shí)圖譜,并實(shí)現(xiàn)知識(shí)圖譜可視化;吳茜[10]利用本體等技術(shù)構(gòu)建農(nóng)業(yè)領(lǐng)域知識(shí)圖譜,其中涵蓋了農(nóng)作物品種、農(nóng)作物病蟲(chóng)害以及農(nóng)藥肥料數(shù)據(jù);王丹丹[11]構(gòu)建了水稻知識(shí)圖譜等。但這些知識(shí)圖譜在規(guī)?;?、智能化、體系化等方面仍有很大的提升空間,如何有效抽取半結(jié)構(gòu)化或非結(jié)構(gòu)化數(shù)據(jù)、解決文本中重疊關(guān)系的提取、減少人工特征的投入等,仍是十分有挑戰(zhàn)性的工作。
知識(shí)圖譜構(gòu)建是知識(shí)表示、知識(shí)抽取以及知識(shí)存儲(chǔ)等技術(shù)的結(jié)合。知識(shí)表示是一種計(jì)算機(jī)可以接受的用于描述知識(shí)的數(shù)據(jù)結(jié)構(gòu),但早期的知識(shí)表示方式表達(dá)性不強(qiáng),且缺乏靈活性,因此目前本體已經(jīng)成為最常用的知識(shí)表示、知識(shí)共享和知識(shí)重用方法。知識(shí)抽取是知識(shí)圖譜構(gòu)建的核心環(huán)節(jié),包括命名實(shí)體識(shí)別(Name Entity Recognition,NER)和關(guān)系抽?。≧elation Extraction,RE)任務(wù)。按照NER和RE兩個(gè)任務(wù)完成的順序不同,實(shí)體關(guān)系抽取可分為流水線方法和聯(lián)合學(xué)習(xí)方法。流水線方法[12-14]將NER和RE分成2個(gè)獨(dú)立的子任務(wù),首先識(shí)別出文本中的實(shí)體,再對(duì)實(shí)體對(duì)之間的語(yǔ)義關(guān)系進(jìn)行分類,雖然更加靈活且易于建模,但將2個(gè)任務(wù)分割的方式存在錯(cuò)誤傳播、信息丟失、實(shí)體冗余等問(wèn)題。因此近年來(lái)實(shí)體關(guān)系聯(lián)合學(xué)習(xí)方法成為主流,根據(jù)建模對(duì)象不同,分為參數(shù)共享和序列標(biāo)注2類子方法。參數(shù)共享方法是分別對(duì)實(shí)體和關(guān)系進(jìn)行建模,通過(guò)共享聯(lián)合的編碼層進(jìn)行聯(lián)合學(xué)習(xí),實(shí)現(xiàn)2個(gè)子任務(wù)之間的交互[15-16],但仍存在無(wú)法剔除冗余實(shí)體信息的問(wèn)題。因此,有學(xué)者[17-21]研究將實(shí)體關(guān)系的聯(lián)合抽取轉(zhuǎn)化為序列標(biāo)注問(wèn)題,在一定程度上解決實(shí)體冗余以及重疊關(guān)系問(wèn)題。Liu等[22]根據(jù)作物病蟲(chóng)害數(shù)據(jù)特征,仔細(xì)分析近年來(lái)病蟲(chóng)害知識(shí)圖譜構(gòu)建的關(guān)鍵技術(shù)和方法,總結(jié)出本體學(xué)習(xí)、機(jī)器學(xué)習(xí)、深度學(xué)習(xí)等是實(shí)現(xiàn)知識(shí)自動(dòng)抽取的重點(diǎn)技術(shù),也是當(dāng)前作物病蟲(chóng)害知識(shí)圖譜的研究熱點(diǎn)。知識(shí)圖譜主要有2種存儲(chǔ)方式,基于資源描述框架(Resource Description Framework,RDF)的存儲(chǔ)和基于圖數(shù)據(jù)庫(kù)的存儲(chǔ)。RDF的重要設(shè)計(jì)原則在于數(shù)據(jù)的易發(fā)布和共享,而圖數(shù)據(jù)庫(kù)以屬性圖為基本的表示形式,更易于表達(dá)現(xiàn)實(shí)的業(yè)務(wù)場(chǎng)景,實(shí)現(xiàn)高效的圖查詢和搜索。因此近年來(lái)基于圖數(shù)據(jù)庫(kù)的知識(shí)圖譜存儲(chǔ)成為主流方式,Neo4j作為一個(gè)開(kāi)源的圖數(shù)據(jù)庫(kù)系統(tǒng),是目前用于知識(shí)圖譜存儲(chǔ)的主要途徑。
如何從海量復(fù)雜的作物病蟲(chóng)害相關(guān)數(shù)據(jù)中準(zhǔn)確提取病原、為害部位、防治藥劑等有用知識(shí),是作物病蟲(chóng)害知識(shí)圖譜構(gòu)建的關(guān)鍵問(wèn)題。隨著信息技術(shù)的發(fā)展,深度學(xué)習(xí)已逐漸滲透到知識(shí)圖譜構(gòu)建的各個(gè)環(huán)節(jié)中[23]。為了提高知識(shí)抽取的效率和準(zhǔn)確性,降低知識(shí)圖譜構(gòu)建成本,本研究在領(lǐng)域本體的基礎(chǔ)上,以一種新穎的語(yǔ)料標(biāo)注模式實(shí)現(xiàn)實(shí)體和關(guān)系的聯(lián)合抽取,對(duì)實(shí)體和關(guān)系進(jìn)行同步標(biāo)注,直接對(duì)三元組進(jìn)行建模,通過(guò)標(biāo)簽匹配和映射即可獲取三元組,同時(shí)利用來(lái)自轉(zhuǎn)換器的雙向編碼器表征量(Bidirectional Encoder Representations from Transformers,BERT)-雙向長(zhǎng)短期記憶網(wǎng)絡(luò)(Bi-directional Long-Short Term Memory,BiLSTM)+條件隨機(jī)場(chǎng)(Conditional Random Field,CRF)端到端模型進(jìn)行訓(xùn)練和預(yù)測(cè)。最后,將抽取到的三元組數(shù)據(jù)存儲(chǔ)到Neo4j圖數(shù)據(jù)庫(kù)中,實(shí)現(xiàn)知識(shí)圖譜的可視化展示和知識(shí)推理。該知識(shí)圖譜可為作物病蟲(chóng)害智能問(wèn)答系統(tǒng)、推薦系統(tǒng)、智能搜索等下游應(yīng)用提供高質(zhì)量的知識(shí)庫(kù)基礎(chǔ),有效應(yīng)用于作物品種選擇、病蟲(chóng)害防控、施肥灌溉等農(nóng)業(yè)生產(chǎn)方面。
知識(shí)圖譜構(gòu)建分為自底向上和自頂向下2種方式。自底向上是指數(shù)據(jù)驅(qū)動(dòng)方式,更加適用于開(kāi)放領(lǐng)域知識(shí)圖譜;而垂直領(lǐng)域由于其特定行業(yè)的專業(yè)性、復(fù)雜多變的業(yè)務(wù)需求以及對(duì)高質(zhì)量數(shù)據(jù)的要求,多采用自頂向下的構(gòu)建模式[24],即首先定義好本體與數(shù)據(jù)模式,再將實(shí)體及其相互關(guān)系填充到知識(shí)圖譜中。本研究采用自頂向下的知識(shí)圖譜構(gòu)建方式,具體構(gòu)建流程如圖1所示,主要包括數(shù)據(jù)獲取、本體構(gòu)建、知識(shí)抽取和知識(shí)存儲(chǔ)。
圖1 作物病蟲(chóng)害知識(shí)圖譜構(gòu)建流程
本研究的主要數(shù)據(jù)來(lái)源是中國(guó)作物種質(zhì)信息網(wǎng)-作物病蟲(chóng)害知識(shí)網(wǎng)站(http://www.cgris.net/disease/ default.html),通過(guò)采用Python編程語(yǔ)言的Scrapy框架進(jìn)行數(shù)據(jù)爬取,同時(shí)結(jié)合規(guī)則和人工審核等方式進(jìn)行數(shù)據(jù)預(yù)處理,得到無(wú)噪聲純文本語(yǔ)料。由于網(wǎng)站XPath路徑不規(guī)則,無(wú)法采用統(tǒng)一的XPath頁(yè)面解析方法進(jìn)行網(wǎng)頁(yè)內(nèi)容的直接爬取,因此以一條病蟲(chóng)害數(shù)據(jù)為一個(gè)基本單位,以多層級(jí)頁(yè)面爬蟲(chóng)方式,共爬取1 619條數(shù)據(jù),包括水稻、麥類、豆類、玉米、雜糧、薯類、棉麻、油料、糖煙、茶桑、藥用植物、貯糧共12類農(nóng)作物的病蟲(chóng)害數(shù)據(jù)。由于爬取到的數(shù)據(jù)中還存在含有網(wǎng)頁(yè)導(dǎo)航、廣告、重復(fù)值等無(wú)關(guān)內(nèi)容和數(shù)據(jù)缺失等問(wèn)題,因此利用正則表達(dá)式結(jié)合人工審核的方式,對(duì)數(shù)據(jù)中的冗余值和缺失值進(jìn)行清理和補(bǔ)全,預(yù)處理之后的文本仍保留了原網(wǎng)頁(yè)固有的半結(jié)構(gòu)化數(shù)據(jù)形式,主要包含病蟲(chóng)害名稱及其癥狀、病原、傳播途徑和發(fā)病條件以及防治方法等屬性。
本體是概念模型的明確的規(guī)范說(shuō)明[25],作物病蟲(chóng)害本體即以一種計(jì)算機(jī)能理解的語(yǔ)言形式對(duì)作物病蟲(chóng)害知識(shí)進(jìn)行描述和組織,通過(guò)上層本體的構(gòu)建,可以有效地組織和管理數(shù)據(jù)層。本研究使用開(kāi)源本體構(gòu)建工具Protégé[26],不需要復(fù)雜難懂的本體構(gòu)建語(yǔ)言,即可定義頂層邏輯概念、實(shí)體之間關(guān)系、實(shí)體屬性,還可以對(duì)關(guān)系和屬性的定義域和值域設(shè)置相應(yīng)的約束。將作物病蟲(chóng)害本體控制為4層(圖2),包括了6類父類概念,分別為病蟲(chóng)害、作物、病原、地理、分類學(xué)和農(nóng)藥。為了更精確地描述病蟲(chóng)害實(shí)體與其他實(shí)體類型之間的相互關(guān)系,根據(jù)數(shù)據(jù)表示特征,結(jié)合實(shí)際業(yè)務(wù)需求和領(lǐng)域?qū)<抑笇?dǎo),預(yù)定義實(shí)體間的關(guān)系集合和實(shí)體的屬性集合,關(guān)系集合包括{為害作物,為害部位,分布區(qū)域……},屬性集合包括{癥狀,為害特點(diǎn),防治方法……},同時(shí)對(duì)關(guān)系和屬性設(shè)定了相應(yīng)的定義域和值域,明確知識(shí)抽取的邊界。定義域和值域的意義在于給關(guān)系和屬性的取值設(shè)定一定范圍的約束,比如對(duì)于“為害作物”這個(gè)關(guān)系來(lái)說(shuō),其主體只能是病蟲(chóng)害實(shí)體,而其對(duì)象只能是農(nóng)作物實(shí)體。
圖2 面向作物病蟲(chóng)害知識(shí)圖譜的本體模型
從中國(guó)作物種質(zhì)信息網(wǎng)-作物病蟲(chóng)害知識(shí)網(wǎng)站上將數(shù)據(jù)爬取下來(lái)時(shí),同時(shí)也獲取了其半結(jié)構(gòu)化信息,如標(biāo)題、段落層級(jí)以及小標(biāo)題等,通過(guò)實(shí)踐發(fā)現(xiàn)可以通過(guò)利用這些半結(jié)構(gòu)化特征,構(gòu)造相應(yīng)規(guī)則進(jìn)行(名稱:作物病蟲(chóng)害;屬性1:屬性值1;屬性2:屬性值2;……;屬性:屬性值)實(shí)例的抽取。首先將文本解析為結(jié)構(gòu)化.json格式,其中每個(gè)作物病蟲(chóng)害實(shí)體為一個(gè)對(duì)象,病蟲(chóng)害的每個(gè)屬性與屬性值組成一個(gè)鍵值對(duì),然后基于Python編程語(yǔ)言的py2neo模塊,直接傳入Cypher語(yǔ)句,將1 619條作物病蟲(chóng)害實(shí)例存儲(chǔ)到Neo4j圖數(shù)據(jù)庫(kù)中(圖3),其中每條實(shí)例為一個(gè)節(jié)點(diǎn),節(jié)點(diǎn)包含了作物病蟲(chóng)害實(shí)體名稱、癥狀、病原、防治方法等實(shí)體屬性及屬性值信息,如{名稱:水稻云形?。话Y狀:又稱葉枯病……;病原:(Hashioka et Yokogi) W. Gams……;……;防治方法:(1)選用無(wú)病種子……}。
在半結(jié)構(gòu)化知識(shí)抽取中是以整段文本作為一個(gè)屬性值,但在屬性值的文本中還包含很多未挖掘到的隱藏信息,如水稻云形病的癥狀屬性值中,還隱藏著別名、分布區(qū)域、為害部位等實(shí)體關(guān)系信息,而抽取這些關(guān)系時(shí)屬于基于非結(jié)構(gòu)化數(shù)據(jù)的知識(shí)抽取。從非結(jié)構(gòu)化文本中提取三元組是一個(gè)有挑戰(zhàn)性的工作,與一般語(yǔ)料相比,本研究的作物病蟲(chóng)害語(yǔ)料有以下3點(diǎn)特殊之處:1)一條數(shù)據(jù)僅圍繞一個(gè)作物病蟲(chóng)害實(shí)體而展開(kāi),因此在同一條數(shù)據(jù)的三元組抽取中,頭實(shí)體是固定的,只需提取尾實(shí)體與兩者間的關(guān)系即可。2)實(shí)體分布密度高,作物病蟲(chóng)害實(shí)體與文本中多個(gè)實(shí)體生成關(guān)系對(duì),且頭尾實(shí)體之間距離較長(zhǎng)。句中的高密度實(shí)體分布看似能夠促進(jìn)命名實(shí)體識(shí)別模型擬合,但同一實(shí)體多次參與不同類型關(guān)系對(duì)的組成,在有限的標(biāo)注信息支撐下,一旦模型缺乏句子級(jí)別語(yǔ)義信息的表征能力,將容易導(dǎo)致對(duì)交錯(cuò)關(guān)系的欠擬合,且距離較長(zhǎng)的2個(gè)實(shí)體之間的關(guān)系較難抽取[27]。3)實(shí)體間關(guān)系復(fù)雜。文本中經(jīng)常同時(shí)出現(xiàn)防治農(nóng)藥和禁用農(nóng)藥實(shí)體,實(shí)體名稱相似度極高,但隸屬的關(guān)系類型完全不同甚至是互斥的,在一定程度上加大關(guān)系抽取的工作難度。
圖3 半結(jié)構(gòu)化知識(shí)存儲(chǔ)結(jié)果示例
根據(jù)上述的本領(lǐng)域語(yǔ)料特征,結(jié)合文獻(xiàn)[17—21]中將實(shí)體關(guān)系的聯(lián)合抽取任務(wù)轉(zhuǎn)化為序列標(biāo)注問(wèn)題的思想,本研究以一種語(yǔ)料標(biāo)注模式“主實(shí)體+關(guān)系+首-內(nèi)部-尾-單-其他”(Main_Entity+Relation+Begin-Inside-End-Single- Other,ME+R+BIESO)實(shí)現(xiàn)實(shí)體和關(guān)系的聯(lián)合抽取,對(duì)實(shí)體和關(guān)系進(jìn)行同步標(biāo)注,直接對(duì)三元組建模而不是分別對(duì)實(shí)體和關(guān)系建模,通過(guò)標(biāo)簽匹配和映射直接得到三元組數(shù)據(jù),有效提高了標(biāo)注效率,還解決了重疊關(guān)系的抽取問(wèn)題。為進(jìn)一步表征更全面的句子級(jí)別語(yǔ)義特征,緩解實(shí)體關(guān)系交錯(cuò)關(guān)聯(lián)和實(shí)體之間距離較長(zhǎng)等問(wèn)題,本研究引入BERT預(yù)訓(xùn)練語(yǔ)言模型,利用BERT-BiLSTM+ CRF端到端模型進(jìn)行訓(xùn)練和預(yù)測(cè),不僅能抽取詞級(jí)特征,還能實(shí)現(xiàn)句子級(jí)別語(yǔ)義特征的深入挖掘和學(xué)習(xí)。
1.4.1 ME+R+BIESO標(biāo)注方法介紹
在一條數(shù)據(jù)僅圍繞一個(gè)主實(shí)體(Main_Entity,ME)而展開(kāi)描述的語(yǔ)料文本中進(jìn)行實(shí)體和關(guān)系的抽取,本質(zhì)上只需抽取與ME存在關(guān)系的實(shí)體{1,2,…,X,…X}以及2個(gè)實(shí)體之間的關(guān)系{1,2,…,R,…R},其中X表示與ME存在關(guān)系的第個(gè)實(shí)體,R表示X與ME之間的關(guān)系類型。為減少實(shí)體冗余,僅對(duì)本體中預(yù)定義關(guān)系集合內(nèi)的關(guān)系進(jìn)行抽取。
ME+R+BIESO標(biāo)注模式旨在對(duì)主實(shí)體和主實(shí)體與各實(shí)體間的關(guān)系進(jìn)行同步標(biāo)注,首先將主實(shí)體標(biāo)注為ME標(biāo)簽,當(dāng)文本中某實(shí)體X與ME之間存在關(guān)系R,則直接將X的標(biāo)簽設(shè)置為R,并用首-內(nèi)部-尾-單-其他(Begin-Inside-End-Single-Other,BIESO)標(biāo)志來(lái)表示ME和實(shí)體X中字符的位置信息(表1)。每匹配到一條數(shù)據(jù)中的標(biāo)簽ME和同一關(guān)系R的完整BIE、BE或S集合,便取出標(biāo)簽集合所對(duì)應(yīng)的實(shí)體ME和X,通過(guò)標(biāo)簽映射和數(shù)據(jù)解析,形成(ME,R,X)三元組。
表1 ME+R+BIESO標(biāo)注方法的標(biāo)簽含義說(shuō)明
注:X為與主實(shí)體存在關(guān)系的第個(gè)實(shí)體。
Note:Xis thethentity that has a relation with the main entity.
以描述水稻云形病實(shí)體的數(shù)據(jù)為例(圖4),首先將水稻云形病標(biāo)注為ME,由于葉枯病與水稻云形病之間存在別名關(guān)系,因此將葉枯病標(biāo)注為別名(Other Name,ON),葉片與水稻云形病之間的關(guān)系為為害部位,則將葉片標(biāo)注為為害部位(Damage Posotion,DP)。當(dāng)匹配到主實(shí)體ME和關(guān)系ON的BIE標(biāo)簽集合,即生成三元組(水稻云形病,別名,葉枯?。黄ヅ涞組E和DP的BE集合,即生成三元組(水稻云形病,為害部位,葉片)。直至匹配到下一個(gè)主實(shí)體標(biāo)簽ME,則說(shuō)明上一個(gè)主實(shí)體所對(duì)應(yīng)的三元組已全部抽取完成。
注:ME為主實(shí)體,ON為別名關(guān)系,DP為為害部位關(guān)系。
ME+R+BIESO標(biāo)注方法只關(guān)注主實(shí)體與各實(shí)體之間的關(guān)系類型R而無(wú)需關(guān)注實(shí)體本身所屬的實(shí)體類型,只在預(yù)定義關(guān)系集合上進(jìn)行標(biāo)注和抽取,減少無(wú)關(guān)實(shí)體對(duì)的冗余性和錯(cuò)誤傳播。同時(shí),對(duì)于ME與多個(gè)X之間存在重疊關(guān)系的問(wèn)題,也可通過(guò)標(biāo)簽匹配和映射即可獲得多個(gè)對(duì)應(yīng)的三元組。此外,基于傳統(tǒng)標(biāo)注和流水線的實(shí)體和關(guān)系抽取方法需先對(duì)實(shí)體進(jìn)行標(biāo)注和識(shí)別,再對(duì)存在關(guān)系的實(shí)體對(duì)之間的關(guān)系進(jìn)行標(biāo)注和分類,而ME+R+BIESO方法對(duì)實(shí)體和關(guān)系進(jìn)行同步標(biāo)注,至少節(jié)省一半的標(biāo)注成本。但該標(biāo)注方法也存在一定的局限性,即僅考慮一對(duì)多的重疊關(guān)系情況,而對(duì)于多對(duì)多的重疊關(guān)系將成為未來(lái)的探索方向。
1.4.2 BERT-BiLSTM+CRF模型解析
在ME+R+BIESO標(biāo)注模式的基礎(chǔ)上,利用基于BERT字嵌入的BiLSTM+CRF端到端模型對(duì)標(biāo)簽進(jìn)行訓(xùn)練和預(yù)測(cè)。模型整體框架如圖5所示,主要包含3個(gè)部分:標(biāo)注語(yǔ)料首先通過(guò) BERT預(yù)訓(xùn)練語(yǔ)言模型生成基于上下文信息的字向量;然后將字向量輸入到BiLSTM模塊進(jìn)行雙向編碼,輸出每個(gè)標(biāo)簽的預(yù)測(cè)分?jǐn)?shù)值;最后,利用CRF模塊對(duì)BiLSTM模塊輸出的結(jié)果進(jìn)行解碼,通過(guò)訓(xùn)練學(xué)習(xí)得到標(biāo)簽轉(zhuǎn)移概率和約束條件,獲得最終的預(yù)測(cè)標(biāo)注序列。
在自然語(yǔ)言處理(Natural Language Processing,NLP)任務(wù)中,需要通過(guò)語(yǔ)言模型將文字轉(zhuǎn)化為向量形式以供計(jì)算機(jī)理解,傳統(tǒng)的語(yǔ)言模型如Word2Vec[28]、Glove[29]等單層神經(jīng)網(wǎng)絡(luò)無(wú)法很好地表征字詞的多義性,因此Devlin等[30]提出了BERT預(yù)訓(xùn)練語(yǔ)言模型,負(fù)責(zé)將原始輸入轉(zhuǎn)換為向量形式,然后將向量輸入到BiLSTM層學(xué)習(xí)上下文特征。BERT是第一個(gè)用于預(yù)訓(xùn)練和NLP技術(shù)的無(wú)監(jiān)督、深度雙向模型,創(chuàng)新性地使用遮蔽語(yǔ)言模型和下一句預(yù)測(cè)2個(gè)任務(wù)進(jìn)行預(yù)訓(xùn)練,使得通過(guò)BERT得到的詞向量不僅隱含上下文詞級(jí)特征,還能有效捕捉句子級(jí)別特征[31]。
注:E1,E2…EN為來(lái)自轉(zhuǎn)換器的雙向編碼器表征量的嵌入,序列中的每個(gè)詞都是由詞向量、段向量和位置向量3個(gè)部分相加而得;T1,T2,…TN為來(lái)自轉(zhuǎn)換器的雙向編碼器表征量的目標(biāo),是經(jīng)過(guò)雙向轉(zhuǎn)換器進(jìn)行特征提取后得到的含有豐富語(yǔ)義特征的序列向量;B-ON為標(biāo)簽ON所對(duì)應(yīng)實(shí)體的首字符;I-ON為標(biāo)簽ON所對(duì)應(yīng)實(shí)體的內(nèi)部字符;E-ON為標(biāo)簽ON所對(duì)應(yīng)實(shí)體的尾字符。
BiLSTM[32]以BERT生成的詞向量作為輸入,通過(guò)捕獲上下文特征,獲取更全面的語(yǔ)義信息。長(zhǎng)短期記憶網(wǎng)絡(luò)[33](Long-Short Term Memory,LSTM)是循環(huán)神經(jīng)網(wǎng)絡(luò)[34](Recurrent Neural Network,RNN)的一種變體,在RNN基礎(chǔ)上引入了記憶單元和門(mén)控機(jī)制,對(duì)上下文歷史信息進(jìn)行有選擇性的遺忘、更新和傳遞,從而學(xué)習(xí)到長(zhǎng)距離的語(yǔ)義依賴,同時(shí)能減少網(wǎng)絡(luò)深度和有效緩解梯度消失、梯度爆炸問(wèn)題。BiLSTM由一個(gè)前向LSTM與一個(gè)后向LSTM組合而成,將原有的按照順序輸入的序列轉(zhuǎn)化為一正一反的2個(gè)輸入,使得整個(gè)網(wǎng)絡(luò)能夠同時(shí)獲得前向和后向的信息,可以更好地捕捉較長(zhǎng)距離的雙向語(yǔ)義依賴,在中文序列標(biāo)注中具有更好的表現(xiàn)。
雖然BiLSTM充分捕獲上下文信息,但有時(shí)不考慮標(biāo)注標(biāo)簽間的依賴信息。如B-ON標(biāo)簽后面可以接I-ON或E-ON標(biāo)簽,但如果接B-DP、I-DP、O等標(biāo)簽即是非法標(biāo)簽序列。CRF[35]可以通過(guò)訓(xùn)練學(xué)習(xí)得到標(biāo)簽轉(zhuǎn)移概率,為預(yù)測(cè)的標(biāo)簽添加一些約束條件,防止非法標(biāo)簽的出現(xiàn)。因此,將CRF作為BiLSTM的輸出層,可以獲得最佳的三元組標(biāo)注結(jié)果。
1.4.3 試驗(yàn)評(píng)價(jià)指標(biāo)和配置環(huán)境
為了精確評(píng)測(cè)模型的性能優(yōu)劣,本研究采用實(shí)體關(guān)系抽取領(lǐng)域的3項(xiàng)基本評(píng)價(jià)指標(biāo),準(zhǔn)確率(Precision,%)、召回率(Recall,%)以及F1得分(F1-score,%)來(lái)評(píng)價(jià)模型性能。各評(píng)價(jià)指標(biāo)的計(jì)算方法如式(1)~式(3)所示
式中TP為預(yù)測(cè)正確的陽(yáng)樣本,F(xiàn)P為預(yù)測(cè)錯(cuò)誤的陽(yáng)樣本,F(xiàn)N為預(yù)測(cè)錯(cuò)誤的陰樣本。
本研究的試驗(yàn)設(shè)備配置及環(huán)境為:Intel(R) Xeon(R) Bronze 3106 CPU @1.70GHz;GPU:NVIDIA GeForce RTX 2080 Ti(11G);內(nèi)存32GB;Python3.7;Tensorflow2.2.0。
本研究共有1 619條作物病蟲(chóng)害試驗(yàn)數(shù)據(jù)(表2),基于交叉驗(yàn)證的重采樣策略,以7∶3的比例劃分為訓(xùn)練集和測(cè)試集放入BERT-BiLSTM+CRF模型進(jìn)行試驗(yàn)。為了驗(yàn)證ME+R+BIESO標(biāo)注方法和BERT-BiLSTM+CRF模型的優(yōu)越性,分別選用流水線方法和聯(lián)合學(xué)習(xí)方法中的其他經(jīng)典模型作為基準(zhǔn)模型進(jìn)行對(duì)比試驗(yàn),各個(gè)模型試驗(yàn)結(jié)果如表3所示。
表2 BERT-BiLSTM+CRF模型的試驗(yàn)數(shù)據(jù)集分配
在訓(xùn)練過(guò)程中,按照顯存容量設(shè)置批處理大小;按照語(yǔ)句平均長(zhǎng)度設(shè)置序列的最大長(zhǎng)度;根據(jù)訓(xùn)練日志判斷損失函數(shù)的收斂情況,并對(duì)隨機(jī)失活率和學(xué)習(xí)率進(jìn)行微調(diào),直到訓(xùn)練的損失穩(wěn)定收斂;為擴(kuò)展系統(tǒng)輸出能力設(shè)置長(zhǎng)短期記憶網(wǎng)絡(luò)(Long Short Term Memory,LSTM)單元數(shù)目。經(jīng)過(guò)多次調(diào)試和試驗(yàn),選擇核心參數(shù)最優(yōu)組合:批處理大小為64,序列的最大長(zhǎng)度為256,隨機(jī)失活率為0.4,學(xué)習(xí)率為0.01,LSTM 單元數(shù)目為200。
表3 實(shí)體和關(guān)系抽取模型性能對(duì)比
為了驗(yàn)證ME+R+BIESO標(biāo)注方法和BERT- BiLSTM+CRF模型在實(shí)體和關(guān)系抽取任務(wù)中的優(yōu)越性,本研究選用了流水線方法中的BERT+BERT模型和聯(lián)合學(xué)習(xí)方法中的BiLSTM+CRF和CNN+BiLSTM+CRF模型進(jìn)行對(duì)比試驗(yàn)?;诹魉€的方法采用傳統(tǒng)的實(shí)體和關(guān)系標(biāo)注方法,利用BIO方式標(biāo)注實(shí)體,再對(duì)存在關(guān)系的實(shí)體對(duì)進(jìn)行分類標(biāo)注。首先使用BERT搭建關(guān)系的分類模型,接著用預(yù)測(cè)出來(lái)的關(guān)系和作物病蟲(chóng)害文本,使用BERT搭建一個(gè)實(shí)體抽取模型。因此實(shí)體抽取模型就是預(yù)測(cè)每一個(gè)令牌的標(biāo)示,最后根據(jù)標(biāo)示可提取出實(shí)體對(duì)。基于聯(lián)合學(xué)習(xí)的實(shí)體和關(guān)系抽取方法,采用本研究提出的ME+R+BIESO標(biāo)注方法,分別利用BiLSTM+CRF、CNN+BiLSTM+CRF以及BERT- BiLSTM+CRF端到端模型進(jìn)行試驗(yàn)。由試驗(yàn)結(jié)果可知,雖然流水線方法的準(zhǔn)確率較高,為93.41%,但整體效果失衡,由于召回率嚴(yán)重偏低,為29.10%,導(dǎo)致F1得分僅為44.38%,通過(guò)對(duì)生成的最終預(yù)測(cè)數(shù)據(jù)的分析,發(fā)現(xiàn)文本中距離較近的實(shí)體對(duì)之間的關(guān)系一般能準(zhǔn)確預(yù)測(cè),但是距離較遠(yuǎn)的實(shí)體對(duì)基本無(wú)法預(yù)測(cè),這說(shuō)明流水線方法在用于長(zhǎng)距離關(guān)系預(yù)測(cè)時(shí)具有很大的局限性。在聯(lián)合抽取模型的對(duì)比試驗(yàn)中,BERT-BiLSTM+CRF模型的性能明顯優(yōu)于BiLSTM+CRF和CNN+BiLSTM+CRF模型。相對(duì)于BiLSTM+CRF和CNN+BiLSTM+CRF,BERT- BiLSTM+CRF的準(zhǔn)確率分別提高了7.19~7.88個(gè)百分點(diǎn),召回率提高了9.74~10.51個(gè)百分點(diǎn),F(xiàn)1得分提高了8.68~9.35個(gè)百分點(diǎn),F(xiàn)1得分達(dá)到91.34%。CNN+BiLSTM+CRF模型在BiLSTM+CRF的基礎(chǔ)上增加了CNN層,但效果并沒(méi)有得到優(yōu)化,F(xiàn)1得分反而降低了0.67個(gè)百分點(diǎn)。不過(guò)在BiLSTM+CRF層上增加BERT預(yù)訓(xùn)練語(yǔ)言模型后,F(xiàn)1得分提高了8.68個(gè)百分點(diǎn),說(shuō)明BERT能夠輔助提升模型對(duì)文本的語(yǔ)義表征能力,更大限度地捕捉作物病蟲(chóng)害文本中交錯(cuò)關(guān)聯(lián)的實(shí)體關(guān)系,從而優(yōu)化了實(shí)體關(guān)系抽取任務(wù)的效果。
BERT-BiLSTM+CRF模型對(duì)主實(shí)體與各實(shí)體間關(guān)系的預(yù)測(cè)結(jié)果如表4所示,整體效果較為均衡,F(xiàn)1得分為90%左右,但“為害部位”關(guān)系的預(yù)測(cè)結(jié)果明顯低于平均水平,尤其是召回率僅為58.15%,這是拉低模型整體效果的重要因素。通過(guò)對(duì)“為害部位”關(guān)系的對(duì)應(yīng)語(yǔ)料文本和最終生成的預(yù)測(cè)結(jié)果進(jìn)行分析,發(fā)現(xiàn)文本中對(duì)同一作物部位的描述方法不統(tǒng)一,如“葉片”、“葉肉”、“葉面”、“葉背”、“葉鞘”、“幼葉”、“嫩葉”、“葉”等詞語(yǔ)均為描述“葉子”這一部位。因此,這樣的情況導(dǎo)致在預(yù)測(cè)過(guò)程中出現(xiàn)很多預(yù)測(cè)錯(cuò)誤的陰樣本,使得召回率嚴(yán)重偏低,從而影響模型整體預(yù)測(cè)水平。
表4 利用BERT-BiLSTM+CRF模型對(duì)主實(shí)體及主實(shí)體與實(shí)體間關(guān)系類型的預(yù)測(cè)結(jié)果
本研究的實(shí)體關(guān)系抽取是在本體所預(yù)定義的關(guān)系集合基礎(chǔ)上進(jìn)行的,關(guān)系預(yù)定義為非結(jié)構(gòu)化知識(shí)抽取確定了邊界,減少冗余信息的無(wú)效抽取,同時(shí)結(jié)合ME+ R+BIESO標(biāo)注方法和BERT-BiLSTM+CRF模型進(jìn)行試驗(yàn),在很大程度上提高了實(shí)體關(guān)系抽取的效率和準(zhǔn)確性,保證知識(shí)圖譜的質(zhì)量。
目前知識(shí)圖譜的存儲(chǔ)方式分為基于RDF三元組和基于圖數(shù)據(jù)庫(kù)。RDF三元組一般采用關(guān)系數(shù)據(jù)庫(kù)進(jìn)行存儲(chǔ),查詢較為靈活高效,但同時(shí)會(huì)存儲(chǔ)大量冗余信息,需要定時(shí)進(jìn)行維護(hù)。圖數(shù)據(jù)庫(kù)將知識(shí)圖譜的實(shí)體和概念作為圖頂點(diǎn),實(shí)體屬性和關(guān)系作為邊,以圖的形式進(jìn)行存儲(chǔ),比較直觀地反映知識(shí)圖譜的內(nèi)部結(jié)構(gòu),有利于進(jìn)行圖查詢以及知識(shí)推理,且可擴(kuò)展性較強(qiáng)。Neo4j是一個(gè)開(kāi)源的圖數(shù)據(jù)庫(kù)系統(tǒng),底層使用圖數(shù)據(jù)結(jié)構(gòu)進(jìn)行存儲(chǔ),大幅度提升數(shù)據(jù)檢索的性能,是目前用于知識(shí)圖譜存儲(chǔ)的主要途徑。因此本研究將作物病蟲(chóng)害知識(shí)圖譜存儲(chǔ)于Neo4j圖數(shù)據(jù)庫(kù)中。
由于本研究數(shù)據(jù)量不是特別大,因此采用Neo4j數(shù)據(jù)庫(kù)自帶Cypher語(yǔ)言中的LOAD CSV方式,首先將通過(guò)解析獲取的實(shí)體節(jié)點(diǎn)和關(guān)系數(shù)據(jù)分別保存為.csv文件并放置在Neo4j的import文件夾中,然后通過(guò)Cypher語(yǔ)言的LOAD CSV語(yǔ)句導(dǎo)入節(jié)點(diǎn)和關(guān)系。采用Cypher語(yǔ)句將實(shí)體與實(shí)體間的關(guān)系存儲(chǔ)到Neo4j圖數(shù)據(jù)庫(kù)中,形成作物病蟲(chóng)害知識(shí)圖譜,其中包括1 619條病蟲(chóng)害實(shí)例信息,28 894個(gè)三元組,部分可視化展示如圖6所示,其中粉紅色節(jié)點(diǎn)為作物病蟲(chóng)害實(shí)體,藍(lán)色節(jié)點(diǎn)為與作物病蟲(chóng)害實(shí)體存在關(guān)系的實(shí)體,邊則為兩者間的關(guān)系類型。知識(shí)圖譜中交互關(guān)聯(lián)的節(jié)點(diǎn)為隱藏關(guān)系的推理提供了很好的知識(shí)基礎(chǔ),如“水稻云形病”與“葉枯病”節(jié)點(diǎn)之間的邊表示為“別名”,與“50%甲基硫菌靈可濕性粉劑”節(jié)點(diǎn)之間的邊表示為“防治農(nóng)藥”,則可推理出“葉枯病”與“50%甲基硫菌靈可濕性粉劑”實(shí)體之間也存在“防治農(nóng)藥”的關(guān)系。
圖6 作物病蟲(chóng)害知識(shí)圖譜的可視化
1)本研究提出了一種基于深度學(xué)習(xí)的作物病蟲(chóng)害知識(shí)圖譜構(gòu)建方法,該方法根據(jù)作物病蟲(chóng)害領(lǐng)域的語(yǔ)料特征,在領(lǐng)域本體的基礎(chǔ)上對(duì)半結(jié)構(gòu)化和非結(jié)構(gòu)化知識(shí)進(jìn)行半自動(dòng)化抽取,并將知識(shí)圖譜存儲(chǔ)于Neo4j圖數(shù)據(jù)庫(kù)中,實(shí)現(xiàn)實(shí)體交互關(guān)系的可視化展示和知識(shí)推理。該知識(shí)圖譜研究方法在農(nóng)業(yè)智能問(wèn)答系統(tǒng)、農(nóng)業(yè)物聯(lián)網(wǎng)、農(nóng)業(yè)大數(shù)據(jù)分析等方面的應(yīng)用提供方法參考。
2)以一種與領(lǐng)域數(shù)據(jù)相適應(yīng)的語(yǔ)料標(biāo)注方式,完成非結(jié)構(gòu)化知識(shí)中的實(shí)體和關(guān)系聯(lián)合抽取。對(duì)實(shí)體和關(guān)系進(jìn)行同步標(biāo)注,三元組通過(guò)標(biāo)簽匹配和映射可直接獲取,不僅有效提高了標(biāo)注效率,還解決了一對(duì)多重疊關(guān)系抽取問(wèn)題。
3)利用來(lái)自轉(zhuǎn)換器的雙向編碼器表征量(Bidirectional Encoder Representations from Transformers,BERT)-雙向長(zhǎng)短期記憶網(wǎng)絡(luò)(Bi-directional Long-Short Term Memory,BiLSTM)+條件隨機(jī)場(chǎng)(Conditional Random Field,CRF)端到端模型在數(shù)據(jù)集上進(jìn)行訓(xùn)練和預(yù)測(cè),試驗(yàn)結(jié)果表明F1得分為91.34%。
盡管本研究實(shí)現(xiàn)的作物病蟲(chóng)害知識(shí)圖譜已初具規(guī)模,但仍有改進(jìn)空間,未來(lái)將在構(gòu)建方式、多對(duì)多重疊關(guān)系抽取、自動(dòng)更新等方面進(jìn)行探索。知識(shí)圖譜構(gòu)建可采用“自頂向下”+“自底向上”相結(jié)合的方式,將自定義本體模型和數(shù)據(jù)驅(qū)動(dòng)方式結(jié)合起來(lái),既設(shè)定了清晰的邏輯概念層次,又能從公開(kāi)數(shù)據(jù)集中進(jìn)行自動(dòng)知識(shí)抽取,同時(shí)保證知識(shí)圖譜的質(zhì)量和規(guī)模性。研究可擴(kuò)展性和可移植性更強(qiáng)的實(shí)體與關(guān)系標(biāo)注方法和訓(xùn)練模型,以解決語(yǔ)料中的多對(duì)多重疊關(guān)系提取問(wèn)題。隨著網(wǎng)絡(luò)數(shù)據(jù)的快速更新,需要及時(shí)對(duì)知識(shí)圖譜數(shù)據(jù)進(jìn)行更新和補(bǔ)充,通過(guò)知識(shí)融合、知識(shí)推理等技術(shù),實(shí)現(xiàn)知識(shí)圖譜的自動(dòng)更新升級(jí)。
[1]徐增林,盛泳潘,賀麗榮,等. 知識(shí)圖譜技術(shù)綜述[J]. 電子科技大學(xué)學(xué)報(bào),2016,45(4):589-606. Xu Zenglin, Sheng Yongpan, He Lirong. et al. Review on knowledge graph techniques[J]. Journal of University of Electronic Science and Technology of China, 2016, 45(4): 589-606. (in Chinese with English abstract)
[2]Auer S, Bizer C, Kobilarov G, et al. Dbpedia: A Nucleus for a Web of Open Data[M]. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007.
[3]Bollacker K, Evans C, Paritosh P, et al. Freebase: A collaboratively created graph database for structuring human knowledge[C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. New York, United States, 2008.
[4]Vrande?i? D. Wikidata: A new platform for collaborative data collection[C]//Proceedings of the 21stInternational Conference on World Wide Web. New York, United States, 2012.
[5]Niu Xing, Sun Xinruo, Wang Haofen, et al. Zhishi. me-weaving Chinese linking open data[C]//International Semantic Web Conference, Berlin, Heidelberg, Germany, 2011.
[6]Swartz A. Musicbrainz: A semantic web service[J]. IEEE Intelligent Systems, 2002, 17(1): 76-77.
[7]Dodds K. Popular geopolitics and audience dispositions: James Bond and the Internet Movie Database (IMDb)[J]. Transactions of the Institute of British Geographers, 2006, 31(2): 116-130.
[8]阮彤,孫程琳,王昊奮,等. 中醫(yī)藥知識(shí)圖譜構(gòu)建與應(yīng)用[J]. 醫(yī)學(xué)信息學(xué)雜志,2016,37(4):8-13. Ruan Tong, Sun Chenglin, Wang Haofen, et al. Construction of traditional Chinese medicine knowledge graph and its application[J]. Journal of Medical Informatics, 2016, 37(4): 8-13. (in Chinese with English abstract)
[9]夏迎春. 基于知識(shí)圖譜的農(nóng)業(yè)知識(shí)服務(wù)系統(tǒng)研究[D]. 合肥:安徽農(nóng)業(yè)大學(xué),2018. Xia Yingchun. Agriculture Knowledge Service System Based on Knowledge Graph[D]. Hefei: Anhui Agricultural University, 2018. (in Chinese with English abstract)
[10]吳茜. 基于知識(shí)圖譜的農(nóng)業(yè)智能問(wèn)答系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D]. 廈門(mén):廈門(mén)大學(xué),2019. Wu Qian. Design and Implementation of Agricultural Intelligent Q&A System Based on Knowledge Graph[D]. Xiamen: Xiamen University, 2019. (in Chinese with English abstract)
[11]王丹丹. 寧夏水稻知識(shí)圖譜構(gòu)建方法研究與應(yīng)用[D]. 寧夏:北方民族大學(xué),2020. Wang Dandan. Research and Application of Construction Method of Rice Knowledge Graph in Ningxia[D]. Ningxia: Northern University for Nationalities, 2020. (in Chinese with English abstract)
[12]Socher R, Huval B, Manning C D, et al. Semantic compositionality through recursive matrix-vector spaces[C]// Joint Conference on Empirical Methods in Natural Language Processing & Computational Natural Language Learning, Jeju Island, Korea, 2012.
[13]Marrero M, Urbano J, Sánchez-Cuadrado S, et al. Named entity recognition: Fallacies, challenges and opportunities[J]. Computer Standards & Interfaces, 2013, 35(5): 482-489.
[14]Kumar S. A survey of deep learning methods for relation extraction[J/OL]. Computer Science, 2017, [2017-05-10], https: //arxiv. org/pdf/1705. 03645. pdf.
[15]Miwa M, Bansal M. End-to-end relation extraction using LSTMs on sequences and tree structures[C]//Proceedings of the 54thAnnual Meeting of the Association for Computational Linguistics, Berlin, Germany, 2016.
[16]Katiyar A, Cardie C. Going out on a limb: Joint extraction of entity mentions and relations without dependency trees[C]//Proceedings of the 55thAnnual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017.
[17]Zheng Suncong, Wang Feng, Bao Hongyun, et al. Joint extraction of entities and relations based on a novel tagging scheme[C]//Proceedings of the 55thAnnual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017.
[18]Zeng Xiaorong, Zeng Daojian, He Shizhu, et al. Extracting relational facts by an end-to-end neural model with copy mechanism[C]//Proceedings of the 56thAnnual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018.
[19]Dai Dai, Xiao Xinyan, Lyu Yajuan, et al. Joint extraction of entities and overlapping relations using position-attentive sequence labeling[C]// Thirty-third AAAI Conference on Artificial Intelligence, Honolulu, United States, 2019, 33: 6300-6308.
[20]Luo Xukun, Liu Weijie, Ma Meng, et al. A bidirectional tree tagging scheme for jointly extracting overlapping entities and relations[J/OL]. Computation and Language, 2020, [2020-09-07], https: //arxiv. org/pdf/2008. 13339. pdf.
[21]奧德瑪,楊云飛,穗志方,等. 中文醫(yī)學(xué)知識(shí)圖譜CMeKG構(gòu)建初探[J]. 中文信息學(xué)報(bào),2019,33(10):1-9. Ao Dema, Yang Yunfei, Sui Zhizfang, et al. Preliminary study on the construction of Chinese medical knowledge graph[J]. Journal of Chinese Information Processing, 2019, 33(10): 1-9. (in Chinese with English abstract)
[22]Liu Xiaoxue, Bai Xuesong, Wang Longhe, et al. Review and trend analysis of knowledge graphs for crop pest and diseases[J]. IEEE Access, 2019, 7(14): 62251-62264.
[23]張善文,王振,王祖良. 結(jié)合知識(shí)圖譜與雙向長(zhǎng)短時(shí)記憶網(wǎng)絡(luò)的小麥條銹病預(yù)測(cè)[J]. 農(nóng)業(yè)工程學(xué)報(bào),2020,36(12):172-178. Zhang Shanwen, Wang Zhen, Wang Zuliang. Prediction of wheat srtipe rust disease by combining knowledge graph and bidirectional long short-term memory network[J]. Transactions of the Chinese Society Agricultural Engineering (Transactions of the CSAE), 2020, 36(12): 172-178. (in Chinese with English abstract)
[24]李思珍. 基于本體的行業(yè)知識(shí)圖譜構(gòu)建技術(shù)的研究與實(shí)現(xiàn)[D]. 北京:北京郵電大學(xué),2019. LI Sizhen. The Research and Implementation of Ontology-based Enterprise Knowledge Graph Construction[D]. Beijing: Beijing University of Posts and Telecommunications, 2019. (in Chinese with English abstract)
[25]Gruber T R. A translation approach to portable ontology specifications[J]. Knowledge Acquisition, 1993, 5(2): 199-220.
[26]Noy N F, Crubézy M, Fergerson R W, et al. Protégé-2000: An open-source ontology-development and knowledge-acquisition environment[C]//AMIA Annual Symposium proceeding, California, United States, 2003.
[27]寧尚明,滕飛,李天瑞. 基于多通道自注意力機(jī)制的電子病歷實(shí)體關(guān)系抽取[J]. 計(jì)算機(jī)學(xué)報(bào),2020,43(5):916-929. Ning Sangming, Teng Fei, Li Tianrui. Multi-channel self-attention mechanism for relation extraction in clinical records[J]. Chinese Journal of Computers, 2020, 43(5): 916-929. (in Chinese with English abstract)
[28]Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[C]// 1stInternational Conference on Learning Representations, Arizona, United States, 2013.
[29]Pennington J, Socher R, Manning C. Glove: Global vectors for word representation[C]//Association for Computational Linguistics, Doha, Qatar, 2014.
[30]Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]// Association for Computational Linguistics, Minneapolis, United States, 2018.
[31]張秋穎,傅洛伊,王新兵. 基于BERT-BiLSTM-CRF的學(xué)者主頁(yè)信息抽取[J]. 計(jì)算機(jī)應(yīng)用研究,2020,37(增刊1):47-49. Zhang Qiuying, Fu Luoyi, Wang Xinbing. Scholar homepage information extraction based on BERT-BiLSTM-CRF[J]. Application Research of Computers, 2020, 37(Supp. 1): 47-49. (in Chinese with English abstract)
[32]Graves A, Fernández S, Schmidhuber J. Bidirectional LSTM networks for improved phoneme classification and recognition[C]//International Conference on Artificial Neural Networks, Warsaw, Poland, 2005.
[33]Sundermeyer M, Schluter R, Ney H, et al. LSTM neural networks for language modeling[C]// Conference of the international speech communication association, Portland, Oregon, United States, 2012.
[34]Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model[C]// Inter speech, Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, 2015.
[35]Lafferty John, Mccallum A, Pereira F C N, et al. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//International Conference on Machine Learning (ICML), Massachusetts, United States, 2001.
Construction of visualization domain-specific knowledge graph of crop diseases and pests based on deep learning
Wu Saisai1, Zhou Ailian1※, Xie Nengfu1, Liang Xiaohe1, Wang Huijuan1, Li Xiaoyu1, Chen Guipeng2
(1.,,100086,; 2.,,330200,)
The knowledge graph describes the concepts, entities, and their relationships in the objective world in a structured form. It has a better ability to organize, manage, and understand massive amounts of information, and can structure heterogeneous knowledge in the field. It can be widely used in medical, biological, financial, etc. In view of the current situation in the field of crop diseases and insect pests, there are multiple relationship pairs between the same entity and multiple entities, multi-source heterogeneous data, poor aggregation ability, low utilization, and the possibility of knowledge sharing. Combining Natural Language Processing (NLP) and text mining technologies, this study focused on data acquisition, ontology construction, knowledge extraction, and knowledge storage, researched on the construction of crops diseases and insect pests knowledge graph based on deep learning. Firstly, this study used the Scrapy crawler framework of the Python programming language to crawl data from web pages related to crop diseases and insect pests, and performed data cleaning and data supplementation through data preprocessing methods. Secondly, according to the characteristics of the domain corpus, the Protégé ontology construction tool was used to complete the semi-automatic construction of the crop diseases and insect pests ontology predefined the set of properties and relations and set the corresponding domains and ranges. Then, based on the ontology, the rule method was used to extract semi-structured knowledge, and the deep learning method was used to extract unstructured knowledge. In the process of unstructured knowledge extraction, a text annotation mode “Main_Entity+Relation+BIESO” (ME+R+BIESO) adapted to the domain corpus was also proposed. Based on a predefined set of relationships, entities and relationships were simultaneously annotated, it contained entity and relationship information at the same time, and directly modeling the triples instead of separately modeling entities and relationships. The corresponding triples were also directly obtained through analysis, which not only saved at least half of the cost of labeling but also realized the joint extraction of entity relations and solved the problem of overlapping relation extraction. And this study used the Bidirectional Encoder Representation from Transformers (BERT)- Bi-directional Long-Short Term Memory (BiLSTM)+ Conditional Random Field (CRF) end-to-end model to experiment on the crop diseases and insect pests dataset. First, this study used the BERT pre-training language model to encode words, extracted text features, and used the generated vector as the input of the BiLSTM layer; BiLSTM integrated contextual information into the model at the same time, and performed bidirectional encoding to achieve effective prediction of label sequences; finally, this study used the CRF module to decode the output result of BiLSTM, and the label transition probability and constraint conditions were obtained through training and learning, and the entity label category of each character was obtained. The experimental results showed that the precision was 94.06%, the recall was 89.02%, and the F1 value reached 91.34%, which was much better than the pipeline method and classic models such as BiLSTM+CRF and Convolutional Neural Networks (CNN)+BiLSTM+CRF in the joint extraction method. The joint extraction of entity relations based on this annotation mode not only improved the efficiency and accuracy of annotation but also solved the problem of overlapping relations in the corpus. Finally, the extracted knowledge was stored in the graph database to realize the visual display of the knowledge graph and deep knowledge mining and reasoning. Combined the deep learning technology to realize the semi-automatic construction of the knowledge graph, which was of great significance for the detection of crop diseases and insect pests, forecasting and early warning, and the establishment of prevention models in the intelligent production system. It could provide a high-quality knowledge base for crop diseases and insect pests question answering systems, recommendation systems, search engines, and other applications, which could be effectively applied to crop variety selection, pest prevention and control, and fertilization and irrigation.
crops; diseases and pests; models; knowledge graph; deep learning; joint extraction of entity and relation
吳賽賽,周愛(ài)蓮,謝能付,等. 基于深度學(xué)習(xí)的作物病蟲(chóng)害可視化知識(shí)圖譜構(gòu)建[J]. 農(nóng)業(yè)工程學(xué)報(bào),2020,36(24):177-185.doi:10.11975/j.issn.1002-6819.2020.24.021 http://www.tcsae.org
Wu Saisai, Zhou Ailian, Xie Nengfu, et al. Construction of visualization domain-specific knowledge graph of crop diseases and pests based on deep learning[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2020, 36(24): 177-185. (in Chinese with English abstract) doi:10.11975/j.issn.1002-6819.2020.24.021 http://www.tcsae.org
2020-10-20
2020-11-27
國(guó)家自然科學(xué)基金面上項(xiàng)目(31671588);國(guó)家社科基金青年項(xiàng)目(20CTQ019);江西現(xiàn)代農(nóng)業(yè)科研協(xié)同創(chuàng)新專項(xiàng)(JXXTCX201801-03);中國(guó)農(nóng)業(yè)科學(xué)院農(nóng)業(yè)信息研究所創(chuàng)新工程項(xiàng)目(CAAS-ASTIP-2016-AII)
吳賽賽,研究方向?yàn)檗r(nóng)業(yè)知識(shí)圖譜、智能問(wèn)答。Email:82101185233@caas.cn
周愛(ài)蓮,副研究員,研究方向?yàn)檗r(nóng)業(yè)信息管理。Email:zhouailian@caas.cn
10.11975/j.issn.1002-6819.2020.24.021
TP391
A
1002-6819(2020)-24-0177-09