黎 煊,趙 建,高 云,劉望宏,雷明剛,譚鶴群
?
基于連續(xù)語音識(shí)別技術(shù)的豬連續(xù)咳嗽聲識(shí)別
黎 煊1,2,趙 建1,2,高 云1,2,劉望宏2,3,雷明剛2,3,譚鶴群1,2
(1. 華中農(nóng)業(yè)大學(xué)工學(xué)院,武漢 430070;2. 生豬健康養(yǎng)殖協(xié)同創(chuàng)新中心,武漢 430070; 3. 華中農(nóng)業(yè)大學(xué)動(dòng)物科技學(xué)院動(dòng)物醫(yī)學(xué)院,武漢 430070)
針對(duì)現(xiàn)有基于孤立詞識(shí)別技術(shù)的豬咳嗽聲識(shí)別存在識(shí)別聲音種類有限,無法反映實(shí)際患病豬連續(xù)咳嗽的問題,該文提出了基于雙向長(zhǎng)短時(shí)記憶網(wǎng)絡(luò)-連接時(shí)序分類模型(birectional long short-term memory-connectionist temporal classification, BLSTM-CTC)構(gòu)建豬聲音聲學(xué)模型,進(jìn)行豬場(chǎng)環(huán)境豬連續(xù)咳嗽聲識(shí)別的方法,以此進(jìn)行豬早期呼吸道疾病的預(yù)警和判斷。研究了體質(zhì)量為75 kg左右長(zhǎng)白豬單個(gè)咳嗽聲樣本的持續(xù)時(shí)間長(zhǎng)度和能量大小的時(shí)域特征,構(gòu)建了聲音樣本持續(xù)時(shí)間在0.24~0.74 s和能量大于40.15V2?s的閾值范圍。在此閾值范圍內(nèi),利用單參數(shù)雙門限端點(diǎn)檢測(cè)算法對(duì)基于多窗譜的心理聲學(xué)語音增強(qiáng)算法處理后的30 h豬場(chǎng)聲音進(jìn)行檢測(cè),得到222段試驗(yàn)語料。將豬場(chǎng)環(huán)境下的聲音分為豬咳嗽聲和非豬咳嗽聲,并以此作為聲學(xué)模型建模單元,進(jìn)行語料的標(biāo)注。提取26維梅爾頻率倒譜系數(shù)(Mel frequency cepstral coefficients,MFCC)作為試驗(yàn)語段特征參數(shù)。通過BLSTM網(wǎng)絡(luò)學(xué)習(xí)豬連續(xù)聲音的變化規(guī)律,并利用CTC實(shí)現(xiàn)了端到端的豬連續(xù)聲音識(shí)別系統(tǒng)。5折交叉驗(yàn)證試驗(yàn)平均豬咳嗽聲識(shí)別率達(dá)到92.40%,誤識(shí)別率為3.55%,總識(shí)別率達(dá)到93.77%。同時(shí),以數(shù)據(jù)集外1 h語料進(jìn)行了算法應(yīng)用測(cè)試,得到豬咳嗽聲識(shí)別率為94.23%,誤識(shí)別率為9.09%,總識(shí)別率為93.24%。表明基于連續(xù)語音識(shí)別技術(shù)的BLSTM-CTC豬咳嗽聲識(shí)別模型是穩(wěn)定可靠的。該研究可為生豬健康養(yǎng)殖過程中豬連續(xù)咳嗽聲的識(shí)別和疾病判斷提參考。
信號(hào)處理;聲音信號(hào);識(shí)別;生豬產(chǎn)業(yè);連續(xù)咳嗽聲;雙向長(zhǎng)短時(shí)記憶網(wǎng)絡(luò)-連接時(shí)序分類模型;聲學(xué)模型
目前,市場(chǎng)對(duì)豬肉的需求量在所有動(dòng)物肉類中比重最大[1]。然而,隨著生豬產(chǎn)業(yè)規(guī)?;陌l(fā)展,豬呼吸道疾病嚴(yán)重威脅了豬肉的質(zhì)量和產(chǎn)量,通過豬咳嗽聲的監(jiān)測(cè)可以及時(shí)發(fā)現(xiàn)豬呼吸道疾病[2-4]。目前豬場(chǎng)監(jiān)測(cè)豬咳嗽聲的方法是人為蹲點(diǎn)監(jiān)測(cè),不僅人力成本高,而且無法保證較理想的識(shí)別率。本文基于語音識(shí)別技術(shù)開展豬場(chǎng)環(huán)境下豬咳嗽聲的自動(dòng)識(shí)別研究,以促進(jìn)生豬健康養(yǎng)殖的發(fā)展[5]。
在豬咳嗽聲時(shí)域特征的研究過程中,Mitchell等[3]研究了比利時(shí)長(zhǎng)白和杜洛克雜交豬的咳嗽聲,發(fā)現(xiàn)病豬、健康豬咳嗽聲持續(xù)時(shí)間分別為0.3和0.21 s;Sara等[4]通過對(duì)長(zhǎng)白和大白雜交豬咳嗽聲的研究,發(fā)現(xiàn)病豬、健康豬咳嗽聲持續(xù)時(shí)間分別為0.67和0.43 s。由此可見豬的咳嗽聲持續(xù)時(shí)間長(zhǎng)度與豬的健康狀況以及品種都有關(guān)系。另外,Cordeiro等[1]采用不同的冷熱環(huán)境對(duì)豬進(jìn)行刺激,發(fā)現(xiàn)處于緊張狀態(tài)下的豬所發(fā)聲音持續(xù)時(shí)間長(zhǎng)于1.02 s,并以此閾值作為決策樹算法(decision tree algorithm)的判斷標(biāo)準(zhǔn),對(duì)豬是否處于緊張狀態(tài)進(jìn)行判斷。
在豬咳嗽聲識(shí)別的研究過程中,Exadaktylos等[6]采用模糊C均值聚類算法識(shí)別豬咳嗽,總識(shí)別率達(dá)到85%。同樣基于模糊C均值聚類算法進(jìn)行豬咳嗽聲識(shí)別的工作有Hirtum等[7],識(shí)別率達(dá)到92%,錯(cuò)誤率達(dá)到21%;徐亞妮等[8]識(shí)別率達(dá)到83.4%;Guarino等[9]則采用動(dòng)態(tài)時(shí)間規(guī)整(dynamic time warping,DTW)算法識(shí)別豬咳嗽,識(shí)別率達(dá)到85.5%;劉振宇等[10]采用隱馬爾科夫模型(hidden markov model,HMM)對(duì)豬咳嗽聲進(jìn)行識(shí)別,識(shí)別率達(dá)到80.0%;黎煊等[11]基于深度信念網(wǎng)絡(luò)(deep belief nets,DBN)實(shí)現(xiàn)了豬咳嗽聲識(shí)別,豬咳嗽聲識(shí)別率達(dá)到95.80%,誤識(shí)別率為6.83%,總識(shí)別率達(dá)到94.29%。
前人的工作均是基于孤立詞的豬咳嗽聲識(shí)別和研究,所考慮的非豬咳嗽聲種類有限,故所得模型對(duì)于沒有學(xué)習(xí)的豬場(chǎng)其他聲音樣本無法做出識(shí)別判斷,模型實(shí)用性受到限制;另外,患病豬每次咳嗽時(shí),會(huì)進(jìn)行多次連續(xù)性的咳嗽[12-13],故通過豬的連續(xù)咳嗽聲識(shí)別更能反映豬的患病情況。
目前,國(guó)內(nèi)外關(guān)于豬連續(xù)聲音識(shí)別的研究工作未曾報(bào)道,但是越來越多的學(xué)者已經(jīng)通過構(gòu)建聲學(xué)模型,將連續(xù)語音識(shí)別技術(shù)運(yùn)用于其他動(dòng)物的聲音識(shí)別研究上。聲學(xué)模型是連續(xù)語音識(shí)別系統(tǒng)的重要組成部分,通過選擇合適的聲學(xué)建模單元可以很方便地描述語音信號(hào)的物理變換規(guī)律。Milone等[14]構(gòu)建了牛吃食聲的聲學(xué)模型,實(shí)現(xiàn)了牛連續(xù)吃食聲的識(shí)別。類似的研究工作還有Reby等[15]實(shí)現(xiàn)了鹿連續(xù)聲音的識(shí)別,Milone等[16]實(shí)現(xiàn)了羊連續(xù)吃食聲的識(shí)別,Trifa等[17]實(shí)現(xiàn)了蟻鳥連續(xù)聲音的識(shí)別。
為此,本文開展了豬連續(xù)咳嗽聲識(shí)別的研究。通過雙向長(zhǎng)短時(shí)記憶(birectional long short-term memory,BLSTM)網(wǎng)絡(luò)[18-20]對(duì)豬連續(xù)聲音進(jìn)行特征學(xué)習(xí),進(jìn)一步借助連接時(shí)序分類(connectionist temporal classification,CTC)[21]直接對(duì)輸入豬連續(xù)聲音序列和其標(biāo)注的對(duì)齊分布進(jìn)行建模,實(shí)現(xiàn)端到端[22-23]的豬連續(xù)咳嗽聲識(shí)別系統(tǒng),以期為生豬健康養(yǎng)殖過程中豬連續(xù)咳嗽聲的識(shí)別和疾病的判斷提供方法參考。
豬聲音采集地點(diǎn)為華中農(nóng)業(yè)大學(xué)校屬精品豬場(chǎng)。用美博M66錄音筆(采樣頻率為48 kHz)進(jìn)行采集。采集時(shí)間為2016年3?4月氣溫變換明顯的豬病多發(fā)期進(jìn)行。聲音采集對(duì)象為10頭體質(zhì)量75 kg左右的長(zhǎng)白豬,各5頭分開飼養(yǎng)于相鄰兩欄。經(jīng)獸醫(yī)診斷10頭豬中5頭感染呼吸道疾病,咳嗽明顯。將錄音筆固定于兩欄中間靠近豬舍墻壁上離地1.5 m處,進(jìn)行每天24 h連續(xù)豬場(chǎng)環(huán)境聲音的采集。對(duì)錄音筆采集的聲音進(jìn)行選取,保留豬咳嗽頻繁時(shí)間段的語音信號(hào)共30 h進(jìn)行試驗(yàn)。
豬場(chǎng)環(huán)境噪聲復(fù)雜,過多的噪聲對(duì)后續(xù)端點(diǎn)檢測(cè)和豬聲音的識(shí)別都有不利的影響。本文選擇基于多窗譜的心理聲學(xué)語音增強(qiáng)算法[11]實(shí)現(xiàn)豬連續(xù)聲音的去噪。圖1所示為語音增強(qiáng)算法處理前后時(shí)長(zhǎng)為8.50 s豬連續(xù)咳嗽聲波形對(duì)比圖,由圖1b可知豬連續(xù)聲音信號(hào)噪聲得到明顯削減,并且通過人耳試聽感知,發(fā)現(xiàn)豬聲音樣本幾乎沒有失真。
豬場(chǎng)采集的連續(xù)聲音中聲音種類繁雜,豬聲音主要包括咳嗽、打噴嚏、吃食、尖叫、哼哼、甩耳朵等,環(huán)境噪聲主要包括狗叫聲、金屬碰撞聲、抽風(fēng)機(jī)噪聲等其他聲音,這些聲音與豬咳嗽聲在持續(xù)時(shí)間和能量大小等時(shí)域特征上存在明顯的差異。本文從前人[3-4]通過對(duì)豬聲音持續(xù)時(shí)間、能量大小等特征的研究工作中得到啟示,研究了本試驗(yàn)中單個(gè)豬咳嗽聲樣本的持續(xù)時(shí)間長(zhǎng)度和能量大小。
圖1 語音增強(qiáng)前后豬連續(xù)咳嗽聲波形圖
經(jīng)過分幀處理后的豬咳嗽聲樣本()的持續(xù)時(shí)間長(zhǎng)度dur計(jì)算公式為
式中是經(jīng)過分幀后豬咳嗽樣本總幀數(shù),是幀長(zhǎng),根據(jù)聲音信號(hào)的短時(shí)平穩(wěn)特性取為25 ms,inc是幀移,取為幀長(zhǎng)的40%,F是采樣頻率,Hz。
令豬咳嗽聲樣本()經(jīng)過分幀后第幀表示為y(),則豬咳嗽聲樣本()的能量計(jì)算公式為
式中是采樣點(diǎn)序號(hào)。
利用Direct Splitter語音信號(hào)處理軟件從錄音筆采集聲音中隨機(jī)截取了597個(gè)豬咳嗽聲樣本,按照公式(1)和(2)分別計(jì)算每個(gè)樣本的持續(xù)時(shí)間長(zhǎng)度和能量大小,進(jìn)一步得到最大最小值,結(jié)果如表1所示。
表1 單個(gè)豬咳嗽聲樣本時(shí)域特征分析結(jié)果
由表1分析結(jié)果可知,本試驗(yàn)對(duì)象長(zhǎng)白豬咳嗽聲持續(xù)時(shí)間從0.24~0.74 s不等,研究結(jié)果與前人的研究結(jié)果[3-4]類似。由于豬咳嗽的強(qiáng)度和豬距離錄音筆的距離都會(huì)對(duì)咳嗽聲樣本的能量造成影響。相對(duì)而言,能量越高的樣本表示豬咳嗽越劇烈且豬距離錄音筆越近,故能量閾值只考慮其下限值。
豬聲音信號(hào)端點(diǎn)檢測(cè)是指從包含豬聲音的連續(xù)信號(hào)中找出所有聲音樣本的起止點(diǎn),把起止點(diǎn)之間的信號(hào)定義為有效信號(hào)。利用文獻(xiàn)[11]中基于短時(shí)能量的單參數(shù)雙門限端點(diǎn)檢測(cè)算法檢測(cè)錄音筆采集的30 h連續(xù)豬場(chǎng)聲音的有效信號(hào),并對(duì)檢測(cè)出的每一個(gè)聲音樣本按照表1中持續(xù)時(shí)長(zhǎng)上下限和能量下限值設(shè)定的閾值范圍進(jìn)行判斷,剔除不在此范圍內(nèi)的聲音樣本。最終得到222段試驗(yàn)語料,其中最長(zhǎng)9.14 s,最短3.91 s。所有222段語料共包含聲音樣本1 145個(gè),其中豬咳嗽樣本一共751個(gè),非豬咳嗽樣本一共394個(gè)。
在獸醫(yī)幫助下,采用人工標(biāo)記法對(duì)222段語料進(jìn)行標(biāo)注得到對(duì)應(yīng)的序列標(biāo)記,將聲學(xué)建模單元中豬咳嗽聲和非豬咳嗽聲分別用符號(hào)‘k’和‘n’表示。
梅爾頻率倒譜系數(shù)(Mel frequency cepstral coefficients,MFCC)[24-25]的分析是基于人耳的聽覺機(jī)理進(jìn)行的。將線性頻譜映射到非線性的Mel頻譜中,依據(jù)人的聽覺試驗(yàn)結(jié)果來分析聲音的頻譜特性。將豬連續(xù)聲音語段經(jīng)過分幀加窗后,采用快速傅里葉變換計(jì)算其頻譜能量,然后將其通過梅爾濾波器組,對(duì)濾波器輸出取對(duì)數(shù)得到梅爾濾波能量,再計(jì)算其離散余弦變換得到可以反映豬聲音靜態(tài)特性的13維梅爾頻率倒譜系數(shù),最后加入反映豬聲音動(dòng)態(tài)特性的一階差分系數(shù),得到26維梅爾頻率倒譜系數(shù)。梅爾頻率倒譜系數(shù)特征參數(shù)提取過程具體步驟如圖2所示。
圖2 MFCC特征參數(shù)提取步驟
相對(duì)于前饋神經(jīng)網(wǎng)絡(luò)[26-27]隱層神經(jīng)元之間無連接的特點(diǎn),RNN(recurrent neural network)是一種允許隱層神經(jīng)元存在自反饋通路的神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)。RNN隱層輸入不僅包括輸入層輸入的豬聲音特征,也包括上一時(shí)刻隱層神經(jīng)元的輸出,這種網(wǎng)絡(luò)結(jié)構(gòu)有利于模型對(duì)前面的信息進(jìn)行記憶,并應(yīng)用于處理當(dāng)前輸出的計(jì)算中。雖然RNN理論上很適合處理類似語音序列的建模問題,但是隨著語音序列長(zhǎng)度的增加存在著梯度爆炸和消失的問題[20,28]。LSTM是一種特殊的RNN,其通過引入記憶單元和門限機(jī)制可以學(xué)習(xí)歷史信息,并控制信息的累積速度,在一定程度上緩解了存在于RNN模型中的問題,LSTM模塊單元如圖3所示。
由圖3可知LSTM單元主要由4個(gè)部分組成:記憶單元(memory cell)、輸入門(input gate)、輸出門(output gate)和遺忘門(forget gate)。在LSTM網(wǎng)絡(luò)中記憶單元彼此互相連接,3個(gè)非線性門控單元可以調(diào)節(jié)輸入和輸出記憶單元的信息(如圖3中虛線連接所示)。其中輸入門控制哪些信息會(huì)被輸入到記憶單元,通過讀取上一時(shí)刻記憶單元輸出h-1和此時(shí)刻輸入x,輸出一個(gè)在0和1之間的數(shù)值,i表示要輸入信息的百分比,0表示全部舍棄,1表示完全輸入。i計(jì)算公式為
i=(W[h-1, x]+ b) (3)
式中是sigmoid函數(shù),W是輸入門權(quán)值,b是輸入門閾值。
注:it表示輸入信息的百分比,ft表示遺忘信息的百分比,ot輸出門狀態(tài)值大小,黑色實(shí)心圓表示進(jìn)行乘積運(yùn)算。Note: it is the percentage of the input information of input gate; ft is the percentage of the forgotten information of forget gate; ot is the state value of output gate, and the black circle indicates the multiplication operation.
類似的,遺忘門控制需要忘記上一時(shí)刻記憶單元狀態(tài)c-1的哪些信息,f表示要遺忘信息的百分比,計(jì)算公式為
f=(W[h-1, x]+ b) (4)
式中W是遺忘門權(quán)值,b是遺忘門閾值。
于是可得到此時(shí)刻記憶單元的狀態(tài)c計(jì)算公式如下
c= f c-1+ itanh(W[h-1, x]+ b) (5)
式中tanh是雙曲正切函數(shù),W是記憶單元權(quán)值,b是記憶單元閾值。
輸出門值o控制記憶單元此時(shí)刻輸出了多少信息,于是有如下計(jì)算公式
o=(W[h-1, x]+ b) (6)
h=otanh c(7)
式中W是輸出門權(quán)值,b是輸出門閾值,h是此時(shí)刻記憶單元輸出。
傳統(tǒng)LSTM是單向展開的,只能利用歷史信息,而豬連續(xù)咳嗽聲識(shí)別是對(duì)整個(gè)語音序列的識(shí)別。當(dāng)前幀的特征不僅與前面各幀有聯(lián)系,也與后面各幀有關(guān)聯(lián)。因此通過2個(gè)獨(dú)立的LSTM來分別處理前向和后向[29-30]豬連續(xù)聲音序列(圖4),然后將輸出組合進(jìn)入網(wǎng)絡(luò)下一層進(jìn)行處理,充分挖掘上下文時(shí)序信息進(jìn)行豬連續(xù)聲音的聲學(xué)建模。
在連續(xù)語音識(shí)別系統(tǒng)中,CTC(connectionist temporal classification)層利用BLSTM學(xué)習(xí)序列信號(hào)的強(qiáng)大能力直接對(duì)輸入語音特征和輸出標(biāo)簽進(jìn)行建模[21,31],而不必依賴語音特征序列與序列標(biāo)記之間的對(duì)齊,從而實(shí)現(xiàn)了端到端的聲學(xué)模型訓(xùn)練。
注:xt-1、xt和xt+1分別表示t-1、t和t+1時(shí)刻輸入層輸入, ct-1、ct和ct+1分別表示隱層記憶單元t-1、t和t+1時(shí)刻的狀態(tài)值,ht-1、ht和ht+1分別表示t-1、t和t+1時(shí)刻記憶單元輸出,上標(biāo)→、←分別表示前向傳播和后向傳播。
BLSTM模型輸出作為CTC層輸入,輸出神經(jīng)元個(gè)數(shù)即所有可能的標(biāo)簽個(gè)數(shù),即聲學(xué)建模單元個(gè)數(shù),額外加入一個(gè)空白標(biāo)簽用于估計(jì)輸出的靜音,在本系統(tǒng)中標(biāo)簽個(gè)數(shù)為3,即‘k’、‘n’和‘_’,其中‘_’為空白標(biāo)簽,表示靜音模型。于是BLSTM模型的輸出可以描述輸入連續(xù)語音對(duì)應(yīng)的標(biāo)簽概率分布。給定長(zhǎng)度為的連續(xù)輸入語料,在時(shí)刻BLSTM模型輸出標(biāo)簽索引(∈{1,2,3})的概率表示為
式中y是BLSTM網(wǎng)絡(luò)時(shí)刻輸出標(biāo)簽的值,即輸出層神經(jīng)元的輸出值,l是BLSTM模型時(shí)刻輸出的標(biāo)簽索引,為標(biāo)簽個(gè)數(shù)。
令CTC輸出序列為π,則π是由個(gè)標(biāo)簽組成的長(zhǎng)度為的序列,將個(gè)時(shí)刻的概率值相乘即得到π的概率為
實(shí)際上,每個(gè)真實(shí)序列標(biāo)記有多個(gè)CTC輸出序列π與之對(duì)應(yīng),定義從π到的映射=(π),通過將可能序列中的重復(fù)標(biāo)簽和空白標(biāo)簽去掉[23]就可以將π轉(zhuǎn)化為。例如,對(duì)于一個(gè)為8的豬連續(xù)聲音信號(hào),若其真實(shí)序列標(biāo)記為(n, k, n, k),相應(yīng)的CTC輸出序列可以為(n, _, k, k, k, n, k, k)或(n, n, _, k, n, k, _, _)等。于是可得到
上式可利用前向后向算法[18]通過動(dòng)態(tài)規(guī)劃的思想計(jì)算并求導(dǎo)。若*為對(duì)應(yīng)連續(xù)輸入語料的序列標(biāo)記,CTC訓(xùn)練目的就是讓BLSTM網(wǎng)絡(luò)輸出*的概率最大化,也即概率的負(fù)導(dǎo)數(shù)最小化,設(shè)定損失函數(shù)為
圖5所示為基于BLSTM-CTC聲學(xué)模型的豬連續(xù)咳嗽聲識(shí)別系統(tǒng)。首先將豬連續(xù)聲音特征參數(shù)作為BLSTM輸入,利用BLSTM的強(qiáng)大聲學(xué)建模能力學(xué)習(xí)處理輸入語音的特征,接著網(wǎng)絡(luò)輸出豬連續(xù)聲音語料特征對(duì)應(yīng)的標(biāo)簽概率分布,以此概率分布作為CTC層輸入,同時(shí)借助原始語料序列標(biāo)記計(jì)算模型損失,進(jìn)一步實(shí)現(xiàn)整個(gè)聲學(xué)模型的訓(xùn)練。
圖5 豬連續(xù)咳嗽聲識(shí)別系統(tǒng)框圖
訓(xùn)練好的豬連續(xù)咳嗽聲識(shí)別系統(tǒng)可以應(yīng)用于豬連續(xù)聲音語料的識(shí)別,測(cè)試過程會(huì)輸出一個(gè)行列的概率矩陣,表示在所有時(shí)刻輸入幀經(jīng)過系統(tǒng)輸出后對(duì)應(yīng)標(biāo)簽的概率分布,通過集束收索算法[32]可解碼得到最大概率輸出序列,即為識(shí)別結(jié)果。
對(duì)BLSTM-CTC豬連續(xù)咳嗽聲識(shí)別模型進(jìn)行性能評(píng)估。試驗(yàn)采用5折交叉驗(yàn)證方法進(jìn)行,將222段試驗(yàn)數(shù)據(jù)集劃分為5個(gè)大小近似相等的互斥子集,然后每次用4個(gè)子集的并集作為訓(xùn)練集,第5個(gè)子集作為測(cè)試集,這樣就得到5組訓(xùn)練、測(cè)試集,從而可以進(jìn)行5次訓(xùn)練和測(cè)試。
在以識(shí)別基元為聲學(xué)模型建模單元的連續(xù)語音識(shí)別系統(tǒng)中一般以詞錯(cuò)誤率[33](word error rate,WER)作為系統(tǒng)評(píng)價(jià)指標(biāo),將識(shí)別結(jié)果與測(cè)試語料的序列標(biāo)記進(jìn)行對(duì)比,計(jì)算替代誤差個(gè)數(shù)(substitution)、插入誤差個(gè)數(shù)(insertion)和刪除誤差個(gè)數(shù)(deletion)三者之和,再除以測(cè)試語料中總樣本個(gè)數(shù),得到WER,即
關(guān)于3種誤差的解釋如下例:序列標(biāo)記為(n, , k, k, n, n),識(shí)別結(jié)果為(n,k,n, _),由識(shí)別結(jié)果與序列標(biāo)記對(duì)比可知,識(shí)別結(jié)果中第一個(gè)豬咳嗽聲為插入誤差,第二個(gè)非豬咳嗽聲為替代誤差,序列標(biāo)記中的最后一個(gè)非豬咳嗽聲沒有被識(shí)別出來,為刪除誤差。由于本文主要進(jìn)行豬連續(xù)咳嗽聲的識(shí)別,所以在識(shí)別過程中,僅考慮非豬咳嗽聲的替代誤差,同時(shí)忽略了非豬咳嗽聲的插入和刪除誤差。為此,本文利用改建的WER來對(duì)豬連續(xù)咳嗽聲識(shí)別系統(tǒng)進(jìn)行性能評(píng)估,評(píng)估指標(biāo)豬咳嗽聲識(shí)別率R、誤識(shí)別率R和總識(shí)別率total計(jì)算公式分別如下所示。
式中S、I、D、N、S、N分別表示豬咳嗽聲識(shí)別為非豬咳嗽聲個(gè)數(shù)、插入豬咳嗽聲個(gè)數(shù)、刪除豬咳嗽聲個(gè)數(shù)、測(cè)試集中豬咳嗽聲個(gè)數(shù)、非豬咳嗽聲識(shí)別為豬咳嗽聲個(gè)數(shù)、測(cè)試集中非豬咳嗽聲個(gè)數(shù)。
通過多次試驗(yàn)對(duì)比,最終將BLSTM前向傳播過程和后向傳播過程隱層神經(jīng)元、全連接層神經(jīng)元個(gè)數(shù)均設(shè)置為300,學(xué)習(xí)率設(shè)置為0.001,訓(xùn)練過程最大迭代次數(shù)為200。5折交叉驗(yàn)證試驗(yàn)結(jié)果如表2所示。
表2 豬連續(xù)咳嗽聲識(shí)別5折交叉驗(yàn)證結(jié)果
通過表2的交叉驗(yàn)證試驗(yàn)對(duì)應(yīng)的5組試驗(yàn)識(shí)別結(jié)果可知,各組豬咳嗽聲識(shí)別率和總識(shí)別率均達(dá)到90.00%,誤識(shí)別率控制在8.00%以內(nèi)。并且5折交叉驗(yàn)證結(jié)果平均豬咳嗽聲識(shí)別率達(dá)到92.40%,誤識(shí)別率達(dá)到3.55%,總識(shí)別率達(dá)到93.77%,本文采用的基于BLSTM-CTC聲學(xué)模型的豬連續(xù)咳嗽聲識(shí)別系統(tǒng)是穩(wěn)定有效的。
為了對(duì)基于連續(xù)語音識(shí)別技術(shù)的豬連續(xù)咳嗽聲識(shí)別模型進(jìn)行算法應(yīng)用測(cè)試,另取一段長(zhǎng)度為1 h豬場(chǎng)環(huán)境語料為試驗(yàn)對(duì)象。先進(jìn)行語音增強(qiáng),然后利用基于閾值的端點(diǎn)檢測(cè)算法獲得測(cè)試數(shù)據(jù)集14段,其中最長(zhǎng)8.51 s,最短3.56 s。此14段語料共包含聲音樣本74個(gè),其中豬咳嗽樣本52個(gè),非豬咳嗽樣本22個(gè)。接著對(duì)此14段測(cè)試語料進(jìn)行人工句級(jí)標(biāo)記,特征參數(shù)提取,最后利用表2第2組數(shù)據(jù)所得模型進(jìn)行算法應(yīng)用測(cè)試。測(cè)試結(jié)果豬咳嗽聲發(fā)生替代誤差1次、插入誤差1次、刪除誤差1次,非豬咳嗽聲發(fā)生替代誤差2次。分別計(jì)算得到豬咳嗽聲識(shí)別率為94.23%,誤識(shí)別率為9.09%,總識(shí)別率為93.24%。算法應(yīng)用測(cè)試結(jié)果表明基于連續(xù)語音識(shí)別技術(shù)的豬連續(xù)咳嗽聲識(shí)別模型對(duì)于訓(xùn)練測(cè)試數(shù)據(jù)集外的樣本同樣可得到較理想的識(shí)別效果,模型穩(wěn)定可靠。
本文提出了一種進(jìn)行豬場(chǎng)環(huán)境豬連續(xù)咳嗽聲識(shí)別的方法。該方法相對(duì)孤立詞識(shí)別技術(shù)而言,可以識(shí)別更多種類的豬場(chǎng)環(huán)境聲音,更能反映豬的患病狀況,語料處理、特征提取、識(shí)別等過程比孤立詞識(shí)別技術(shù)更簡(jiǎn)單。
1)提出了豬聲音的聲學(xué)模型,并且引入具有強(qiáng)大時(shí)序信號(hào)處理能力的雙向長(zhǎng)短時(shí)記憶網(wǎng)絡(luò)結(jié)構(gòu)和連接時(shí)序分類層來構(gòu)建豬聲音聲學(xué)模型。以豬咳嗽聲與非豬咳嗽聲為聲學(xué)建模單元,對(duì)連續(xù)語料進(jìn)行了標(biāo)注,實(shí)現(xiàn)了端到端的豬連續(xù)咳嗽聲識(shí)別系統(tǒng)。
2)通過5折交叉驗(yàn)證試驗(yàn),將BLSTM前向傳播過程和后向傳播過程隱層神經(jīng)元、全連接層神經(jīng)元個(gè)數(shù)均設(shè)置為300,學(xué)習(xí)率設(shè)置為0.001,5折交叉驗(yàn)證試驗(yàn)平均豬咳嗽聲識(shí)別率達(dá)到92.40%,誤識(shí)別率為3.55%,總識(shí)別率達(dá)到93.77%。同時(shí),以數(shù)據(jù)集外1 h語料進(jìn)行了算法應(yīng)用測(cè)試。得到豬咳嗽聲識(shí)別率為94.23%,誤識(shí)別率為9.09%,總識(shí)別率為93.24%,表明基于連續(xù)語音識(shí)別技術(shù)的BLSTM-CTC豬連續(xù)咳嗽聲識(shí)別模型是穩(wěn)定可靠的。
[1] Cordeiro A, N??s I, Leit?o F, et al. Use of vocalisation to identify sex, age, and distress in pig production[J]. Biosystems Engineering, 2018, 173:57-63.
[2] Silva M, Ferrari S, Costa A, et al. Cough localization for the detection of respiratory diseases in pig houses[J]. Computers and Electronics in Agriculture, 2008, 64(2): 286-292.
[3] Mitchell S, Vasileios E, Sara F, et al. The influence of respiratory disease on the energy envelope dynamics of pig cough sounds[J]. Computers and Electronics in Agriculture, 2009, 69(1): 80-85.
[4] Sara F, Mitchell S, Marcella G, et al. Cough sound analysis to identify respiratory infection in pigs[J]. Computers and Electronics in Agriculture, 2009, 64(2): 318-325.
[5] 何東健,劉冬,趙凱旋. 精準(zhǔn)畜牧業(yè)中動(dòng)物信息智能感知與行為檢測(cè)研究進(jìn)展[J]. 農(nóng)業(yè)機(jī)械學(xué)報(bào),2016,47(5):231-244. He Dongjian, Liu Dong, Zhao Kaixuan. Review of perceiving animal information and behavior in precision livestock farming[J]. Transactions of the Chinese Society for Agricultural Machinery, 2016, 47(5): 231-244. (in Chinese with English abstract)
[6] Exadaktylos V, Silva M, Aerts J M, et al. Real-time recognition of sick pig cough sounds[J]. Computers and Electronics in Agriculture, 2008, 63(2): 207-214.
[7] Hirtum A V, Berckmans D. Fuzzy approach for improved recognition of citric acid induced piglet coughing from continuous registration[J]. Journal of Sound and Vibration, 2003, 266(3): 677-686.
[8] 徐亞妮,沈明霞,閆麗,等. 待產(chǎn)梅山母豬咳嗽聲識(shí)別算法的研究[J]. 南京農(nóng)業(yè)大學(xué)學(xué)報(bào),2016,39(4):681-687. Xu Yani, Shen Mingxia, Yan Li, et al. Research of predelivery meishan sow cough recognition algorithm[J]. Journal of Nanjing Agricultural University, 2016, 39(4): 681-687. (in Chinese with English abstract)
[9] Guarino M, Jans P, Costa A, et al. Field test of algorithm for automatic cough detection in pig house[J]. Computers and Electronics in Agriculture, 2008, 62(1): 22-28.
[10] 劉振宇,赫曉燕,桑靜,等. 基于隱馬爾可夫模型的豬咳嗽聲音識(shí)別的研究[C]//中國(guó)畜牧獸醫(yī)學(xué)會(huì)信息技術(shù)分會(huì)第十屆學(xué)術(shù)研討會(huì)論文集,2015:99-104.
[11] 黎煊,趙建,高云,等. 基于深度信念網(wǎng)絡(luò)的豬咳嗽聲識(shí)別[J]. 農(nóng)業(yè)機(jī)械學(xué)報(bào),2018,49(3):179-186. Li Xuan, Zhao Jian, Gao Yun, et al. Recognitional of pig cough sound based on deep belief nets[J]. Transactions of the Chinese Society for Agricultural Machinery, 2018, 49(3): 179-186. (in Chinese with English abstract)
[12] 陳升科. 從中獸醫(yī)學(xué)角度分析豬咳嗽氣喘及治療方案[J]. 中國(guó)動(dòng)物保健,2015,17(3):22-23.
[13] 陳潤(rùn)生. 豬咳嗽疾病的鑒別診斷[J]. 現(xiàn)代農(nóng)業(yè)科技,2016(14):269-270.
[14] Milone D H, Galli J R, Cangianoc C A, et al. Automatic recognition of ingestive sounds of cattle based on hidden markov models[J]. Computers and Electronics in Agriculture, 2012, 87(3): 51-55.
[15] Reby D, Andreobrecht R, Galinier A, et al. Cepstral coefficients and hidden markov models reveal idiosyncratic voice characteristics in red deer (cervus elaphus) stags[J]. Journal of the Acoustical Society of America, 2006, 120(6): 4080-4089.
[16] Milone D H, Rufiner H L, Galli J R, et al. Computational method for segmentation and classification of ingestive sounds in sheep[J]. Computers and Electronics in Agriculture, 2009, 65(2): 228-237.
[17] Trifa V M, Kirschel A N, Taylor C E, et al. Automated species recognition of antbirds in a mexican rainforest using hidden markov models[J]. Journal of the Acoustical Society of America, 2008, 123(4): 2424-2431.
[18] Sepp H, Jurgen S. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[19] 陳英義,程倩倩,方曉敏,等. 主成分分析和長(zhǎng)短時(shí)記憶神經(jīng)網(wǎng)絡(luò)預(yù)測(cè)水產(chǎn)養(yǎng)殖水體溶解氧[J]. 農(nóng)業(yè)工程學(xué)報(bào),2018,34(17):183-191. Chen Yingyi, Cheng Qianqian, Fang Xiaomin, et al. Principal component analysis and long short-term memory neural network for predicting dissolved oxygen in water for aquaculture[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE),2018, 34(17): 183-191. (in Chinese with English abstract)
[20] Bengio Y, Frasconi P, Simard P. The problem of learning long-term dependencies in recurrent networks[C]// IEEE International Conference on Neural Networks. IEEE, 1993: 1183-1188.
[21] 王智超,張鵬遠(yuǎn),潘接林,等. 連接時(shí)序分類準(zhǔn)則聲學(xué)建模方法優(yōu)化[J]. 聲學(xué)學(xué)報(bào),2018,43(6): 984-990.
Wang Zhichao, Zhang Pengyuan, Pan Jielin, et al. Optimization of acoustic modeling method with connectionist temporal classification criterion[J]. Acta Acustica, 2018,43(6): 984-990. (in Chinese with English abstract)
[22] Bahdanau D, Chorowski J, Serdyuk D, et al. End-to-end attention-based large vocabulary speech recognition[C]// IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2016: 4945-4949.
[23] Graves A, JaitlyA N. Towards end-to-end speech recognition with recurrent neural networks[C]// International Conference on Machine Learning, 2014: 1764-1772.
[24] Chia A O, Hariharan M, Yaacob S, et al. Classification of speech dysfluencies with mfcc and lpcc features[J]. Expert Systems with Applications, 2012, 39(2): 2157-2165.
[25] 李志忠,騰光輝. 基于改進(jìn)MFCC的家禽發(fā)聲特征提取方法[J]. 農(nóng)業(yè)工程學(xué)報(bào),2008,24(11):202-205.
Li Zhizhong, Teng Guanghui. Feature extraction for poultry vocalization recognition based on improved MFCC[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2008, 24(11): 202-205. (in Chinese with English abstract)
[26] Hinton G E. Learning multiple layers of representation[J]. Trends in Cognitive Sciences, 2007, 11(10): 428-434.
[27] Lecun Y, Bengio Y, Hinton G E. Deep learning[J]. Nature, 2015, 512: 436-444.
[28] 趙明,杜回芳,董翠翠,等. 基于word2vec和LSTM的飲食健康文本分類研究[J]. 農(nóng)業(yè)機(jī)械學(xué)報(bào),2017,48(10):202-208. Zhao Ming, Du Huifang, Dong Cuicui, et al. Diet health text classification based on word2vec and LSTM[J]. Transactions of the Chinese Society for Agricultural Machinery, 2017, 48(10): 202-208. (in Chinese with English abstract)
[29] Schuster M, Paliwal K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 2002, 45(11): 2673-2681.
[30] Chen K, Huo Q . Training deep bidirectional LSTM acoustic model for LVCSR by a Context-Sensitive-Chunk BPTT approach[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(7): 1185-1193.
[31] Woellmer M , Eyben F , Schuller B , et al. Spoken term detection with connectionist temporal classification: A novel hybrid CTC-DBN decoder[C]//International Conference on Acoustics Speech and Signal Processing (ICASSP). IEEE, 2010: 5274-5277.
[32] Graves A, Gomez F. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks[C]// International Conference on Machine Learning. ACM, 2006: 369-376.
[33] Abu-Khzam F N, Fernau H, Langston M A, et al. A fixed-parameter algorithm for string-to-string correction[C]// Sixteenth Symposium on Computing: the Australasian Theory(CATS 2010). Australian Computer Society, 2010: 31-37.
Pig continuous cough sound recognition based on continuous speech recognition technology
Li Xuan1,2, Zhao Jian1,2, Gao Yun1,2, Liu Wanghong2,3, Lei Minggang2,3, Tan Hequn1,2
(1.,,430070,; 2.,430070,; 3.,,,430070,)
Cough is one of the most frequent symptoms in the early stage of pig respiratory diseases. So it is possible to monitor and diagnose the diseases of pigs by detecting their coughs. The existing methods for pig cough recognition are based on key word recognition technology, which cannot recognize the samples that have not been trained or learned by itself, another drawback is that the methods are for isolated coughs while the coughs of sick pigs are usually continuous. This paper intends to realize the recognition of pig continuous cough sound based on continuous speech recognition technology. Ten Landrace pigs, with a body weight of about 75 kg, were used as sound collection objects, and pig sounds were collected in pig farms during late winter and early spring when the respiratory diseases of pigs were prevalent. The sound collection devices were working continuously all day. By selecting the frequent coughing phases in the collected signal, a total of 30 h pig farm sound signals were obtained as the experimental corpus. Firstly, the sound signals were denoised by the speech enhancement algorithm based on a psychoacoustical model. Then the time-domain characteristics, including duration and energy of individual cough, were studied, and it was found that the duration of pig cough ranged from 0.24 to 0.74 s and the energy ranged from 40.15 to 822.87V2·s. So threshold of the sound samples was set with the duration and the lower energy value of individual coughs. Based on the threshold range, the speech endpoint detection algorithm based on short-time energy was used to detect the 30 h pig field sound signals which had been preprocessed by the speech enhancement algorithm, and 222 experimental sentences were obtained. The longest was 9.14 s and the shortest was 3.91 s. All 222 corpus contained a total of 1 145 sound samples, including 751 pig coughs and 394 non-pig coughs. Sounds in the pig farm environment, including cough, sneeze, eating, scream, hum, shaking ears sounds of pigs and sounds of dogs, metal clanging and some other background noise, were divided into pig cough and non-pig cough, which were chosen as the acoustic modeling units. The labels of the experimental sentences were obtained with the help of experts. Then the 13-dimensional Mel frequency cepstrum coefficients (MFCC) reflecting the static characteristics of pig sound were extracted, and the first-order differential coefficients reflecting the dynamic characteristics of pig sound were added to obtain the 26-dimensional MFCC, which were used as the characteristic parameter of the experimental sentence. Finally, the bidirectional Long Short-term Memory-Connectionist temporal classification(BLSTM-CTC) model was selected to recognize the pig continuous sounds, specifically, the BLSTM network had excellent feature learning ability of continuous pig sounds, and the CTC could directly model the alignment of the input continuous pig sound sequence and its labels. Through the 5-fold cross-validation experiment and analysis, the number of hidden layer neurons in the BLSTM forward propagation process, the backward propagation process, and the fully connected layer, were all set to 300, and the learning rate was set to 0.001. The average recognition rate, error recognition rate and total recognition rate of the results of 5 groups were 92.40%, 3.55% and 93.77%, respectively. Furthermore, the algorithm application test was carried out with another 1 h data, and the recognition rate reached to 94.23%, the error recognition rate was 9.09% with the total recognition rate of 93.24%. It is indicated that the pig cough sound recognition model based on continuous speech recognition technology is stable and reliable. This paper provides a reference for the recognition and disease judgment of pig continuous cough sound during the healthy breeding of pigs.
signal processing; acoustic signal; recognition; pig industry; continuous cough; birectional long short-term memory-connectionist temporal classification; acoustic model
2018-11-09
2019-01-13
國(guó)家重點(diǎn)研發(fā)計(jì)劃項(xiàng)目(2018YFD0500700);華中農(nóng)業(yè)大學(xué)自主科技創(chuàng)新基金;華中農(nóng)業(yè)大學(xué)大北農(nóng)青年學(xué)者提升專項(xiàng)項(xiàng)目(2017DBN005);現(xiàn)代農(nóng)業(yè)產(chǎn)業(yè)技術(shù)體系項(xiàng)目(CARS-36);國(guó)家級(jí)大學(xué)生創(chuàng)新創(chuàng)業(yè)訓(xùn)練計(jì)劃(201810504074)
黎煊,副教授,博士,主要從事生豬信息智能感知與行為識(shí)別研究。Email:lx@mail.hzau.edu.cn
10.11975/j.issn.1002-6819.2019.06.021
TN912.34
A
1002-6819(2019)-06-0174-07
黎 煊,趙 建,高 云,劉望宏,雷明剛,譚鶴群. 基于連續(xù)語音識(shí)別技術(shù)的豬連續(xù)咳嗽聲識(shí)別[J]. 農(nóng)業(yè)工程學(xué)報(bào),2019,35(6):174-180. doi:10.11975/j.issn.1002-6819.2019.06.021 http://www.tcsae.org
Li Xuan, Zhao Jian, Gao Yun, Liu Wanghong, Lei Minggang, Tan Hequn. Pig continuous cough sound recognition based on continuous speech recognition technology[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2019, 35(6): 174-180. (in Chinese with English abstract) doi:10.11975/j.issn.1002-6819.2019.06.021 http://www.tcsae.org
農(nóng)業(yè)工程學(xué)報(bào)2019年6期