程玉勝 錢坤 王一賓 趙大衛(wèi)
摘 要:已有的多標(biāo)簽懶惰學(xué)習(xí)算法(IMLLA)在利用近鄰標(biāo)簽時(shí)因僅考慮了近鄰標(biāo)簽相關(guān)性信息,而忽略相似度的影響,這可能會(huì)使算法的魯棒性有所降低。針對(duì)這個(gè)問(wèn)題,引入螢火蟲方法,將相似度信息與標(biāo)簽信息相結(jié)合,提出一種融合螢火蟲方法的多標(biāo)簽懶惰學(xué)習(xí)算法(FFMLLA)。首先,利用Minkowski距離來(lái)度量樣本間相似度,從而找到近鄰點(diǎn);然后,結(jié)合標(biāo)簽近鄰點(diǎn)和螢火蟲方法對(duì)標(biāo)簽計(jì)數(shù)向量進(jìn)行改進(jìn);最后,使用奇異值分解(SVD)與核極限學(xué)習(xí)機(jī)(ELM)進(jìn)行線性分類。該算法同時(shí)考慮了標(biāo)簽信息與相似度信息從而提高了魯棒性。實(shí)驗(yàn)結(jié)果表明,所提算法較其他的多標(biāo)簽學(xué)習(xí)算法有一定優(yōu)勢(shì),并使用統(tǒng)計(jì)假設(shè)檢驗(yàn)與穩(wěn)定性分析進(jìn)一步說(shuō)明所提出算法的合理性與有效性。
關(guān)鍵詞:多標(biāo)簽學(xué)習(xí);螢火蟲方法;標(biāo)簽相關(guān)性;多標(biāo)簽懶惰學(xué)習(xí)算法;極限學(xué)習(xí)機(jī)
中圖分類號(hào):TP181
文獻(xiàn)標(biāo)志碼:A
Abstract: The existing Improved Multilabel Lazy Learning Approach (IMLLA) has the problem that the influence of similarity information is ignored with only the neighbor label correlation information considered when the neighbor labels were used, which may reduce the robustness of the approach. To solve this problem, with firefly method introduced and the combination of similarity information with label information, a Multilabel Lazy Learning Approach based on FireFly method (FFMLLA) was proposed. Firstly, Minkowski distance was used to measure the similarity between samples to find the neighbor point. Secondly, the label count vector was improved by combining the neighbor point and firefly method. Finally, Singular Value Decomposition (SVD) and kernel Extreme Learning Machine (ELM) were used to realize linear classification. The robustness of the approach was improved due to considering both label information and similarity information. The experimental results demonstrate that the proposed approach improves the classification performance to a great extent compared to other multilabel learning approaches. And the statistical hypothesis testing and stability analysis are used to further illustrate the rationality and effectiveness of the proposed approach.
英文關(guān)鍵詞Key words: multilabel learning; firefly method; label correlation; Improved Multilabel Lazy Learning Approach (IMLLA); Extreme Learning Machine (ELM)
0 引言
多標(biāo)簽學(xué)習(xí)[1]是一種應(yīng)用非常廣泛的學(xué)習(xí)范式,是機(jī)器學(xué)習(xí)研究的重要熱點(diǎn)之一。傳統(tǒng)的單標(biāo)簽學(xué)習(xí),每個(gè)對(duì)象只與單個(gè)標(biāo)簽相關(guān)聯(lián);然而,真實(shí)世界中的對(duì)象往往具有多義性,比如一篇文章可能屬于軍事、體育、運(yùn)動(dòng)等多個(gè)主題[2]。
多標(biāo)簽學(xué)習(xí)作為處理具有豐富語(yǔ)義真實(shí)世界對(duì)象的學(xué)習(xí)框架之一,且其研究成果已經(jīng)廣泛應(yīng)用到文本分類[3]、基因工程[4]、圖像識(shí)別[5-6]、Web數(shù)據(jù)挖掘[7]和視頻自動(dòng)標(biāo)注[8]等多個(gè)領(lǐng)域。對(duì)此許多學(xué)者提出了針對(duì)多標(biāo)簽分類的學(xué)習(xí)算法,例如BR(Binary Relevance)算法、LP(Label Power)算法[9]等,它們通過(guò)增加分類器個(gè)數(shù)或者標(biāo)簽的種類來(lái)解決多標(biāo)簽問(wèn)題,但在一定程度上影響了分類器效率。經(jīng)典的MLKNN(MultiLabel K Nearest Neighbors)算法[10]利用最大化后驗(yàn)概率(Maximum A Posteriori)來(lái)解決多標(biāo)簽學(xué)習(xí)預(yù)測(cè)問(wèn)題,雖提升了分類器的性能,卻增加了其計(jì)算的復(fù)雜度。
此外,現(xiàn)實(shí)世界中各樣本所含標(biāo)簽并不相互獨(dú)立,存在相關(guān)關(guān)系,然而目前絕大多數(shù)多標(biāo)簽學(xué)習(xí)算法未充分考慮其標(biāo)簽相關(guān)性。因此,充分利用標(biāo)簽之間的相關(guān)性信息,對(duì)構(gòu)建強(qiáng)泛化性能的多標(biāo)簽分類學(xué)習(xí)算法具有重要意義[11]。
而針對(duì)標(biāo)簽間的相關(guān)性,許多學(xué)者提出了相關(guān)算法,取得了不錯(cuò)的效果。例如,RankSVM(Ranking Support Vector Machine)算法[12]采用最大間隔策略以適應(yīng)多標(biāo)簽學(xué)習(xí),采用類似BR策略構(gòu)建SVM(Support Vector Machine)多標(biāo)簽分類器,但其時(shí)間消耗較大。由于極限學(xué)習(xí)機(jī)(Extreme Learning Machine, ELM)[13]訓(xùn)練速度快,MLRKELM(MultiLabel algorithm of Regression Kernel Extreme Learning Machine)算法[14]使用回歸模式的核ELM,縮短了算法的運(yùn)行時(shí)間。MLASRKELM(MLRKELM with Association Rules)算法[14]在MLRKELM算法的基礎(chǔ)上引入了關(guān)聯(lián)規(guī)則,保留了標(biāo)簽之間的信息。針對(duì)標(biāo)簽之間的相關(guān)性,張敏靈[15]在MLKNN算法基礎(chǔ)上提出一種新型的多標(biāo)簽懶惰學(xué)習(xí)算法(Improved Multilabel Lazy Learning Approach, IMLLA)。IMLLA利用近鄰的標(biāo)簽信息構(gòu)建一個(gè)標(biāo)記計(jì)數(shù)向量來(lái)進(jìn)行分類, 此算法在構(gòu)建標(biāo)簽計(jì)數(shù)向量時(shí)使用了近鄰標(biāo)簽信息,認(rèn)為近鄰的標(biāo)簽具有相同的重要性。然而,近鄰與樣本間的相似度越大,此近鄰的標(biāo)簽越重要, IMLLA因未考慮近鄰相似度信息所以其泛化性有所降低。
在上述研究成果上,對(duì)于樣本分布問(wèn)題,本文在IMLLA的基礎(chǔ)上引入螢火蟲方法[16-17]。螢火蟲方法作為模仿自然界中螢火蟲發(fā)光行為而構(gòu)造出的元啟發(fā)式算法,具有操作簡(jiǎn)單、易于并行處理、魯棒性強(qiáng)等特點(diǎn)。故利用螢火蟲方法將近鄰的標(biāo)簽信息與近鄰的相似度信息相融合,以提高算法的魯棒性,而提出一種融合螢火蟲方法的多標(biāo)簽懶惰學(xué)習(xí)算法(Multilabel Lazy Learning Approach based on FireFly method, FFMLLA)。本文通過(guò)螢火蟲方法根據(jù)相似度來(lái)計(jì)算樣本與近鄰間的吸引度,吸引度越大則該近鄰的標(biāo)簽越重要。然后將吸引度作為權(quán)重與標(biāo)簽信息相結(jié)合,對(duì)IMLLA中的標(biāo)簽計(jì)數(shù)向量進(jìn)行重構(gòu)。由于Huang等提出的極限學(xué)習(xí)機(jī)算法[13]具有訓(xùn)練速度快、泛化能力強(qiáng)等優(yōu)點(diǎn),所以在使用線性分類器進(jìn)行分類時(shí),引入ELM進(jìn)行權(quán)重求解。此外,還使用了奇異值分解(Singular Value Decomposition, SVD)求解權(quán)重。為了驗(yàn)證本文算法的有效性,本文將FFMLLA與標(biāo)準(zhǔn)IMLLA,以及其他經(jīng)典的多標(biāo)簽算法在多個(gè)公開數(shù)據(jù)集上進(jìn)行實(shí)驗(yàn)對(duì)比。實(shí)驗(yàn)結(jié)果表明,本文算法較其他對(duì)比算法具有一定優(yōu)勢(shì)。
1 理論介紹
1.1 多標(biāo)簽學(xué)習(xí)
多標(biāo)簽學(xué)習(xí)是針對(duì)現(xiàn)實(shí)生活中普遍存在的多義性對(duì)象而提出的一種學(xué)習(xí)框架。在這個(gè)框架之下,樣本由多個(gè)特征和多個(gè)標(biāo)簽構(gòu)成,學(xué)習(xí)的目標(biāo)是將未知的實(shí)例對(duì)應(yīng)更多正確的標(biāo)簽。在單實(shí)例多標(biāo)簽學(xué)習(xí)中,假設(shè)X={x1,x2,…,xn}T∈Rn*d表示有n個(gè)樣本且每個(gè)樣本的特征數(shù)為d,Y={1,2,…,Q}表示可能的概念構(gòu)成的集合。T={(x1,Y1),(x2,Y2),…,(xm,Ym)}(xi∈X,Yi∈Y)表示訓(xùn)練集,多標(biāo)簽學(xué)習(xí)的目標(biāo)就是得到映射關(guān)系f:X→{-1,1}Q,并對(duì)標(biāo)簽未知而特征已知的樣本進(jìn)行標(biāo)簽預(yù)測(cè)。
1.2 IMLLA
MLKNN是一種經(jīng)典的多標(biāo)簽分類算法,它先獲取近鄰樣本的標(biāo)簽信息,再通過(guò)“最大化后驗(yàn)概率”的方式推理未見實(shí)例的標(biāo)簽集合, 但它未充分考察標(biāo)簽之間的相關(guān)性。基于此問(wèn)題,張敏靈提出一種新型的多標(biāo)簽懶惰學(xué)習(xí)算法IMLLA。IMLLA首先將測(cè)試樣本在訓(xùn)練集中找出k個(gè)近鄰及其k個(gè)近鄰的標(biāo)記信息,然后根據(jù)k個(gè)近鄰的標(biāo)記信息生成各標(biāo)簽計(jì)數(shù)向量,并提交給已訓(xùn)練的分類器進(jìn)行標(biāo)簽預(yù)測(cè)。
4 結(jié)語(yǔ)
本文針對(duì)基于k近鄰的多標(biāo)簽相關(guān)性算法未考慮樣本分布問(wèn)題進(jìn)行了研究,運(yùn)用螢火蟲方法的思想將相似度信息與近鄰標(biāo)簽信息進(jìn)行融合。螢火蟲方法是源于模擬自然界螢火蟲在晚上的群聚活動(dòng)的自然現(xiàn)象而提出的,其計(jì)算效率高、魯棒性強(qiáng),能夠很好地將相似度信息與標(biāo)簽信息融合。本文算法在重構(gòu)標(biāo)簽計(jì)數(shù)向量后分別使用了奇異值分解與核極限學(xué)習(xí)機(jī)進(jìn)行權(quán)重求解,再進(jìn)行線性分類。實(shí)驗(yàn)結(jié)果表明了本文提出的FFMLLA算法具有不錯(cuò)的效果和較好的穩(wěn)定性。
雖然將相似度信息與近鄰標(biāo)簽信息相結(jié)合的方法一定程度上提升了模型的分類精度,但與預(yù)期效果之間還存在一定差距,因此如何從近鄰空間提取出比相似度信息更為有效的信息來(lái)輔助分類器進(jìn)行分類是今后研究的重點(diǎn)。
參考文獻(xiàn) (References)
[1] ??? GIBAJA E, VENTURA S. A tutorial on multilabel learning[J]. ACM Computing Surveys, 2015,47(3):1-38.
[2] ??? 何志芬, 楊明, 劉會(huì)東. 多標(biāo)記分類和標(biāo)記相關(guān)性的聯(lián)合學(xué)習(xí)[J]. 軟件學(xué)報(bào), 2014, 25(9):1967-1981. (HE Z F, YANG M, LIU H D. Joint learning of multilabel classification and label correlations[J]. Journal of Software, 2014, 25(9):1967-1981.)
[3] ??? LIU J, CHANG W, WU Y, et al. Deep learning for extreme multilabel text classification[C]// Proceedings of the 40th International ACM SIGIR Conference on Research & Development in Information Retrieval. New York: ACM, 2017:115-124.
[4] ??? KORDMAHALLEH M M, HOMAIFAR A, DUKKA B K C. Hierarchical multilabel gene function prediction using adaptive mutation in crowding niching[C]// Proceedings of the 13th IEEE International Conference on BioInformatics and BioEngineering. Piscataway, NJ: IEEE, 2013:1-6.
[5] ??? ZHU X, LI X, ZHANG S. Blockrow sparse multiview multilabel learning for image classification[J]. IEEE Transactions on Cybernetics, 2016, 46(2):450.
[6] ??? WANG Z, CHEN T, LI G, et al. Multilabel image recognition by recurrently discovering attentional regions[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Washington, DC: IEEE Computer Society, 2017:464-472.
[7] ??? OZONAT K M, YOUNG D E. Towards a universal marketplace over the Web: statistical multilabel classification of service provider forms with simulated annealing[C]// Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2009:1295-1304.
[8] ??? HOU S, ZHOU S, CHEN L, et al. Multilabel learning with label relevance in advertising video[J]. Neurocomputing, 2016, 171(C):932-948.
[9] ??? BOUTELL M R, LUO J, SHEN X, et al. Learning multilabel scene classification [J]. Pattern Recognition, 2004, 37(9):1757-1771.
[10] ?? ZHANG M, ZHOU Z. MLKNN: a lazy learning approach to multilabel learning[J]. Pattern Recognition, 2007, 40(7):2038-2048.
[11] ?? LEE J, KIM H, KIM N R, et al. An approach for multilabel classification by directed acyclic graph with label correlation maximization[J]. Information Sciences, 2016, 351(C):101-114.
[12] ?? ELISSEEFF A E, WESTON J. A kernel method for multilabelled classification[C]// Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Cambridge, MA: MIT Press, 2002: 681-687.
[13] ?? HUANG G, ZHU Q, SIEW C K. Extreme learning machine: theory and applications[J]. Neurocomputing, 2006, 70(1/2/3):489-501.
[14] ?? 王一賓, 程玉勝, 何月,等. 回歸核極限學(xué)習(xí)機(jī)的多標(biāo)記學(xué)習(xí)算法[J]. 模式識(shí)別與人工智能, 2018, 31(5):419-430. (WANG Y B, CHENG Y S, HE Y, et al. Multilabel learning algorithm of regression kernel extreme learning machine[J]. Pattern Recognition and Artificial Intelligence, 2018, 31(5): 419-430.)
[15] ?? 張敏靈. 一種新型多標(biāo)記懶惰學(xué)習(xí)算法[J]. 計(jì)算機(jī)研究與發(fā)展, 2012, 49(11):2271-2282. (ZHANG M L. An improved multilabel lazy learning approach[J]. Journal of Computer Research and Development, 2012, 49(11):2271-2282.)
[16] ?? YANG X, HE X. Firefly algorithm: recent advances and applications[J]. International Journal of Swarm Intelligence, 2013, 1(1):36-50.
[17] ?? HIDALGOPANIAGUA A, MIDUEL A V, JOAQUIN F, et al. Solving the multiobjective path planning problem in mobile robotics with a fireflybased approach[J]. Soft Computing, 2017, 21(4):1-16.
[18] ?? LEI Y, ZHAO D, CAI H B. Prediction of lengthofday using extreme learning machine[J]. Geodesy and Geodynamics, 2015, 6(2):151-159.
[19] ?? WANG Z, XIN J, TIAN S, et al. Distributed and weighted extreme learning machine for imbalanced big data learning[J]. Tsinghua Science and Technology, 2017, 22(2):160-173.
[20] ?? LUO F F, GUO W Z, YU Y L, et al. A multilabel classification algorithm based on kernel extreme learning machine[J]. Neurocomputing, 2017, 260: 313-320.
[21] ?? 楊明極, 馬池, 王婭, 等. 一種改進(jìn)Kmeans 聚類的FCMM 算法[J/OL]. 計(jì)算機(jī)應(yīng)用研究, 2019, 36(7)[2018-04-12]. http://www.arocmag.com/article/02201907006.html.(YANG M J, MA C, WANG Y, et al. Algorithm named FCMM to improve Kmeans clustering algorithm[J/OL].Application Research of Computers, 2019, 36(7)[2018-04-12]. http://www.arocmag.com/article/02201907006.html.)
[22] ?? WANG H, WANG W, ZHOU X, et al. Firefly algorithm with neighborhood attraction[J]. Information Sciences, 2017, 382/383:374-387.
[23] ?? 程美英, 倪志偉, 朱旭輝. 螢火蟲優(yōu)化算法理論研究綜述[J]. 計(jì)算機(jī)科學(xué), 2015, 42(4):19-24.(CHENG M Y, NI Z W, ZHU X H. Overview on glowworm swarm optimization or firefly algorithm[J]. Computer Science, 2015, 42(4):19-24.)
[24] ?? ZHANG M L, ZHOU Z H. A Review on multilabel learning algorithms[J]. IEEE Transactions on Knowledge & Data Engineering, 2014, 26(8):1819-1837.
[25] ?? DEMSAR J. Statistical comparisons of classifiers over multiple data sets[J]. Journal of Machine Learning Research, 2006, 7(1):1-30.
[26] ?? ZHANG M, WU L. Lift: Multilabel learning with labelspecific features[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(1): 107-120.
[27] ?? LIN Y, LI Y, WANG C, et al. Attribute reduction for multilabel learning with fuzzy rough set[J]. KnowledgeBased Systems, 2018,152:51-56.