王梅,宋曉暉,劉勇,許傳海
神經(jīng)正切核K?Means聚類
王梅1,2,宋曉暉1,劉勇3,4*,許傳海1
(1.東北石油大學(xué) 計(jì)算機(jī)與信息技術(shù)學(xué)院,黑龍江 大慶 163318; 2.黑龍江省石油大數(shù)據(jù)與智能分析重點(diǎn)實(shí)驗(yàn)室(東北石油大學(xué)),黑龍江 大慶 163318; 3.中國(guó)人民大學(xué) 高瓴人工智能學(xué)院,北京 100872; 4.大數(shù)據(jù)管理與分析方法研究北京市重點(diǎn)實(shí)驗(yàn)室(中國(guó)人民大學(xué)),北京 100872)(?通信作者電子郵箱liuyonggsai@ruc.edu.cn)
針對(duì)K?Means聚類算法利用均值更新聚類中心,導(dǎo)致聚類結(jié)果受樣本分布影響的問(wèn)題,提出了神經(jīng)正切核K?Means聚類算法(NTKKM)。首先通過(guò)神經(jīng)正切核(NTK)將輸入空間的數(shù)據(jù)映射到高維特征空間,然后在高維特征空間中進(jìn)行K?Means聚類,并采用兼顧簇間與簇內(nèi)距離的方法更新聚類中心,最后得到聚類結(jié)果。在car和breast?tissue數(shù)據(jù)集上,對(duì)NTKKM聚類算法的準(zhǔn)確率、調(diào)整蘭德系數(shù)(ARI)及FM指數(shù)這3個(gè)評(píng)價(jià)指標(biāo)進(jìn)行統(tǒng)計(jì)。實(shí)驗(yàn)結(jié)果表明,NTKKM聚類算法的聚類效果以及穩(wěn)定性均優(yōu)于K?Means聚類算法和高斯核K?Means聚類算法。NTKKM聚類算法與傳統(tǒng)的K?Means聚類算法相比,準(zhǔn)確率分別提升了14.9%和9.4%,ARI分別提升了9.7%和18.0%,F(xiàn)M指數(shù)分別提升了12.0%和12.0%,驗(yàn)證了NTKKM聚類算法良好的聚類性能。
神經(jīng)正切核;K?Means;核聚類;特征空間;核函數(shù)
聚類算法是一種典型的無(wú)監(jiān)督的機(jī)器學(xué)習(xí)方法,是利用樣本的特征比較樣本的相似性,并將具有相似屬性的樣本劃分到同一類或簇中的算法[1-3]。聚類算法的目的是使每個(gè)類中的數(shù)據(jù)之間的相似度最大,不同類中的數(shù)據(jù)之間的相似度最小。聚類算法的應(yīng)用十分廣泛,主要在數(shù)據(jù)挖掘、信息檢索和圖像分割等方面發(fā)揮重要作用[4-6]。
為了在復(fù)雜多樣的數(shù)據(jù)中提取人們所需的有價(jià)值的信息,研究人員不斷改進(jìn)聚類算法。學(xué)者們將核方法引入聚類中,提出了核聚類算法,并對(duì)核聚類算法展開(kāi)了廣泛而深入的研究。Girolami[7]和張莉等[8]的研究對(duì)核特征空間中的聚類問(wèn)題有指導(dǎo)性意義;Ben?Hur等[9]在基于高斯核的支持向量數(shù)據(jù)域描述(Support Vector Domain Description, SVDD)算法基礎(chǔ)上拓展出無(wú)監(jiān)督非參數(shù)型的聚類算法。核聚類在圖像處理方面也有出色表現(xiàn):徐小來(lái)等[10]采用基于核改進(jìn)的模糊C?均值(Fuzzy C?Means,F(xiàn)CM)聚類算法,通過(guò)高斯核函數(shù)和歐氏距離分別對(duì)像素8?鄰域的灰度和空間信息進(jìn)行建模,實(shí)現(xiàn)了圖像的分割;楊飛等[11]提出了使用內(nèi)核誘導(dǎo)距離取代歐氏距離的核函數(shù)FCM算法,提高了傳統(tǒng)FCM算法處理噪聲圖像的能力。在多核聚類的研究方面:Xiang等[12]針對(duì)多核聚類方法在數(shù)據(jù)具有高缺值率時(shí),易落入局部最優(yōu)狀態(tài)的問(wèn)題提出了一種針對(duì)不完整數(shù)據(jù)的多核聚類(Absent Multiple Kernel Clustering , AMKC)方法;Liu等[13]提出了簡(jiǎn)單的多核K?均值(Simple Multiple Kernel K?Means, SimpleMKKM),它將廣泛使用的監(jiān)督核對(duì)齊準(zhǔn)則推廣到多核聚類中。在核聚類的其他研究方面:Liu等[14]研究了核K?Means的統(tǒng)計(jì)特性并獲得近乎最優(yōu)的過(guò)度聚類風(fēng)險(xiǎn)界限,大幅提高了現(xiàn)有聚類風(fēng)險(xiǎn)分析中的最新界限。核聚類通過(guò)非線性映射增加了數(shù)據(jù)點(diǎn)線性可分的概率,即能較好地分辨、提取并放大有用的特征,從而實(shí)現(xiàn)更準(zhǔn)確的聚類,算法收斂速度也較快。在經(jīng)典聚類算法失效時(shí),核聚類算法常常能得到較好的聚類結(jié)果[15];但學(xué)者們大多數(shù)使用淺層的核函數(shù)為基礎(chǔ)進(jìn)行研究,而淺層的核函數(shù)挖掘深層次信息時(shí)存在局限性。
Neal等[16]在1994年就提出了在無(wú)限寬度的神經(jīng)網(wǎng)絡(luò)下,具有參數(shù)為獨(dú)立同分布的單層全連接神經(jīng)網(wǎng)絡(luò)等價(jià)于高斯過(guò)程(Gaussian Process, GP),揭示了無(wú)限寬的神經(jīng)網(wǎng)絡(luò)與核方法之間的聯(lián)系。Lee等[17]和Matthews等[18]研究發(fā)現(xiàn)這些內(nèi)核對(duì)應(yīng)于無(wú)限寬的深度網(wǎng)絡(luò),參數(shù)都是隨機(jī)選擇的,只有頂層是通過(guò)梯度下降進(jìn)行訓(xùn)練。近幾年,Jacot等[19]研究表明參數(shù)化深層神經(jīng)網(wǎng)絡(luò)的訓(xùn)練與神經(jīng)正切核(Neural Tangent Kernel, NTK)有關(guān),一個(gè)正確地隨機(jī)初始化的、足夠?qū)挼?、由具有無(wú)窮小步長(zhǎng)大小的梯度下降訓(xùn)練的深度神經(jīng)網(wǎng)絡(luò)和一個(gè)帶有NTK的確定性核回歸預(yù)測(cè)器是等效的。NTK在無(wú)限寬極限下趨于一個(gè)確定的核,而且在梯度下降的訓(xùn)練過(guò)程中保持不變。NTK不僅可用于全連接網(wǎng)絡(luò),還可用于其他各種神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)。例如,在文獻(xiàn)[20]中,研究人員將NTK拓展到卷積神經(jīng)網(wǎng)絡(luò)中,稱為卷積神經(jīng)正切核(Convolutional Neural Tangent Kernel, CNTK);王梅等[21]將神經(jīng)正切核拓展到多核學(xué)習(xí)中,提出了一種基于神經(jīng)正切核的多核學(xué)習(xí)方法,能增強(qiáng)多核學(xué)習(xí)方法的表示能力;Arora等[22]通過(guò)對(duì)比實(shí)驗(yàn)證明NTK比高斯核和低次多項(xiàng)式核在低數(shù)據(jù)任務(wù)中表現(xiàn)更佳。
NTK可以看作是一種復(fù)雜的多層次結(jié)構(gòu)的神經(jīng)網(wǎng)絡(luò),它相比淺層的核函數(shù)能更好地表示數(shù)據(jù)之間的關(guān)系,捕捉深層信息,進(jìn)一步增加了數(shù)據(jù)點(diǎn)的線性可分的概率。因此本文提出了神經(jīng)正切核K?Means(Neural Tangent Kernel K?Means, NTKKM)算法,經(jīng)過(guò)NTK核處理后的數(shù)據(jù)集中的數(shù)據(jù)特征能夠更好地突顯出來(lái),從而能夠更好地聚類復(fù)雜數(shù)據(jù)。
核聚類的基本思想是利用Mercer核[23]把輸入空間中的樣本映射到高維特征空間中,在高維特征空間中得到更加理想的聚類效果,該方法是普適的。核K?Means聚類算法在聚類的準(zhǔn)確性、穩(wěn)定性及健壯性等方面相比K?Means聚類算法有一定程度的改進(jìn)[24-25]。
K?Means目標(biāo)函數(shù)為:
首先利用核函數(shù)計(jì)算出輸入數(shù)據(jù)集的一個(gè)核矩陣,在高維空間中核K?Means更新聚類中心為:
采用無(wú)窮小學(xué)習(xí)率的梯度下降算法作為損失函數(shù),即:
由上述可得,NTK相當(dāng)于一個(gè)多層結(jié)構(gòu)的神經(jīng)網(wǎng)絡(luò),它與其他淺層核函數(shù)相比可以更好地表示復(fù)雜數(shù)據(jù)的特征。這為后面的聚類工作奠定了良好的基礎(chǔ)。
2.2.1算法的目標(biāo)函數(shù)
2.2.2聚類中心的初始
2.2.3聚類中心的更新
2.2.4本文算法
綜上所述,NTKKM聚類算法如下所示:
end for
end for
end for
end while
2.2.5本文算法時(shí)間復(fù)雜度
為驗(yàn)證本文算法能夠在維度較高、數(shù)量較大以及聚類數(shù)目較多的數(shù)據(jù)集上有較好的表現(xiàn),選取UCI數(shù)據(jù)集中4個(gè)在維度、數(shù)量及聚類數(shù)目等方面具有代表性的數(shù)據(jù)集進(jìn)行實(shí)驗(yàn),分別為紅酒質(zhì)量數(shù)據(jù)集(winequality?red)、鳶尾花數(shù)據(jù)集(iris)、乳腺組織(breast?tissue)以及汽車評(píng)估數(shù)據(jù)集(car),它們的詳細(xì)信息如表1所示。
表1 實(shí)驗(yàn)數(shù)據(jù)集信息
本文選用準(zhǔn)確率(Accuracy, ACC)、調(diào)整蘭德系數(shù)(Adjusted Rand Index, ARI)及FM指數(shù)(Fowlkes and Mallows Index, FMI)三個(gè)評(píng)價(jià)指標(biāo)進(jìn)行評(píng)價(jià)。三種算法在每個(gè)數(shù)據(jù)集分別運(yùn)行20次,然后統(tǒng)計(jì)各評(píng)價(jià)指標(biāo)平均結(jié)果。上述三個(gè)評(píng)價(jià)指標(biāo)都是值越大、聚類效果越好[33]。
3.2.1根據(jù)評(píng)價(jià)指標(biāo)分析實(shí)驗(yàn)結(jié)果
表2 三種算法的準(zhǔn)確率
表3 三種算法的ARI
FMI作為衡量分類效果的標(biāo)準(zhǔn),是精確率(Precision)和召回率(Recall)的幾何平均值,取值范圍為[0,1]。精確率指模型判為正的所有樣本中有多少是真正的正樣本;召回率指所有正樣本有多少被模型判為正樣本。精確率和召回率分別從局部和全局考慮分類效果。三種算法在每個(gè)數(shù)據(jù)集上FMI的統(tǒng)計(jì)結(jié)果如表4所示。由表4可知,NTKKM聚類算法精確率和召回率的幾何平均數(shù)的值高于對(duì)比算法,說(shuō)明NTKKM聚類算法分類效果更優(yōu)。特別是在breast?tissue、car以及winequality?red數(shù)據(jù)集上,NTKKM聚類算法的FMI有顯著提升,比K?Means分別提升了12.0%、12.0%及20.0%。
表4 三種算法精確率和召回率的幾何平均數(shù)
3.2.2根據(jù)數(shù)據(jù)集特征分析實(shí)驗(yàn)結(jié)果
本節(jié)根據(jù)數(shù)據(jù)集的特征對(duì)NTKKM聚類算法的聚類性能進(jìn)行分析。實(shí)驗(yàn)選用的數(shù)據(jù)集特征包括樣本維度、類別個(gè)數(shù)以及樣本數(shù)量,評(píng)價(jià)指標(biāo)為準(zhǔn)確率。
隨著樣本維度的增加,三種聚類算法準(zhǔn)確率的變化情況如圖1所示,在NTKKM聚類算法聚類的準(zhǔn)確率都優(yōu)于對(duì)比的兩種算法的前提下,NTKKM聚類算法相較于K?Means聚類算法在car、breast?tissue以及winequality?red數(shù)據(jù)集上準(zhǔn)確率分別提升了14.9%、9.4%、9.4%及3.9%,在維度較低的iris數(shù)據(jù)集上準(zhǔn)確率僅提升3.9%,說(shuō)明了NTKKM聚類聚類算法在維度較高的數(shù)據(jù)集合上的聚類效果更佳。
隨著聚類個(gè)數(shù)的增加,三種聚類算法準(zhǔn)確率的變化情況如圖2所示,可以明顯看到在聚類個(gè)數(shù)相對(duì)較少的iris數(shù)據(jù)集上,NTKKM聚類算法的準(zhǔn)確率較其余兩個(gè)對(duì)比算法提升不明顯,但在聚類個(gè)數(shù)較多的數(shù)據(jù)集car、breast?tissue以及winequality?red上,NTKKM聚類算法的聚類性能相較于經(jīng)典的K?Means聚類算法和GKKM聚類算法有顯著提高。
隨著樣本數(shù)量的增加,三種聚類算法準(zhǔn)確率的變化情況如圖3所示,NTKKM在處理樣本數(shù)量較多數(shù)據(jù)集winequality?red和car上,聚類準(zhǔn)確性明顯優(yōu)于對(duì)比算法。
圖1 不同維度下三種算法的準(zhǔn)確率
圖2 不同聚類個(gè)數(shù)下三種算法的準(zhǔn)確率
圖3 不同樣本數(shù)量下三種算法的準(zhǔn)確率
3.2.3聚類結(jié)果圖
使用K?Means聚類算法、GKKM聚類算法以及NTKKM聚類算法分別對(duì)iris和breast?tissue數(shù)據(jù)集進(jìn)行聚類,結(jié)果圖如圖4、5所示。
圖4 在iris數(shù)據(jù)集上三種算法的聚類效果圖
圖5 在breast?tissue數(shù)據(jù)集上三種算法的聚類效果圖
由圖4、5分析可知,傳統(tǒng)的K?Means聚類算法沒(méi)有對(duì)數(shù)據(jù)進(jìn)行處里,數(shù)據(jù)點(diǎn)的分布十分密集不易聚類;GKKM聚類算法通過(guò)高斯核將數(shù)據(jù)進(jìn)行映射后進(jìn)行聚類,通過(guò)圖4(b)和圖5(b)可以看出,淺層的高斯核對(duì)數(shù)據(jù)的表達(dá)能力較弱,未能充分表達(dá)出數(shù)據(jù)之間的相關(guān)性,因此,GKKM聚類效果并不理想;而NTKKM聚類算法通過(guò)神經(jīng)正切核將數(shù)據(jù)集進(jìn)行高維映射后,充分挖掘數(shù)據(jù)之間的聯(lián)系,從圖4(c)和圖5(c)可以看出,原始空間中分布緊密的數(shù)據(jù)被映射到高維特征空間后數(shù)據(jù)之間的分布發(fā)生了改變,增加了數(shù)據(jù)之間的可分性,因此,能夠得到較好的聚類效果。
綜上所述,NTKKM聚類算法與傳統(tǒng)的K?Means聚類算法和GKKM聚類算法相比有更好的聚類效果和更強(qiáng)的穩(wěn)定性,并且在維度較高、分類數(shù)目較多以及樣本數(shù)量較多的數(shù)據(jù)集上表現(xiàn)更好。
本文基于核聚類的基本思想提出了一種神經(jīng)正切核K?Means的聚類算法。首先使用NTK對(duì)數(shù)據(jù)集進(jìn)行映射,挖掘數(shù)據(jù)之間在低維空間未顯示出的特征,同時(shí)對(duì)K?Means聚類算法進(jìn)行改進(jìn),最后在高維空間中對(duì)數(shù)據(jù)點(diǎn)聚類。相較于傳統(tǒng)的K?Means聚類算法和淺層核K?Means,NTKKM聚類算法有更好的表示能力和更強(qiáng)的聚類效果。實(shí)驗(yàn)結(jié)果表明,本文提出的NTKKM聚類算法在維度較高、分類數(shù)目較多以及聚類數(shù)量較多的數(shù)據(jù)集上的聚類結(jié)果較好,穩(wěn)定性也更強(qiáng),能夠更好地表達(dá)數(shù)據(jù)的特征。
基于以上分析,NTKKM聚類算法也適用于圖像聚類。圖像聚類是將圖像劃分成各具特征或者是某種特性的區(qū)域,實(shí)現(xiàn)對(duì)圖像的精準(zhǔn)分割,從而達(dá)到對(duì)特定實(shí)物識(shí)別和檢測(cè)的目的。由于NTK核具有多層次的網(wǎng)絡(luò)結(jié)構(gòu),因此NTK在對(duì)數(shù)據(jù)集進(jìn)行高維映射時(shí)所需要時(shí)間較長(zhǎng)。因此,下一步需要解決的問(wèn)題是減少其計(jì)算時(shí)間,以及將NTKKM聚類算法拓展到圖像聚類應(yīng)用中。
[1] FAHIM A M, SALEM A M, TORKEY F A, et al. An efficient enhanced?means clustering algorithm[J]. Journal of Zhejiang University — SCIENCE A (Applied Physics and Engineering), 2006, 7(10): 1626-1633.
[2] 汪敏,武禹伯,閔帆. 基于多種聚類算法和多元線性回歸的多分類主動(dòng)學(xué)習(xí)算法[J]. 計(jì)算機(jī)應(yīng)用, 2020, 40(12):3437-3444.(WANG M, WU Y B, MIN F. Multi?category active learning algorithm based on multiple clustering algorithms and multiple linear regression[J]. Journal of Computer applications, 2020, 40(12):3437-3444.)
[3] LAWRENCE L O. A Primer on Cluster Analysis by James C. Bezdck[J]. IEEE Systems, Man, and Cybernetics Magazine, 2018, 4(1):48-50.
[4] 于佐軍,秦歡. 基于改進(jìn)蜂群算法的K?means算法[J]. 控制與決策, 2018, 33(1):181-185.(YU Z J, QIN H. K?means algorithm based on improved artificial bee swarm algorithm[J]. Control and Decision, 2018, 33(1):181-185.)
[5] 覃華,詹娟娟,蘇一丹. 基于概率無(wú)向圖模型的近鄰傳播聚類算法[J]. 控制與決策, 2017, 32(10):1796-1802.(QIN H, ZHAN J J, SU Y D. Affinity propagation clustering algorithm based on probabilistic undirected graphical model[J]. Control and Decision, 2017, 32(10): 1796-1802.)
[6] 周濤,陸惠玲. 數(shù)據(jù)挖掘中聚類算法研究進(jìn)展[J]. 計(jì)算機(jī)工程與應(yīng)用, 2012, 48(12):100-111.(ZHOU T, LU H L. Clustering algorithm research advances on data mining[J]. Computer Engineering and Applications, 2012, 48(12):100-111.)
[7] GIROLAMI M. Mercer kernel?based clustering in feature space[J]. IEEE Transactions on Neural Networks, 2002, 13(3):780-784.
[8] 張莉,周偉達(dá),焦李成. 核聚類算法[J]. 計(jì)算機(jī)學(xué)報(bào), 2002, 25(6): 587-590.(ZHANG L, ZHOU W D, JIAO L C. Kernel clustering algorithm[J]. Chinese Journal of Computers, 2002, 25(6): 587-590.)
[9] BEN?HUR A, HORN D, SIEGELMANN H T, et al. Support vector clustering[J]. Journal of Machine Learning Research, 2001, 2:125-137.
[10] 徐小來(lái),房曉麗. 基于改進(jìn)的直覺(jué)模糊核聚類的圖像分割方法[J]. 計(jì)算機(jī)工程與應(yīng)用, 2019, 55(17):227-231.(XU X L, FANG X L. Image segmentation method based on improved intuitive fuzzy kernel c?means clustering algorithms[J]. Computer Engineering and Applications, 2019, 55(17):227-231.)
[11] 楊飛,朱志祥. 基于特征和空間信息的核模糊C-均值聚類算法[J]. 電子科技, 2016, 29(2):16-19.(YANG F, ZHU Z X. Kernelized fuzzy C?means clustering algorithm based on feature and spatial information[J]. Electronic Science and Technology, 2016, 29(2):16-19.)
[12] XIANG L Y, ZHAO G H, LI Q, et al. A fast and effective multiple kernel clustering method on incomplete data[J]. Computers, Materials & Continua, 2021, 67(1):267-284.
[13] LIU X W, ZHU E, LIU J Y, et al. SimpleMKKM: simple multiple kernel?means[EB/OL]. (2020-05-12)[2021-07-20]. https://arxiv.org/pdf/2005.04975.pdf.
[14] LIU Y, DING L. Nearly optimal risk bounds for kernel K?means[EB/OL]. (2020-03-09)[2021-07-01].https://arxiv.org/pdf/2003.03888v1.pdf.
[15] 孔銳,張國(guó)宣,施澤生,等. 基于核的K-均值聚類[J]. 計(jì)算機(jī)工程, 2004, 30(11):12-13, 80.(KONG R, ZHANG G X, SHI Z S, et al. Kernel?based K?means clustering[J]. Computer Engineering, 2004, 30(11):12-13, 80.)
[16] NEAL R M. Bayesian Learning for Neural Networks[M]. New York: Springer, 1996: 29-53.
[17] LEE J, BAHRI Y, NOVAK R, et al. Deep neural networks as Gaussian processes[EB/OL]. (2018-03-03)[2020-05-19].https://arxiv.org/pdf/1711.00165.pdf.
[18] MATTHEWS A G D G, ROWLAND M, HRON J, et al. Gaussian process behaviour in wide deep neural networks[EB/OL]. (2018-08-16)[2020-05-19].https://arxiv.org/pdf/1804.11271.pdf.
[19] JACOT A, GABRIEL F, HONGLER C. Neural tangent kernel: convergence and generalization in neural networks[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2018:8580-8589.
[20] ARORA S, DU S S, HU W, et al. On exact computation with an infinitely wide neural net[C/OL]// Proceedings of the 33rd Conference on Neural Information Processing Systems. [2020-12-24].https://proceedings.neurips.cc/paper/2019/file/dbc4d84bfcfe2284ba11beffb853a8c4-Paper.pdf.
[21] 王梅,許傳海,劉勇. 基于神經(jīng)正切核的多核學(xué)習(xí)方法[J]. 計(jì)算機(jī)應(yīng)用, 2021,41(12):3462-3467.(WANG M, XU C H, LIU Y. Multi?kernel learning methods based on neural positive tangent kernel[J]. Journal of Computer Applications, 2021, 41(12):3462-3467.)
[22] ARORA S, DU S S, LI Z Y, et al. Harnessing the power of infinitely wide deep nets on small?data tasks[EB/OL]. (2019-10-27)[2021-01-08].https://arxiv.org/pdf/1910.01663.pdf.
[23] SCH?LKOPF B, SEBASTIAN MIKA S, BURGES C J C, et al. Input space versus feature space in kernel?based methods[J]. IEEE Transactions on Neural Networks, 1999, 10(5): 1000-1017.
[24] 王守志,何東健,李文,等. 基于核K-均值聚類算法的植物葉部病害識(shí)別[J]. 農(nóng)業(yè)機(jī)械學(xué)報(bào), 2009, 40(3):152-155.(WANG S Z, HE D J, LI W, et al. Plant leaf disease identification based on the nuclear K?means clustering algorithm[J]. Transactions of the Chinese Society for Agricultural Machinery, 2009, 40(3): 152-155.)
[25] 王宇,李曉利. 核-凝聚聚類算法[J]. 大連理工大學(xué)學(xué)報(bào), 2007, 47(5):763-766.(WANG Y, LI X L. Kernel?aggregate clustering algorithm[J]. Journal of Dalian University of Technology, 2007, 47(5): 763-766.)
[26] HUO J, BI Y H, Lü D R, et al. Cloud classification and distribution of cloud types in Beijing using Ka?band radar data[J]. Advances in Atmospheric Sciences, 2019, 36(8):793-803.
[27] TZORTZIS G, LIKAS A. The MinMax?Means clustering algorithm[J]. Pattern Recognition, 2014, 47(7):2505-2516.
[28] 翟東海,魚江,高飛,等. 最大距離法選取初始簇中心的K?means文本聚類算法的研究[J]. 計(jì)算機(jī)應(yīng)用研究, 2014, 31(3):713-715, 719.(ZHAI D H, YU J, GAO F, et al. K?means text clustering algorithm based on initial cluster centers selection according to maximum distance[J]. Application Research of Computers, 2014, 31(3):713-715, 719.)
[29] DENG Z H, CHOI K S, CHUNG F L, et al. Enhanced soft subspace clustering integrating within?cluster and between?cluster information[J]. Pattern Recognition, 2010, 43(3):767-781.
[30] HUANG X H, YE Y M, ZHANG H J. Extensions of kmeans?type algorithms: a new clustering framework by integrating intra?cluster compactness and inter?cluster separation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(8): 1433-1446.
[31] PANG N, ZHAO X, WANG W, et al. Few?shot text classification by leveraging bi?directional attention and cross?class knowledge[J]. Science China Information Sciences, 2021, 64(3): No.130103.
[32] 黃學(xué)雨,程世超. KNN優(yōu)化的密度峰值聚類算法[J]. 通信技術(shù), 2021, 54(7):1608-1618.(HUANG X Y,CHENG S C. KNN optimized density peak clustering algorithm[J]. Communication Technology, 2021, 54(7): 1608-1618.)
[33] 王芙銀,張德生,張曉. 結(jié)合鯨魚優(yōu)化算法的自適應(yīng)密度峰值聚類算法[J]. 計(jì)算機(jī)工程與應(yīng)用, 2021, 57(3):94-102.(WANG F Y, ZHANG D S, ZHANG X. Adaptive density peak clustering algorithm combining with whale optimization algorithm[J]. Computer Engineering and Applications, 2021, 57(3): 94-102.)
Neural tangent kernel K?Means clustering
WANG Mei1,2, SONG Xiaohui1, LIU Yong3,4*, XU Chuanhai1
(1,,163318,;2(),163318,;3,,100872,;4(),100872,)
Aiming at the problem that the clustering results of K?Means clustering algorithm are affected by the sample distribution because of using the mean to update the cluster centers, a Neural Tangent Kernel K?Means (NTKKM) clustering algorithm was proposed. Firstly, the data of the input space were mapped to the high?dimensional feature space through the Neural Tangent Kernel (NTK), then the K?Means clustering was performed in the high?dimensional feature space, and the cluster centers were updated by taking into account the distance between clusters and within clusters at the same time. Finally, the clustering results were obtained. On the car and breast?tissue datasets, three evaluation indexes including accuracy, Adjusted Rand Index (ARI) and FM index of NTKKM clustering algorithm and comparison algorithms were counted. Experimental results show that the effect of clustering and the stability of NTKKM clustering algorithm are better than those of K?Means clustering algorithm and Gaussian kernel K?Means clustering algorithm. Compared with the traditional K?Means clustering algorithm, NTKKM clustering algorithm has the accuracy increased by 14.9% and 9.4% respectively, the ARI increased by 9.7% and 18.0% respectively, and the FM index increased by 12.0% and 12.0% respectively, indicating the excellent clustering performance of NTKKM clustering algorithm.
Neural Tangent Kernel (NTK); K?Means; kernel clustering; feature space; kernel function
This work is partially supported by National Natural Science Foundation of China (51774090, 62076234), Postdoctoral Research Startup Fund of Heilongjiang Province (LBH?Q20080), Natural Science Foundation of Heilongjiang Province (LH2020F003), Higher Education Teaching Reform Key Entrusted Project of Heilongjiang Province (SJGZ20190011).
WANG Mei, born in 1976, Ph. D., professor. Her research interests include machine learning, kernel methods, model selection.
SONG Xiaohui, born in 1998, M. S. candidate. Her research interests include deep kernel learning.
LIU Yong, born in 1986, Ph. D., research associate. His research interests include large?scale machine learning, automatic machine learning, statistical machine learning theory.
XU Chuanhai, born in 1998, M. S. candidate. His research interests include deep kernel learning.
TP181
A
1001-9081(2022)11-3330-07
10.11772/j.issn.1001-9081.2021111961
2021?11?17;
2021?12?13;
2021?12?23。
國(guó)家自然科學(xué)基金資助項(xiàng)目(51774090, 62076234);黑龍江省博士后科研啟動(dòng)金資助項(xiàng)目(LBH?Q20080);黑龍江省自然科學(xué)基金資助項(xiàng)目(LH2020F003);黑龍江省高等教育教學(xué)改革重點(diǎn)委托項(xiàng)目(SJGZ20190011)。
王梅(1976—),女,河北保定人,教授,博士,CCF會(huì)員,主要研究方向:機(jī)器學(xué)習(xí)、核方法、模型選擇;宋曉暉(1998—),女,山東濟(jì)南人,碩士研究生,CCF會(huì)員,主要研究方向:深度核學(xué)習(xí);劉勇(1986—),男,湖南益陽(yáng)人,副研究員,博士,CCF會(huì)員,主要研究方向:大規(guī)模機(jī)器學(xué)習(xí)、自動(dòng)機(jī)器學(xué)習(xí)、統(tǒng)計(jì)機(jī)器學(xué)習(xí)理論;許傳海(1998—),男,黑龍江雞西人,碩士研究生,CCF會(huì)員,主要研究方向:深度核學(xué)習(xí)。