中圖分類號(hào):TP391 文獻(xiàn)標(biāo)志碼:A 文章編號(hào):1671-5489(2025)04-1137-06
Missing Data Filling Method in Time Series Based on Generalized Center Clustering
YU Yanpeng,HUI Xianghui (College of Information and Management Science (College of Software), HenanAgricultural University,Zhengzhou 45oo46,China)
Abstract: Aiming at the problem that the filling of missing values in time series usually relied on the predictions of existing data,and the complexity and uncertainty of time series often led to errors in the prediction results. In order to ensure the effectiveness of data filling,we proposed a time series missing data filling method based on generalized center clustering. Firstly,we calculated the distance between objects and classes,as well as between classes, quantified the relative positional relationship between data points and cluster centers,and obtained the spatial relationship between data. Secondly, we used information bottleneck algorithms to cluster the generalization centers in space,dividing time series datasets containing missing data into the same class. Finally,we calculated the cluster radius, divided the outlier data generated by the generalized center clustering into usable and weakly usable randomly damaged data,set a fluctuation threshold,and compared the randomly damaged data within the fluctuation threshold with a string of the unified attribute values in the cluster,achieving the missing data filing in the time series. The experimental results show that this method has high standardized mutual information and hit rate in the clustering process,and can ensure a data replenishment rate of over 80% when filling in missing data, indicating that this method can effectively improve the integrity of time series data.
Keywords: generalized center clustering; time series; missing data filling; information bottleneck;randomly damaged data;replenishment rate
隨著時(shí)間序列數(shù)據(jù)在經(jīng)濟(jì)學(xué)、金融學(xué)、氣象學(xué)等多個(gè)領(lǐng)域的廣泛應(yīng)用,數(shù)據(jù)的準(zhǔn)確性和完整性對(duì)分析和預(yù)測(cè)時(shí)間相關(guān)現(xiàn)象和趨勢(shì)至關(guān)重要[1].但在實(shí)際應(yīng)用中時(shí)間序列數(shù)據(jù)常面臨缺失問(wèn)題[-4],因此研究時(shí)間序列缺失數(shù)據(jù)的填補(bǔ)方法具有重要意義.其旨在通過(guò)合適的數(shù)學(xué)和統(tǒng)計(jì)方法,有效恢復(fù)缺失數(shù)據(jù),提高數(shù)據(jù)分析的質(zhì)量和效率,對(duì)推動(dòng)相關(guān)領(lǐng)域的研究和決策具有重要意義.
目前,對(duì)缺失數(shù)據(jù)填補(bǔ)方法的研究已取得了很多成果.例如:?jiǎn)谭堑萚5研究了面向多維特性數(shù)據(jù)的缺失值檢測(cè)及填補(bǔ)方法,該方法檢測(cè)了多維數(shù)據(jù)的缺失程度,在不同缺失程度下設(shè)計(jì)了不同填補(bǔ)方法,但該方法在進(jìn)行缺失數(shù)據(jù)填補(bǔ)時(shí)無(wú)法保證數(shù)據(jù)補(bǔ)齊率,導(dǎo)致填補(bǔ)結(jié)果不完善;任兵等[6研究了基于壓縮感知的相關(guān)性數(shù)據(jù)填補(bǔ)方法,該方法將填補(bǔ)問(wèn)題轉(zhuǎn)化為壓縮感知框架下的稀疏向量恢復(fù)問(wèn)題,通過(guò)快速迭代加權(quán)閾值算法實(shí)現(xiàn)缺失數(shù)據(jù)的填補(bǔ),雖然在填補(bǔ)過(guò)程中效率較高,但其并不適用于時(shí)間序列數(shù)據(jù);盧繼哲等通過(guò)神經(jīng)網(wǎng)絡(luò)模型預(yù)測(cè)缺失內(nèi)容,并對(duì)其進(jìn)行填補(bǔ),雖然缺失數(shù)據(jù)預(yù)測(cè)的結(jié)果較精準(zhǔn),但其在聚類過(guò)程中無(wú)法保證較高的標(biāo)準(zhǔn)化互信息,從而無(wú)法保證后續(xù)填補(bǔ)質(zhì)量;Sun等[8]研究了基于缺失率和異常度測(cè)量的不完全數(shù)據(jù)處理方法,該方法對(duì)不同缺失比率異常數(shù)據(jù)進(jìn)行檢測(cè),并通過(guò)填補(bǔ)方法對(duì)其填充,該方法雖然能實(shí)現(xiàn)不同缺失比率異常數(shù)據(jù)的精準(zhǔn)檢測(cè),但填補(bǔ)后仍存在大量缺失數(shù)據(jù).為保證缺失數(shù)據(jù)填補(bǔ)質(zhì)量,本文提出一種基于泛化中心聚類的時(shí)間序列缺失數(shù)據(jù)填補(bǔ)方法.
時(shí)間序列缺失數(shù)據(jù)填補(bǔ)
通過(guò)泛化中心聚類可將具有相似時(shí)間模式或特征的數(shù)據(jù)點(diǎn)聚集在一起,從而利用這些相似數(shù)據(jù)的信息更準(zhǔn)確地估計(jì)和填補(bǔ)缺失值[9-10].這種方法有助于捕捉時(shí)間序列數(shù)據(jù)的內(nèi)在結(jié)構(gòu),提高缺失數(shù)據(jù)填補(bǔ)的準(zhǔn)確性和可靠性.
1. 1 泛化中心距離計(jì)算
計(jì)算泛化中心距離的意義在于衡量不同數(shù)據(jù)點(diǎn)在特征空間中的相對(duì)位置關(guān)系,從而評(píng)估不同數(shù)據(jù)之間的相似性或差異性.對(duì)象(數(shù)據(jù)點(diǎn))與類(聚類)之間距離的計(jì)算公式為
其中 x 為對(duì)象(數(shù)據(jù)點(diǎn))的特征向量, σo 為泛化中心的特征向量, n 表示特征的數(shù)量.通常情況下,距離d 越小,說(shuō)明對(duì)象與該類中的其他對(duì)象越類似,設(shè)置距離上限為 U ,若某一對(duì)象的距離 d?U ,則將該對(duì)象添加到該類中,否則構(gòu)建一個(gè)新類[1].
當(dāng)計(jì)算完對(duì)象與類之間的距離后,需計(jì)算類與類之間的距離,分別設(shè)兩個(gè)類為 O1,O2 ,兩者的泛化中心分別為 OO1 和 OO2 ,則這兩個(gè)類之間距離的計(jì)算公式為
其中 |O1|,|O2| 分別表示類 O1,O2 中存在的對(duì)象數(shù)量,若某一類中僅存在一個(gè)對(duì)象,則可將該對(duì)象設(shè)為泛化中心.通過(guò)類間距離 d(O1,O2 )可以評(píng)估不同類之間的差異性, d(O1,O2) 越小兩個(gè)類之間越相似,若該值越大,則兩個(gè)類之間的差異就越明顯.
1.2 基于信息瓶頸算法的泛化中心聚類
通過(guò)上述計(jì)算得到了對(duì)象與類之間、類與類之間的距離,量化了數(shù)據(jù)點(diǎn)與聚類中心之間的相對(duì)位
置關(guān)系,得到了數(shù)據(jù)間的空間關(guān)系[1-13],然后可利用信息瓶頸(information bottleneck,IB)算法對(duì)空間中的泛化中心進(jìn)行聚類處理,信息瓶頸算法通過(guò)限制信息傳遞的瓶頸,能在保持?jǐn)?shù)據(jù)聚類質(zhì)量的同時(shí),減少不必要的信息損失[14],從而實(shí)現(xiàn)數(shù)據(jù)的有效壓縮和聚類.
設(shè)原始數(shù)據(jù)集為 X={x1,x2,…,xN} ,聚類中心為 C={c1,c2,…,cK} ,其中 K 為預(yù)設(shè)的聚類數(shù)量.定義一個(gè)編碼函數(shù) q(c|x) 和一個(gè)解碼函數(shù) ?(c|x) .使用互信息度量 X 與 C 之間的相關(guān)性:
其中 p(x),p(c) 分別為 X 和 C 的邊緣概率分布, ρ(x,c) 為先驗(yàn)聯(lián)合分布.使用條件熵度量 X 在給定 C 下的不確定性:
信息瓶頸的目標(biāo)是最小化 H(X∣C) 的同時(shí)最大化 I(X,C) ,可通過(guò)優(yōu)化一個(gè)加權(quán)和實(shí)現(xiàn):
minq(c∣x){βI(X,C)-H(X,C)},
其中 β 為一個(gè)權(quán)衡參數(shù).
利用迭代方法求解上述優(yōu)化問(wèn)題,迭代過(guò)程中更新編碼函數(shù) q(c|x) 和解碼函數(shù) p(c|x) ,以減少目標(biāo)函數(shù)的值.經(jīng)過(guò)多次迭代后,可得到優(yōu)化的編碼函數(shù) q(c|x) 和解碼函數(shù) ?(c|x) .對(duì)于每個(gè)聚類中心 ck ,可使用解碼函數(shù)生成或描述該簇的泛化中心.這種泛化中心聚類方法不僅考慮了聚類中心的位置,還考慮了數(shù)據(jù)在聚類中心周圍的分布情況,因此能提供更豐富的信息表示每個(gè)簇.
1.3基于泛化中心聚類的缺失數(shù)據(jù)填補(bǔ)
通過(guò)泛化中心聚類先將含有缺失數(shù)據(jù)的時(shí)間序列數(shù)據(jù)集劃分到同一類中,然后再對(duì)缺失數(shù)據(jù)進(jìn)行填補(bǔ).在數(shù)據(jù)填補(bǔ)過(guò)程中,通常會(huì)將完全缺失和部分缺失的情況都?xì)w為缺失數(shù)據(jù),這種方式忽視了部分缺失數(shù)據(jù)所蘊(yùn)含的潛在價(jià)值.因此,本文在處理缺失數(shù)據(jù)填補(bǔ)問(wèn)題時(shí),先對(duì)數(shù)據(jù)損壞程度進(jìn)行細(xì)致的劃分:若殘留字符與原始數(shù)據(jù)毫無(wú)關(guān)聯(lián),則該數(shù)據(jù)被界定為弱可用隨機(jī)損壞數(shù)據(jù);反之,為可用隨機(jī)損壞數(shù)據(jù).通過(guò)區(qū)分并有效利用這兩種數(shù)據(jù),可進(jìn)一步提升缺失數(shù)據(jù)填補(bǔ)的準(zhǔn)確性和可靠性.
基于上述數(shù)據(jù)劃分結(jié)果,采用簇半徑計(jì)算方法對(duì)泛化中心聚類后產(chǎn)生的離群點(diǎn)數(shù)據(jù)再次進(jìn)行可用、弱可用隨機(jī)損壞數(shù)據(jù)劃分.在實(shí)際劃分時(shí),給定波動(dòng)閾值 ,將位于波動(dòng)閾值內(nèi)的隨機(jī)損壞數(shù)據(jù)與聚類中統(tǒng)一屬性值進(jìn)行字符串對(duì)比,波動(dòng)閾值
的計(jì)算公式為
其中 Rc 表示泛化中心簇半徑, 表示簇頭競(jìng)爭(zhēng)半徑.簇中心點(diǎn) Cm 可利用下式計(jì)算:
其中 tip 為泛化中心任一點(diǎn), N 為被選擇的泛化中心點(diǎn)數(shù)量.此時(shí),簇半徑可表示為
通過(guò)上述計(jì)算,對(duì)泛化中心聚類后產(chǎn)生的離群點(diǎn)再次進(jìn)行缺失數(shù)據(jù)劃分,以此為基礎(chǔ),降低弱可用數(shù)據(jù)的占比.圖1為數(shù)據(jù)損壞程度劃分.
在進(jìn)行時(shí)間序列缺失數(shù)據(jù)填補(bǔ)過(guò)程中,若找到一個(gè)匹配的字符串,則將其記錄為1;如果未找到匹配的字符串,則記錄為0.然后將所有字符串對(duì)比得到的數(shù)值相加,得到一個(gè)總和, b 值越大,說(shuō)明對(duì)象之間的相似度越高.基于這一原理,利用b 值最大的項(xiàng)所對(duì)應(yīng)的數(shù)據(jù)對(duì)缺失數(shù)據(jù)進(jìn)行填補(bǔ).
該過(guò)程不僅考慮了部分缺失數(shù)據(jù)的重要性,還提高了數(shù)據(jù)填補(bǔ)的準(zhǔn)確性.
2 實(shí)驗(yàn)分析
為評(píng)估本文方法對(duì)時(shí)間序列缺失數(shù)據(jù)的填補(bǔ)能力,下面對(duì)該方法進(jìn)行性能測(cè)試.在UCI數(shù)據(jù)庫(kù)中選擇實(shí)驗(yàn)所用數(shù)據(jù)集,數(shù)據(jù)集信息列于表1.
標(biāo)準(zhǔn)化互信息(NMI)可評(píng)估聚類結(jié)果與數(shù)據(jù)實(shí)際類標(biāo)簽之間的相似性,本文利用NMI評(píng)估聚類質(zhì)量:
其中 S 表示目標(biāo)聚類結(jié)果, L 表示數(shù)據(jù)實(shí)際類標(biāo)簽, E(S,L) 表示 S 與 L 之間的互信息, H(S),H(L) 分別表示 S,L 的信息量.NMI值越大,說(shuō)明聚類結(jié)果與實(shí)際情況越相符.利用命中率(HR)指標(biāo)評(píng)估本文方法的聚類質(zhì)量:
其中 ψ 表示缺失數(shù)據(jù)總數(shù), hits 表示成功聚類數(shù)據(jù).利用補(bǔ)齊率(CR)評(píng)估本文方法的缺失數(shù)據(jù)填補(bǔ)能力:
其中 F 為錯(cuò)誤填補(bǔ)的數(shù)據(jù)總量, 5 為缺失數(shù)據(jù)比率, M 為數(shù)據(jù)總量.
選取文獻(xiàn)[5]方法、文獻(xiàn)[6]方法和文獻(xiàn)[7]方法與本文方法進(jìn)行對(duì)比,不同方法的標(biāo)準(zhǔn)化互信息對(duì)比結(jié)果列于表2.由表2可見(jiàn),文獻(xiàn)[5方法的標(biāo)準(zhǔn)化互信息相對(duì)較小,文獻(xiàn)[6]方法和文獻(xiàn)[7]方法的標(biāo)準(zhǔn)化互信息雖然高于文獻(xiàn)[5]方法,但并未超過(guò) 50% ,而本文方法在聚類過(guò)程中可提供較高的標(biāo)準(zhǔn)化互信息.可見(jiàn),相比于其他方法,本文方法具有良好的聚類能力.
選取數(shù)據(jù)集Abalone和Mushroom,用本文方法對(duì)該數(shù)據(jù)集進(jìn)行聚類和缺失數(shù)據(jù)填補(bǔ),分析本文方法在處理該數(shù)據(jù)集不同缺失比率時(shí)的命中率和補(bǔ)齊率,分析結(jié)果如圖2所示.由圖2可見(jiàn),當(dāng)數(shù)據(jù)集中缺失數(shù)據(jù)比率逐漸增大時(shí),本文方法對(duì)數(shù)據(jù)處理時(shí)的命中率和補(bǔ)齊率也隨之降低,但在本文方法處理下,命中率始終保持在 70% 以上,缺失數(shù)據(jù)補(bǔ)齊率始終保持在 80% 以上,可見(jiàn)本文方法具有良好的缺失數(shù)據(jù)填補(bǔ)能力.這是因?yàn)楸疚姆椒ㄔ诜夯行木垲惡?,?duì)產(chǎn)生的離群點(diǎn)數(shù)據(jù)進(jìn)行了再次處理,將其劃分為可用、弱可用隨機(jī)損壞數(shù)據(jù),該策略能更精細(xì)地處理數(shù)據(jù)集中的異常值,提高數(shù)據(jù)填補(bǔ)的精度.
利用本文方法對(duì)不同數(shù)據(jù)集進(jìn)行缺失數(shù)據(jù)填補(bǔ),并分析用該方法進(jìn)行填補(bǔ)后的數(shù)據(jù)缺失數(shù),以此評(píng)估該方法的數(shù)據(jù)填補(bǔ)能力,分析結(jié)果列于表3.
由表3可見(jiàn),本文方法將泛化中心聚類方法引人到時(shí)間序列缺失數(shù)據(jù)填補(bǔ)中,通過(guò)計(jì)算數(shù)據(jù)點(diǎn)與聚類中心之間的相對(duì)位置關(guān)系,得到數(shù)據(jù)間的空間關(guān)系,能更準(zhǔn)確地捕捉時(shí)間序列數(shù)據(jù)的內(nèi)在結(jié)構(gòu)和特征.因此,用本文方法對(duì)數(shù)據(jù)進(jìn)行填補(bǔ)后,不同數(shù)據(jù)集中的缺失數(shù)據(jù)數(shù)量明顯下降,其中,數(shù)據(jù)集Credit approval在填補(bǔ)后缺失數(shù)據(jù)數(shù)量?jī)H為 26個(gè),可見(jiàn)本文方法在進(jìn)行缺失數(shù)據(jù)填補(bǔ)時(shí)可有效保證填補(bǔ)質(zhì)量.
綜上所述,針對(duì)填補(bǔ)時(shí)間序列中的缺失值通常依賴于已有數(shù)據(jù)的預(yù)測(cè),由于時(shí)間序列的復(fù)雜性和不確定性導(dǎo)致預(yù)測(cè)結(jié)果常存在誤差的問(wèn)題,本文提出了一種基于泛化中心聚類的時(shí)間序列缺失數(shù)據(jù)填補(bǔ)方法.首先,該方法通過(guò)引入信息瓶頸算法對(duì)時(shí)間序列數(shù)據(jù)進(jìn)行泛化中心聚類,精確捕捉數(shù)據(jù)的內(nèi)在結(jié)構(gòu)和特征.其次,針對(duì)聚類后產(chǎn)生的離群點(diǎn)數(shù)據(jù),進(jìn)一步采用波動(dòng)閾值與字符串對(duì)比的策略進(jìn)行精細(xì)處理,實(shí)現(xiàn)了缺失數(shù)據(jù)的準(zhǔn)確填補(bǔ).最后,通過(guò)量化指標(biāo)如標(biāo)準(zhǔn)化互信息、命中率及數(shù)據(jù)補(bǔ)齊率,對(duì)本文方法的填補(bǔ)效果進(jìn)行了全面評(píng)估,結(jié)果表明,本文方法的命中率始終保持在 70% 以上,缺失數(shù)據(jù)補(bǔ)齊率始終保持在 80% 以上,有效提升了缺失數(shù)據(jù)填補(bǔ)效果.
參考文獻(xiàn)
[1]關(guān)李晶,何潔帆,張立勇,等.基于單輸出子網(wǎng)迭代學(xué)習(xí)的缺失值填補(bǔ)方法[J].大連理工大學(xué)學(xué)報(bào),2022,62(4): 427-432. (GUAN L J, HE JF, ZHANG L Y,et al. Missng Value Imputation Method Based on SingleOutput Sub-network with Iterative Learning [J].Journal of Dalian University of Technology,2O22,62(4):427-432.)
[2]陳俊揚(yáng),戴志江,李雪亮,等.基于強(qiáng)化學(xué)習(xí)的多變量時(shí)序數(shù)據(jù)缺失值補(bǔ)全方法[J].中國(guó)科技論文,2023,18(11):1205-1212. (CHEN J Y,DAI Z J,LI X L,et al. Reinforcement Learning Based Missing ValueCompletion Method for Multivariate Time Series Data [J]. China Sciencepaper,2023,18(11):1205-1212.)
[3]鄧明星,歐陽(yáng)含笑,錢楓,等.基于改進(jìn)LSTM的重型柴油車遠(yuǎn)程監(jiān)測(cè) NOx 濃度缺失數(shù)據(jù)填補(bǔ)[J].環(huán)境科學(xué)學(xué)報(bào),2023,43(11):245-257. (DENG M X,OUYANG H X,QIAN F,et al. Filling in Missing NOx Concentration Data for Remote Monitoring of Heavy-Duty Diesel Vehicles Based on Improved LSTM[J].ActaScientiaeCircumstantiae,2023,43(11):245-257.)
[4]劉兵,鄭承利.基于EMD特征提取的高頻面板數(shù)據(jù)自適應(yīng)聚類方法[J].統(tǒng)計(jì)與決策,2022,38(10):16-20.(LIU B,ZHENG C L. Adaptive Clustering Method for High Frequency Panel Data Based on EMD Feature
[5]喬非,翟曉東,王巧玲.面向多維特性數(shù)據(jù)的缺失值檢測(cè)及填補(bǔ)方法對(duì)比[J].同濟(jì)大學(xué)學(xué)報(bào)(自然科學(xué)版),2023,51(12): 1972-1982. (QIAO F, ZHAI X D,WANG Q L. Comparison of Imputation Methods Based onMissing Value Detection for Multidimensional Feature Data [J]. Journal of Tongji University (Natural Science),2023,51(12):1972-1982.)
[6]任兵,郭艷,李寧,等.基于壓縮感知的相關(guān)性數(shù)據(jù)填補(bǔ)方法[J].計(jì)算機(jī)科學(xué),2023,50(7):82-88.(REN B,GUO Y,LI N,et al. Method for Correlation Data Imputation Based on Compressed Sensing[J]. ComputerScience,2023,50(7):82-88.)
[7」盧繼哲,劉宣,唐悅,等.基于聚類和LSTM的電力分鐘凍結(jié)數(shù)據(jù)缺失值填充方法[J].控制工程,2022,29(4):611-616.(LU J Z,LIU X,TANG Y,et al. Missing Value Treatment for Minute Freezing Data ofElectricity Based on Clustering and LSTM[J]. Control Engineering of China, 2022,29(4): 611-616.)
[8]SUN Z G,GAO M M, JIANG A P,et al. Incomplete Data Processng Method Based on the Measurement ofMissing Rate and Abnormal Degree:Take the Loose Particle Localization Data Set as an Example[J].ExpertSystems with Applications,2023,216(4):119411-1-119411-22.
[9]肖釗,鄧杰文,劉曉明,等.基于運(yùn)行規(guī)律和 TICC算法的風(fēng)電SCADA 高維時(shí)序數(shù)據(jù)聚類方法[J].機(jī)械工程學(xué)報(bào),2022,58(23):196-207. (XIAO Z, DENG JW,LIU X M,et al. Clustering Method of High-DimensionalTime Series SCADA Data from Wind Turbines Based on Operational Laws and TICC Algorithm [J]. Journal ofMechanical Engineering,2022,58(23):196-207.)
[10]李建華,朱澤陽(yáng),徐禮勝,等.基于深度嵌入聚類的ICU患者生理數(shù)據(jù)缺失插補(bǔ)[J].東北大學(xué)學(xué)報(bào)(自然科學(xué)版),2022,43(5): 639-645.(LIJH, ZHU ZY,XUL S,et al. Interpolation of Missing Physiological Data ofICU Patients Based on Dep Embedded Clustering [J]. Journal of Northeastern University(Natural Science),2022,43(5):639-645.)
[11]劉恒孜,呂寧,姜侯,等.基于 DCT-PLS 算法的 MODIS LST缺值填補(bǔ)方法研究[J].地球信息科學(xué)學(xué)報(bào),2022,24(2): 378-390.(LIU H Z,LU N,JIANG H,et al. Research on Gaps Filing of MODIS LST Based onDCT-PLS[J]. Journal of Geo-information Science,2022,24(2):378-390.)
[12]趙林鎖,陳澤,丁琳琳,等.基于RELM的時(shí)間序列數(shù)據(jù)加權(quán)集成分類方法[J].計(jì)算機(jī)工程與科學(xué),2022,44(3): 545-553. (ZHAO L S, CHEN Z, DING L L,et al. A Weighted Ensemble Classification Method for TimeSeries Data Based on Regularized Extreme Learning Machine [J]. Computer Engineering amp; Science,2022,44(3):545-553.)
[13]LIU SS,HUR,WU JF,et al.Research on Data Classification and Feature Fusion Method of Cancer NucleiImage Based on Deep Learning [J]. International Journal of Imaging Systems and Technology,2O22,32(3):969-981.
[14]古險(xiǎn)峰,湯永利.基于群體智能算法的混合屬性大數(shù)據(jù)聚類仿真[J].計(jì)算機(jī)仿真,2023,40(9):458-461.(GU XF,TANG Y L. Simulation of Mixed Attribute Big Data Clustering Based on Swarm IntelligenceAlgorithm[J]. Computer Simulation,2023,40(9):458-461.)