朱躍龍 朱曉曉 王繼民
摘 要:針對時間序列模體發(fā)現(xiàn)算法計算復雜,并且無法發(fā)現(xiàn)多實例模體的問題,提出基于子序列全連接和最大團的時間序列模體發(fā)現(xiàn)(TSSJMC)算法。首先,使用快速時間序列子序列全連接算法求得所有子序列之間的距離,生成距離矩陣;然后,設置相似性閾值,將距離矩陣轉(zhuǎn)化為鄰接矩陣,構(gòu)造子序列相似圖;最后采用最大團搜索算法從相似圖中搜索最大團,最大團的頂點對應的時間序列為包含最多實例的模體。在公開的時間序列數(shù)據(jù)集上進行實驗,選用已有的能夠發(fā)現(xiàn)多實例模體的Brute Force和Random Projection算法作為對比對象,分別從準確性、效率、可擴展性和魯棒性對TSSJMC算法進行分析并獲得了客觀的評判結(jié)果。實驗結(jié)果表明,與Random Projection算法相比,TSSJMC算法在效率、可擴展性和魯棒性法方面均有明顯優(yōu)勢;與Brute Force算法相比,TSSJMC算法發(fā)現(xiàn)的模體實例數(shù)量雖略低,但其效率和可擴展性都優(yōu)于Brute Force算法。因此,TSSJMC是質(zhì)量和效率相平衡的算法。
關(guān)鍵詞:時間序列;時間序列子序列;子序列連接;最大團;模體發(fā)現(xiàn)
中圖分類號: TP311.13
文獻標志碼:A
Abstract: Existing time series motif discovery algorithms have high computational complexity and cannot find multi-instance motifs. To overcome these defects, a Time Series motif discovery algorithm based on Subsequence full Joins and Maximum Clique (TSSJMC) was proposed. Firstly, the fast time series subsequence full join algorithm was used to obtain the distance between all subsequences and generate the distance matrix. Then, the similarity threshold was set, the distance matrix was transformed into the adjacency matrix, and the sub-sequence similarity graph was constructed. Finally, the maximum clique in the similarity graph was extracted by the maximum clique search algorithm, and the time series corresponding to the vertices of the maximum clique were the motifs containing most instances. In the experiments on public time series datasets, TSSJMC algorithm was compared with Brute Force algorithm and Random Projection algorithm which also could find multi-instance motifs in accuracy, efficiency, scalability and robustness. The experimental results demonstrate that compared with Random Projection algorithm, TSSJMC algorithm has obvious advantages in terms of efficiency, scalability and robustness; compared with Bruce Force algorithm, TSSJMC algorithm finds slightly less motif instances, but its efficiency and scalability are better. Therefore, TSSJMC is an algorithm that balances quality and efficiency.
Key words: time series; time series subsequence; subsequence join; maximum clique; motif discovery
0 引言
時間序列是按時間順序排列的、具有相等時間間隔的一系列數(shù)據(jù)的集合[1]。時間序列無處不在,使其在各個行業(yè)獲得普遍的應用。例如金融領(lǐng)域的證券交易數(shù)據(jù)[2]、氣象領(lǐng)域的氣溫氣壓數(shù)據(jù)[3]、工業(yè)領(lǐng)域的用電數(shù)據(jù)[4]、醫(yī)學領(lǐng)域的腦電波和心電圖數(shù)據(jù)[5-6]等。在時間序列數(shù)據(jù)挖掘的諸多問題中,時間序列模式發(fā)現(xiàn)是一個基礎(chǔ)性問題。時間序列模式發(fā)現(xiàn)包括查找事先指定模式和預先未知的模式。查找事先指定模式的問題(即按內(nèi)容查詢)已有諸多解決方法[7-10]。然而,查找預先未知、重復出現(xiàn)的模式即時間序列模體發(fā)現(xiàn)(也稱為時間序列的序列主題發(fā)現(xiàn))問題則面臨更多挑戰(zhàn)[11]。模體發(fā)現(xiàn)對于時間序列挖掘具有重要意義,可以用于解決時間序列聚類、分類、關(guān)聯(lián)規(guī)則發(fā)現(xiàn)等問題[12]。
模體發(fā)現(xiàn)源于生物信息學,用于尋找脫氧核糖核酸(DeoxyriboNucleic Acid, DNA)序列中具有相似排列和功能的短核苷酸片段。2002年,Lin等[11]首次將“模體”一詞引入時間序列,并提出時間序列模體發(fā)現(xiàn)的概念,此后出現(xiàn)了許多模體發(fā)現(xiàn)方法。第一類常見的時間序列模體發(fā)現(xiàn)算法采用近似離散化方法,該類算法先采用字符串聚集近似(Symbolic Aggregate Approximation, SAX)算法將時間序列進行離散并符號化,然后在壓縮后的數(shù)據(jù)中尋找相似片段或者提取規(guī)則[13-14]。此類算法雖表現(xiàn)出良好的性能,但是SAX會對數(shù)據(jù)進行平均處理,可能丟失數(shù)據(jù)集中具有重要意義的峰、谷等信息。第二類常見的模體發(fā)現(xiàn)方法是采用聚類發(fā)現(xiàn)時間序列模體,先將時間序列進行分段,然后使用聚類算法進行模體發(fā)現(xiàn)[15-16]。Eamonn等[17]指出利用滑動窗口提取時間序列進行聚類挖掘是完全無意義的,因為通過滑動窗口平移提取的每個數(shù)據(jù)點對于整體貢獻為一條直線。第三類時間序列模體發(fā)現(xiàn)算法基于概率方法,該類算法主要采用滑動窗口提取子序列,然后將投影算法作用于子序列得到候選模體,再將候選模體作為基準從原始序列中查找多條模體[18-19]。此類算法雖效率高,但涉及參數(shù)過多并且計算復雜。第四類算法是子序列連接的時間序列模體發(fā)現(xiàn)算法,該類算法通過子序列連接得到所有子序列的最近鄰后,構(gòu)建子序列相似圖,再采用最大團或最近鄰算法計算相似序列作為模體[20-21]。該類算法在子序列連接和最大團搜索中計算成本很高。
2017年Yeh等[1]使用快速傅里葉變換提高子序列全連接的速度,然后尋找每個子序列的1-最近鄰(1-Nearest Neighbor, 1-NN),彼此間相似度最高的兩子序列即為模體,該算法無法發(fā)現(xiàn)多實例模體。結(jié)合文獻[1]中算法計算簡單、速度快的優(yōu)勢,本文提出基于子序列全連接和最大團的時間序列模體發(fā)現(xiàn)(Time Series motif discovery based on Subsequence full Joins and Maximum Cliques, TSSJMC)算法以高效地發(fā)現(xiàn)等長的多實例模體。實驗結(jié)果表明,本文算法能夠快速、準確地發(fā)現(xiàn)多實例模體,同時對噪聲具有較好的魯棒性。
4 結(jié)語
針對時間序列模體發(fā)現(xiàn)算法計算復雜,并且無法發(fā)現(xiàn)多實例模體的問題,本文提出了一種計算簡單并且能夠發(fā)現(xiàn)多實例模體的基于子序列全連接和最大團的時間序列模體發(fā)現(xiàn)(TSSJMC)算法。該算法通過子序列全連接,構(gòu)建子序列相似圖,尋找圖中的最大團三個步驟獲得時間序列中的多實例模體?;诙鄠€公開數(shù)據(jù)集的多組實驗表明,本文提出的算法能夠在較短時間內(nèi)發(fā)現(xiàn)多條模體,并且具有高效性、準確性和更強的可擴展性和魯棒性。文中算法發(fā)現(xiàn)的模體均為等長模體,未來我們將考慮發(fā)現(xiàn)不同長度的模體。
參考文獻:
[1] YEH C-C M, ZHU Y, ULANOVA L, et al. Matrix Profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets [C]// Proceedings of the 2016 IEEE 16th International Conference on Data Mining. Piscataway, NJ: IEEE, 2016: 1317-1322.
[2] 周博,嚴洪森.基于小波和多維泰勒網(wǎng)動力學模型的金融時間序列預測[J].系統(tǒng)工程理論與實踐,2013,33(10):2654-2662. (ZHOU B, YAN H S. Financial time series forecasting based on wavelet and multi-dimensional Taylor network dynamics model [J]. Systems Engineering ― Theory and Practice, 2013, 33(10): 2654-2662.)
[3] AILLIOT P, BESSAC J, MONBET V, et al. Non-homogeneous hidden Markov-switching models for wind time series [J]. Journal of Statistical Planning and Inference, 2015, 160: 75-88.
[4] 張淑清,師榮艷,李盼,等.基于混沌關(guān)聯(lián)積分的暫態(tài)電能質(zhì)量擾動分類[J].儀器儀表學報,2015,36(1):160-166. (ZHANG S C, SHI R Y, LI P, et al. Transient power quality disturbance classification based on chaos-correlation-integral [J]. Chinese Journal of Scientific Instrument, 2015, 36(1): 160-166.)
[5] ARES J, LARA J A, LIZCANO D, et al. A soft computing framework for classifying time series based on fuzzy sets of events [J]. Information Sciences, 2016, 330: 125-144.
[6] PADMAVATHI S, RAMANUJAM E. Naive bayes classifier for ECG abnormalities using multivariate maximal time series motif [J]. Procedia Computer Science, 2015, 47: 222-228.
[7] GE X, SMYTH P. Deformable Markov model templates for time-series pattern matching [C]// Proceedings of the 2000 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2000: 81-90.
[8] KALPAKIS K, GADA D, PUTTAGUNTA V. Distance measures for effective clustering of ARIMA time-series [C]// Proceedings of the 2001 IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2001: 273-280.
[9] KEOGH E, CHAKRABARTI K, PAZZANI M, et al. Dimensionality reduction for fast similarity search in large time series databases [J]. Knowledge & Information Systems, 2001, 3(3): 263-286.
[10] CHAKRABARTI K, KEOGH E, MEHROTRA S, et al. Locally adaptive dimensionality reduction for indexing large time series databases [J]. ACM Transactions on Database Systems, 2002, 27(2): 188-228.
[11] LIN J, KEOGH E, LONARDI S, et al. Finding motifs in time series [C]// Proceedings of the 2nd Workshop on Temporal Data Mining. New York: ACM, 2002: 53-68.
[12] 馬百鳴.基于DTW度量的時間序列主旨模式提取[D].大連:大連理工大學,2011:1-3. (MA B M. Motif extraction algorithm of time series based on DTW [D]. Dalian: Dalian University of Technology, 2011:1-3.)
[13] PATEL P, KEOGH E, LIN J, et al. Mining motifs in massive time series databases [C]// Proceedings of the 2002 IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2002: 370-377.
[14] LIN J, LI Y. Finding approximate frequent patterns in streaming medical data [C]// Proceedings of the 2010 IEEE 23rd International Symposium on Computer-Based Medical Systems (CBMS). Washington, DC: IEEE Computer Society, 2010: 13-18.
[15] MURAKAMI K, DOKI S, OKUMA S, et al. A study of extraction method of motion patterns observed frequently from time-series posture data [C]// Proceeding of the 2005 IEEE International Conference on Systems, Man and Cybernetics. Piscataway, NJ: IEEE, 2005: 3610-3615.
[16] FU T-C, CHUNG F-L, LUK R, et al. Preventing meaningless stock time series pattern discovery by changing perceptually important point detection [C]// FSKD 2005: Proceedings of the 2005 Second International Conference on Fuzzy Systems and Knowledge Discovery, LNCS 3613. Berlin: Springer, 2005: 1171-1174.
[17] EAMONN KEOGH J L. Clustering of time series subsequences is meaningless: implications for past and future research [J]. Knowledge & Information Systems, 2003, 8(2):154-177.
[18] CHIU B, KEOGH E, et al. Probabilistic discovery of time series motifs [C]// KDD03: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2003: 493-498.
[19] MINNEN D, ISBELL C, ESSA I, et al. Detecting subdimensional motifs: an efficient algorithm for generalized multivariate pattern discovery [C]// Proceedings of the 2007 Seventh IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2007: 601-606
[20] LIN Y, MCCOOL M D, GHORBANI A A. Time series motif discovery and anomaly detection based on subseries join [J]. IAENG International Journal of Computer Science, 2010, 37(3): 259-271.
[21] SILVA D F, YEH C-C M, ENRIQUE G, et al. SiMPle: assessing music similarity using subsequences joins [C]// Proceedings of the 17th International Society for Music Information Retrieval Conference, New York: ISMIR, 2016:23-29.
[22] GRABOCKA J, SCHILLING N, SCHMIDTTHIEME L. Latent time-series motifs [J]. ACM Transactions on Knowledge Discovery from Data, 2016, 11(1): 1-20.
[23] YANKOV D, KEOGH E, MEDINA J, et al. Detecting timeseries motifs under uniform scaling [C]// KDD07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2007: 844-853.
[24] LI Y, LIN J. Approximate variable-length time series motif discovery using grammar inference [C]// MDMKDD10: Proceedings of the 10th International Workshop on Multimedia Data Mining. New York: ACM, 2010: Article No. 10.
[25] LI Y, LIN J, OATES T. Visualizing variable-length time series motifs [C]// Proceedings of the 12th SIAM International Conference on Data Mining. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM), 2012: 895-906.
[26] DUY T C, ANH D T. A fast method for motif discovery in large time series database under dynamic time warping [C]// Proceeding of the Sixth International Conference on Knowledge and Systems Engineering, AISC 326. Cham: Springer, 2015: 155-167.
[27] MUEEN A, ZHU Y, YEH M, et al. The fastest similarity search algorithm for time series subsequences under euclidean distance [EB/OL]. (2015-08-01) [2017-11-01]. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html.
[28] MUEEN A, NATH S, LIU J. Fast approximate correlation for massive time-series data [C]// Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2010: 171-182.
[29] SAKURAI Y, PAPADIMITRIOU S, FALOUTSOS C. BRAID: stream mining through group lag correlations [C]// Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2005: 599-610.
[30] RAKTHANMANON T, CAMPANA B, MUEEN A, et al. Searching and mining trillions of time series subsequences under dynamic time warping [C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2012: 262-270.
[31] BELACHEW M T, GILLIS N. Solving the maximum clique problem with symmetric rank-one non-negative matrix approximation [J]. Journal of Optimization Theory and Applications, 2017, 173(1): 279-296.
[32] MUEEN A, KEOGH E. Online discovery and maintenance of time series motif [EB/OL]. (2010-06-25) [2017-11-01]. http://alumni.cs.ucr.edu/~mueen/OnlineMotif/.