SangSeok Lee ,HaeWon Moon and Lee Sael,2,★
1Department of Artificial Intelligence,Ajou University,Suwon-si,16499,Korea
2Department of Software and Computer Engineering,Ajou University,Suwon-si,16499,Korea
ABSTRACT How can we efficiently store and mine dynamically generated dense tensors for modeling the behavior of multidimensional dynamic data? Much of the multidimensional dynamic data in the real world is generated in the form of time-growing tensors.For example,air quality tensor data consists of multiple sensory values gathered from wide locations for a long time.Such data,accumulated over time,is redundant and consumes a lot of memory in its raw form.We need a way to efficiently store dynamically generated tensor data that increase over time and to model their behavior on demand between arbitrary time blocks.To this end,we propose a Block Incremental Dense Tucker Decomposition(BID-TUCKER)method for efficient storage and on-demand modeling of multidimensional spatiotemporal data.Assuming that tensors come in unit blocks where only the time domain changes,our proposed BID-TUCKER first slices the blocks into matrices and decomposes them via singular value decomposition(SVD).The SVDs of the time × space sliced matrices are stored instead of the raw tensor blocks to save space.When modeling from data is required at particular time blocks,the SVDs of corresponding time blocks are retrieved and incremented to be used for Tucker decomposition.The factor matrices and core tensor of the decomposed results can then be used for further data analysis.We compared our proposed BID-TUCKER with D-Tucker,which our method extends,and vanilla Tucker decomposition.We show that our BID-TUCKER is faster than both D-Tucker and vanilla Tucker decomposition and uses less memory for storage with a comparable reconstruction error.We applied our proposed BID-TUCKER to model the spatial and temporal trends of air quality data collected in South Korea from 2018 to 2022.We were able to model the spatial and temporal air quality trends.We were also able to verify unusual events,such as chronic ozone alerts and large fire events.
KEYWORDS Dynamic decomposition;tucker tensor;tensor factorization;spatiotemporal data;tensor analysis;air quality
Various real-world stream data come in the form of dense dynamic tensors,i.e.,time-growing multi-dimensional arrays.Accumulated over time,these data require efficient storage and efficient ondemand processing.Tensor decomposition methods have been widely used to analyze and model from unlabeled multidimensional data.Several scalable methods have been proposed for efficient tensor decomposition [1-4].There are also several dynamic tensor decomposition methods.Most dynamic tensor decomposition methods focus on the CANDECOMP/PARAFAC(CP)decomposition method[5-8].However,compared to Tucker decompositions,CP decompositions tend to produce higher reconstruction errors when the input tensors are skewed and dense[9].Therefore,recent studies have focused on dynamic Tucker decompositions [10,11].D-L1 Tucker [10] emphasizes outlier-resistant Tucker analysis of dynamic tensors,and D-TuckerO[11]focuses on the computational efficiency of dynamic Tucker decomposition.Overall,the need for efficient storage has been overlooked in previous dynamic methods,which focus on speed,accuracy,and scalability.Thus,there is still a need for scalable Tucker decomposition methods.These methods should allow efficient storage and partial on-demand modeling and analysis of dense and dynamically generated tensors.
Our proposed Block Incremental Dense Tucker Decomposition(BID-TUCKER)method efficiently stores and decomposes,on-demand,the multi-dimensional tensor data that are dynamically generated in the form of dense tensor blocks.Each incoming tensor block is transformed into slice matrices.Each of the slice matrices is then decomposed by a singular value decomposition (SVD) method.The approach of decomposing slice matrices was proposed in D-Tucker [12] for static tensors.We extend the approach to dynamically generated tensors.We store the block-wise SVD results of the slice matrices and use them,on-demand,to decompose the original tensor blocks(Fig.1).Storing the SVD results instead of the original tensor blocks significantly reduces memory requirements.The Tucker decomposition result of the queried tensor blocks can be further analyzed and used for modeling.In this paper,we apply our method to spatiotemporal air quality tensor data and perform analysis to detect spatial and temporal trends.
Figure 1:Storage phase and query phase of the proposed BID-TUCKER
Our contributions are summarized as follows:
· Propose an efficient storage method for block-wise generated tensor stream data.
· Propose an efficient on-demand dense tensor decomposition method for queried blocks for further analysis and modeling.
· Compare the reconstruction error and the computational speed of the proposed method with D-Tucker[12]and vanilla Tucker decomposition.
· Apply the proposed method to Korean air quality data and show how spatial and temporal trends can be modeled.
We begin by summarizing the tensor operation and the terms used in the paper.The notations and descriptions are consistent with the notations of Kolda et al.[9]and also D-Tucker[12].Table 1 lists the symbols used.
Table 1: Table of symbols consistent with the notations of Kolda et al.[9]
Matricizationtransforms a tensor into a matrix.The mode-nmatricization of a tensor X ∈is denoted as X(n).The mapping from an element (i1,...,iN) of X to an element (in,j) of X(n)is given as follows:
Given a tensor of the form X ∈,slice matricesare two-dimensional slices of the tensor defined by fixing all but two indices.For example,in a 3rdordertensor,the horizontal,lateral,and frontal slices of the tensor X ∈can be denoted asrespectively.A mode-1 matricesX(1)of a tensor X can be represented as slice matrices as follows:
whereXl∈RI1×I2,Lis equal toI3×···×IN,and the indexlis defined using the following equation[9]:
whereKmis the dimensionality of mode-m,N is the order of the input tensor,andequals 1 ifi-1<m.Note that on the right hand side of the Eq.(2)there is a slice matrixXldenoted by the indexl.
N-mode productallows multiplications between a tensor and a matrix.Then-mode product of a tensor X ∈RI1×···×INwith a matrix U ∈is denoted by X×nU.The resulting tensor has the shape ofI1×···×In-1×Jn×In+1×···×IN.The element-wisen-mode product of a tensor can be written as follows:
For more on tensor operations,please refer to a review by Kolda et al.[9].
Incremental singular value decomposition(incremental SVD)is a method that generates the SVD result of a growing matrix in an incremental manner.Let X consist of two vertically concatenated blocks XAand XB.Also,let Ui,Σi,andbe the SVD result of a given matrixi.We can compute the SVD of the vertically concatenated X as follows[13]:
Our proposed method is based on one of the most popular tensor decomposition methods,the Tucker decomposition [9].Tucker decomposition decomposes tensors into factor matrices of each mode and a core tensor.Given annthorder tensor X,the Tucker decomposition approximates X to a core tensor Gand factor matricesThe core tensor G is much smaller than the input tensor X and the factor matrices A(n)are normally column orthogonal.The Tucker decomposition with tensor operations is shown as follows:
A widely used technique for minimizing the reconstruction error in a standard Tucker decomposition is alternating least squares (ALS) [9],which updates a single factor matrix or core tensor while holding all others fixed.Algorithm 1 describes a vanilla Tucker factorization algorithm based on ALS,calledhigher-orderorthogonaliteration(HOOI)(see[9]for details).Algorithm 1 requires the storage of a full-density matrix Y(n),and the amount of memory needed to store Y(n)isAlso,computing a singlen-mode product between an input tensor X and the factor matricesA(n),as in Eq.(8),requirestime complexity.
As the order,the mode dimensionality,or the rank of a tensor increases,the amount of memory required grows rapidly.
Sequentially Truncated Higher Order SVD(ST-HOSVD)[14]calculates the HOSVD in sequentially truncated form as follows in the Algorithm(2)using randomized SVD.
The ST-HOSVD algorithm provides a way to sequentially obtain factor matrices without the repeated mode-nproducts between the original tensor and all other factor matrices,as in line 4 of the Algorithm 1.
Our proposed Block Incremental Dense Tucker Decomposition (BID-TUCKER) method assumes that only the time mode increases in dimension and dimensions remain for the rest of the modes.Our BID-TUCKER is divided into two phases (Fig.1 and Algorithm 3): the storage phase and the query phase.In the storage phase,incoming tensor blocks are stored as slice SVDs.The slice SVDs are then used in the query phase to efficiently decompose tensor blocks.In the query phase,a dense Tucker decomposition is performed on the user-specified query blocks.The decomposition results can be used for further analysis such as anomaly detection.We explain our approach for both phases in detail.
Our storage phase,inspired by the block incremental SVD in FAST[13]and the approximation step of D-Tucker[12],provides a method to efficiently store the incoming tensor blocks.When a new tensor block arrives at the input system,the tensor is first sliced,where the two unfixed indices are time and space.Next,each of the sliced matrices is decomposed by the SVD.Only the SVD results are stored:singular values containing rectangular matricesΣs,left singular vector matrices Us,and right singular vector matrices Vs.
Assuming there areBblocks and the SVD truncation size isR,the space requirement isO(BI1RL),whereI1is the time mode dimension andL=I3×···×INis the number of sliced matrices for the input tensor Xb∈I1×···×IN.IfO(BI1RL)is small enough,they can be stored in memory.Otherwise,the SVD results can be stored on disk for later retrieval.For our experiments,we assume that the SVD results stored in memory are used in the query phase for Tucker decomposition and analysis.
The storage phase is described in detail in the Algorithm 4.In the storage phase,given the spatiotemporal tensor data,the first mode is selected to be time,and the second mode is selected to be space (line 1).Once the two modes are selected,they are not changed during the storage and query phases.Afterward,the data is normalized so that all mode values are scaled,and a tensor is constructed(line 2).For each of the incoming tensor blocks,slice matrices are formed using the two modes,while the indices for the remaining modes are fixed(line 3).Each of theslicematricesis further decomposed with SVD,denoted asslice-SVDs(lines 4-6).Several SVD methods can be used.For smaller and less noisy tensor blocks,exact SVD can be used.For larger,noisy data,faster low-rank SVD methods such as randomized SVDs are recommended.Randomized SVDs are fast and memory efficient,and their low-rank approximation helps remove noise.Since most of the real-world sensor data is large and noisy,we used the randomized SVDs to generate the slice SVDs.We extend this process to each new incoming block of tensors.The process is shown in the storage phase of Fig.1.
In the query phase,core tensor and factor matrices of the queried block tensors are generated on-demand.The stored slice SVDs of blocks in the query time range are retrieved and processed in two steps:initialization and iteration.First,BID-TUCKER constructs a single slice-SVDs by block-wise incrementing the slice-SVDs of the requested time blocks[S,E],which are used to initialize the factor matrices.Our BID-TUCKER then iterates over the factor matrices to find optimal cell values.The core tensor is computed after the last iteration.
3.2.1Initialization
The left figure in the query phase of Fig.1 shows the block-wise incremented SVDs and the initialization of our BID-TUCKER.In the block-wise increment step,the block-wise slice-SVDs are used to generate slice-SVDs that cover the entire query blocks[S,E].
The specifics of block-wise incremented SVD are described in the following.Let[U(S,l);...;U(E,l)];be the left singular vectors of thel-th slice matrix between the start blockSand the end blockE.All left singular vectors in the requested time blocks are concatenated into the block diagonal form as in Eq.(9).
Then thel-th slice matrix corresponding to the time query blocks[S,E]is derived in the following:
The next step in the initialization process is to initialize the cell values for the factor matrices and the core tensor of the query time range for efficient computation.There are two factors to consider when initializing.The first factor is that we do not want to refer to the original tensor values since we discard them after the storage phase.The second factor is that we want to initialize well so that the number of iterations required in the alternating least squares(ALS)procedure is minimal.D-Tucker[12]provides a way to decompose the tensor blocks with slice SVDs,which addresses both purposes.The method showed tolerable reconstruction error and a reduced number of iterations in ALS[12].In fact,our factor matrix initialization and iteration steps follow the initialization and iteration phases of the D-Tucker[12],modified higher-order orthogonal iteration(HOOI)algorithm(Algorithm 1).We rewrite them here for clarity of the proposed BID-TUCKER.
The computational bottleneck of naive HOOI is the computation of a singlen-mode product between the input tensor X and the factor matrices (line 4 of Algorithm 1).Again,there are two problems with the naive HOOI approach: it requires the input tensor X and the space complexity is high.Fortunately,the input tensor X can be approximated with the stored slice SVDs for each block without reconstruction,which also improves the time and space complexity.This is possible by approximating the mode-ntensor with slice matrices and replacing the regular higher-order singular value decomposition(HOSVD)approach used in regular HOOI(lines 5 and 8 of Algorithm 1)with the sequentially truncated higher-order singular value decomposition (ST-HOSVD) [14] approach.ST-HOSVD sequentially assigns the truncated left singular vector matrix of the mode-ntensor as the mode-nfactor matrix.first mode More specifically,the first mode factor matrixA(1)can be obtained by first representing the mode-1 tensorX(1)as follows:
whereLis equal toI1×..×IN,lis as defined in Eq.(3),UΣVTis the SVD result of[U1Σ1;...;UlΣl;...;ULΣL],andblkdiagis the block diagonal form of right singular vectorsVls similar to Eq.(9).With the above mode-1 tensor X(1)representation and application of ST-HOSVD,we can obtain the initial factor matrix for the first mode as A(1)=U(line 8).
The initialization of the mode-2 factor matrix A(2)also follows the ST-HOSVD and respects the order of the tensor operations.As with the first mode factor matrix,A(2)is initialized to be the left singular vectors of the mode-2 matrices of X×1A(1),but without explicit reconstruction of X as in Eq.(12).
whereY(2),inter=A(1)T[U1;...;Ul;...;UL],Lis equal toI3×···×IN,blkdiagis a block diagonal matrix.Transforming the right-most equation in Eq.(12) toand taking the left singular vectors from mode-2 matricized from gives the initial values for A(2)(lines 9-11).
Initialization of the mode-3 factor matrix A(3),following the ST-HOSVD,is also similar.As with the mode-1 and mode-2 factor matrice,A(3)is initialized to be the left singular vectors of mode-3 matricization of X×1A(1)×2A(2)again without explicit reconstruction of X as in Eq.(13).
whereY(2),inter=A(1)T[U1;...;Ul;...;UL],Lis equal toI3×···×IN,blkdiagis a block diagonal matrix.Reshaping the rightmost equation in Eq.(13) toand taking left-singular vectors from mode-3 matricized from gets the initial values for A(3)(lines 15,16,19).
The remaining modesi=4,...,Nfollowing the ST-HOSVD can also be initialized by taking the left singular vectors of mode-iin the matricized form of X ×1A(1)T×2A(2)T×i-1A(i-1)T(line 13).For efficient computation,D-Tucker[12]avoids redundant computation by finding the recursive computation pattern of Yprev×i-1A(i-1)T.That is,the recursive computation of Yprevto mode-3(line 15)is just the mode-i-1 product between Yprevand A(i-1)T(line 18).
3.2.2Iteration
After initialization,to reduce reconstruction error,ALS is performed over the factor matrices to generate optimal cell values for factor matrices A(i)(i=1...N).We follow the iteration procedure of the D-Tucker[12],which follows the HOOI iteration process that(1)approximates the original tensor X with stored SVD results(2)carefully arranges the update for the first-and the second-factor matrices by utilizing SVD results and their transpose(3)and avoid duplicate calculations of the intermediate tensor.
The overall ALS iteration process is shown in Algorithm 6,which we describe in detail.Again,the algorithm and description follow those of D-Tucker[12]which is an approximate form of the HOOI Algorithm 1.HOOI constructs an intermediate tensor that Y byn-mode products of the original tensor and the factor matrices,except for the factor matrix that is to be updated.D-Tucker approximates this process for the updated first-factor matrix by computing the approximate of X×2A(2)Twith the previous factor matrix A(2)Tand transposing the SVD of stitched matrices(lines 3-5).Then,the rest of the factor matrices are multiplied normally (line 10),prior to the factor matrix update (line 11).The second-factor matrix is updated similarly(lines 7-11).Prior to the update of the rest of the factor matrices,intermediate tensor Yprevthat approximates X×1A(1)T×2A(2)Tis calculated for repeated reuse(lines 13,14).Then the algorithm follows the same procedure as the HOOI (lines 16-17).Standard SVD is used for finding the leading singular vector ofY(i)(lines 11,17).
To simplify the theoretical analysis of our BID-TUCKER,we assume that the ranks of the output tensor are allJdimensional and that the dimensionality of a block of the input tensor is as follows:the dimension of the time mode isIand the dimension of remaining modes are allK,whereI≤K≤J>0.
Thestorage phase takesO(BI2KN-2)time andO(BIRKN-2)space to process and storeBblocks of theNthorder input tensor Xb∈I×K×···×K.Because the randomized SVD takes[15]time,and the randomized SVD is performed on each of theL=KN-2slice matrices for a block of the input tensor.The space requirement for the storage phase,given the SVD rankR,isO((IR+RR+KR)KN-2),sinceIR,RR,KRstand for the sizes of the left singular vectors,the singular values,and the right singular vectors of each of theKN-2sliced matrices.The space complexity can be simplified toO(IRKN-2).
Thequery phasetakesO(MNZIKN-2J2)time andO(ZIKN-2J)space,whereIis the dimensionality of the input tensor,Jis the output rank,Zis the number of tensor time blocks processed,Mis the number of iterations,andNis the order of the input tensor.The query phase analysis of BID-TUCKER largely follows the complexity analysis of D-Tucker,which takesO(IKN-2J2) time for initialization,O(MNIKN-2J2)time for iteration,andO(IKN-2J)space for processing a single block,B=Z=1,of the input tensor[12].
After obtaining the core tensors and factor matrices of the desired time range[S,E],they can be used for spatial and temporal analysis.For the temporal analysis,we focus on finding unusual events.To do this,we define the anomaly score and the anomaly threshold based on the clusters of the time mode factor matrix.In a time mode factor matrix,a row contains latent values for the corresponding time point.Finding time-wise anomalies means finding unusual air quality that covers wide locations and features that differ from normal times.
Theanomaly score:for each row of the time mode factor matrix is calculated by computing the minimum distance-to-centroid after clustering the row vectors.An appropriate clustering method can be selected after examining the distribution of the row vectors.We used k-means clustering but other clustering methods can be used.Then,for each cluster,a centroid vector ci∈C is constructed fori=1...K,whereKis the number of clusters.Given a row vector v,the anomaly score is then defined as the distance of the vector to the nearest cluster.Specifically,we used the following score based on Euclidean distance:
Different distance measurements can be used.
Theanomaly thresholdcan be selected,after calculating the anomaly scores,to detect unusual events.Since the distribution of distances follows the Gaussian distribution,we use unbiased estimates of the Gaussian distribution,i.e.,the meanμand the varianceσ,to set the threshold.That is,the thresholdTis calculated as the two standard deviations 2σfrom the meanμof the anomaly values as follows:
We setA=2 and consider events with a higher anomaly score thanTto be unusual.However,to detect more extreme anomalies,a largerAvalue can be used.
We first describe our data set and the experimental setup.We then performed experiments to answer the following questions:
· Ablation Study:How does the error change due to SVD truncation sizes and Tucker ranks in our proposed BID-TUCKER?
· Scalability Test: How scalable is our proposed method compared to D-Tucker and vanilla Tucker decomposition?
· Application:What are the spatial and temporal properties of the air quality tensor that can be revealed by our proposed BID-TUCKER?
First,we describe the data preparation and experimental setup.
AirQuality Stream Data:We used Korean air quality data1https://www.airkorea.or.kr/from 2018.01.01 to 2022.12.31(41616-time points)to construct a 3rd-order temporal tensor for various spatial and temporal analyses.The Air Korea Air Quality Center measures and records particulate matter and air pollutants every hour from widely covered regions of South Korea,including remote islands.There are a total of 629 local monitoring stations with six types of measurements.The six types of measurements are ultrafine particulate matter (PM2.5;μgm2(1 h)),fine particulate matter (PM10;μgm2(1 h)),ozone (O3;parts per million (ppm)),nitrogen dioxide (NO2;ppm),carbon monoxide (CO;ppm),and sulfur dioxide(SO2;ppm).
Constructing Air Quality Tensor:After obtaining the raw air quality data,we constructed a spatiotemporal tensor.
First,a min-max normalization was performed for each measurement type after missing values were handled.Missing value handling is important in Tucker decomposition methods that use SVDs,where each input cell value is used to find the singular vectors and values.A small number of missing values can either be filled with zero or interpolated with mean values.When a large amount of data is missing.Then the resulting singular vectors and values may be misleading.Thus,for stations with less than 10%missing values,we filled the missing values with the average value of the corresponding normalized air quality measurement type for the corresponding station.Stations with more than 10%missing values were removed,leaving 316 stations.
We then constructed a 3rdorder tensor,where the modes are (time,location,type).Overall,the shape of the obtained (time,location,type) indexed air quality tensor was (41616,316,6).In the experiment,we assumed that the tensor blocks are generated weekly for each hour (7*24=168),so the incoming blocks were set to tensors with a shape of(168,316,6).
Experiment Settings:The study was conducted on a 12-core Intel(R)Core(TM)i7-6850K CPU@3.60 GHz.The following libraries were used with Python version 3.10:NumPy version 1.23.5,scikitlearn version 1.2.1,Pytorch 2.0.0,and Tensorly 0.8.1.
There are two hyperparameters in the BID-TUCKER:SVD truncation size and Tucker ranks.In the first ablation study,the SVD truncation size was varied and the corresponding normalized reconstruction error was measured to observe the effect of the SVD truncation size on the overall reconstruction error.In the second ablation study,Tucker ranks were varied,and normalized reconstruction errors were measured to evaluate the effect of rank sizes on overall reconstruction error.The normalized reconstruction error was calculated as||Xorg-Xrecon||F/||Xorg||F.Again,we assumed that the air quality tensor of(41616,316,6)comes in a tensor stream of the form(168,316,6).
Fig.2 (left) in the blue straight line shows the normalized reconstruction error as the SVD truncation sizes increase from 5 to 50 in the storage phase of the proposed BID-TUCKER.In this test,the Tucker rank was set at(30,30,6).The normalized error started to decrease slowly after the size of 20.Unless otherwise specified,the SVD truncation size was set to 20 in the remaining experiments.
Figure 2:Normalized reconstruction error in blue solid lines and query processing time in red dotted lines over SVD truncation size with tensor ranks(30,30,6)(left)and tensor rank in the form of(x,x,6)with SVD truncation size 20(right)
The Fig.2(right)in the blue straight line shows the normalized reconstruction error for different Tucker rank tuples.The tensor ranks for thetimeandlocationmodes were set equal.The rank for thetypemode was set to 6.The ranks were varied from(5,5,6)to(80,80,6)in increments of 5.A rank size of 30 showed a low error for the air quality tensor.Unless otherwise specified,the Tucker ranks were set to(30,30,6)in the remaining experiments.
In the scalability test,we first tested the scalability according to the SVD truncation size and the Tucker rank.Fig.2(left)in the red dotted lines shows the query processing time as the SVD truncation sizes increase from 5 to 50.The time complexity increased linearly as the SVD truncation size increased.Fig.2(right)in the red dotted lines shows the query processing time for different Tucker ranks.The Tucker ranks for thetimeandlocationmodes were set equal,and the rank for thetypemodes was set to 6.The ranks were varied from(5,5,6)to(80,80,6),increasing by 5.The processing time increased in a near quadratic form which is also evident from the complexity analysis(Section 3.3)that shows the complexity to be proportional to theJ2whereJis the Tucker rank.
In the second part of the scalability test,we answer the question of how scalable our proposed BID-TUCKER is compared to D-Tucker and regular Tucker decomposition.We varied the input tensor sizes and computed the error and query processing time for the proposed BID-TUCKER,D-Tucker,and HOOI-based vanilla Tucker decomposition.Fig.3 shows the normalized reconstruction error on the left and the query processing time on the right.We can see from the left plot that all three methods have similar reconstruction errors.However,as we can see from Fig.3(left)that the error differences are small,less than 0.001.With the comparable loss,we then looked at how scalable each method is to the size of the input tensor.
Figure 3: Comparison of normalized reconstruction error (left) and query processing time (right)of dynamically dense tucker decomposition (DDT) with D-tucker and HOOI-based vanilla tucker decomposition over different tensor sizes.The x axis is the length of the time mode for the input tensor,where the location mode dimension is fixed to 316 and the type mode dimension is fixed to 6
For a fair comparison,we included the processing time for both the storage and query phases and set the query range to the full range for our BID-TUCKER.Fig.3(right)shows that our proposed BID-TUCKER significantly outperformed vanilla Tucker and showed a lower growth rate compared to D-Tucker.The lower growth rate compared to D-Tucker comes from efficiency grained by block incremental SVD used in the storage phase.
Overall,our proposed method scales better than the compared methods,with a tolerable error difference.
We describe our analysis results for the Korean air quality data.First,we used the time mode factor matrix to experiment with whether we could detect unusual air quality.Then,we analyzed whether our method could detect regional similarities in air quality by analyzing the location mode factor matrix.
Temporal Modeling and Anomaly Detection:We clustered the time mode factor matrix of the decomposed air quality tensor for modeling the temporal trends and detecting unusual air quality events in Korean air quality.Specifically,we used the time mode factor matrixAT∈R(41616×30)over the entire data period from the year 2018 to 2022 over the latent dimensions to perform k-means clustering.The optimalk=3 was found by the elbow method,which plots the within-cluster sum of squares (WCSS) against the cluster sizek(Fig.4 (left)).Fig.4 (right) shows the clustering result visualized by t-SNE(t-distributed stochastic neighbor embedding),where each color corresponds to a cluster.
Figure 4:Temporal analysis of air quality data using k-means.The elbow method is used and plotted on the left to find optimal clusters of size three,the color-coded cluster distribution is plotted using t-SNE on the right
The three clusters map well to Korea’s four seasons,each of which has its own characteristics.Specifically,Cluster 1(yellow)maps mainly to summer periods when temperatures and humidity are high nationwide.Ozone alerts are often issued during this time.Cluster 2 (red) mostly maps to the spring and fall seasons.Yellow dust and fine particle alerts are common during this time.Cluster 3 (purple) was mainly associated with the winter seasons.This period is characterized by cold,dry weather across the country,with occasional fine particle alerts.
To detect unusual air quality over time,we calculated anomaly scores as described in Section 4 using the Euclidean distance to the nearest cluster centroid of the time mode factor matrix.We considered anomaly scores above themean+2×standarddeviationto be unusual events.The vertical line in Fig.5 corresponds to the threshold.We took a closer look at some of the unusual time points to see if there were any notable events in the news that could have led to the time point having a high anomaly score.
We found two interesting items,chronic ozone alerts issued during the late spring and early summer periods and large fire events.The period from May to August between the years 2019 to 2022 contains several days with unusually high ozone levels2List of days and times of ozone alerts issued:2019-05-23 25,2019-06-19 21,2019-08-02 05,2019-08-18 20,2020-06-06 09,2020-07-16 17,2020-08-18 21,2021-05-13 19,2021-06-05 09,2021-06-20 25,2021-07-20 24,2022-05-23 25,2022-06-20 22,2022-07-10 16,2022-07-25 26..The time of high ozone concentration was between 13:00 and 17:00.The high ozone concentration during these periods occurred between 13:00 and 17:00,and the concentrations were two to three times higher than on normal days.These chronic events are marked in red in Fig.5.
Figure 5:Temporal anomaly score plot with threshold and selected events.The x axis is the time points in chronological order from 2018-01-01 to 2022-12-31.The vertical line is the threshold at t=mean+2×standarddeviation.Red marks denote chronic high ozone concentration events.Blue marks denote the two construction fire events.Organ marks denote several forest fire events
For fire events,we found several.Two marked cases are as follows: large construction fires and forest fire events.On December 28,2018,there were two construction fires: the Jeonbuk sewage treatment plant fire on 2018-12-28 between 7:00 and 8:00,and the Busan factory fire on 2018-12-28 between 1:00 and 3:00.The anomaly values mean the effect of the fires on air quality lasted for a few days after the events.These events are marked in blue in Fig.5.We also confirmed that large forest fires were detectable by the anomaly values.During the period from March 04,2022 to March 11,2022,there were several small and large fire events throughout the east coast of South Korea.The forest fire events are marked in orange in Fig.5.
In addition,Fig.5 shows that there is a difference in the pattern of anomaly values before and after the COVID-19 pandemic break.Before the pandemic break in December 2019,there were much more high peaks of anomaly air quality than during the pandemic.We speculate that this is related to the decrease in factory and vehicle utilization rates in and around nations during the pandemic.
Spatial Modeling:For the spatial modeling based on the air quality tensor,we performed K-means clustering on the location mode factor matrix AL∈R316×30.Similar to the temporal analysis,we selected the optimalk=3 using the elbow method(Fig.6a)and visualized the clustering result using t-SNE(Fig.6b).
For each of the three clusters,we found distinct local similarities,which we list below.Cluster 1 (yellow) corresponded to the metropolitan areas,including Seoul and Incheon.Cluster 2 (purple)corresponded to the eastern half of South Korea,which is known to have better overall air quality throughout the year.The eastern half is mostly mountainous.Cluster 3 (red) corresponded to the western half of South Korea,which is known to have poor overall air quality compared to the eastern half,even in rural areas.The western half has fewer mountains and more plains.
Figure 6: Spatial analysis of air quality data using k-means.The elbow method is used and plotted on the left to find optimal clusters of size three,the color-coded cluster distribution is plotted using t-SNE on the right
Recognizing the need for efficient storage of spatiotemporal stream data in dynamic modeling,we proposed Block Incremental Dense Tucker Decomposition(BID-TUCKER),which can provide efficient storage and on-demand Tucker decomposition results.Our BID-TUCKER can be applied to any tensor data that comes in tensor blocks where the dimensions of modes stay the same except for the time mode.We also assume only partial time-ranged data needs to be analyzed and modeled.Our BID-TUCKER stores the singular value decomposition(SVD)results of each sliced matrix of a tensor block instead of the original tensor.The SVD results of each time block are concatenated by block incremental SVD[13] and used to initialize the Tucker decomposition[12].We have applied our BID-TUCKER to the Korean Air Quality data for modeling the air quality trends.When applying it to the air quality tensor,we provided an approach to find the optimal SVD truncation size and Tucker rank.We used the same optimal setting to compare and show benefits in the error rate and scalability of our BIDTUCKER to that of D-Tucker,which is an efficient Tucker decomposition method that our BID-TUCKER extends,and the vanilla Tucker method,which is known to give the best error although not efficient.We further applied our BID-TUCKER on the air quality tensor to model the air quality trends and detect unusual events.Also,we were able to verify that there are spatial similarities in the air qualities.Overall,we conclude that our proposed BID-TUCKER was able to provide efficient storage and Tucker decomposition results on-demand for time block tensors.Although we only applied the BID-TUCKER to the air quality tensor,it can be applied to various multi-dimensional spatiotemporal data,and the post-decomposition analysis process can be used for modeling spatial and temporal trends.
Acknowledgement:The authors wish to express their appreciation to the reviewers for their helpful suggestions which greatly improved the presentation of this paper.
Funding Statement:This work was supported by the Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korean government(MSIT)(No.2022-0-00369)and by the National Research Foundation of Korea Grant funded by the Korean government(2018R1A5A1060031,2022R1F1A1065664).
Author Contributions:The authors confirm their contribution to the paper as follows:study conception and design: S.Lee,L.Sael;data collection: H.Moon;analysis and interpretation of results: S.Lee,H.Moon,L.Sael;draft manuscript preparation:S.Lee,H.Moon,L.Sael.All authors reviewed the results and approved the final version of the manuscript.
Availability of Data and Materials:The Korean Air Quality Data has been obtained through the Air Korea Insitute portal at https://www.airkorea.or.kr/.Our codes for BID-TUCKER and D-Tucker,written in Python,are provided in the GitHub https://github.com/Ajou-DILab/BID-Tucker.
Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.
Computer Modeling In Engineering&Sciences2024年4期