Mingyuan Zhang ,Shenwen Chen ,Lijun Sun ,Wenbo Du ,*,Xianbin Cao ,*
a School of Electronic and Information Engineering,Beihang University,Beijing 100191,China
b National Engineering Laboratory for Comprehensive Transportation Big Data Application Technology,Beijing 100191,China
c Department of Civil Engineering and Applied Mechanics,McGill University,Montreal,QC H3A 0C3,Canada
Keywords:
Air traffic management Flight delay Latent class model Tensor decomposition
ABSTRACT In air traffic and airport management,experience gained from past operations is crucial in designing appropriate strategies when facing a new scenario.Therefore,this paper uses massive spatiotemporal flight data to identify similar traffic and delay patterns,which become critical for gaining a better understanding of the aviation system and relevant decision-making.However,as the datasets imply complex dependence and higher-order interactions between space and time,retrieving significant features and patterns can be very challenging.In this paper,we propose a probabilistic framework for highdimensional historical flight data.We apply a latent class model and demonstrate the effectiveness of this framework using air traffic data from 224 airports in China during 2014-2017.We find that profiles of each dimension can be clearly divided into various patterns representing different regular operations.To prove the effectiveness of these patterns,we then create an estimation model that provides preliminary judgment on the airport delay level.The outcomes of this study can help airport operators and air traffic managers better understand air traffic and delay patterns according to the experience gained from historical scenarios.
With the rapid development of the world’s civil aviation industry,severe flight delays have remained a major problem and a cause of inconvenience.Such delays not only discourage passengers from considering air transport or choosing the same airline again[1-4],but also force airlines to bear the additional costs of aircraft maintenance and fleet underutilization[5].Furthermore,flight delays lead to increases in fuel consumption and carbon dioxide emissions,bringing harm to the environment[6,7].Besides the direct impacts listed above,flight delays have a negative influence on various aspects of the overall economy[8].In summary,flight delays constitute a severe and widespread problem with significant negative impacts.
Many factors contribute to the complexity and intractability of this problem.These factors are generally summarized as abnormal weather[9,10]and technical reasons including air traffic control[11],insufficient facility capacity,poor scheduling[12],changes in procedures[13],and limited buffer time[14].This variety of factors makes it difficult to understand the underlying pattern of flight delays and design appropriate strategies[15].Recently,data-driven approaches based on historical observations have been shown to be free from previous constraints and to fit the potential dynamic properties[16].Therefore,one approach to facilitating system cognition and decision-making is to fully utilize and learn from historical data[17].For example,when faced with bad weather,it is possible to look up past days with similar weather conditions and refer to the actions taken by air traffic controllers on those days.Several previous studies[18-23]have focused on identifying fixed patterns in air traffic management.Liu et al.[19]introduced a semi-supervised learning algorithm that can determine groups of similar days as different patterns.The first step was to measure the distance between hourly weather forecasts,followed by identifying the days with a small total distance.The authors applied this method to conduct two case studies at Newark Liberty International Airport and proved its efficiency.Mukherjee et al.[20]proposed a way to classify patterns based on the impact of severe weather conditions.The authors used a weather index as the input and applied factor analysis to identify dominant weather patterns.Then,they clustered days using Ward’s minimum-variance method.Days belonging to the same cluster shared similar weather patterns.Apart from weather patterns,some studies have attempted to identify similar days from other perspectives.Grabbe et al.[21]used a k-means clustering algorithm to identify similar days in terms of ground delay programs.The authors applied an expectation-maximization(EM)algorithm on the start and end times of ground delay programs and scheduled arrival rate data.Other studies have focused on air traffic flow and flight delays to identify similar patterns[24].Gorripaty et al.[17]measured the principal components in demand and capacity data,but found that there were no natural patterns in demand or capacity after applying clustering analysis to the data.By identifying the periodic patterns of arrival delay for non-stop domestic flights,Abdel-Aty et al.[25]found that some patterns were not detected by statistical methods.
Despite these previous efforts,there are still gaps in understanding flight delay patterns.As mentioned above,an effective method is to find clusters or patterns in spatiotemporal historical data[26,27].However,due to the high dimensionality of this data,it is difficult to find distinct patterns in Euclidian space,as mentioned in previous studies[28].Methods of latent component analysis have thus been put forward to reveal the hidden patterns.Instead of directly mining the patterns,methods such as latent distribution analysis[29],latent trait analysis(including item response theory and Rasch models),and grade-of-membership analysis can utilize the signatures derived from the tensor factorization to form a projection subspace with much lower dimensionality,which strengthens the underlying cluster structure of spatiotemporal traffic dynamic patterns[30].These methods open up new frontiers in the field of transportation science[31],such as urban mobility analysis[32-34],traffic speed prediction[35],missing traffic data completion[36,37],and vessel track recovery[38].
Inspired by the methods described above,the main goal of this paper is to understand potential air traffic and flight delay patterns using massive air traffic data.In the first part of this work,we regard flight records as a multivariate observation sampled from a universal distribution and adopt a Tucker-like latent class model to identify principal patterns of each mode.We then propose an estimation model based on potential patterns to show the efficiency of this framework.The remainder of the paper is organized as follows.In Section 2,we introduce the modeling framework.Section 3 demonstrates a case study in which we apply the model to a dataset of the Chinese aviation system,and Section 4 gives conclusions.
In this section,we introduce the overall framework for modeling multi-way flight data in a probabilistic setting.The aim is to characterize the principal patterns of air traffic and flight delays from different dimensions and their joint interactions.In Section 2.1,we introduce the notation used in our framework.In Section 2.2,the nonnegative Tucker decomposition(NTD)method is presented as a method of understanding the tensor decomposition(TD).Finally,the latent class analysis(LCA)is described in Section 2.3.
We let xa=(xa1,...,xaq)Trepresent a flight record a,where q represents the number of dimensions—that is,the number of attributes in a single trip record.To characterize the features of flights,each element xa*may denote the departure airport of the flight(xa1),departure time(xa2),departure date(xa3),delay level(xa4),and so forth.
We map the values into discrete values for convenience.We define xaβ∈{1,...,wq}discrete values starting from 1 for attribute β(β=1,...,q).βdenotes the indices of attributes.wqis the discrete value of the vector of dimension q.Taking the departure airport as an example,xa1=1 represents airport 1 for flight record a.The departure time of a flight is then represented by xa2∈{1,...,24},with each value corresponding to a 1 h period in a day.We then present our method using flight record data from four years(2014-2017),with the days ordered by departure date as xa3∈{1,...,1461}.As the arrival delay is more related to airborne delay,which is difficult to analyze,we consider departure delay rather than arrival delay to investigate the flight delay patterns.Departure delay can better reflect the operation mode of the target airport.In addition,there is a clear definition for the level of departure delays.According to the rules set by the Federal Aviation Administration(FAA)and related research,delayed flights are those that depart more than 15 min later than the scheduled time.Considering practical circumstances,a number of 45 and 90 min later than scheduled are also selected as division thresholds.Thus,we classify the delay of each flight into four levels:①<15 min,②15-45 min,③45-90 min,and④>90 min.We use xa4∈{1,...,4}to represent‘‘on time,”‘‘slight delay,”‘‘moderate delay,”and‘‘serious delay,”respectively.
It has been shown that TD displays a number of advantages in various contexts,especially when data must be decomposed into a sum of additive components[39].TD was first proposed by Tucker[40]in 1963.Since then,NTD has been proposed[41,42]to deal with naturally nonnegative and observation data.NTD is a powerful tool for the extraction of nonnegative-parts-based and physically meaningful latent components from high-dimensional tensor data while preserving the natural multilinear structure of the data[24].In mathematical terms,it decomposes a tensor into a set of matrices and a core tensor.
where A(1),A(2),A(3),...,A(K)are the factor matrices and can be regarded as the principal components of mode K.The entries of? show the level of interaction and connections between different components.In this method,both the core tensor?and the factor matrices A(K)should be elementwisely nonnegative.Elementwise,they are given as follows:
where F is matrix norm.
LCA is a statistical method for finding subtypes of related cases(latent classes)from multivariate categorical data[43,44].The latent class model is written as follows:
whereθ(N)is the probability vectors characterizing the principal patterns of each mode N.
The flight operation data is used as a proxy for the collective features of flights in this analysis.The dataset analyzed in this paper was provided by the Civil Aviation Administration of China(CAAC).Given that the purpose of our study is to provide decision-making support for air traffic and airport management,flight delay conditions are of particular interest.Moreover,air traffic flow is the basis of delay conditions.Therefore,it is natural to choose air traffic flow and flight delays as the main objects of our study.Departure delays are chosen,as they can better reflect the level of congestion of a departure airport and airspace.The categorical values of flight data are shown in Table 1.The database contains 13 492 326 domestic flights.All flights connect 224 airports,among which Beijing Capital International Airport has the most flights,accounting for 6.3%of flights in all airports.The departure date of the flights is from 1 January 2014 to 31 December 2017,and the departure time could be in any hour within a day.As can be seen,all the flights can be divided into four groups according to the departure delay time.The percentage of flights in the four groups are 37%,38%,14%,and 11%,respectively.The average departure delay of these flights is 31.08 min.Canceled flights are ignored in this study.9 February 2016 was the day with the greatest number of flights,which was 12 419.1 January 2014 was the day with the least number of flights,which was 7009.The busiest time period for airports is from 8:00 to 9:00,with 6%of all flights taking off during this time period.
In our research,we assume that flight records are multivariate variables sampled from a universal distribution.The 13 492 326 flights are aggregated to the same tensor.Each observation contains four variables,including departure airport,order of the day during the four years,time of day,and level of delay.The total combinations are 224×1461×24×4.Here,we use a small 3(departure airport,A)×4(order of day in four years,D)×5(time of day,H)×4(level of delay,L)core tensorπto capture the interactions across different modes.Although a bigger core tensor could contain more information and reflect comprehensive relationships among different modes,this small core tensor can facilitate the interpretation of results.Moreover,existing research shows that the results are basically consistent across different sizes of core tensors[48].In the following paragraph,we present the main results based on the core tensor size[3×4×5×4]as an example.The hour,day,airport,and delay-level factor matrix allows us to interpret the pattern due to its distribution profile.
Fig.1(a)depicts hour profiles of the five patterns.The component pattern H1,covering 18.7%of all flights,reveals a gradual rise from 11:00 and reaches its peak at 24:00,followed by a decrease until the early morning.Pattern H2shows similarities to a Gaussian distribution with a peak at 17:00,and covers about 24.4%of all flights.Compared with patterns H1and H2,patterns H3and H5are more concentrated.They increase significantly in the morning and display a sudden drop by noon.Pattern H4reaches its peak at 11:00 and then decreases continuously over the rest of the day.The proportions of H3,H4,and H5are 22.9%,22.2%,and 11.8%,respectively.The day factor matrix has 1461 rows and four columns,which depict the corresponding relationship between original days and day patterns.We analyze the day patterns across different months of the year and days of the week.To do so,we sum up the daily traffic probability along different months and days of the week to identify how the traffic flow interacts with the dates.Patterns across days of the week are shown in Fig.1(b).W is the weekly pattern and M is the monthly pattern.Most traffic observations of pattern W1are concentrated on weekdays,while the peaks of pattern W3are on the two-day weekend.Interestingly,we found that patterns W2and W4also show opposite trends.W2is mainly concentrated on Monday,Saturday,and Sunday,while pattern W4is concentrated on Wednesday,Thursday,and Friday.Fig.1(c)shows patterns of different months.We can observe clear seasonal diversity in that no pattern is equally distributed across the seasons.Pattern M1is mainly concentrated in autumn and winter,while pattern M2is concentrated in winter and spring.Pattern M3is concentrated in spring,and pattern M4is mainly distributed in summer,which is the travel peak of the aviation system.
Table 1 Categorical values of flight data.
Fig.1.Principal patterns in different modes.(a)The Y-axis is the probability of each pattern(column)in the hour factor matrix Probability(H);(b)the Y-axis is the probability of day of the week along with the day factor matrix Probability(W);(c)the Y-axis is the probability along with the day factor matrix Probability(M).
Even though the estimation process does not take any spatial location information into account,we find that these components can be identified by the geographic locations of airports.As the traffic hub of China,Beijing Capital International Airport is obviously prominent in patterns A1and A2.Furthermore,pattern A1is mainly distributed in the southeast,while pattern A2is mainly distributed in the southwest.Pattern A3is mostly composed of airports in the midwest.
Fig.2 shows a composition of the level of delay for each pattern(column)in the delay-level factor matrix.L1,L2,L3,and L4represent flights of different levels,varying from‘‘on time”to‘‘serious delay”(as mentioned in Section 2.1).As can be seen,the delay-level patterns remain almost the same compared with the original tensor,with flights of different delay levels being identified.Fig.2 shows the composition of each delay-level pattern,which covers 40.7%,10.5%,29.1%,and 19.7%of total traffic flow,respectively.It is noted that delay is a manifestation of air traffic congestion,and delay level thus closely interacts with traffic flow characteristics in spatiotemporal factors.To conduct further analysis,we adopt conditional probability across the delay-level mode and other modes to investigate the interactions.We calculate the conditional probability distribution Probability(L|H)by Bayes’theorem.Given the hour patterns,the conditional distribution of delay-level patterns is
Fig.2.Delay-level composition probability for each pattern(column)in the delay factor matrix.
Probability(L|H)shows how the delay-level patterns interact with the hour patterns.As can be seen,pattern L1,denoting the on-time component,strongly interacts with all the hour patterns,which indicates that on-time flights occupy the mainstream at any time of day.To characterize the rules of delay,we now discuss the delay-level patterns excluding L1.H1,denoting the traffic pattern of an increasing trend throughout the entire day,is associated with delay pattern L2,which corresponds to slightly delayed flights.H2is mainly covered by the hour patterns L3and L4,denoting that the high-level delays are highly correlated to the traffic in the afternoon.H3,H4,and H5are mainly covered by L3and L4.This can be explained by the fact that flights during the travel peak in the morning are likely to be seriously delayed due to the heavy air traffic in most airports.
Based on the factor matrices of the day pattern and delay-level pattern,we can find that on-time and slight delay patterns occur the most in M1/W1,M2/W2,and M3/W3;however,pattern M4/W4is very unusual.Along with the week and month dimensions,M4/W4indicates that the traffic is mostly concentrated on weekdays and in the summer,and tends to interact with serious delays.
As shown in Probability(L|A)airport patterns are also correlated with the delay-level patterns.The obvious point is that pattern A3is mainly covered by L1,which may denote that air traffic from the airports in the midwest have fewer delays.Compared with A3,A1and A2are more likely to be seriously delayed.This can be explained by the midwest airports’relatively lighter traffic flow and adequate airspace resources.As shown in the analysis above,delay is heavily influenced by time and space.To further explore the interactions among the patterns,we present the disaggregated distribution along the temporal and spatial dimensions.As can be seen in Fig.3,the value of(L2,H1)is greater than that of the other cells in A1,indicating that when afternoon or evening flights take off from the southeast airports,they are accompanied by slight delays.We also observe a considerable amount of traffic within the A2pattern in(L3,H3),while such traffic rarely appears in A1and A3.This can be explained by the fact that flights from southwest airports during the morning peak often have moderate delays.As mentioned before,due to their excess capacity,the midwest airports(A3)have a low probability of flight delays.This phenomenon is consistent with cells(L1,H2-H4)in A3.
The numerical examples show that the factorization allows us to explain the complex dependence and higher-order interactions based on latent factors.The core tensorπcharacterizes the interactions between different modes in a very efficient and informative manner.This framework helps us to comprehend and interpret the latent interactions and complex dependence among the patterns in large datasets,enhancing our understanding of air traffic management.
(1)Prepare the training data.We prepareρ-dimensional the training data samples along with their class label.
(2)Select the parameters.E:maximum depth of each tree;C:minimum number of samples required to split an internal node;V:the variables at each split;ML:the minimum number of samples required to be at a leaf node.
(3)Grow a classification tree.For b=1 to B,draw a bootstrap sample Z*of size(Sd)from the training data.Use about two-thirds of the original training samples to grow a classification tree.Leave the remaining one-third of the samples as so-called out-of-bag(OOB)samples.
(4)Grow trees.For each bootstrap sample,grow a treeΓb(X)with the following modifications.At each node,pick the best variable/split-point and split the node into two nodes until the sample number of this node is less than C.The tree is grown to the maximum size E and not pruned back.Repeat the steps above until B such trees are grown.
To evaluate the performance of the estimation model,four indices are employed,referring to Eqs.(11)-(14).F1macrois the macro-averaged score and weights all the classes(1,...,u)equally,regardless of how many samples Subelong to a given class u.F1microis the micro-averaged score and weights all the samples equally,thus favoring the performance of common classes.Weighted score finds the average in each label,weighted by the number of true instances for each class.AccuracyOOBis the mean accuracy on each training sample Z*,using only the trees that do not have Z*in their bootstrap sample[52].
Fig.3.Interaction of delay level and hour for different airport patterns.
where Pmacrois the macro-averaged precision,Rmacrois the macroaveraged recall,Precisionuis the number of true positives divided by the total number of elements labelled as belonging to the positive class for a class u,and Recalluis defined as the number of true positives divided by the total number of elements that actually belong to the positive class for a class u.
where Pmicrois the micro-averaged precision,Rmicrois the microaveraged recall,TPuis the true positive rate of class u,FPuis the false positive rate of class u,and FNuis the false negative rate of class u.
where Suis number of samples of class u,S is number of samples.
where TPOOBis the true positive rate of OOB samples,FPOOBis the false positive rate of OOB samples,FNOOBis the false negative rate of OOB samples,and TNOOBis the true negative rate of OOB samples.
For this study,the data of latent spatiotemporal patterns are used.The classification problem involves the identification of four delay levels(on time,slight delay,moderate delay,and serious delay).To estimate how accurately a model will perform in practice,this paper adopts a five-fold cross-validation strategy.One round of cross-validation involves partitioning all the records into five complementary subsets,performing the training process on four subsets,and then validating the analysis on another testing set.Next,the validation results are averaged over the five rounds to give an estimation of the model’s performance.A total of 13 492 326 records are used.As mentioned above,the probability values of the flights in each mode(latent spatiotemporal patterns)are selected as the features;that is,{A1-A4,H2-H5,W1-W4,M1-M4}.Each feature ranges from 0 to 1.
The number of trees is the most important variable.It should be large enough to let the generalization error of the RF converge.In Fig.4,we find that AccuracyOOBchanges from 53.1%to 53.4%when B increases from 100 to 150,and increases slightly when B is greater than 150.This indicates that the RF classifier is almost insensitive to the increase of B when it is greater than 150.To obtain a better parameters set,a grid search method is used in this paper.Several combinations of other parameters are tried to determine the optimal values for this model.
The overall performance and the accuracy of individual classes are given in Fig.5.As prior studies have noted,delays are mostly determined by dynamic operational factors.It is not possible to give a judgment much earlier than the actual occurrence of the delay[53,54].Our model only considers the latent patterns revealed through basic flight information,which may not be viewed optimistically.In this case,the RFs achieved more than 50%in all performance indices and achieved accuracies of 60.0%,46.0%,44.0%,and 65.0%,respectively,for the individual classes,which may be a very positive sign.It signifies that the probability of different levels of flight delay can be estimated preliminarily based only on the time and airport information.Furthermore,this application supports the effectiveness of the potential patterns.We can also see that the‘‘on time”pattern and‘‘serious delay”pattern are classified more accurately due to their distinctive features.Even if the algorithm has more difficulty in classifying the‘‘slight delay”and‘‘moderate delay”classes,the deep color near the diagonal of the confusion matrix indicates that the wrong estimations have a small bias.
Fig.4.Model performance with different numbers of trees.
Fig.5.Classification results.(a)Overall performance through different indices;(b)the confusion matrix.
In this paper,we develop a probabilistic factorization framework that transforms massive flight record data into a spatiotemporal tensor.Our purpose is to investigate the spatiotemporal dynamic patterns of air traffic and flight delays.We assume that each flight observation is a sample generated from a universal joint distribution.Then,we formulate this joint task using nonnegative tensor factorization,which has been shown to be a useful analytical tool for high-dimensional massive data.The results show that clear patterns are identified in different modes.The interactions of different modes are also characterized in the core tensor,which interprets the relationship between delays and spatiotemporal patterns.The detailed analysis clearly shows that serious delays are strongly correlated to afternoon,weekday,and summer traffic.Flights from midwest airports have a low probability of delay at any time of day.
This framework can shed light on the modeling of flight delays by providing an understanding of flights in both the spatial and temporal dimensions.Moreover,the potential patterns have been shown to display a certain effect on delay estimation.As the integration of space and time information,potential patterns can give positive estimation results regarding the delay level.In the context of highly dynamic environments and the complexity of delays,this output gives us a new comprehension of how air traffic and flight delays interact with time and space.This framework provides a deep understanding of massive aviation data through the latent class model and probabilistic factorization method.The outcomes of this study can help airport operators and air traffic managers to better prepare air traffic and airport arrangements based on the experience gained from historical scenarios.Further analysis can be performed to tackle the challenge of understanding interactions involving more factors,such as weather and air route attributes.
Acknowledgements
The authors would like to thank Yu Zhang from the University of South Florida and Shenqi Zhang from Carnegie Mellon University for technical support and suggestions,which helped a lot to improve the quality of this work.This paper is supported by the National Key Research and Development Program of China(2019YFF0301400)and the National Natural Science Foundation of China(61671031,61722102,and 61961146005).
Compliance with ethics guidelines
Mingyuan Zhang,Shenwen Chen,Lijun Sun,Wenbo Du,and Xianbin Cao declare that they have no conflict of interest or financial conflicts to disclose.