Yutong CHEN, Minghu HU, Yn XU, Lei YANG
aCollege of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
bSchool of Aerospace, Transport and Manufacturing, Cranfield University, Cranfield MK43 0AL, United Kingdom
KEYWORDSAir traffic flow management;Demand and capacity balancing;Deep Q-learning network;Flight delays;Generalisation;Ground delay program;Multi-agent reinforcement learning
AbstractReinforcement Learning (RL) techniques are being studied to solve the Demand and Capacity Balancing(DCB)problems to fully exploit their computational performance.A locally generalised Multi-Agent Reinforcement Learning(MARL)for real-world DCB problems is proposed.The proposed method can deploy trained agents directly to unseen scenarios in a specific Air Traffic Flow Management(ATFM)region to quickly obtain a satisfactory solution.In this method,agents of all flights in a scenario form a multi-agent decision-making system based on partial observation.The trained agent with the customised neural network can be deployed directly on the corresponding flight,allowing it to solve the DCB problem jointly.A cooperation coefficient is introduced in the reward function,which is used to adjust the agent’s cooperation preference in a multi-agent system,thereby controlling the distribution of flight delay time allocation.A multi-iteration mechanism is designed for the DCB decision-making framework to deal with problems arising from non-stationarity in MARL and to ensure that all hotspots are eliminated.Experiments based on large-scale high-complexity realworld scenarios are conducted to verify the effectiveness and efficiency of the method.From a statistical point of view,it is proven that the proposed method is generalised within the scope of the flights and sectors of interest,and its optimisation performance outperforms the standard computer-assisted slot allocation and state-of-the-art RL-based DCB methods.The sensitivity analysis preliminarily reveals the effect of the cooperation coefficient on delay time allocation.
Recently,one of the major concerns in the global development of civil aviation is the growing imbalance between increasing total traffic volume and saturated airspace accommodation,also known as the demand-capacity mismatch.If demand exceeds capacity in a sector for a certain period, hotspots will arise,resulting in increased loads on controllers, congestion in airspace,and flight delays.1As a result,balancing demand and capacity has become a vital issue for the aviation industry.DCB is one of the seven operational concepts of Air Traffic Management (ATM).2According to Single European Sky ATM Research (SESAR), DCB will play an essential role in the future air traffic management system as part of network management and can help to reduce flight delays.3,4
DCB is also known as ATFM or Air Traffic Flow and Capacity Management (ATFCM).5,6Depending on the advance time of operation, ATFM is divided into strategic(one year to one week), pre-tactical (one week to one day),and tactical (day of operation).7The main focus of this paper is on ATFM implemented on the Day before (D-1) or on the Day (D-0) of operation.The typical operational ways of ATFM contain Ground Delay Program (GDP),8,9rerouting,10,11separation management12,13and their combination.14,15
ATFM methods are classified into two types based on their solution methods: exact solution methods16,17and approximate solution methods.18,19In general, the advantage of exact solution methods is obtaining a globally optimal solution.However,when the problem is too large,such methods cannot guarantee that the solution will be completed in a limited time.Besides, the computing time highly depends on cases and can vary considerably for DCB problems of similar problem scales.16,17As a result, exact solution methods are hardly ever applied in practice.On the other hand, Computer-Assisted Slot Allocation(CASA), an approximate algorithm, is usually used in practice.CASA is commonly used in Europe,and it is similar to the Ration By Schedule (RBS) approach as applied in the United States.Approximation solution methods typically employ some heuristic framework or algorithm to find a locally optimal solution in a reasonable amount of time.The computation time of approximation solution methods is less sensitive to problem scale than exact solution methods.However, local optimal solutions are often not readily accepted because there is frequently a significant gap between the local and the global optimal solution.Thus, a DCB method capable of obtaining solutions with high optimisation performance in a short time is highly desired.
In recent years, reinforcement learning techniques have gradually been tried to solve DCB problems to find a good balance between computing speed and optimisation performance.RL methods train agents to obtain strategies through a large number of training scenarios and then deploy trained agents to an actual scenario problem to make quick decisions so that the solution can be obtained in a short time.It is equivalent to transferring a significant amount of solution time to the training stage, allowing it to respond quickly to actual scenarios in operation.Hence, RL-based methods have the advantage of being faster in the calculation compared with exact solution methods.Moreover, compared with approximation solution methods, RL-based methods have potential to obtain a better approximate solution.It is because approximate solution methods such as CASA are rule-based algorithms designed based on human experience,which is likely to limit optimality.RL methods were initially explored in the field of DCB.However, up to our best knowledge, it is challenging for existing RL-based DCB methods to solve the DCB problem with large-scale high-complexity real-world scenarios in a short time by directly deploying trained agents.For some existing RL-based DCB methods, it is impossible to deploy trained agents to unseen scenarios because of the model’s scalability.For others, it is not trivial to achieve satisfactory solutions in unseen scenarios because of model design structure (please refer to Table 132–42and corresponding discussions for details).Therefore, these methods must retrain the agent if they intend to effectively solve the DCB problem for an unseen scenario.However, training the agent is time-consuming, which offsets the advantage of RL methods.
Therefore,we propose a locally generalised MARL for realworld DCB problems to fill the gap.Our method can deploy trained agents directly to unseen scenarios in a specific ATFM region to obtain a satisfactory solution quickly(where‘locally’corresponds to ‘a(chǎn) specific ATFM region’).In this method, a distributed decision system is constructed by deploying a customised neural network on each flight to handle the specificity of each flight in DCB problems.The cooperation coefficient is introduced to control the degree of cooperation between flights.A multi-iteration mechanism is designed to deal with problems arising from non-stationarity in MARL.
This paper is organised as follows.In the rest of Section 1,we provide a brief review of related work and introduce features and contributions of our method.Section 2 introduces notations and formulates the DCB problem.In Section 3, we discuss the construction of the reinforcement learning environment, the architecture of the network, the training method of the neural network,and the multi-iteration mechanism.In Section 4, we show the MARL training process, the performance test results, and the cooperation coefficient sensitivity analysis through simulation experiments.Finally, conclusions and future work are summarised in Section 5.
Much research has been done on DCB problems, and some review articles provide good summaries of the current research progress.20–22RL methods were explored in various fields of ATM because they can solve a problem in a short time after completing training, such as conflict resolution,23–25collision avoidance26–28and assistant control tool.29–31Due to the advantage of responding quickly in real scenarios,RL methods have excellent research value and potential for application in real-world DCB problems.Several representative RL-based DCB methods are summarised in Table 132–42,and the relevant explanations for this table are given as follows:
(1) Agent: Artificial intelligence with decision-making ability.
- Number:The number of agents in the system,where S and M refer to single and multiple, respectively.
- Role: The agent’s role in the DCB problem, where C and F refer to controllers and flights, respectively.
- Mode:The operating mode of agents,where C and D refer to centralized and decentralized (distributed),respectively.
(2) RL method:The method used to train the agent’s policy,e.g., Q-table, Temporal-Difference (TD) learning, Qlearning, Deep Q-Learning (DQN), Proximal Policy Optimisation (PPO) and Asynchronous Advantage Actor-Critic (A3C).
Table 1 Features of DCB methods based on RL32–42
(3) Sharing policy: Whether agents share the neural network parameters in the multi-agent system.
(4) Generalisation:Whether the method is generalised in a specific ATFM region.L1 means that the trained agent can be deployed directly to unseen scenarios(the model is scalable),but there is no guarantee that a satisfactory solution will be obtained.L2 means that the trained agent can obtain a satisfactory solution based on L1.
(5) ATFM method: Operational ways of ATFM, where GDP refers to ground delay program, and MIT refers to Miles-in-trail (a kind of separation management).
(6) Sector opening scheme: Whether the sector structure changes over time, as in the real world.
(7) Uncertainty: Whether to consider the uncertainty of demand and capacity forecasts and the uncertainty of flight.
(8) Elimination:Whether all hotspots can be guaranteed to be eliminated.
(9) Experimental scenario: The most complex and realistic experimental scenario in the study.
- Real-world: Whether the experiment was based on real-world data, including flights and sectors.
- Hotspot: The initial number of hotspots.
- Flight scale: The number of flights (round by 100).
- Sector scale: The number of sectors.
(10) Symbol descriptions: checkmark (?) and circle (?)respectively mean that the method has and does not has the corresponding feature.N/A means that the feature or parameter is not applicable to the method or is not disclosed in the study.
Please note that if there are several extended methods (or variants of the basic method) introduced in a study, only the one with the highest comprehensive performance is shown in the table.
Crespo et al.32trained a single agent through RL to centrally assign flight delays to several airports (all flights at the same airport are delayed by the same time).Due to the limitation of the Q table, this agent can only be deployed for problems with specific sectors and airports.Agogino and Tumer33deployed agents on several route nodes around overloaded sectors, forming a distributed multi-agent system and employing the Miles-in-Trail(MIT)method to adjust the distance between flights passing through the same node for ATFM purposes.Kravaris et al.34proposed a collaborative RL framework for DCB, which deploys an agent on each flight.In Kravaris’method,each flight makes a decision independently without communicating with other flights.However, the effectiveness of this method has only been demonstrated in one tiny toy case.Spatharis et al.35,36further enhanced Kravaris’method and verified their method’s effectiveness in real-world scenarios with high complexity.Duong et al.37presented a blockchain-based RL method for ATFM.Like Crespo’s method, the agent in Duong’s method plays a controller to assign delays to flights at its responding airport,and each agent is responsible for only one airport.Spatharis et al.38,39proposed and extended a hierarchical MARL learning scheme for DCB problems, which constructed two states for agents: the ground state and abstract state.A common shortcoming of the six studies mentioned above is that they all employed RL algorithms as a search algorithm; that is,the process of training is used as the process of solving.Therefore,these methods are not generalised,and the advantages of reinforcement learning in terms of rapid response are not available for the above methods.Besides, because of the nonstationarity in MARL, they cannot guarantee that the agents will always be able to eliminate all hotspots.Thanks to the development of deep neural networks in reinforcement learning applications,the DCB methods with deep neural networks have been further enhanced.Chen et al.40validated the effectiveness of the Deep Q-learning Network (DQN) in the RLbased DCB method.However, the experiments are not based on real-world scenarios, and the generalisation of the method has not been verified.Tang and Xu41integrated an action supervisor into a rule-based time-step DCB method using Proximal Policy Optimisation (PPO).Tang’s method forces the agent to change its action when it chooses an action that would cause a hotspot to ensure that hotspots are eliminated.However,the addition of an action supervisor has led to a significant increase in delays.Huang and Xu42presented a hotspot-based DCB method with Asynchronous Advantage Actor-Critic (A3C), and this method ensures that hotspots are eliminated by allocating multiple delays.Despite Tang’s and Huang’s methods attempting to deploy trained agents to unseen scenarios, their method cannot be considered highly generalised due to potential flaws in the model design.For Tang’s method, the observation matrix seems so sparse that it limits agent learning.For Huang’s method, sector IDs are used as part of the agent observations and the ID as a code is not computationally meaningful.
In summary,it is difficult to deploy agents trained by these existing methods in Table 1 directly to an unseen scenario and obtain a satisfactory solution quickly.Therefore, there is an urgent need to improve the generalisation of RL-based DCB methods.
This paper serves as an extension of our previous study.40In this paper, a locally generalised MARL method for DCB where each agent has a customised neural network is proposed(the features of our method are also summarised in Table 1).
Generalisation is one of the most critical indicators of RL methods.To our knowledge, this is a dilemma for the DCB problem, however.On the one hand, the scale of each DCB problem (the number of flights or sectors) is different, while the dimension of the observation matrix in an RL method is usually required to be fixed.On the other hand, maybe we could technically remove the limitation of the observation matrix.However, completely homogenising individuals in large-scale DCB problems could significantly decrease the solver’s optimisation performance and make it difficult to meet the differentiated preferences of different flights.Hence,a balance needs to be found in this dilemma to maximise the advantages of the RL method.
Considering that flight schedules for commercial flights are usually cyclical,we can train an agent with a customised neural network for each flight schedule and deploy it in any scenario that contains that flight.Our method has both a local generalisation and the optimisation performance improvement led by heterogeneous individuals in the multi-agent system.We set the cooperation coefficient in the reward function to adjust the cooperation preference of the flight, thereby adjusting the distribution of the global delay time allocation.Our neural network outputs are only two discrete actions.We employ the state-of-the-art algorithm based on DQN, Rainbow DQN,43which significantly improves RL efficiency by integrating multiple DQN enhancement technologies.Besides, we design a multi-iteration mechanism for the DCB decision-making framework to deal with problems arising from nonstationarity in MARL,thereby enabling the solver to eliminate hotspots.
(1) Trained agents of the proposed method can be deployed directly to unseen scenarios in a specific ATFM region to obtain a satisfactory solution quickly.
(2) A cooperation coefficient is introduced to adjust the distribution of flight delay time allocation.
(3) A multi-iteration mechanism is designed for the DCB decision-making framework to enable the solver to eliminate all hotspots.
(4) Systematic experiments based on large-scale highcomplexity real-world scenarios are performed to verify the proposed method’s effectiveness and efficiency.
The demand and capacity balancing problem to be handled in this article is ensuring that the number of flights entering the sector per unit time does not exceed its capacity (that is, all hotspots are eliminated) by implementing the ground delay program before the operation.
All initial flight schedules can be obtained one day before the flight operation.The flight schedule of the ith flight is denoted by fi, which consists of a set of binary arrays, as shown in Fig.1.
where eijdenotes the jth sector through which the ith flight is scheduled to pass,tijdenotes the time of entry into that sector,Midenotes the number of sectors through which the ith flight is scheduled to pass,and I denotes the set of flights.The set of initial flight schedules FInitialis represented as
Fig.1 Flight schedule.
For example,if the width of time windows τ is 20 min,the time range of the 10th time window is [1 80;200),which corresponds to 03:00–03:20 of the operation day (not including 03:20).
In practice, sectors are divided into two types, elementary sectors and collapsed sectors.An elementary sector is the most fundamental sector unit in the airspace, while a collapsed sector comprises several adjacent elementary sectors.The basic sector units operate as elementary sectors or collapsed sectors depending on the sector opening scheme.In this paper,the sector opening scheme is considered, and the state of the sectors changes over time windows.
Fig.2 Sector opening scheme.
In summary, the DCB problem in this paper is defined as balancing the traffic demand with the airspace capacity through shifting flight take-off time to avoid hotspots.The sector opening scheme is considered.The optimisation objective is to minimise the total delay time for all flights.
Fig.3 Multi-agent decision-making framework (top-level framework of the proposed method) where all agents are distributed.
To find an optimal policy,we adopt an objective by minimizing the expectation of the average delay time of all flights in the same scenario, which is defined as
where N is the number of flights.The average delay time will also be an essential metric to evaluate the trained policy in Section 4.
In this section,we first introduce the key ingredient of our reinforcement learning setup.Then, the detail of the architecture of the policy network based on a neural network is proposed.Next, we present the MARL training algorithm based on Rainbow DQN and the design of training scenarios.Finally,we discuss the multi-iteration mechanism.
We assume all flights in the environment are heterogeneous to tackle the DCB problem through the MARL method.Each flight has different aircraft types, routes, and preferences.More importantly, most commercial flight plans are generally repeated every week, and there is a limited amount of flight routes in the real world.Thus, we treat all flights over time as a set of flights based on historical data and train an agent for each flight using reinforcement learning methods.If so,we can solve DCB problems based on a subset of this set of flights by deploying corresponding agents to flights.Based on the problem formulation in Section 2.2, the DCB problem can be transferred into a Partially Observable Markov Decision Process (POMDP) solved with MARL.A POMDP can be described as a six-tuple (S;A;P;R;Ω;O).S is the state space, A is the action space, P is the state-transition model,R is the reward function, Ω is the observation space(o ?Ω), and O is the observation probability distribution given the system state (o ~O(s)), where s is the global state.An agent is deployed on each aircraft in this environment,and all the agents form a distributed decision-making system.Therefore, a multi-aircraft state-transition model P determined by the aircraft’s delay time and uncertainty is unnecessary.The action space A, the observation space Ω and the reward function R are defined as below.
3.1.1.Action space
3.1.3.Reward design
3.3.1.Training algorithm
Because the proposed reinforcement learning model’s action is discrete and has only two actions,DQN is ideal as an efficient reinforcement learning algorithm for discrete actions.45In recent years, several DQN performance improvement techniques have been proposed.We adaptively employ and combine the Double-DQN,46Duelling DQN47and Prioritised Replay Buffer48to enhance the performance of DQN,inspired by a recently proposed advanced DQN method, Rainbow DQN.43
Fig.4 Architecture of policy network, showing the working process of an agent and focusing on neural network.
Algorithm 1.Learning to solve DCB problems.46–49
The algorithm of learning for DCB is summarised in Algorithm 146–49In this algorithm, each episode consists of two parts.One is the simulation and collecting transitions (lines 4–29), and the other is the policy update (lines 31–37).
As agents in our MARL model are distributed, several parts of the algorithm can run in parallel, including lines 7–16, lines 18–22 and lines 31–37.
Fig.5 French and Spanish sectors.
3.3.2.Training scenarios
This paper’s scenario data consists of the sector opening scheme and flight schedule.We use the real data of French and Spanish airspace, as shown in Fig.5.Specifically, 12187 flights on 2019-08-27 and 12138 flights on 2019-08-28 are selected for training (a total of 24325 flights), and a unique agent is deployed on each of them.The sector opening schemes for each day in August 2019 (31 sector opening schemes in total) are used for training scenarios.Three hundred ninetysix sectors are considered, including elementary sectors and collapsed sectors.Based on the real data mentioned above,each training scenario comprises a set of flight schedules and a sector opening scheme for any day.Thus,there are 62 different training scenarios in total, and one of them is randomly selected in each episode for training(refer to Algorithm 1,line 4).
Due to the non-stationarity in MARL, in our model, the trained agents do not guarantee that hotspots will be wholly eliminated with just one iteration.One iteration refers to that the trained agents make action decisions for all flights of the day according to the order of the time window until all flights are assigned a departure time or delayed to the next day.Note that, to date, there are few proven methods to handle nonstationarity in MARL effectively50Therefore, a multiiteration mechanism is introduced to deal with the problem arising from the non-stationarity in MARL.If there are still hotspots that have not been eliminated after the first iteration,multiple iterations are performed based on the assigned departure time until all hotspots are eliminated.Please note that the iteration following the first iteration differs from the first iteration.For legibility,we call an iteration after the first iteration a subsequent iteration.The framework of the subsequent iteration in the multi-iteration mechanism is shown in Fig.6.In each time window of the subsequent iteration, flights are divided into two categories:
(1) One is the flight(represented by Flight A in Fig.6)which does not cause hotspots based on the currently assigned departure time.Its departure time will remain unchanged;it is scheduled to take off in the current time window.
(2) The other is the flight(represented by Flight B in Fig.6)which causes hotspots based on the currently assigned departure time.Through its trained neural network,the agent deployed on the flight will decide whether the flight will take off at the current time window or be delayed to the next time window.
Please note that this multi-iteration mechanism is only used for trained agents to solve a DCB problem and not for training.In training,only one iteration is performed.In solving,the first iteration (refer to Fig.3) is performed firstly.If, after the first iteration, there are still hotspots that have not been eliminated, subsequent iterations (refer to Fig.6) are performed until all the hotspots have been eliminated.
In this section,we first introduce the training setup and present the training results.Next, we test the proposed method’s generalisation and compare it with state-of-the-art RL-based DCB methods.Finally, we discuss the sensitivity analysis of the cooperation coefficient.
The neural networks are constructed based on the Pytorch library in the Python environment and are trained on a computer with one i5-8300H CPU.The Adam optimisation algorithm51updates the network parameters.We determine the values of the hyper-parameters in Algorithm 1 through tuning experiments,and they are summarised in Table 2.Specifically,the exploration rate ε is set to 0.95; the buffer capacity nBCis set to 200; the batch size nBSis set to 20; the discount rate γ is set to 0.9;the learning rate α is set to 1×10-3;the target network update cycle nTNis set to 100.In each training episode,one of the 62 scenarios in the training dataset is randomly input to the reinforcement learning environment as the training scenario (refer to Section 3.3.2 for details).
To study the training process and the effectiveness of each DQN enhancement technology in our model,we introduce and compare the training results of five models:Model 1,Model 2,Model 3,Model 4 and Model 5.The features of each model are summarised in Table 3, where the checkmark (?) means that the model uses the corresponding DQN enhancement technology while the circle (?) means not.
Fig.6 Framework of subsequent iteration in multi-iteration mechanism.
Table 2 Hyper-parameter settings for training algorithm.
Table 3 Model features.
The training result is shown in Fig.7.The light curves are the actual result data and the dark ones are fitted curves based on the former by 90% smoothing coefficient for legibility.We can find that the trend of change of these curves is the same;after a short fluctuation, it rises rapidly and then tends to stabilise and gradually converge.Model 1 and Model 2 grow the fastest and converge around the 800th episode.The other three curves stabilised around the 1600th episode.In terms of performance, according to the average reward, Model 1 is the best, and Model 2 is the second.Model 3 and Model 4 are nearly identical.Model 5 is significantly inferior to the others.In summary, the training of our neural networks is effective,and the DQN enhancement technologies in our model can effectively improve training efficiency and model performance.
Fig.7 Training result.
4.2.1.Benchmark and test scenario
To verify the optimisation performance of the proposed MARL method, we use CASA as a benchmark52CASA is a DCB solver based on a heuristic algorithm,and it is often used in actual ATM operations.It has the advantages of requiring less computing time and being simple to use in the real world.CASA is a general term for a class of methods whose specific algorithms are often fine-tuned to suit specific scenarios.For ease of comparison, in this paper, we use a standard CASA introduced in ATFCM OPERATIONS MANUAL-Network Manager by EUROCONTROL53and the CASA’s source code based on Python3 is available on our Github.
To test the performance of our proposed method in diverse scenarios different from the training scenarios, we use sector opening schemes in July 2019 and combine the flights on 2019-08-27 and 2019-08-28 to generate test scenarios,as shown in Fig.8.Three sets of test scenarios are generated for three corresponding tests to verify the generalisation of our method for sector opening schemes, flight schedules and problem scales, respectively.The numbers shown in Fig.8 represent the number of the corresponding element.For example,in Test 2,it means that there are 7 sets of flight schedules and 31 sector opening schemes, and they make up a total of 217 scenarios.Besides, we compare our method with two state-of-the-art RL-based DCB methods.The experimental design and results are shown as follows.
4.2.2.Generalisation for sector opening scheme (Test 1)
Fig.8 Scenarios generation for performance tests.
Test 1 aims to verify the generalisation of our method for the sector opening scheme.Although commercial flights are repeated for a certain period, the flight schedules for corresponding days will not be the same.Therefore, we randomly select 90% of flights on 2019-08-27 and 10% of flights on 2019-08-28 to form a new set of flight schedules which contains 12182 flights.To differentiate from the training scenario, we use the 31 sector opening schemes in July 2019 in the test scenarios.The new set of flight schedules and the 31 sector opening schemes form 31 test scenarios (also refer to Fig.8).Two performance indicators are used to reflect the optimisation performance of the method,namely the number of delayed flights and the average delay time.Please note that the average delay time is the mean for all flights in a scenario.The 31 test scenarios are solved by our proposed MARL method and CASA.The results are shown in Fig.9, where the sector opening scheme ID corresponds to the day in July 2019 (for example,01 refers to 2019–07-01).Compared with CASA, MARL decreases the average delay time and the number of delayed flights by about 72.5%and 74.6%,respectively.Since the performance of our method, according to the benchmark, does not vary significantly in different scenarios of Test 1, it can be considered generalised for various sector opening schemes.
4.2.3.Generalisation for flight schedule (Test 2)
Test 2 aims to verify the generalisation of our method for flight schedules.We generate seven sets of flight schedules by combining the flights on 2019-08-27 and 2019-08-28 with different proportions.We use the rate of difference to represent the proportion of non-same-day flights.Assuming that the flights on 2019-08-27 are taken as main flights in a test scenario while the test set of flight schedules consists of x flights on 2019-08-27 and y flights on 2019-08-28, the rate of difference isWe randomly select flights on 2019-08-27 and 2019-08-28 based on seven specific rates of difference and test them in 31 sector opening schemes in July 2019; there are 217 test scenarios in total (also refer to Fig.8).The results of Test 2 are shown in Fig.10,where C and M in the boxes below the figure represent CASA and MARL, respectively (grey represents CASA and red represents MARL; this representation is also used in Fig.11).IQR is interquartile range.Although the average delay time and the number of delayed flights of the results when the rate of difference is 0 in both are the smallest on average, the two performance indicators, by comparing with the results of CASA,show no significant uptrend as the rate of difference increases.In other words, the method does not significantly decrease performance due to the rise in the rate of difference in flights.Therefore,it is demonstrated that the proposed method can cope with limited flight combinations.
4.2.4.Generalisation for problem scale (Test 3)
Test 3 aims to verify the generalisation of our method for problem scale.To obtain several scenarios with different scales, we first select the sector opening scheme of 2019-07-27 and then randomly select a specific number of sectors to form a new sector opening scheme.We generate 6 sets of 10 sector opening schemes in total,with each set’s sector opening schemes containing a different number of sectors.The flight schedule in Test 1 is tested in each opening scheme separately for 60 scenarios in Test 3 (also refer to Fig.8).The results of Test 3 are shown in Fig.11.In addition to the average delay time and the number of delayed flights, each scenario’s computing time and initial condition are also presented for reference.For the 60 scenarios in Test 3, the initial number of hotspots and the initial number of flights increase with the number of sectors, as shown in Fig.11(d).As expected, the average delay time and the number of delayed flights increase as the problem scale increases.The growth trends of the two optimisation indicators for MARL are linear, similar to the growing trend of the problem scale.Therefore, the optimisation performance of our proposed method is similar in scenarios of different problem scales, which demonstrates its generalisation for problem scales.As shown in Fig.11(c),although MARL’s computing time is much longer than CASA’s in the same scenario,its growth trend is almost linear rather than exponential as the problem scale increases,and the computing time for the scenario with 300 sectors and more than 10000 flights is about 30 s, which is acceptable for a DCB problem of this scale.
4.2.5.Comparison with state-of-art RL-based DCB methods
Fig.9 Experimental results of method generalisation test for sector opening schemes.
Fig.10 Experimental results of method generalisation test for flight schedules.
Fig.11 Experimental results of method generalisation test for problem scales.
This test aims to compare the proposed method with the stateof-the-art RL-based DCB methods.Because Tang’s41and Huang’s42methods have the potential for generalisation, we compare our method with them in the computing time and average delay time.The same way is followed to train them as with the proposed method.Because the two methods do not consider sector opening schemes, we upgrade them to enable dealing with DCB problems with sector opening schemes.The comparison is based on ten random scenarios of Test 2, and the experimental results are shown in Fig.12,where MARL represents the proposed method, T represents Tang’s method41and H represents Huang’s method42Although this test is for comparing RL-based DCB methods,the performance data of CASA are still added to Fig.12 for reference.We can find that the proposed method outperforms the two state-of-the-art RL-based DCB methods.In terms of the average delay time, compared with the proposed method,Tang’s method and Huang’s method increase by about 116% and 79%, respectively.The reason may be that their methods do not have enough generalisation for the test scenarios.In terms of the computing time, compared with the proposed method, Tang’s method and Huang’s method increase by about 43% and 620%, respectively.The delay time allocated by the agent in Huang’s method is tiny at each iteration,so a considerable number of iterations are needed to eliminate hotspots, which is significantly time-consuming.The decisionmaking framework of Tang’s method is similar to ours; a longer average time means that agents make more actions,which is also time-consuming.It tends to explain why the computing time for Tang’s method is longer than ours but very close.
Sensitivity analysis on the coefficient of cooperation (refer to Eq.(20)) is performed to explore the impact on our proposed method.We re-train another four sets of neural networks in the training scenarios(mentioned in Section 3.3)with different coefficients of cooperation.With the set of neural networks trained on the base model, there are five sets of neural networks for training flights; the coefficient of cooperation,respectively, is set to 0.25, 0.5, 1, 2 and 4.Then, the five sets of neural networks are deployed on the corresponding flights and tested in the scenarios of Test 1, respectively.The experimental results in the average delay time and the number of delayed flights are shown in Fig.13.From a statistical point of view,as the coefficient of cooperation increases,the average delay time decreases while the number of delayed flights increases.The reason is that the greater the coefficient of cooperation, the lower the proportion of the penalty of holding in the total reward, and the more likely the flight is to promote the achievement of the global optimisation goal even if its own delay time is longer.Thus, in practice, an appropriate cooperation coefficient should be set according to the actual needs of air traffic operations to balance the average delay time and the number of delayed flights.
Fig.12 Experimental results of the comparison of RL-based DCB methods.
Fig.13 Experimental results of sensitivity analysis on coefficient of cooperation.
In this paper, we have presented a locally generalised multiagent reinforcement learning for demand and capacity balancing with customised neural networks.DQN-based enhancement technologies (i.e., Double DQN, Duelling DQN and Prioritised Replay Buffer) have been proven to improve the learning efficiency of agents.We have evaluated the performance of our method using a series of comprehensive experiments based on large-scale high-complexity real-world scenarios.By analysing the experimental results, we can draw the following conclusions:
(1) The proposed method is generalised for sector opening schemes,flight schedules and problem scales in the scope of interest.
(2) The proposed method outperforms CASA in optimisation performance.
(3) The proposed method outperforms the state-of-the-art RL-based DCB methods41,42in computing time and optimisation performance.
(4) The trend of average delay time and the number of delayed flights are negatively correlated as the cooperation coefficient changes in the proposed method.
This work serves as our first step toward reducing the gap between the theory and practice of multi-agent reinforcement learning methods used for demand and capacity balancing.We are fully aware that the situation in practice is far more complicated than that in our experiments.Compared with traditional methods (such as exact solution methods), reinforcement learning methods have the potential to deal with uncertainty because they are based on probability inherently.In the future, the uncertainty should be taken into account in RL-based DCB methods to improve the potential for practical application.Furthermore, without compromising the optimisation performance,we will further enhance the generalisation of the technique to reduce the training cost.For example, all agents with the same flight path share the same neural network.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This study was co-funded by the National Natural Science Foundation of China (No.61903187), the National Key R&D Program of China (No.2021YFB1600500), the China Scholarship Council (No.202006830095), the Natural Science Foundation of Jiangsu Province (No.BK20190414) and the Jiangsu Province Postgraduate Innovation Fund (No.KYCX20_0213).The research described in this paper has been partly conducted in the framework of a project that has received funding from the SESAR Joint Undertaking within the framework of the European Union’s Horizon 2020 research and innovation programme under grant agreement No 891965.
CHINESE JOURNAL OF AERONAUTICS2023年4期