Siqing Sun , Defu Cai , Hai-Tao Zhang ,,, and Ning Xing
Dear Editor,
As a promising multi-agent systems (MASs) operation, autonomous interception has attracted more and more attentions in these years, where defenders prevent intruders from reaching destinations.So far, most of the relevant methods are applied in ideal environments without agent damages.As a remedy, this letter proposes a more realistic interception method for MASs suffered by damages,where the defenders are fewer than the intruders.Firstly, a multiagent interception frame (MAIF) is proposed, enabling the defenders to take actions and interact with the environments.To address nonstationarity issue induced by MAIF, a multi-agent reinforcement learning-based interception method (MAIM) is developed by sophisticatedly designing a reward function.Sufficient conditions are derived to guarantee the convergence of MAIM.Finally, numerical simulations are conducted to verify the effectiveness of the proposed method.
These years have witnessed the tremendous development of the research on autonomous interception of MASs [1], [2].Loweet al.[3] establish a multiple particle environment, where predators seek to intercept preys from foraging food.Zhanget al.[4] investigate a pursuit-evasion game for multi-quadcopters to intercept a random drone.Yuet al.[5] address a typical dynamic combat scenario for two hostile drone swarms intercepting each other from destroying their military bases.To achieve more agile collective interception, some scholars seek assistance from reinforcement learning [6], [7].
The main challenge lies in designing suitable a reward function,which has a significant influence on the interception performance [8].In this regard, some interception studies define shape rewards by distances [9]–[11], so as to find an action to maximize the distances between the intruders and the defensive area However, oversimplification of aforementioned studies have hindered their further applications.For example, intruders are less than defenders, agent damages are not considered, etc.This motivate us to develop a more realistic collective interception scheme to address the challenging antagonistic interception problem for MASs, where the shape reward may change frequently upon emergence of events such as agent damages.Besides, reward function is usually non-convex in such a scenario[12], making the learning procedure apt to be trapped in local optima.
To address the dilemma, we propose a MAIM.Firstly, shape rewards are sophisticatedly designed and assigned to the agents with attention weights.In this way, each defender does not have to consider too many intruders simultaneously, reducing its decision space dimension.Additionally, to ensure learning process convergence upon the emergence of intermediate events, an event reward is designed to revise the reward history sequence, instead of directly adding a large value to the reward of a moment.
In brief, the contribution of this work is two-fold: 1) Developing a reinforcement learning-based MAIM with a suitable reward function,which enables defenders to intercept intruders in antagonistic environments; 2) Deriving sufficient convergence conditions for MAS governed by the proposed MAIM.
Fig.1.Illustration of the present MAIF.
bufferD.The goal of each defender becomes optimizing its policy πi() for generating an actionwith the maximized long-term discounted cumulative reward, i.e.,
whereγ(∈[0,1])denotes adiscount factor.To obtain the optimum policyπ?i,theindex is expressed as
Design a critic network parameter training law εi=f1(Q︿i,Ri) such that
Algorithm 1 The Training Process of MAIM ati=πθi(oti)+ηt πθi ηt ati Step 1: Select actions for defenders, w.r.t.the current policy and exploration , and execute in MAIF by (1);sti rshapeti(sti,ati,rti,st+1i )Step 2: Obtain the state and the shape reward by (5) and (6), and store the transitions in replay buffer D;reventi reventti Step 3: Calculate the event rewards by (7), if the intermediate events happen, and add to revise the buffer D;θi,εi θk+1 Step 4: Sample a random minibatch samples from D at every instant t, and train the actor-critic network parameters , i.e.,;i =θki +αc?θi J(θi),εk+1 i =εki -αc?L(εi),k ∈[0,?]
Table 1.Structures of Actor and Critic Networks
The training processes are presented in Fig.3, ten independent interception simulations are performed after every training episode to reduce the influence of the randomness of the initial values.Cumulative rewardsRiof MAIM gradually increase, whereasRiof MAL-S fluctuatecontinuously,whichiscausedby thehigh environmental nonstationarity.In thestatistical results of 1×105simulations,MAIM improves the win rate from 0 to 89.76%, whereas MAL-S usually intercepts 2 and 3 intruders, which has lost the confrontation.To further explore the reason of performance differences, Fig.4 presents interception examples.For MAIM, intruders 5, 7 and 4 are intercepted by defenders 1, 2 and 3 in the initial period.Then,defender 1 hits its nearest intruder 6, and all defenders head to the last running intruder 8.Finally, MAIM safely guides defenders to win.By contrast, defenders approach the targets during the initial period for MAL-S.However, according to the trajectories of defender 1 and intruder 6, defenders are apt to lose interception ability, once intruders execute some evasive moves.Although defenders slow down and turn around, it is too late to catch up with intruders.Finally, no intruder is intercepted.In this scenario, MAL-S is trapped in a local optimum.In summary, the above numerical simulations have verified the effectiveness of the present MAIM.
Fig.3.Average cumulative reward of MAIM and MARL-S.
Fig.4.An example of interception processes of MAIM and MAL-S.
Conclusion and future works: This letter proposes a more efficient reinforcement learning interception method in antagonistic environments, namely MAIM, by sophisticatedly designing a reward function.Significantly, sufficient conditions are derived to guarantee the convergence of MAIM.Finally, the effectiveness of the proposed method is verified by extensive numerical simulations.Future work will focus on more challenging collective interception version in higher dimensional spaces with dynamic obstacles.
Acknowledgments: This work was supported by the Science and Technology Project of State Grid Corporation of China, China (5100-202199557A-0-5-ZN)
IEEE/CAA Journal of Automatica Sinica2024年1期