• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Task assignment in ground-to-air confrontation based on multiagent deep reinforcement learning

    2023-01-18 13:37:24JiayiLiuGangWangQiangFuShaohuaYueSiyuanWang
    Defence Technology 2023年1期

    Jia-yi Liu,Gang Wang,Qiang Fu,Shao-hua Yue,Si-yuan Wang

    Air and Missile Defense College,Air Force Engineering University,Xi'an,710051,China

    Keywords:Ground-to-air confrontation Task assignment General and narrow agents Deep reinforcement learning Proximal policy optimization (PPO)

    ABSTRACT The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to ground-to-air confrontation,there is low efficiency in dealing with complex tasks,and there are interactive conflicts in multiagent systems.This study proposes a multiagent architecture based on a one-general agent with multiple narrow agents(OGMN)to reduce task assignment conflicts.Considering the slow speed of traditional dynamic task assignment algorithms,this paper proposes the proximal policy optimization for task assignment of general and narrow agents (PPOTAGNA) algorithm.The algorithm based on the idea of the optimal assignment strategy algorithm and combined with the training framework of deep reinforcement learning(DRL)adds a multihead attention mechanism and a stage reward mechanism to the bilateral band clipping PPO algorithm to solve the problem of low training efficiency.Finally,simulation experiments are carried out in the digital battlefield.The multiagent architecture based on OGMN combined with the PPO-TAGNA algorithm can obtain higher rewards faster and has a higher win ratio.By analyzing agent behavior,the efficiency,superiority and rationality of resource utilization of this method are verified.

    1.Introduction

    The core feature of the modern battlefield is a strong confrontation game.Relying on human judgment and decision-making cannot meet the requirements of fast-paced and high-intensity confrontations,and relying on a traditional analytical model cannot meet the requirements of complex and changeable scenes.In the face of future operations,intelligence is considered to be the key to solving the problem of a strong confrontation game and realizing the transformation of the information advantage into a decision advantage [1].Task assignment is a key problem in modern air defense operations.The main purpose is to allocate each meta-task to execute the appropriate elements to achieve target interception [2] and provide full play to the maximum resource efficiency ratio.The key to solving this problem is establishing a task assignment model and using an assignment algorithm.Weapon target assignment (WTA) is an important problem to be solved in air defense interception task assignment[3].The optimal assignment strategy algorithm is one of the methods for solving the WTA problem.It uses the Markov decision process (MDP) as the main idea.It considers that new targets appear randomly with a certain probability in the assignment process.Hang et al.[4]believe that dynamic weapon target assignment (DWTA) can be divided into two stages:strategy optimization and matching optimization.The Markov dynamics can be used to solve DWTA.On this basis,Chen et al.[5] improved the hybrid iterative method based on the policy iterative method and the value iterative method in Markov process strategy optimization to solve large-scale WTA problems.He et al.[6]transformed the task assignment problem into a phased decision-making process through MDP.This method has good results in small-scale optimization problems.

    Although the optimal assignment strategy algorithm continuously improves,the speed of solving large-scale WTA problems is still slightly insufficient.In addition,the WTA method based on MDP cannot immediately deal with emergencies because it needs to wait for processing in the previous stage to end.This paper combines the optimal assignment strategy algorithm for WTA with the DRL method to apply a neural network to the WTA method based on MDP to solve the problems of slow solution speed and not being able to deal with emergencies.

    This research aims to apply a multiagent system and DRL to the task assignment problem of ground-to-air confrontation.Because a multiagent system has superior autonomy and unpredictability,it has the advantages of strong solution ability,fast convergence speed and strong robustness in dealing with complex problems[7].QMIX algorithm can train decentralised policies in a centralized end-to-end fashion,and has a good performance in StarCraft II micromanagement tasks [8].Cao J et al.[9] proposed the LINDA framework to achieve better collaboration among agents.Zhang Z proposed a PGP algorithm using policy gradient potential as the information source for guiding strategy update[10]and a learning automata-based algorithm known as LA-OCA[11].Both algorithms have good performance in common cooperative tasks,and can significantly improve the learning speed of agents.However,due to the self-interest of each agent in the multiagent system and the complexity of the capacity scheduling problem of the multiagent system[12],it is prone to interactive conflict when dealing with the task assignment of large-scale ground-to-air confrontation.The multiagent architecture based on OGMN is proposed in this paper,which can reduce the system complexity and eliminate the shortcoming that multiagent systems are prone to interactive conflict when dealing with complex problems.Task assignment is a typical sequential decision-making process for incomplete information games.DRL(DRL)provides a new and efficient method to solve the problem of incomplete information games[13],which relies less on external guidance information.There is no need to establish an accurate mathematical model of the environment and tasks.Additionally,it has the characteristics of fast reactivity and high adaptability [14].However,in the ground-to-air confrontation scenario,there is a large quantity of data,and it is difficult to directly convert the combat task target into a reasonable reward function,so decision-making faces the problems of sparse feedback,delay and inaccuracy,resulting in low training efficiency.According to the characteristics of large-scale ground-to-air confrontation,this paper proposes the PPO-TAGNA algorithm,which effectively improves the training efficiency and stability through a multihead attention mechanism and phased reward mechanism.Finally,experiments in the digital battlefield verify the feasibility and superiority of the multiagent architecture based on the OGMN and PPO-TAGNA algorithms to solve the task assignment problem of ground-to-air confrontation.

    2.Background

    2.1.Multiobjective optimization method

    Since there may be conflicts or constraints between various objectives and there is no unique solution to the multiobjective optimization problem but there is an optimal solution set,the main solution methods of the multiobjective problem in the multiagent system are as follows [15]:

    (1) Linear weighted sum method

    The difficulty of solving this problem lies in how to allocate the weight,as follows:

    where fi(x) is one of the multiobjective functions,ωiis the weight andThe advantage of this algorithm is that it is easy to calculate,but the disadvantage is that the weight parameters are uncertain,and whether the weight setting is appropriate directly leads to the effect of the optimal solution.

    (2) Reward function method

    The reward function is used as the solution method of the optimization problem.Its design idea comes from the single-agent system and the rod balance system.The design method of the rod balance system reward function is that after the agent transfers the state,the reward value of failure is - 1,and the reward and punishment value of success is 0.The system has several obvious defects.① During the task executions,the agent cannot define whether it is the state transition that contributes to the final benefits and cannot determine the specific contribution.②The reward function design gives the task a goal but only provides the reward value in the last step,resulting in too sparse of a reward value.The probability of the agent exploring the winning state and learning strategy is very low,which is not conducive to the realization of the task goal.

    2.2.Reinforcement learning

    The idea of reinforcement learning (RL) is to use trial and error methods and rewards to train agent learning behavior [16].The basic reinforcement learning environment is an MDP.An MDP contains five quantities,namely,(S,A,R,P,γ),where S is a finite set of states,A is a finite set of actions,R is the reward function,P is the state transition probability,and γ is the discount factor.The agent perceives the current state from the environment,then makes the corresponding action and obtains the corresponding reward.

    2.3.Proximal policy optimization

    Deep learning (DL) uses a deep neural network as a function fitter and combines it with RL to produce DRL,which effectively solves the problem of dimension disaster in large-scale and complex problems in traditional RL methods.Proximal policy optimization(PPO)belongs to a class of DRL optimization algorithms[17].It is different from value-based methods,such as Q-learning.It directly calculates the policy gradient of cumulative expected return by optimizing the policy function to solve the policy parameters that maximize the overall return,PPO algorithm description is shown in Table 1.

    The objective function defining the cumulative expected return of the PPO is as follows:

    where in the equation as follows:

    Atis the dominance estimation function.This can be seen in the equation as follows:

    Table 1 PPO algorithm.

    2.4.Attention mechanism

    The attention mechanism is a mechanism that enables agents to focus on some information at a certain point in time and ignore other information.It can enable agents to make better decisions faster and more accurately in local areas[18].

    When the neural network is faced with a large amount of input situation information,only some key information is selected for processing through the attention mechanism.In the model,the maximum convergence and gating mechanism can be used to approximate the simulation,which can be regarded as a bottom-up saliency-based attention mechanism.In addition,top-down convergent attention is also an effective information selection method.For example,when inputting a large text,an article is given,then the content of the article is extracted and a certain number of questions is assumed.The questions raised are only related to part of the article content and have nothing to do with the rest [19].To reduce the solution pressure,only the relevant content needs to be selected and processed by a neural network.

    aican be defined as the degree to which the ith information is concerned when the task-related input information X and query vector q are given,and the input information is summarized as follows:

    Fig.1 is an example of an attention mechanism.In the red-blue confrontation scene,the computational action can adopt the attention mechanism.For example,the input data x are all currently selectable enemy targets,while q is the query vector output from the front part of the network.

    3.Problem modeling

    3.1.Problem formulation

    Task assignment is the core link in command and control.It means that on the basis of situational awareness and according to certain criteria and constraints,we can make efficient use of our own multiple types of interception resources,reasonably allocate and intercept multiple targets,and avoid missing key targets and repeated shooting,so as to achieve the best effect.The task assignment problem has the principles of the maximum number of intercepted targets,the minimum damage of protected objects and the minimum use of resources.This paper studies the task assignment of ground-to-air confrontation based on the principle of using the least resources and minimum damage to the protected object.

    Fig.1.Attention mechanism.

    Ground-to-air confrontation system is a loosely coupled system with a wide deployment range.It has a large scale and needs to deal with many concurrent task assignments and random events.The main tasks include target identification,determining the firing sequence,determining the interception strategy,missile guidance,killing targets,etc.The purpose of task assignment in this paper is to use the least resources while protecting the object with the least damage.Therefore,the task assignment of ground-to-air confrontation is a many to many optimization problem.It needs to have fast search ability and strong dynamic performance to deal with regional large-scale air attack,and can dynamically change the task assignment results according to the situation.DRL relies less on external guidance information and does not need to establish an accurate mathematical model of environment and task.At the same time,it has the characteristics of fast reactivity and high adaptability,which makes it suitable for ground-to-air confrontation task assignment.

    3.2.MDP modeling

    In order to combine the optimal assignment strategy of task assignment with DRL,we establish an MDP model of ground-to-air confrontation,which is divided into the states,the actions and the reward function.

    States:(1)The status of the red base,including base information and the status of the base when it is being attacked.(2)The status of the red interceptor,including the current configuration of the interceptor,the working status of the radar,the attack status of the radar,and the information of the enemy units that the interceptor can attack.(3) The status of the enemy units,including the basic information of the enemy units and the status when being attacked by red missiles.(4)The status of enemy units that can be attacked,including the status that can be attacked by red interceptors.

    Actions:(1)When is the radar on.(2)Which interception unit to choose.(3) Which launch vehicle to choose to intercept at what time (4) which enemy target to choose.

    Reward function:the reward function based on the principle of using the least resources and minimum damage to the protected object in this paper is as follows:

    The reward value is 50 points for winning,5 points for intercepting manned targets,such as the blue fighter,1 point for intercepting a UAV,and 0.05 points are deducted for each missile launched.In Eq.(10),m is the number of manned units intercepting the blue side,n is the number of UAVs intercepting the blue side,and i is the number of missiles launched.Since each of the above reward stages is the task goal that the red side must achieve if it wants to win,it can guide the agent to learn step-by-step and stageby-stage.

    4.Research on a multiagent system model based on OGMN

    4.1.The general and narrow agents architecture design

    The task assignment of large-scale ground-to-air confrontation needs to deal with many concurrent task assignments and random events.The whole battlefield situation is full of complexity and uncertainty.The fully distributed multiagent architecture has poor global coordination for random events,which makes it difficult to meet the needs of ground-to-air confrontation task assignments.The current centralized assignment architecture can achieve global optimal results[20],but for large-scale complex problems,it is not practical because of the high solving time cost.Aiming at the command-and-control problem of ground-to-air confrontation distributed cooperative operations,combined with the DRL development architecture and based on the idea of double driving data rules,a general and narrow agents command-and-control system is designed and developed.The global situation in a certain period of time is taken as the general agent input with strong computing power to obtain combat tasks and achieve tactical objectives.The narrow agent based on tactical rules decomposes the tasks of general agents according to its situation and outputs specific actions.The main idea of OGMN architecture proposed in this paper is to retain the global coordination ability of centralized method and combine the efficiency advantage of multiagent.The general agent assigns tasks to the narrow agents,and the narrow agents divide their tasks into sub-tasks,and selects the appropriate time to execute according to their remaining resources,the current number of channels and other information.This method can largely retain the overall coordination ability of general agent,and avoid missing key targets,repeated shooting,wasting of resources and so on.And it can reduce the computing pressure of general agent and improve the assignment efficiency.The general agent assigns tasks to the narrow agent according to the global situation.The narrow agent decomposes the tasks into instructions (such as intercepting a target at a certain time) and assigns them to an interceptor according to its situation.The structural design of general and narrow multiagent systems is shown in Fig.2.

    Fig.2.The framework of the task assignment decision model of general and narrow agents.

    Based on the existing theoretical basis,this paper vertically combines rules and data to drive general agents and rules to drive narrow agents.The purpose is to improve the speed of a multiagent system to solve complex tasks,reduce system complexity,and eliminate the shortcomings of a multiagent system in dealing with complex problems.It is expected that in a short time,a general agent with strong computing power is used to obtain situation information and quickly allocate tasks,and then multiple narrow agents select appropriate time and interceptors to intercept enemy targets according to specific tasks,as well as their state to save as many resources as possible on the premise of achieving tactical objectives.

    4.2.Markov’s decision process of cooperative behavior

    Traditional multiagent collaborative decision-making research mainly focuses on model-based research [21],namely,rational agent research.Traditional task assignment research has the disadvantages of relying too much on the accuracy of the model behind it.It only focuses on the design from the model to the actuator,but it does not focus on the model’s generation process.In the intelligent countermeasure environment,there are many kinds of agents.For multiple agents,it is difficult to obtain accurate decision-making models,models,complex task environments and situation disturbances.Additionally,environmental models present certain randomness and time variability [22-24].All these factors need to be studied to control the method of agent models under the lack of information.

    The essence of this framework shown in Fig.3 is to solve the large-scale task assignment problem based on the idea of an optimal assignment strategy algorithm and a DRL method.

    MDP four elements are set:(S,A,r,p):state(S),action(A),reward(r),transitionprobability(p);Markovproperty:p(st+1|s0,a0,…,st,at)=p(st+1|st,at) strategy function π:S→A orπ:S× A→[0,1];

    Optimization objective: The objective is to solve the optimal strategy function π*,maximizing the expected cumulative reward value as follows:

    Fig.3.The research framework of the cooperative behavior decision model of general and narrow agents.

    The method uses a reinforcement learning algorithm to solve the MDP when p(st+1|st,at) is unknown.The core idea is to use temporal-difference learning (TD) to estimate the action-value function as follows:

    Compared with Alpha C2[25],the model framework optimizes the agent state,which better meets the conditions of rationality and integrity.Rationality requires that states with similar physical meaning also have small numerical differences.For example,for the launch angle θ of the interceptor,since θ is a periodic variable,it is unreasonable to directly take θ as a part of the state.Thus,the launch angle θ should be changed to[cos θ,sin θ].

    Integrity requires that the state contains all the information required by the agent's decision-making.For example,in the agent's trajectory tracking problem,the trend information of the target trajectory needs to be added.However,if this information cannot be observed,the state needs to be expanded to include historical observations,such as the observation wake of the blue drone.

    5.PPO for task assignment of general and narrow agents(PPO-TAGNA)

    To solve the task assignment problem in large-scale scenarios,this paper combines the optimal assignment strategy algorithm with the DRL training framework,designs the stage reward mechanism and multihead attention mechanism for the traditional PPO algorithm,and proposes a PPO algorithm for the task assignment of general and narrow agents.

    5.1.Stage reward mechanism

    The reward function design is the key to the DRL application in ground-to-air confrontation task assignment.The DRL reward function design must be analyzed in detail.To solve the ground-toair confrontation task assignment problem,the reward value design idea in Alpha C2 [25] is to set the corresponding reward value for each unit type.If there is unit loss,the reward value of the corresponding unit is given.At the end of each deduction round,the reward value of each step is added as the final reward value.However,in practice,the reward value lost by each unit offsets each other at each step,resulting in a small reward value and low learning efficiency.Nevertheless,if only the reward value of victory or failure is given in the last step of each game,the reward value of the other steps is 0,which is equivalent to no artificial prior knowledge being added,which can give the neural network the maximum learning space.However,it leads to too sparse of a reward value,and the probability of neural networks exploring the winning state and learning strategies is very low [26].Therefore,the ideal reward value should be neither too sparse nor too dense and can clearly guide the agent to learn in the winning direction.

    The stage reward mechanism adopts the method of dismantling task objectives and giving reward values periodically to guide the neural network to find strategies to win.For example,phased rewards can be given at one time after successfully resisting the first attack.After the loss of the blue side high-value unit,the corresponding reward value is given.After the red side wins,it is given the winning reward value.On this basis,the reward function is optimized according to different objectives in the actual task,such as maximum accuracy,minimum damage,minimum response time,interception and condition constraints,to increase the effect of maximizing global revenue on the revenue of the agent and reduce the self-interest of the agent as much as possible.

    5.2.The multihead attention mechanism

    5.2.1.Neural network structure

    The neural network structure of the multiagent command-andcontrol model is shown in Fig.4.The situation input data are divided into four categories.The first category is the status of the red base,including base information and the status of the base when it is being attacked.The second category is the status of the red interceptor,including the current configuration of the interceptor,the working status of the radar,the attack status of the radar,and the information of the enemy units that the interceptor can attack.The third category is the status of the enemy units,including the basic information of the enemy units and the status when being attacked by red missiles.The fourth category is the status of enemy units that can be attacked,including the status that can be attacked by red interceptors.The number of units of each type of data is not fixed and changes with the battlefield situation.

    Each type of situation data carries out feature extraction through two layers of fully connected rectified linear units (FCReLU) and then connects all feature vectors before generating global features through one layer of FC-ReLU and gated recurrent units(GRUs)[27].When making decisions,neural networks should consider not only the current situation but also historical information.Networks need to continuously interact with the global situation through GRU and choose to retain or forget the information.The global feature and the selectable blue unit feature vector are calculated through the attention mechanism to select the interception unit.Each intercepting unit then selects the interception time and the enemy units through an attention operation according to its state and combined with the rule base.

    5.2.2.Standardization and filtering of stat data

    State data standardization is a necessary step before entering the network.The original status data include various data,such as radar vehicle position,aircraft speed,aircraft bomb load,and threat degree of enemy units.The unit and magnitude of this kind of data are different,and thus,it must be normalized before being input into the neural network.In the battle process,some combat units later join the battle situation,some units are destroyed,and thus,the data are lost.The neural network needs to be compatible with these situations.

    Fig.4.Neural network structure.

    Different units have different states at various time points.Therefore,when deciding to select some units to perform a task,it is necessary to eliminate those participating units that cannot perform the task at this time point.For example,there must be a certain time interval between two missile launches by the interceptor,and the interceptor must be connected to the radar vehicle to launch the missile.

    5.2.3.The attention mechanism and target selection

    In this paper,the decision-making action is processed by multiple heads as the network output,meaning that the action is divided into action subject (which interception unit to choose),action predicate (which launch vehicle to choose to intercept at what time),and action object (which enemy target to choose).

    When selecting interception targets,the network needs to focus on some important targets in local areas.In this paper,the state of each fire unit and the eigenvector of the incoming target are used to realize the operation of the attention mechanism by using an additive model.

    X=[x1, …, xN] is defined as N input information,and the probability aiis first calculated by selecting the ith input information under the given q and X.Then,aiis defined as follows:

    where aiis the distribution of attention,s(xi,q) is the attention scoring function,and the calculation model is the additive model as follows:

    where the query vector q is the feature vector of each fire unit,xiis the ith attack target that is currently selectable,W and u are the trainable neural network parameters,and v is the global situation feature vector,namely,the conditional attention mechanism;thus,the global situation information can participate in the calculation.The attention score of each fire unit about each target is obtained,each bit of the score vector is sigmoid sampled,and finally,the overall decision is generated.

    5.3.The ablation experiment

    To study the impact of the two mechanisms on the algorithm performance,this paper designs an ablation experiment.By adding or subtracting two mechanisms to the basic PPO algorithm,four different algorithms are set up to compare the differences in the effects.The experimental setup is shown in Table 2:

    Based on the general and narrow agents’framework in Section 3 of this paper,all algorithms are iteratively trained 100,000 times under the same scenario.The experimental results are shown in Fig.5.The performance of the basic PPO algorithm can be improved by adding the stage reward mechanism and the multihead attention mechanism alone.The multihead attention mechanism can increase the average reward from 10 to approximately 38.The stage reward mechanism has a slightly larger and more stable effect,which can increase the average reward from 10 to approximately 42.When the two mechanisms are added simultaneously,the algorithm performance can be considerably improved,and the average reward value can be increased to approximately 65,which shows that the PPO-TAGNA algorithm proposed in this paper is effectively applicable to the task assignment problem under the framework of general and narrow agents.

    6.Experiments and results

    6.1.Experimental environment setting

    The neural network training environment in this paper is carried out in the virtual digital battlefield.In the hypothetical combat area,for a certain number of blue offensive forces,when the red side has important places to protect and the forces are limited,the red side agent needs to make real-time decisions according to the battlefield situation and allocate tasks according to the threat degree of the enemy and other factors while trying to preserve their strength and protect important places from destruction.In this paper,the task assignment strategy of the red side is trained by a DRL method.Considering physical constraints,such as earth curvature and ground object shielding,the key elements of the digital battlefield are close to those of the real battlefield.The red-blue confrontation scenario is shown in Fig.6.

    Table 2 Ablation experimental design.

    6.1.1.The red army force setting

    Fig.5.Performance comparison of ablation experimental algorithms.

    There are 2 important places,the headquarters and the airport.

    One early warning aircraft has a detection range of 400 km.

    The long-range interception unit consists of 1 long-range radar and 8 long-range interceptors (each interceptor carries 3 longrange missiles and 4 short-range missiles).

    The short-range interception unit consists of 1 short-range radar and 3 short-range interceptors (each interceptor is loaded with 4 short-range missiles).

    Three long-range interception units and three short-range interception units are deployed to defend the red headquarters in a sector,while four long-range interception units and two shortrange interception units are deployed to defend the red airport in a sector,for a total of 12 interception units.

    6.1.2.Blue army force setting

    There are 18 cruise missiles.

    There are 20 UAVs,each carrying 2 antiradiation missiles and 1 air-to-ground missile.

    There are 12 fighter planes,each carrying 6 antiradiation missiles and 2 air-to-ground missiles.

    There are 2 jammers for long-distance support jamming outside the defense area.

    6.1.3.Confrontation criterion

    If the radar is destroyed,the unit loses combat capability.The radar needs to be started up in the whole guidance process.When the machine is turned on,it radiates electromagnetic waves,which are captured by the opponent and expose its position.The radar is subject to physical limitations,such as earth curvature and ground object shielding,and the missile flight trajectory is the best energy trajectory.The interception distances are 160 km (long range)and 40 km (short range).For UAVs,fighters,bombers,antiradiation missiles and air-to-ground missiles,the high-kill probability in the kill zone is 75%,the low-kill probability is 55%,and for cruise missiles,the high-kill probability in the kill zone is 45%,and the low-kill probability is 35%.The antiradiation missile has a range of 110 km and a hit rate of 80%.The air-to-ground missile has a range of 60 km and a hit rate of 80%.The jamming sector of the blue jammer is 15°,and after the red radar is interfered with,the killing probability is reduced according to the jamming level.

    6.2.Experimental hardware configuration

    The CPU running the simulation environment is an Intel Xeon E5-2678v3,88 core,256 G memory;the GPU * 2 runs neural network training.The model is an NVIDIA GeForce 2080ti,72 cores and 11 G video memory.In PPO,the superparameter is ε=0.2,the learning rate is 10-4,the batch size is 5,120,and the number of hidden layer units in the neural network is 128 and 256.

    6.3.Analysis of the experimental results

    6.3.1.Agent architecture comparison

    The agent architecture based on OGMN proposed in this paper and the framework of Alpha C2 [25] are iterated 100,000 times in the digital battlefield with the PPO,A3C and DDPG algorithms,and the reward value and win ratio are compared.The comparison results are shown in Fig.7.

    Fig.6.Schematic diagram of an experimental scenario.

    It can be seen that the agent architecture based on OGMN proposed in this paper can obtain a higher reward value faster in the training process,and the effect of PPO is the best,which can substantially improve the average reward value and stability.In terms of the win ratio,the agent architecture in this paper also has the same performance,which can achieve a higher win ratio and be more stable.Experiments show that the OGMN architecture can not only largely retain the overall coordination ability of general agent to ensure the stability of training,but also has the efficient characteristics of multiagent,which can improve the training efficiency.

    6.3.2.Algorithm performance comparison

    Under the same scenario,the PPO-TAGNA algorithm proposed in this paper is compared with the Alpha C2 algorithm [25] and the rule base based on expert decision criteria [5,28-30].The three algorithms are iterated 100,000 times in the digital battlefield,and the comparison results are shown in Fig.8:

    6.4.Agent behavior analysis

    In the deduction of the digital battlefield,some strategies and tactics can emerge in this study.Fig.9 shows the performance of the red agents before training.At this time,only the unit closest to the target is allowed to defend without the awareness of sharing the defense pressure,and the value of the target is not distinguished.Finally,when the high-value target attacks,the resources of the unit that can intercept are exhausted and fail.

    Fig.10 shows the performance of the red agents after training.At this time,the agent can distinguish the high threat units of the blue side,share the defense pressure,make more rational use of resources,defend the key areas more efficiently,and finally take the initiative to attack the high-value targets of the blue side to win.

    7.Conclusions

    Aiming at the low processing efficiency problem and the slow solution speed of large-scale task assignment issue,based on the idea of an optimal algorithm of assignment strategy and combined with the DRL training framework,this paper proposes the agent architecture based on OGMN and PPO-TAGNA algorithm for the ground-to-air confrontation task assignment.By comprehensively analyzing the large-scale task assignment requirements,a reasonable state space,action space and reward function are designed.Using the real-time confrontation of the digital battlefield,experiments,such as algorithm ablation,agent framework comparisons and algorithm performance comparisons,are carried out.The experimental results show that the OGMN task assignment method based on DRL has a higher win ratio than the traditional method.The use of resources is more reasonable and can achieve better results under limited training times.The multiagent architecture based on the OGMN and PPO-TAGNA algorithms proposed in this paper has applicability and superiority in ground-to-air confrontation task assignment and has important application value in the intelligent aided decision-making field.

    Fig.7.Comparison of agent architecture training effect.

    Fig.8.Algorithm performance comparison.

    Fig.9.Performance of agents before training.

    Fig.10.Performance of agents after training.

    Declaration of competing interest

    The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

    Acknowledgements

    The authors would like to acknowledge the Project of National Natural Science Foundation of China (Grant No.62106283),the Project of Natural Science Foundation of Shaanxi Province (Grant No.2020JQ-484) and the Project of National Natural Science Foundation of China (Grant No.72001214) to provide fund for conducting experiments.

    亚洲第一青青草原| 嫁个100分男人电影在线观看 | 久久人妻熟女aⅴ| 国产欧美日韩一区二区三 | 亚洲成人免费电影在线观看 | 男的添女的下面高潮视频| 午夜福利免费观看在线| 国产在线视频一区二区| 在线天堂中文资源库| 人人妻人人爽人人添夜夜欢视频| 一区二区日韩欧美中文字幕| 色网站视频免费| 两个人免费观看高清视频| 午夜免费男女啪啪视频观看| 久久毛片免费看一区二区三区| 国产精品久久久av美女十八| 国产日韩欧美在线精品| 亚洲图色成人| 天天躁夜夜躁狠狠躁躁| 亚洲欧美色中文字幕在线| 无遮挡黄片免费观看| 免费少妇av软件| 久久精品熟女亚洲av麻豆精品| 一个人免费看片子| 久久精品熟女亚洲av麻豆精品| 91麻豆精品激情在线观看国产 | 国精品久久久久久国模美| 黄色a级毛片大全视频| 精品第一国产精品| 高潮久久久久久久久久久不卡| 天堂8中文在线网| 国产成人免费无遮挡视频| 国产免费一区二区三区四区乱码| 国产男人的电影天堂91| 操美女的视频在线观看| 成人国语在线视频| 美女扒开内裤让男人捅视频| 日本wwww免费看| 首页视频小说图片口味搜索 | 极品少妇高潮喷水抽搐| 无遮挡黄片免费观看| 久久亚洲精品不卡| 一二三四社区在线视频社区8| 欧美日韩综合久久久久久| 香蕉丝袜av| 91字幕亚洲| 精品人妻在线不人妻| 国产有黄有色有爽视频| 久久精品久久久久久噜噜老黄| 久久影院123| 午夜福利视频精品| 亚洲精品国产区一区二| h视频一区二区三区| 成人免费观看视频高清| 久久人妻熟女aⅴ| 日韩伦理黄色片| 美女扒开内裤让男人捅视频| 老司机深夜福利视频在线观看 | 悠悠久久av| 成年动漫av网址| 香蕉丝袜av| 久久这里只有精品19| 久久精品久久精品一区二区三区| 亚洲欧美一区二区三区国产| 午夜福利视频在线观看免费| videosex国产| 国产成人影院久久av| 国产激情久久老熟女| 无限看片的www在线观看| 欧美 日韩 精品 国产| 蜜桃在线观看..| 捣出白浆h1v1| 欧美xxⅹ黑人| 亚洲国产av影院在线观看| 国产成人免费无遮挡视频| 国产精品一二三区在线看| av在线老鸭窝| 热99久久久久精品小说推荐| 成年人免费黄色播放视频| 天堂中文最新版在线下载| 亚洲中文字幕日韩| 亚洲欧美日韩高清在线视频 | 成年人免费黄色播放视频| 亚洲九九香蕉| 成人手机av| 首页视频小说图片口味搜索 | 久久精品aⅴ一区二区三区四区| 亚洲男人天堂网一区| 丝袜美腿诱惑在线| 亚洲av日韩在线播放| 亚洲色图 男人天堂 中文字幕| 精品国产乱码久久久久久小说| 十分钟在线观看高清视频www| 欧美大码av| 日本欧美国产在线视频| 国产黄色视频一区二区在线观看| 国产亚洲av片在线观看秒播厂| 久久毛片免费看一区二区三区| 天堂中文最新版在线下载| 69精品国产乱码久久久| 日韩大片免费观看网站| 中文精品一卡2卡3卡4更新| 亚洲色图综合在线观看| 欧美国产精品va在线观看不卡| 欧美亚洲日本最大视频资源| 午夜影院在线不卡| 国产欧美亚洲国产| 性色av乱码一区二区三区2| 最近中文字幕2019免费版| 精品福利永久在线观看| 免费观看a级毛片全部| 久9热在线精品视频| 亚洲国产中文字幕在线视频| 久久女婷五月综合色啪小说| 国产女主播在线喷水免费视频网站| 婷婷色综合大香蕉| www.精华液| 国产免费一区二区三区四区乱码| 婷婷色av中文字幕| 国产伦理片在线播放av一区| 久久精品国产亚洲av高清一级| 免费在线观看黄色视频的| 熟女少妇亚洲综合色aaa.| 日韩免费高清中文字幕av| 手机成人av网站| 狠狠精品人妻久久久久久综合| 国产欧美日韩精品亚洲av| 国产成人精品无人区| 中国美女看黄片| 日本猛色少妇xxxxx猛交久久| 2021少妇久久久久久久久久久| 晚上一个人看的免费电影| 免费看不卡的av| 天天操日日干夜夜撸| 亚洲精品乱久久久久久| 真人做人爱边吃奶动态| 九色亚洲精品在线播放| 男女午夜视频在线观看| 久久av网站| 永久免费av网站大全| 别揉我奶头~嗯~啊~动态视频 | 亚洲人成77777在线视频| 精品国产乱码久久久久久男人| 91国产中文字幕| 2018国产大陆天天弄谢| 国产人伦9x9x在线观看| 午夜91福利影院| 亚洲图色成人| 日本欧美视频一区| 亚洲精品美女久久av网站| 国产精品偷伦视频观看了| 一级片'在线观看视频| 丁香六月欧美| 亚洲精品一卡2卡三卡4卡5卡 | 欧美日韩精品网址| 国产日韩欧美亚洲二区| 九草在线视频观看| 欧美精品一区二区免费开放| 国产亚洲午夜精品一区二区久久| 91精品伊人久久大香线蕉| 一级a爱视频在线免费观看| 新久久久久国产一级毛片| 欧美日本中文国产一区发布| a级毛片在线看网站| xxx大片免费视频| 男男h啪啪无遮挡| 每晚都被弄得嗷嗷叫到高潮| 赤兔流量卡办理| 性色av乱码一区二区三区2| 男人操女人黄网站| 成人国产av品久久久| 日韩制服丝袜自拍偷拍| 国产一卡二卡三卡精品| cao死你这个sao货| 一本色道久久久久久精品综合| 国产av国产精品国产| 亚洲午夜精品一区,二区,三区| 日本91视频免费播放| 19禁男女啪啪无遮挡网站| 丝袜人妻中文字幕| 欧美中文综合在线视频| 高清欧美精品videossex| 老熟女久久久| 久久鲁丝午夜福利片| 18在线观看网站| 欧美另类一区| 亚洲国产精品成人久久小说| 在线观看一区二区三区激情| 国产精品一区二区在线观看99| 久久精品亚洲av国产电影网| av天堂久久9| 97精品久久久久久久久久精品| 精品国产一区二区三区四区第35| 大陆偷拍与自拍| 免费在线观看黄色视频的| 欧美日韩视频高清一区二区三区二| 97精品久久久久久久久久精品| 久久精品熟女亚洲av麻豆精品| 精品国产乱码久久久久久男人| 日本一区二区免费在线视频| 国产日韩一区二区三区精品不卡| 黄色视频在线播放观看不卡| 巨乳人妻的诱惑在线观看| av有码第一页| 亚洲av国产av综合av卡| 亚洲精品国产一区二区精华液| 2021少妇久久久久久久久久久| 十八禁高潮呻吟视频| 久久鲁丝午夜福利片| 亚洲精品久久久久久婷婷小说| 免费久久久久久久精品成人欧美视频| 男人爽女人下面视频在线观看| videos熟女内射| 人人妻人人澡人人爽人人夜夜| 亚洲五月婷婷丁香| 欧美国产精品一级二级三级| 亚洲,一卡二卡三卡| 性色av一级| 午夜影院在线不卡| 免费观看av网站的网址| 亚洲国产欧美在线一区| 午夜福利影视在线免费观看| 韩国高清视频一区二区三区| 日韩一区二区三区影片| 自线自在国产av| 91麻豆精品激情在线观看国产 | av天堂久久9| av不卡在线播放| 一区福利在线观看| 国产成人影院久久av| 中国国产av一级| 国产成人欧美| www.999成人在线观看| 国产精品一国产av| 天堂8中文在线网| 免费看av在线观看网站| 黄片小视频在线播放| 最新的欧美精品一区二区| 久久午夜综合久久蜜桃| 欧美精品亚洲一区二区| 欧美精品高潮呻吟av久久| 肉色欧美久久久久久久蜜桃| 精品国产一区二区久久| 一区二区三区四区激情视频| 中文字幕制服av| 国产精品二区激情视频| 美女主播在线视频| 日本av手机在线免费观看| 18禁观看日本| 看十八女毛片水多多多| 少妇人妻久久综合中文| 丝袜美足系列| 纯流量卡能插随身wifi吗| 国产精品.久久久| 18禁裸乳无遮挡动漫免费视频| 九色亚洲精品在线播放| 久久久欧美国产精品| 国产精品一国产av| 亚洲综合色网址| 亚洲午夜精品一区,二区,三区| 国产主播在线观看一区二区 | 亚洲精品第二区| 亚洲国产精品999| 如日韩欧美国产精品一区二区三区| 亚洲欧洲日产国产| 国产精品人妻久久久影院| 老鸭窝网址在线观看| 美女中出高潮动态图| 首页视频小说图片口味搜索 | 国产亚洲精品久久久久5区| 亚洲国产看品久久| 国产欧美亚洲国产| 99久久精品国产亚洲精品| 9191精品国产免费久久| 久久亚洲精品不卡| 观看av在线不卡| 久久精品国产综合久久久| 精品亚洲成国产av| 久久久久久久久久久久大奶| 国产精品一区二区免费欧美 | 国产亚洲精品第一综合不卡| 国产伦人伦偷精品视频| 久久国产精品人妻蜜桃| 欧美日韩亚洲高清精品| 免费高清在线观看日韩| 一级片'在线观看视频| 日韩中文字幕欧美一区二区 | 热re99久久国产66热| 亚洲精品国产区一区二| 老司机影院毛片| 精品一区在线观看国产| 亚洲av综合色区一区| 性高湖久久久久久久久免费观看| 精品国产一区二区久久| 在现免费观看毛片| 热99久久久久精品小说推荐| 亚洲欧洲国产日韩| 国产在视频线精品| 久久毛片免费看一区二区三区| 国产精品一国产av| 国产在视频线精品| 国产一卡二卡三卡精品| 久久这里只有精品19| 亚洲精品国产av成人精品| 亚洲精品一二三| 亚洲国产av新网站| 欧美精品啪啪一区二区三区 | 日本91视频免费播放| 国产成人精品无人区| 精品第一国产精品| 男人舔女人的私密视频| 免费观看av网站的网址| 欧美久久黑人一区二区| 亚洲图色成人| 亚洲欧美精品自产自拍| 成人18禁高潮啪啪吃奶动态图| 精品亚洲乱码少妇综合久久| 亚洲黑人精品在线| √禁漫天堂资源中文www| 老汉色∧v一级毛片| 久久国产精品大桥未久av| 多毛熟女@视频| 啦啦啦啦在线视频资源| 久久久国产精品麻豆| 可以免费在线观看a视频的电影网站| 亚洲成人免费电影在线观看 | 国产成人精品久久二区二区免费| 日本欧美视频一区| 制服诱惑二区| 99国产精品一区二区三区| 亚洲视频免费观看视频| 少妇人妻久久综合中文| 狂野欧美激情性bbbbbb| 国产精品一区二区在线观看99| 国产成人啪精品午夜网站| 精品国产超薄肉色丝袜足j| 成人亚洲欧美一区二区av| 国产视频首页在线观看| 久久精品国产a三级三级三级| 国产熟女欧美一区二区| 欧美中文综合在线视频| 国产亚洲一区二区精品| 99精国产麻豆久久婷婷| svipshipincom国产片| 飞空精品影院首页| 国产成人精品无人区| 国产成人精品久久二区二区免费| av欧美777| 又大又黄又爽视频免费| 2021少妇久久久久久久久久久| 欧美人与善性xxx| 十分钟在线观看高清视频www| 国产精品一区二区精品视频观看| 国产欧美日韩精品亚洲av| 亚洲国产欧美在线一区| 美女视频免费永久观看网站| 成在线人永久免费视频| 丰满迷人的少妇在线观看| 亚洲三区欧美一区| 久久亚洲精品不卡| 精品国产乱码久久久久久小说| 男人爽女人下面视频在线观看| 99re6热这里在线精品视频| 免费久久久久久久精品成人欧美视频| 好男人电影高清在线观看| 国产女主播在线喷水免费视频网站| 久久久久久久精品精品| 国产淫语在线视频| 99香蕉大伊视频| 少妇 在线观看| 亚洲九九香蕉| 欧美少妇被猛烈插入视频| 晚上一个人看的免费电影| 亚洲一码二码三码区别大吗| 少妇猛男粗大的猛烈进出视频| a级片在线免费高清观看视频| 18禁裸乳无遮挡动漫免费视频| 亚洲av综合色区一区| 免费观看av网站的网址| 日本av手机在线免费观看| 国产日韩欧美在线精品| 两性夫妻黄色片| 日韩 亚洲 欧美在线| 久久人妻福利社区极品人妻图片 | 国产免费又黄又爽又色| 久久性视频一级片| 99国产精品一区二区三区| 久久鲁丝午夜福利片| 亚洲国产中文字幕在线视频| 黑人巨大精品欧美一区二区蜜桃| 亚洲国产欧美一区二区综合| 91九色精品人成在线观看| 亚洲久久久国产精品| 欧美日韩亚洲国产一区二区在线观看 | 丰满饥渴人妻一区二区三| 免费观看人在逋| 91老司机精品| 国产成人系列免费观看| 亚洲一卡2卡3卡4卡5卡精品中文| 日韩 亚洲 欧美在线| 视频区欧美日本亚洲| 人人澡人人妻人| 欧美在线黄色| 国产精品秋霞免费鲁丝片| 秋霞在线观看毛片| 国产成人精品久久久久久| 日日摸夜夜添夜夜爱| 午夜免费观看性视频| 精品久久蜜臀av无| www.自偷自拍.com| 日韩av在线免费看完整版不卡| 美国免费a级毛片| 国产成人一区二区在线| 18禁裸乳无遮挡动漫免费视频| 国产深夜福利视频在线观看| 欧美日韩成人在线一区二区| 免费在线观看视频国产中文字幕亚洲 | 丝袜喷水一区| 国产精品一国产av| 一级a爱视频在线免费观看| 一级毛片黄色毛片免费观看视频| 91老司机精品| 考比视频在线观看| 久久久久久亚洲精品国产蜜桃av| 欧美日韩黄片免| 亚洲国产欧美一区二区综合| 亚洲欧美一区二区三区久久| 亚洲国产精品999| 欧美中文综合在线视频| 丝袜喷水一区| 久久精品久久久久久噜噜老黄| 热re99久久精品国产66热6| 国产女主播在线喷水免费视频网站| 一区二区三区乱码不卡18| 日本黄色日本黄色录像| 激情五月婷婷亚洲| 一个人免费看片子| 极品人妻少妇av视频| 精品国产超薄肉色丝袜足j| 国产精品久久久久久人妻精品电影 | 男人添女人高潮全过程视频| bbb黄色大片| 欧美成人精品欧美一级黄| 18在线观看网站| 亚洲欧美一区二区三区久久| 国产片内射在线| 欧美成人精品欧美一级黄| 日韩人妻精品一区2区三区| 午夜福利影视在线免费观看| 国产三级黄色录像| 久久国产精品人妻蜜桃| 国产精品香港三级国产av潘金莲 | 亚洲av成人不卡在线观看播放网 | 国产成人免费观看mmmm| 亚洲专区中文字幕在线| 欧美亚洲 丝袜 人妻 在线| 国产片特级美女逼逼视频| 香蕉丝袜av| 大片免费播放器 马上看| 中文字幕人妻丝袜一区二区| 国产精品三级大全| 久久久国产欧美日韩av| 精品熟女少妇八av免费久了| 侵犯人妻中文字幕一二三四区| 免费女性裸体啪啪无遮挡网站| 大片免费播放器 马上看| 国产91精品成人一区二区三区 | 热99久久久久精品小说推荐| 爱豆传媒免费全集在线观看| 国产在线一区二区三区精| 麻豆国产av国片精品| 少妇 在线观看| 美女大奶头黄色视频| 不卡av一区二区三区| 欧美xxⅹ黑人| 咕卡用的链子| 午夜福利免费观看在线| av国产精品久久久久影院| 国产精品二区激情视频| 久久亚洲精品不卡| 国产成人一区二区三区免费视频网站 | 999精品在线视频| 久久精品亚洲av国产电影网| 少妇 在线观看| 久久免费观看电影| 老司机亚洲免费影院| 人体艺术视频欧美日本| 我的亚洲天堂| 欧美黄色片欧美黄色片| 中文字幕最新亚洲高清| 久久精品aⅴ一区二区三区四区| 成人三级做爰电影| 亚洲国产毛片av蜜桃av| 久久精品久久久久久噜噜老黄| 欧美成人精品欧美一级黄| 国产色视频综合| 亚洲成人免费电影在线观看 | 一级黄片播放器| 日韩av不卡免费在线播放| 久久久久国产精品人妻一区二区| 亚洲精品自拍成人| 手机成人av网站| 99精国产麻豆久久婷婷| 久久国产亚洲av麻豆专区| 十八禁网站网址无遮挡| 精品一区二区三区av网在线观看 | 又大又爽又粗| 国产精品一区二区精品视频观看| 亚洲欧美一区二区三区黑人| 午夜日韩欧美国产| 国产欧美日韩精品亚洲av| 色视频在线一区二区三区| 国产伦理片在线播放av一区| 成人影院久久| 搡老岳熟女国产| 18禁裸乳无遮挡动漫免费视频| 十八禁高潮呻吟视频| 亚洲国产av新网站| 国产爽快片一区二区三区| 中国美女看黄片| www日本在线高清视频| 亚洲九九香蕉| 午夜91福利影院| 午夜福利在线免费观看网站| 国产1区2区3区精品| 黄色一级大片看看| 我的亚洲天堂| 亚洲精品美女久久久久99蜜臀 | 一本一本久久a久久精品综合妖精| www.熟女人妻精品国产| 考比视频在线观看| 男女边吃奶边做爰视频| 午夜福利乱码中文字幕| 亚洲精品国产一区二区精华液| 午夜影院在线不卡| 各种免费的搞黄视频| 精品人妻1区二区| 99热网站在线观看| 久久久国产欧美日韩av| 一本色道久久久久久精品综合| av在线老鸭窝| e午夜精品久久久久久久| 国产亚洲精品第一综合不卡| 菩萨蛮人人尽说江南好唐韦庄| 好男人电影高清在线观看| 国产精品秋霞免费鲁丝片| 久久人人97超碰香蕉20202| 女警被强在线播放| 国产爽快片一区二区三区| 国产成人一区二区三区免费视频网站 | 91字幕亚洲| 黄色片一级片一级黄色片| 亚洲精品一区蜜桃| av片东京热男人的天堂| 精品人妻1区二区| 桃花免费在线播放| 91成人精品电影| 狂野欧美激情性xxxx| 国产又爽黄色视频| 久久久久国产一级毛片高清牌| 国产成人系列免费观看| 精品久久蜜臀av无| 成年动漫av网址| 国产精品免费大片| 青春草视频在线免费观看| av又黄又爽大尺度在线免费看| 日韩 欧美 亚洲 中文字幕| 大片电影免费在线观看免费| 日韩电影二区| 国产精品三级大全| 国产精品久久久人人做人人爽| 国产精品欧美亚洲77777| a级片在线免费高清观看视频| 久久影院123| 欧美黄色片欧美黄色片| 两个人看的免费小视频| 黄片小视频在线播放| 亚洲精品久久午夜乱码| 国产一区二区三区综合在线观看| 成人影院久久| 尾随美女入室| 久久久久久久久免费视频了| 人人妻人人爽人人添夜夜欢视频| 天堂8中文在线网| 国产一区二区激情短视频 | 国语对白做爰xxxⅹ性视频网站| 你懂的网址亚洲精品在线观看| 女性被躁到高潮视频| 亚洲一区二区三区欧美精品| 久久人人爽人人片av| 亚洲五月婷婷丁香| 建设人人有责人人尽责人人享有的| 后天国语完整版免费观看| 久久久久久人人人人人| 极品人妻少妇av视频| 国产精品麻豆人妻色哟哟久久| 成人三级做爰电影| 亚洲第一青青草原| 好男人视频免费观看在线| 天堂中文最新版在线下载| a 毛片基地| 久久精品成人免费网站| 每晚都被弄得嗷嗷叫到高潮| 色婷婷av一区二区三区视频| 精品国产一区二区三区四区第35| 多毛熟女@视频| 久久av网站| 亚洲中文日韩欧美视频| 成人亚洲欧美一区二区av| 亚洲一区中文字幕在线| 久久性视频一级片| 18禁观看日本| 狂野欧美激情性xxxx| 少妇被粗大的猛进出69影院| 国产成人av激情在线播放|