• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    An Intelligent Algorithm for Solving Weapon-Target Assignment Problem:DDPG-DNPE Algorithm

    2023-10-26 13:14:50TengdaLiGangWangQiangFuXiangkeGuoMinruiZhaoandXiangyuLiu
    Computers Materials&Continua 2023年9期

    Tengda Li,Gang Wang,Qiang Fu,Xiangke Guo,Minrui Zhao and Xiangyu Liu

    Air Defense and Antimissile College,Air Force Engineering University,Xi’an,710051,China

    ABSTRACT Aiming at the problems of traditional dynamic weapon-target assignment algorithms in command decisionmaking,such as large computational amount,slow solution speed,and low calculation accuracy,combined with deep reinforcement learning theory,an improved Deep Deterministic Policy Gradient algorithm with dual noise and prioritized experience replay is proposed,which uses a double noise mechanism to expand the search range of the action,and introduces a priority experience playback mechanism to effectively achieve data utilization.Finally,the algorithm is simulated and validated on the ground-to-air countermeasures digital battlefield.The results of the experiment show that,under the framework of the deep neural network for intelligent weapon-target assignment proposed in this paper,compared to the traditional RELU algorithm,the agent trained with reinforcement learning algorithms,such as Deep Deterministic Policy Gradient algorithm,Asynchronous Advantage Actor-Critic algorithm,Deep Q Network algorithm performs better.It shows that the use of deep reinforcement learning algorithms to solve the weapon-target assignment problem in the field of air defense operations is scientific.In contrast to other reinforcement learning algorithms,the agent trained by the improved Deep Deterministic Policy Gradient algorithm has a higher win rate and reward in confrontation,and the use of weapon resources is more efficient.It shows that the model and algorithm have certain superiority and rationality.The results of this paper provide new ideas for solving the problem of weapon-target assignment in air defense combat command decisions.

    KEYWORDS Weapon-target assignment;DDPG-DNPE algorithm;deep reinforcement learning;intelligent decision-making;GRU

    1 Introduction

    Weapon-target assignment (WTA) is the core link in the command and control of air defense operations,which has a significant impact on improving combat effectiveness.Its connotation refers to the efficient use of its own multi-type and multi-platform weapon resources based on battlefield situational awareness,rational allocation and interception of multiple incoming targets,avoiding the omission of key targets,repeated shooting,and other phenomena,to achieve the best combat effect[1–3].This problem has been proven to be a non-deterministic polynomial complete (NP-complete)problem[4,5].The efficient WTA is more than 3 times more effective than free fire[6],acting as a force multiplier.

    The experts and scholars at home and abroad regard WTA as a class of mathematical problems and have carried out related model construction and solution work from multiple levels [7–13].Davis et al.[14]focused on ballistic missile defense with multiple rounds of launches under dynamic conditions from the perspective of maximizing the protection of strategic points;Xu et al.[15] fully considered the communication and cooperation between sensor platforms and weapon platforms,and carried out research on the optimization goals of maximizing the target threat and minimizing the total cost of interception for the multi-stage air defense WTA problem.Guo et al.[16] mainly studied the problem of many-to-many missile interception and proposed two strategies: fixed and adaptive grouping strategies,which were solved by artificial swarm algorithm;Aiming at the WTA problem of regional defense,Kim et al.[17]proposed two new algorithms:rotation fixed strategy and rotation strategy to deal with multi-target attacks,and the effectiveness of the algorithm was verified by experiments.Li et al.[18]summarized and analyzed the static WTA problem,and introduced an improved Multi-objective Evolutionary Algorithm Based on Decomposition(MOEA/D framework)to solve it.Based on analyzing the difficulties of regional air defense decision-making,Severson et al.[19] adopted the idea of multi-layer defense to establish a distribution model to maximize the interference effectiveness index.Ouyang et al.[20] proposed a distributed allocation method to effectively manage radar resources and use probabilistic optimization algorithms to allocate radar targets for limited radar early warning resources.To maximize the probability of killing,Feng et al.[21]divided the fire units into groups,considered the situation of compound strikes in the same group of fire units,and constructed a dynamic WTA model with multiple interception timing.Bayrak et al.[22]studied how to efficiently solve the firepower cooperative allocation problem using the genetic algorithm (GA) with good convergence and search speed;Li et al.[23] designed a particle swarm optimization(PSO)algorithm with perturbation by attractors and inverse elements for the anti-missile interceptor-target allocation problem;to realize the rapid solution of WTA,Fan et al.[24]introduced the variable neighborhood search factor into the solution equation,mimics the natural behavior of bees,and proposes a memory mechanism that improves the global search efficiency of the artificial bee colony(ABC)algorithm.Fu et al.[25]proposed a multi-target PSO algorithm based on the coevolution of multiple populations to construct a model of co-evolution.

    However,the above studies are based on traditional analytical models and algorithms.Due to the rapid changes and difficult quantification of the battlefield environment and the opponent’s strategies,there are bottlenecks in the uncertainty and nonlinearity of traditional analytical models and algorithms in dealing with decision-making,and it is difficult to adapt to the changing battlefield environment.Facing the needs of air defense operation,decision-making advantage is the core,so it is urgent to study new methods for WTA to improve the level of intelligent decision-making.WTA is a typical sequential decision-making process oriented to incomplete information games,which can be boiled down to solving the Markov decision process(MDP)problem;deep reinforcement learning(DRL)provides an efficient solution to this problem:DRL can realize the end-to-end learning process from perception to action,and its learning mechanism and method are in line with the experience learning and decision-making thinking mode of combat commanders,which has obvious advantages for solving sequential decision-making problems under game confrontation conditions.Some good results have been achieved in the application of Go[26–28],real-time strategy games[29,30],automatic driving[31],intelligent recommendation[32],and other fields.

    In summary,this paper aims to apply the theory and algorithm of DRL to the WTA problem of air defense operation command decision-making.By introducing an event-based reward mechanism(EBR),multi-head attention mechanism (MHA),and gated cyclic unit (GRU),a new deep neural network framework for intelligent WTA is constructed,which is solved by the improved Deep Deterministic Policy Gradient algorithm with dual noise and prioritized experience replay (DDPGDNPE)algorithm with dual noise and prioritized experience replay to improve the auxiliary decisionmaking ability of intelligent WTA for air defense operation in highly dynamic,uncertain and complex battlefield environments,transform information advantages into decision-making advantages,and provide more accurate WTA decision support for commanders.Finally,the red and blue sides are designed on the simulation deduction platform to verify the network architecture and algorithm proposed in this paper.The experimental results prove the practicability and effectiveness of the method used in this paper.

    2 Related Theories

    2.1 DRL

    The goal of reinforcement learning(RL)is to enable the agent to obtain the maximum cumulative reward value during interaction with the environment,and to learn the optimal control method of its actions.RL introduces the concept of agent and environment,expanding the optimal control problem into a more general and broader sense of sequential decision-making problems,and the agent can autonomously interact with the environment and obtain training samples,rather than relying on a limited number of expert samples.The RL model consists of five key parts:agent,environment,state,action,and reward.Each interaction between the agent and environment produces corresponding information,which is used to update the agent’s knowledge,and this perception-action-learning cycle is shown in Fig.1.

    Figure 1:Schematic diagram of RL

    DRL is the reinforcement learning method of using a deep neural network to express the agent strategy.In the field of air defense intelligent WTA,the perception ability of deep learning(DL)can be used for battlefield situation recognition,and the RL algorithm can be used to assist decision-making to improve the efficiency of WTA and gain competitive advantages.

    2.2 Markov Decision Process

    MDP is represented by five tuples 〈S,A,P,R,γ〉 :S=(s1,s2,···,sn)is the set of states;A=(a1,a2,···,am)is the action set;Pis the state transition matrix;Ris the reward.γis the discount factor.

    In the process of MDP,the agent is in the initial statesin the environment,at which time it will execute an actiona,and then the environment will output the next states′and the rewardrobtained by the current actiona.The agent is constantly interacting with the environment.

    Rtis the cumulative reward,which is the sum of rewards after the time stept:

    The policyπis the probability of selecting an actionain the states:

    The state-value functionVπ(s)is the expected total reward of taking strategyπin the initial states:

    The action-value functionQπ(s,a)is the expected total reward obtained when actionais performed in statesand subsequent actions follow strategyπ:

    The goal of RL is to solve the optimal policy functionπ?of MDP to maximize the return,while the optimal state value functionV?(s)and the optimal action value functionQ?(s,a)are expressions of the optimal policy functionπ?:

    2.3 DDPG Algorithm

    2.3.1 Introduction of the DDPG Algorithm

    Aiming at the dimensionality disaster problem existing in traditional RL algorithms,a better DRL algorithm can be found for solving large-scale decision-making tasks by combining the representation advantages of DL and the decision-making advantages of RL.

    DDPG algorithm has strong deep neural network fitting ability and generalized learning ability,and its sequential decision-making ability is strong,which is very consistent with the decision-making thinking of air defense operations,so this paper considers the use of DDPG algorithm in air defense intelligent decision-making.

    The expectation that defines the cumulative reward is the objective function of the DDPG algorithm:

    To find the optimal deterministic behavior policyμ?,is equivalent to maximizing the policy in the objective functionJβ(μ):

    The Actor network is updated as follows:

    Qμ(s,μ(s))is the action state value that can be generated when the action is selected according to the deterministic policyμunder the states;represents the expectation ofQif the statesconforms to theρβdistribution.The gradient ascent algorithm is used to optimize the above equation to continuously improve the expectation of discount cumulative reward.Finally,the algorithm updates the parametersθμof the policy network in the directionQ(s,a;θQ)of increasing the action value.

    To update the critic network by Deep Q Network(DQN)updating value network,the gradient of the value network is as follows:

    The neural network parametersin the TargetQvalue are the parameters of the target policy network and target value network,respectively,and the gradient descent algorithm is used to update the parameters in the network model.The process of training the value network is to find the optimal solution of the parametersθQin the value network.

    Therefore,the training objective of the DDPG algorithm is to maximize the objective functionJβ(μ)and minimize the loss of the value networkQ.

    2.3.2 DDPG-DNPE Algorithm

    The traditional DDPG algorithm uses an experience replay mechanism,and samples uniformly from the experience replay pool during sampling,that is,all experience importance is considered to be consistent.In the actual simulation process,it is found that the importance of the sample is different,and the data that makes the network performs poorly in the interaction process is more valuable for learning.Therefore,this paper introduces a priority experience replay mechanism,and gives different data a certain weight,so that the training network can be invested in learning high-value data as much as possible.

    If|δi|is larger,give the experience a higher weight.The sampling probability of experience can be defined as:

    where,Di=,rank(i)is the sequence number of experienceiin the experience pool.The larger|δi|is,the higher the sequence number is,that is,the greater the probability of experienceibeing drawn.αmainly determines the order in which priorities are used.

    A High-frequency sampling of experiences with high weights changes the distribution of samples,making it difficult for the model to converge.Importance sampling is often considered,and the importance sampling weight is:

    Sis the size of the experience replay pool,andβis a hyperparameter that controls the level of experience replay based on priority.

    As shown in Fig.2,in the training process,to better take into account exploration and update,OU(Ornstein-Uhlenbeck)random noise and Gaussian noise are introduced to change the decisionmaking process of the action from deterministic to a random process,and the differential form of OU random noiseNtis:

    where,μis the mean value;θis the speed at which noise tends to the average value;δis the fluctuation degree of noise;Btis the standard Brownian motion.

    Figure 2:Exploration strategy based on OU noise and Gaussian noise

    Gaussian noise is directly superimposed in the motion exploration in the form ofε~N(0,δ2).

    In summary,the structural block diagram of the improved DDPG algorithm is shown in Fig.3.

    The calculation flow of the DDPG-DNPE algorithm is as follows:

    Figure 3:Block diagram of the DDPG-DNPE algorithm

    2.3.3 DRL Training Framework

    Before using DRL to solve the WTA problem,it is necessary to collect training samples through the interaction between the agent and the environment,and then optimize the neural network parameters through the RL algorithm,so that the agent can learn the optimal strategy.The agent training framework is shown in Fig.4.

    Figure 4:Schematic diagram of the agent training framework

    The focus of DRL is on both sampling and training.When sampling,the input of the neural network used by the agent is state information and reward,and the output is action information,while the simulation environment requires input operating instructions and the output is battlefield situation information.Therefore,when the agent interacts with the environment for sampling,the data output from the simulation environment is transformed into state information and reward,and the action output from the neural network is transformed into action commands according to the parameters in the MDP model.During training,the collected samples are input to the RL algorithm,and the parameters are continuously trained and optimized to finally obtain the optimal strategy.

    3 Intelligent Decision-Making Model and Training

    3.1 Deep Neural Network Framework for WTA

    Based on the key indicators such as air defense combat boundaries,engagement criteria,and physical constraints,the deep neural network structure used in this paper is shown in Fig.5.In the red-blue confrontation,the red agent comprehensively evaluates the threat degree of the incoming blue target according to the real-time state of both sides,considers the deployment of the red side’s air defense fire units,and decides which blue targets to intercept at which points in time.The input of the network model is mainly the real-time state of the red and blue sides,and the output is which interceptor weapons are used to intercept which blue targets are in the current state.The network structure can be divided into three parts:battlefield situation input,decision-making action calculation,and decision-making action output.

    Figure 5:Deep neural network framework for WTA

    3.1.1 Battlefield Situation Input

    The battlefield situation input is mainly to input the network state space.The network state space is integrated and reduced by combining the air defense WTA combat elements.The battlefield situation is planned to be divided into four categories and input to the neural network in the form of semantic information.The specific classification and state information are shown in Table 1.

    Table 1:Classification and information of the state

    In a complex battlefield environment,there are many air defense combat entities and operational constraints,and the battlefield situation will change in space over time,so the number of each type is dynamically changing.

    3.1.2 Decision-Making Action Calculation

    After the state of the red key target,the state of the red fire unit,the state of the blue incoming target that can be observed,and the state of the blue target that can be intercepted are input into the neural network,each type of state data is extracted from the situation characteristics through two layers of fully connected-rectified linear unit(FC-ReLU),and then all the data are combined and connected to the site.After a layer of FC-ReLU and GRU,the global situation characteristics are output,and then decision reasoning and action calculation are carried out.

    Due to the complex battlefield environment and random disturbance,the battlefield situation presents dynamic uncertainty,and the temporal attributes of the situation and the spatial attributes of the operational nodes should be fully considered.Moreover,the red and blue adversarial data often contain the historical value,that is,the decision-making in the current state is related to historical information,and the GRU network can selectively forget unimportant historical information,which better solves the problem of gradient disappearance and gradient explosion in long-sequence training.The structure of the GRU network is shown in Fig.6.

    Figure 6:GRU

    ztandrtrepresent the updated door and the reset door,respectively.The update gate is used to control the extent to which the state information at the previous moment is brought into the current state,and the larger the value of the update gate,the more state information is brought in at the previous moment.Resetting the gate controls how much information is written to the current candidate setfrom the previous state,and the smaller the reset gate,the less information is written from the previous state.The update mechanism for each door is:

    where,xtis the input information at the current moment,ht-1denotes the hidden state of the previous moment.The hidden state acts as a neural network memory,which contains information about the data seen by the previous node,htrepresents the hidden state passed to the next moment,is the candidate hidden state,δstands for the sigmoid function,by which the data can be changed to a value in the range 0–1,tanh is the tanh function,by which the data can be changed to a value in the range[-1,1],Wris the weight matrix.

    After the introduction of GRU,it can effectively retain high-value historical information,so that the neural network can skillfully store and retrieve information,rationally use the effective information in the strategy to achieve cross-time correlation events,fully carry out comprehensive analysis and judgment,and improve the prediction accuracy of the neural network strategy in the time-varying environment.

    3.1.3 Decision-Making Action Output

    By integrating the network action space,the action output in the WTA process can be divided into three categories: 1.Action subject: selectable red fire unit;2.Action predicate: the timing of interception of the red fire unit and the type of weapon launched;3.Action object: blue target that can be intercepted.

    In the red-blue confrontation,the combat units of two sides change dynamically with the development of the situation,and the intention of the blue target is closely related to its state,and different characteristics of the target state have different degrees of influence on the analysis of target intention.To improve the efficiency of training,the multi-head attention mechanism is considered.It simulates the human brain’s different attention to different objects in the same field of view by assigning certain correlation degree weights to the input sequence features and further analyzes the importance of different target features so that the agent can focus on the blue target with higher threat degree at some moments,give priority to important information,and make accurate decisions quickly.

    According to the attention distribution relationship of input characteristics,the attention mechanism can be divided into hard attention and soft attention.The soft attention mechanism assigns attention to each input feature and continuously learns and trains to obtain the weight of each feature.At the same time,the whole mode based on the soft attention mechanism is differentiable,that is,backpropagation learning can be realized.Therefore,the soft attention mechanism is chosen in this article.

    The attention variablezis used to represent the location of the selected information,and the probability of theiinput information is defined asai,then

    where,X=[x1,...,xN]is the input information,which is the intercepted blue target feature vector;qis the selectable red fire unit feature vector,namely,the hidden state obtained by GRU;f (xi,q)is the attention scoring function,representing the attention score of the red fire unit to the blue target;WandUare the neural network parameters;vis the global situation feature vector.The current situation is processed by soft max function,the relative importance of each parameter information is obtained,and the focus of local situation information is realized.

    After the global situation features are generated,the feature vectors of the situation of the red fire unit and the interceptable blue target are respectively scored for attention,and the score of each red fire unit about each interceptable blue target is generated.Finally,the sigmoid sampling of the score vector is carried out to generate the attack target of the red fire unit.When making decisions,the algorithm will output the command and control for each unit,collect the status and overall situation of each unit,and then call the next decision command.

    3.2 Red Agent Training Method

    In the simulation,the data itself is unstable,each round of iteration may produce fluctuations,and will immediately react to the next round of iterations,it is difficult to obtain a stable model.This paper intends to decouple the intelligent WTA neural network in the training process,as shown in Fig.7,dividing it into an inference module and a training module.The two modules use the same network structure and parameters,and the inference module is responsible for interacting with the simulation environment to obtain the interaction data.Based on the interactive data,the training module continuously updates the network parameters through the improved DDPG algorithm and synchronizes the network parameters to the inference module when the training module completesNiiteration.

    Figure 7:Schematic representation of the decoupling training

    Since the parameters of the inference module are fixed inNiiteration time,the data difference is reduced and the network fluctuation can be effectively avoided.The value ofNiis affected by the fluctuation range of the training module,and the threshold T is considered.When the fluctuation range is less than T andNimeets the lower limit,the parameters of the inference module are updated synchronously.

    4 Simulation Deduction and Verification

    This paper uses the intelligent simulation deduction platform to compile the confrontation operation plan of the red and blue sides,and realize data collection and air defense command and control intelligent body verification.The deduction platform has a variety of models such as UAVs,cruise missiles,bombers,early warning aircraft,long-range and short-range fire units,radars,etc.,which can realize a variety of operations such as aircraft takeoff and landing,flight along designated routes,bombing,missile launching,fire unit firing,radar switching and on,etc.,and can carry out countermeasure deduction in real-time and evaluate the decision-making level of the agent.

    4.1 WTA Platform Architecture

    As shown in Fig.8,the training environment and the extrapolation environment are physically divided,the corresponding training environment is constructed in the digital battlefield according to the combat idea,and the training environment and the agent are deployed on the training cloud of the large-scale data center.By training in the learning environment of the training cloud for some time,the agent will have some real-time decision-making ability.Then a corresponding extrapolation system is constructed in the digital battlefield which runs on the extrapolation cloud composed of small-scale server clusters.The agents trained on the training cloud will also be deployed on the same extrapolation cloud,and the countermeasures learned during training will be applied to the extrapolation system.Through intuitive adversarial deduction,the decision-making level of the agent is evaluated,and the defects and deficiencies of the agent are analyzed.The hyperparameters of the neural network in the training environment are adjusted in a targeted manner,and then iteratively trained.

    Figure 8:Platform architecture diagram

    4.2 Simulation Environment

    The whole simulation deduction process is based on the virtual digital battlefield close to the real battlefield,using the real elevation digital map,which can be configured with physical constraints and performance indicators of equipment,including radar detectable area,missile killing area,and killing probability,etc.At the same time,the combat damage of both sides and other confrontation results can be recorded in real-time.

    The simulation environment is shown in Fig.9.The confrontation process is divided into two camps,“red and blue”.In the combat area,a certain number of blue forces attack the command post and airport of the red side,and the task of the red side is to protect strategic places such as the command post and airfield.The task of the blue side is to destroy strategic points of the red side and attack the exposed fire units of the red side.The red agent receives the battlefield situation in real-time in the battlefield environment and makes decision-making instructions according to the battlefield situation to strike at incoming blue targets,protecting important places.It uses the reward and punishment mechanism to continuously modify the behavior of the decision-making brain and finally enables it to generate correct decision-making instructions for the situation in the environment.

    Figure 9:Simulation environment

    4.3 Troop Setting

    The main goal of the red side is to rationally plan the use of ammunition and defend the command post and airfield with a minimum interceptor missile resource.The main goal of the blue side is to destroy the command post and airfield of the red side while attacking the exposed fire units.The force settings and performance indicators of the red and blue sides are shown in Table 2.

    Table 2:Troop setting

    In the initial deployment stage,taking into account the fire connection and overlap of each air defense position while ensuring a certain depth of destruction,some troops are selected to deploy in advance.A total of nine interception units are deployed to protect the command post and airfield:among them,three long-range interception units and two short-range interception units are deployed to defend the command post of the red side;Three long-range interception units and one short-range interception unit were deployed to defend the airfield,as shown in Fig.9.

    4.4 Reward Function Setting

    The reward signal is the only supervised information in RL,and whether an objective and appropriate reward function can be given is crucial for training an excellent model.The design of the reward function is closely related to the combat mission and directly affects the update of strategy parameters.Due to the large number of units on both sides of red and blue,the state space and action space are correspondingly large.If the neural network only gets feedback according to the reward function after each round of confrontation,it will reduce the exploration efficiency,resulting in each action facing the problem of sparse feedback.That is,the neural network does a lot of“correct”actions,and a small number of“incorrect”actions lead to combat failure,while these“correct”actions are not rewarded,resulting in difficulty in strategy exploration and optimization.Due to the complexity of the air defense combat command decision-making task,the probability of the agent exploring the winning state by itself is very low,so it is necessary to reasonably design the reward function,clarify the key events that trigger the reward,formulate the final indicators of each component of the reward function,and closely associate the trigger mechanism of the reward with the air defense WTA combat process.

    Considering that bombers and fighters pose a greater threat to the red side,the interception of high-value blue targets can be taken as a key reward and punishment trigger,and a one-time periodic reward will be given after the first wave of blue side attacks is successfully intercepted;when the blue side’s different value targets are successfully intercepted,a certain reward value will be given;when the red side wins,a winning bonus value is given.The reward function is set as follows:

    where,i,j,k,mrespectively represent the number of intercepted bombers,fighters,cruise bombs,and UAVs.The red side will be awarded 8 points,5 points,1 point,and 0.5 points for intercepting a bomber,fighter,cruise bomb,and UAV respectively.Since the trigger events that reward the red agent are all the goals that the red agent must achieve to win,the reward function can gradually guide the agent to find the direction of learning.

    4.5 Antagonism Criterion and Winning Condition Setting

    The radar needs to be switched on throughout the guidance.The red side fire control radar will radiate electromagnetic waves,which will be captured by the blue side and then expose the position.If the red fire control radar is destroyed,the red fire unit cannot fight.The interception rate of the anti-aircraft missiles launched by the red fire unit is about 45%–75% in the kill zone,which fluctuates with different types of blue combat units.If the red radar is interfered with,the kill probability will be reduced accordingly.

    When the red command post and airfield are all destroyed,or the radar loss exceeds 60%,the red side fails;When the blue team loses more than 50% of its fighters,the blue team fails.

    5 Analysis of the Simulation Results

    5.1 Ablation Experiment

    To compare and analyze the effects of event-based reward mechanism,GRU,and multi-head attention mechanism on training effectiveness,this section designs an ablation experiment as shown in Table 3.

    Table 3:Design of experiment

    The results of the ablation experiment are shown in Figs.10 and 11.

    Figure 10:Average reward comparison

    The horizontal axis represents the training round,and the vertical axis represents the average reward obtained.It can be seen from Fig.10 that with the increase of training rounds,the average reward obtained by the algorithm using the three mechanisms has increased,indicating that the three mechanisms proposed have a certain role,but the degree of impact is different: among them,the DDPG algorithm has a low reward and an extremely slow rise,probably because it has not used any mechanism,and the bottleneck of training is more obvious.The average reward of DDPG+G is also low,which may be due to the lack of real-time analysis of the battlefield situation,the difficulty of grasping the battlefield dynamics in real-time,and the delay in rewards,making it difficult to obtain better training results;The higher rewards obtained by DDPG+E and DDPG+M algorithms indicate that the influence of event-based reward mechanism and multi-head attention mechanism is greater,but the effect based on event-based reward mechanism is more obvious.When the three mechanisms are used at the same time,the average reward obtained by the agent increases from 15 to about 80,an increase of 81.25%,indicating that the three mechanisms introduced can significantly improve the performance of the algorithm,accelerate the training of the agent,and improve the quality of decision-making.

    Figure 11:Comparison of ablation results

    Fig.11 shows the comparison of win rate and average reward using different mechanics.Consistent with the above analysis,when the three mechanisms are used,it ensures an effective understanding of the characteristics of the air situation and masters the implementation situation.At the same time,it can reward the agent in time,the winning rate of the red agent is the highest at this time,which can reach 78.14%;If only one machine is used,the win rate of the agent also increases,indicating that the introduction of the machine is necessary and critical for the improvement of the win rate.

    5.2 Comparison of Different Algorithms

    Under the neural network architecture proposed in this paper,the comparison of the improved DDPG algorithm,Asynchronous Advantage Actor-Critic(A3C)algorithm,DQN algorithm,and the RELU algorithm is shown in Fig.12.Among them,the RELU algorithm refers to the method of expert rule base to solve the model,as a contrast between the traditional method and the agent model.

    In horizontal comparison,the reward function curve and win rate curve when using the rule algorithm are relatively stable,and only fluctuate within a very limited range,indicating that the play of the expert agent is stable.However,agents trained with RL algorithms(such as DDPG,A3C,and DQN)have higher win rates and reward values than rule algorithms,which shows that it is scientific and reasonable to use deep reinforcement learning algorithms to solve WTA problems in the field of air defense operations.Due to the rapid changes in the battlefield environment and the opponent’s strategy,it is difficult to deal with complex situations by relying on traditional rules alone,and it is not possible to solve such problems well.The network model trained by neural network and DRL algorithm can provide good solutions to such problems and have a strong ability to adapt to complex battlefields.

    In longitudinal comparison,under the same network architecture,compared with the A3C algorithm and DQN algorithm,the use of an improved DDPG algorithm can obtain a higher win rate and reward,indicating that the improved DDPG algorithm can effectively deal with such problems,and the algorithm proposed in this paper is effective.It is worth noting that the win curve and reward curve jitter is more intense because the scene is full of a large number of uncertainties,resulting in the overall fluctuation of the curve.

    As can be seen from Fig.13,consistent with the above analysis,the improved DDPG algorithm has a higher win rate and average reward compared with RELU,A3C,and DQN.It shows that the improved DDPG algorithm is more suitable to solve the red-blue confrontation problem in air defense operations to a certain extent.

    Figure 13:Comparison of the final reward and win rate for different algorithms

    5.3 Analysis of Confrontation Details

    In the process of simulation and deduction of the virtual digital battlefield,the trained red agent uses ammunition more reasonably,and tactics emerge,which can better complete the task of defending key places.This section mainly reviews and analyzes the data obtained in the process of simulation and summarizes the strategies emerging from the red agent in the process of confrontation.

    (1) reasonable planning of ammunition,the first-line interception

    As shown in Fig.14,in the initial stage,the red agent has almost no strategy,and each firepower unit fires freely when the firing conditions are met.The firepower units use too much ammunition to intercept incoming enemy aircraft at the same time,resulting in excessive ammunition consumption in the early stage,and the efficiency-cost ratio is extremely low,which greatly causes a waste of resources;When the blue’s important and threatening incoming targets approach,the ammunition available is extremely limited,and it has to adopt a very conservative strategy to shoot and intercept,and finally the red is failed due to insufficient ammunition.

    After a period of training,as shown in Fig.15,the red agent can better adapt to the blue side’s offensive rhythm,master certain rules,and correctly plan the use of ammunition.After the blue target enters the kill zone,the firepower units cooperate to complete the interception with the least ammunition,reflecting the effectiveness of the strategy;When the important and threatening incoming target of the blue side approaches,the firepower unit has a large ammunition stock at this time,and can flexibly adapt the shooting strategy to complete the defense task with low ammunition consumption.Without training,only when the blue target is about to enter the strategic hinterland,the red fire unit can complete the interception;After training,the red side firepower unit can be detected and intercepted as soon as possible,which further verifies the rationality and effectiveness of the network structure trained by the improved DDPG algorithm.

    Figure 14:Agent performance before training

    Figure 15:Agent performance after training

    (2) short-range and long-range units cooperate,intercept with high efficiency

    As shown in Fig.16,when it has not been trained,the red side firepower units fight independently,and only perform their defensive tasks,and when friendly neighboring units encounter danger,they fail to respond in time to counterattack without any tactics and battle methods.

    After a period of training,as shown in Fig.17,the red agent can command the firepower units to carry out coordinated defense,and while completing its defense task,it provides timely and appropriate fire support to other neighboring units,which greatly relieves the defensive pressure of other firepower units and improves the overall defense efficiency.When the long-range firepower units are seriously damaged,they will turn off the radar in time and defend themselves in a silent state.At this time the short-range firepower units can actively react,when the blue target enters the ambush circle,they can cooperate with the long-range firepower units to quickly and efficiently destroy the incoming blue targets.Since the blue side strategy has not been fixed,that is,the blue side strategy is random,the trained red agent strategy has a certain generalization and can be adapted to other battle scenarios.

    Figure 16:Agent performance before training

    Figure 17:Agent performance after training

    6 Conclusion

    Aiming at the difficulty of traditional models and algorithms to solve the uncertainty and nonlinearity problems in WTA,this paper constructs a new deep neural network framework for intelligent WTA,analyzes the network structure composition in detail,and introduces event-based reward mechanism,multi-head attention mechanism,and GRU.Then,based on the virtual digital battlefield close to the real battlefield,real-time confrontation simulation experiments are carried out,and the improved DDPG algorithm with dual noise and priority experience playback techniques is used to solve the problem.The results show that under the new deep neural network framework,compared with the A3C algorithm,DQN algorithm,and RELU algorithm,the agent trained by the improved DDPG algorithm has a higher win rate and reward return,and the planning and use of ammunition is more reasonable,which can show a high decision-making level.The framework proposed in this paper has some reasonableness.

    Acknowledgement:We thank our teachers,friends,and other colleagues for their discussions on simulation and comments on this paper.

    Funding Statement:This research was funded by the Project of the National Natural Science Foundation of China,Grant Number 62106283.

    Author Contributions:Study conception and design:Gang Wang,Qiang Fu;data collection:Minrui Zhao,Xiangyu Liu;analysis and interpretation of results:Tengda Li,Xiangke Guo;draft manuscript preparation:Tengda Li,Qiang Fu.All authors reviewed the results and approved the final version of the manuscript.

    Availability of Data and Materials:The datasets used or analysed during the current study are available from the corresponding author Qiang Fu on reasonable request.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    美女扒开内裤让男人捅视频| 亚洲人成伊人成综合网2020| 51午夜福利影视在线观看| 中文字幕人妻丝袜一区二区| 久久人人精品亚洲av| 亚洲 欧美 日韩 在线 免费| 白带黄色成豆腐渣| 又黄又爽又免费观看的视频| 色综合婷婷激情| 免费在线观看日本一区| 美女扒开内裤让男人捅视频| 亚洲免费av在线视频| 十八禁人妻一区二区| 18禁美女被吸乳视频| 久久中文看片网| 午夜视频精品福利| 韩国av一区二区三区四区| 欧美另类亚洲清纯唯美| 中文字幕人妻熟女乱码| 精品久久久久久久末码| 叶爱在线成人免费视频播放| 亚洲av成人av| 国产欧美日韩精品亚洲av| 一区二区日韩欧美中文字幕| 97碰自拍视频| 久久久久久人人人人人| 欧美激情久久久久久爽电影| 在线国产一区二区在线| 高潮久久久久久久久久久不卡| 日本在线视频免费播放| 无人区码免费观看不卡| 久久九九热精品免费| 禁无遮挡网站| 极品教师在线免费播放| 琪琪午夜伦伦电影理论片6080| 亚洲精品色激情综合| 日韩视频一区二区在线观看| 一级a爱片免费观看的视频| 亚洲专区中文字幕在线| 日本成人三级电影网站| 成人国产综合亚洲| 久久精品国产综合久久久| 国产又黄又爽又无遮挡在线| 国产亚洲欧美精品永久| 国产高清videossex| 午夜成年电影在线免费观看| 亚洲成国产人片在线观看| 男人的好看免费观看在线视频 | 国产亚洲精品第一综合不卡| 999久久久国产精品视频| 欧美三级亚洲精品| 成人国产综合亚洲| 在线十欧美十亚洲十日本专区| 日韩大尺度精品在线看网址| 黑人欧美特级aaaaaa片| 国产久久久一区二区三区| 高清在线国产一区| 欧美一级毛片孕妇| 国产区一区二久久| 可以免费在线观看a视频的电影网站| 成年免费大片在线观看| 国产成人av教育| 日本撒尿小便嘘嘘汇集6| 久久这里只有精品19| tocl精华| 久久久久国内视频| 99国产精品一区二区蜜桃av| 狠狠狠狠99中文字幕| 国内毛片毛片毛片毛片毛片| 19禁男女啪啪无遮挡网站| 波多野结衣高清作品| 99在线人妻在线中文字幕| 欧美性猛交黑人性爽| 免费观看人在逋| 久久人妻福利社区极品人妻图片| 亚洲性夜色夜夜综合| 黑丝袜美女国产一区| 搡老妇女老女人老熟妇| 99国产综合亚洲精品| 国产精品精品国产色婷婷| 露出奶头的视频| 欧美日韩亚洲综合一区二区三区_| 亚洲人成77777在线视频| 成在线人永久免费视频| 亚洲av电影在线进入| 免费看美女性在线毛片视频| 嫩草影院精品99| 在线永久观看黄色视频| 此物有八面人人有两片| 日本熟妇午夜| 免费在线观看视频国产中文字幕亚洲| 国产免费男女视频| 欧美精品啪啪一区二区三区| 日韩国内少妇激情av| 精品国产国语对白av| 国产亚洲精品综合一区在线观看 | 91老司机精品| 我的亚洲天堂| 热re99久久国产66热| 精品卡一卡二卡四卡免费| 老司机深夜福利视频在线观看| 又黄又爽又免费观看的视频| 女同久久另类99精品国产91| 色哟哟哟哟哟哟| 成人av一区二区三区在线看| 男人舔奶头视频| 女人高潮潮喷娇喘18禁视频| 十八禁人妻一区二区| 男女之事视频高清在线观看| 国产成人精品久久二区二区91| 在线观看午夜福利视频| 亚洲国产精品成人综合色| 嫁个100分男人电影在线观看| 国产极品粉嫩免费观看在线| www国产在线视频色| 免费观看精品视频网站| 成人欧美大片| 波多野结衣巨乳人妻| 免费观看精品视频网站| 99国产精品99久久久久| 丁香六月欧美| 国产激情欧美一区二区| 亚洲国产日韩欧美精品在线观看 | 50天的宝宝边吃奶边哭怎么回事| 日本一本二区三区精品| 亚洲中文日韩欧美视频| 美女国产高潮福利片在线看| 精品一区二区三区视频在线观看免费| 亚洲一区二区三区色噜噜| 成人国产一区最新在线观看| 91成人精品电影| 亚洲国产毛片av蜜桃av| 亚洲国产看品久久| 日韩欧美三级三区| 国产一区二区三区在线臀色熟女| 一夜夜www| 日韩高清综合在线| 最近在线观看免费完整版| 欧美丝袜亚洲另类 | 老司机福利观看| 日本在线视频免费播放| 国产在线精品亚洲第一网站| 国产激情欧美一区二区| 不卡av一区二区三区| 午夜两性在线视频| 日韩欧美 国产精品| 亚洲专区中文字幕在线| 日韩成人在线观看一区二区三区| 巨乳人妻的诱惑在线观看| 久久天堂一区二区三区四区| 亚洲精品国产精品久久久不卡| 精品免费久久久久久久清纯| 黄片大片在线免费观看| 99在线视频只有这里精品首页| 99热6这里只有精品| 国产精品亚洲av一区麻豆| 成熟少妇高潮喷水视频| 十分钟在线观看高清视频www| 波多野结衣高清作品| 99久久国产精品久久久| 日本五十路高清| 欧美国产日韩亚洲一区| 成人手机av| 亚洲色图 男人天堂 中文字幕| 亚洲欧美激情综合另类| 伦理电影免费视频| 亚洲av成人一区二区三| 亚洲中文字幕日韩| 欧美av亚洲av综合av国产av| 香蕉av资源在线| 亚洲九九香蕉| 国产av又大| 一卡2卡三卡四卡精品乱码亚洲| 国产1区2区3区精品| 久久久久国产精品人妻aⅴ院| 夜夜看夜夜爽夜夜摸| 人人妻,人人澡人人爽秒播| 国产真人三级小视频在线观看| www.999成人在线观看| 久久人妻福利社区极品人妻图片| 十八禁人妻一区二区| 欧美一级毛片孕妇| 国产精品一区二区免费欧美| 亚洲va日本ⅴa欧美va伊人久久| 亚洲成人久久性| 国产高清激情床上av| 亚洲人成77777在线视频| 禁无遮挡网站| 国产单亲对白刺激| 成人欧美大片| 国产视频一区二区在线看| 2021天堂中文幕一二区在线观 | 午夜福利成人在线免费观看| 啦啦啦韩国在线观看视频| 一二三四在线观看免费中文在| 欧美又色又爽又黄视频| 久久精品aⅴ一区二区三区四区| 成人免费观看视频高清| 亚洲aⅴ乱码一区二区在线播放 | 美女 人体艺术 gogo| 国产精品永久免费网站| 啦啦啦韩国在线观看视频| 久热爱精品视频在线9| 日韩精品青青久久久久久| 黄片小视频在线播放| 国产精品精品国产色婷婷| www.www免费av| 99在线视频只有这里精品首页| 久久久精品国产亚洲av高清涩受| 高清在线国产一区| 亚洲午夜理论影院| 国产亚洲精品久久久久久毛片| 国产精品久久久久久亚洲av鲁大| 一进一出好大好爽视频| 99国产综合亚洲精品| av福利片在线| 国内久久婷婷六月综合欲色啪| 欧美不卡视频在线免费观看 | 国产精品乱码一区二三区的特点| 黄网站色视频无遮挡免费观看| 丁香欧美五月| 欧美性猛交黑人性爽| 午夜福利在线观看吧| 国产成人一区二区三区免费视频网站| 欧美日韩亚洲国产一区二区在线观看| 黄片播放在线免费| 香蕉久久夜色| 亚洲精品在线美女| 亚洲第一av免费看| 亚洲中文av在线| 麻豆av在线久日| av片东京热男人的天堂| 人妻丰满熟妇av一区二区三区| 国产精品98久久久久久宅男小说| 12—13女人毛片做爰片一| 亚洲美女黄片视频| 黑丝袜美女国产一区| 精品欧美国产一区二区三| 欧美中文综合在线视频| 亚洲国产日韩欧美精品在线观看 | 少妇粗大呻吟视频| 亚洲片人在线观看| 国产一区二区在线av高清观看| 日本 欧美在线| 在线av久久热| 热re99久久国产66热| 久久精品91蜜桃| 日本 av在线| 日本免费a在线| 免费在线观看成人毛片| 欧美国产日韩亚洲一区| 老司机在亚洲福利影院| 波多野结衣av一区二区av| 亚洲男人天堂网一区| 啪啪无遮挡十八禁网站| 一级作爱视频免费观看| 一卡2卡三卡四卡精品乱码亚洲| 欧美日本视频| 黄色 视频免费看| 国产精品一区二区三区四区久久 | 可以免费在线观看a视频的电影网站| 亚洲欧美日韩高清在线视频| 亚洲av五月六月丁香网| 18禁黄网站禁片免费观看直播| 欧美性猛交╳xxx乱大交人| x7x7x7水蜜桃| 日日爽夜夜爽网站| 国产成人欧美| 欧美黄色片欧美黄色片| 桃红色精品国产亚洲av| 欧美国产日韩亚洲一区| 国产精品1区2区在线观看.| 久久精品夜夜夜夜夜久久蜜豆 | 成人三级黄色视频| 国产精品爽爽va在线观看网站 | 日韩大尺度精品在线看网址| 国产乱人伦免费视频| 欧美在线黄色| 久久精品国产亚洲av高清一级| 在线av久久热| 欧美精品啪啪一区二区三区| 亚洲国产精品999在线| 久久久精品欧美日韩精品| 免费在线观看黄色视频的| 50天的宝宝边吃奶边哭怎么回事| 中文亚洲av片在线观看爽| 午夜免费观看网址| 在线观看午夜福利视频| 成人18禁在线播放| 亚洲精品一卡2卡三卡4卡5卡| 日韩欧美一区视频在线观看| 欧美日韩乱码在线| 女人被狂操c到高潮| 午夜日韩欧美国产| 午夜福利高清视频| 国产黄a三级三级三级人| 成年人黄色毛片网站| 亚洲人成网站高清观看| 老司机靠b影院| 亚洲国产欧洲综合997久久, | av免费在线观看网站| av欧美777| 视频在线观看一区二区三区| 天堂影院成人在线观看| 侵犯人妻中文字幕一二三四区| 国产精品免费一区二区三区在线| 俺也久久电影网| 黄色丝袜av网址大全| 亚洲中文字幕日韩| 亚洲av美国av| 国产成人影院久久av| 国产片内射在线| 久热爱精品视频在线9| 欧美黑人精品巨大| 99久久无色码亚洲精品果冻| 最新在线观看一区二区三区| 免费一级毛片在线播放高清视频| ponron亚洲| 成人亚洲精品一区在线观看| 香蕉国产在线看| 国产在线精品亚洲第一网站| 免费在线观看影片大全网站| 久久香蕉国产精品| 中文字幕精品亚洲无线码一区 | 一级毛片高清免费大全| 午夜视频精品福利| 亚洲,欧美精品.| 成年人黄色毛片网站| 欧美一级a爱片免费观看看 | 在线观看免费视频日本深夜| 精品国产一区二区三区四区第35| 久久久久久久精品吃奶| 午夜福利成人在线免费观看| 成人特级黄色片久久久久久久| 久久中文看片网| 桃红色精品国产亚洲av| 50天的宝宝边吃奶边哭怎么回事| 不卡一级毛片| 国产一区二区在线av高清观看| 好男人在线观看高清免费视频 | 男人舔女人下体高潮全视频| 黄频高清免费视频| 亚洲欧美精品综合久久99| 成人国产综合亚洲| 丰满的人妻完整版| 大香蕉久久成人网| 啦啦啦韩国在线观看视频| 18禁观看日本| 久久久久久国产a免费观看| 亚洲欧美激情综合另类| 中文资源天堂在线| 久久精品人妻少妇| 51午夜福利影视在线观看| 狂野欧美激情性xxxx| 欧美精品亚洲一区二区| 亚洲av成人av| xxxwww97欧美| 国产精品免费视频内射| 亚洲男人天堂网一区| 国产成人一区二区三区免费视频网站| 亚洲va日本ⅴa欧美va伊人久久| 精品国产一区二区三区四区第35| 99国产精品一区二区蜜桃av| 观看免费一级毛片| 伊人久久大香线蕉亚洲五| 欧美日韩乱码在线| 美女扒开内裤让男人捅视频| 国产亚洲精品久久久久久毛片| 亚洲熟妇熟女久久| 无人区码免费观看不卡| 少妇熟女aⅴ在线视频| 久久天堂一区二区三区四区| 在线观看免费午夜福利视频| av电影中文网址| 丁香欧美五月| 久久久久九九精品影院| 看免费av毛片| 国产亚洲精品第一综合不卡| 中国美女看黄片| 中文字幕av电影在线播放| av中文乱码字幕在线| a级毛片在线看网站| 亚洲 欧美一区二区三区| 女人高潮潮喷娇喘18禁视频| cao死你这个sao货| 91老司机精品| 亚洲专区中文字幕在线| 欧美日韩亚洲综合一区二区三区_| √禁漫天堂资源中文www| 999精品在线视频| 黑丝袜美女国产一区| 午夜福利成人在线免费观看| 国产午夜福利久久久久久| 欧美日韩亚洲综合一区二区三区_| 日韩欧美国产在线观看| 97碰自拍视频| 欧美黑人巨大hd| 亚洲精品av麻豆狂野| 18禁黄网站禁片免费观看直播| 丝袜人妻中文字幕| 亚洲午夜理论影院| 国产免费av片在线观看野外av| 欧美色视频一区免费| 久久久久免费精品人妻一区二区 | 国产精品精品国产色婷婷| 日本三级黄在线观看| 少妇粗大呻吟视频| 一夜夜www| 50天的宝宝边吃奶边哭怎么回事| 亚洲欧美日韩无卡精品| 国产精品久久视频播放| 91在线观看av| 国产精品,欧美在线| 国产国语露脸激情在线看| 搡老熟女国产l中国老女人| 男人舔女人下体高潮全视频| 一本久久中文字幕| 热re99久久国产66热| 男女下面进入的视频免费午夜 | 精品久久久久久成人av| 亚洲中文字幕一区二区三区有码在线看 | 精品久久久久久久毛片微露脸| 久久久久国产一级毛片高清牌| 日本黄色视频三级网站网址| 美女高潮喷水抽搐中文字幕| 在线观看免费午夜福利视频| www日本在线高清视频| 亚洲av成人av| 国产成人欧美在线观看| 老熟妇乱子伦视频在线观看| 99精品欧美一区二区三区四区| e午夜精品久久久久久久| 99久久99久久久精品蜜桃| 一二三四在线观看免费中文在| 波多野结衣高清作品| 首页视频小说图片口味搜索| 精品不卡国产一区二区三区| 国产精品自产拍在线观看55亚洲| 91大片在线观看| 亚洲真实伦在线观看| 午夜福利在线观看吧| 日日摸夜夜添夜夜添小说| 午夜福利欧美成人| 欧美日韩福利视频一区二区| 午夜激情av网站| 琪琪午夜伦伦电影理论片6080| 亚洲男人的天堂狠狠| 一区二区三区激情视频| 99国产综合亚洲精品| 日本在线视频免费播放| 在线观看www视频免费| 亚洲国产精品成人综合色| 亚洲第一av免费看| 亚洲aⅴ乱码一区二区在线播放 | 精品久久久久久成人av| 此物有八面人人有两片| 丝袜人妻中文字幕| 不卡av一区二区三区| 国产一区二区激情短视频| √禁漫天堂资源中文www| 男人舔女人的私密视频| 亚洲精华国产精华精| 亚洲专区字幕在线| 亚洲第一青青草原| 熟女少妇亚洲综合色aaa.| 日韩免费av在线播放| 免费搜索国产男女视频| 久久久久久久久中文| 精品国产美女av久久久久小说| 99久久国产精品久久久| 手机成人av网站| 99热只有精品国产| 久久草成人影院| 丝袜在线中文字幕| 久久精品国产亚洲av高清一级| 亚洲av中文字字幕乱码综合 | 久久草成人影院| 欧美一区二区精品小视频在线| 久久精品91无色码中文字幕| 欧美精品啪啪一区二区三区| 露出奶头的视频| 免费观看精品视频网站| 在线永久观看黄色视频| 精品一区二区三区四区五区乱码| 国产黄片美女视频| av在线天堂中文字幕| 亚洲 欧美一区二区三区| 中文字幕人成人乱码亚洲影| 可以在线观看的亚洲视频| 变态另类成人亚洲欧美熟女| 久久久精品欧美日韩精品| 欧美最黄视频在线播放免费| 日韩成人在线观看一区二区三区| 欧美日韩乱码在线| 免费一级毛片在线播放高清视频| 美女免费视频网站| cao死你这个sao货| 搡老熟女国产l中国老女人| 免费高清视频大片| 亚洲欧洲精品一区二区精品久久久| 18禁黄网站禁片免费观看直播| 国内精品久久久久久久电影| 国产99白浆流出| a在线观看视频网站| www日本在线高清视频| 亚洲人成77777在线视频| 一级毛片女人18水好多| 国产一区在线观看成人免费| 国产区一区二久久| 日本免费一区二区三区高清不卡| 一区二区三区精品91| 在线观看免费视频日本深夜| 母亲3免费完整高清在线观看| 又黄又粗又硬又大视频| 淫秽高清视频在线观看| 精品国内亚洲2022精品成人| 久久精品国产清高在天天线| 美女扒开内裤让男人捅视频| 久久国产精品男人的天堂亚洲| 两人在一起打扑克的视频| 成人国产一区最新在线观看| 99精品久久久久人妻精品| 999久久久精品免费观看国产| 国产成人影院久久av| 少妇 在线观看| cao死你这个sao货| www日本在线高清视频| 日韩欧美免费精品| 热re99久久国产66热| 老汉色av国产亚洲站长工具| 日本一本二区三区精品| 亚洲黑人精品在线| 欧美一区二区精品小视频在线| 夜夜爽天天搞| 国产精品国产高清国产av| 亚洲欧美一区二区三区黑人| 亚洲av片天天在线观看| 亚洲精品国产精品久久久不卡| 中文字幕av电影在线播放| 深夜精品福利| 国产成人av激情在线播放| 国产三级黄色录像| 真人做人爱边吃奶动态| 亚洲精品国产精品久久久不卡| 99精品在免费线老司机午夜| 亚洲,欧美精品.| 制服人妻中文乱码| 男女做爰动态图高潮gif福利片| 欧美一级毛片孕妇| 精华霜和精华液先用哪个| 久久久久久亚洲精品国产蜜桃av| av中文乱码字幕在线| x7x7x7水蜜桃| 人人妻,人人澡人人爽秒播| 国产成人av教育| 波多野结衣高清无吗| bbb黄色大片| 亚洲av成人不卡在线观看播放网| 国产精品美女特级片免费视频播放器 | 又紧又爽又黄一区二区| 亚洲国产欧美一区二区综合| 色老头精品视频在线观看| 男女视频在线观看网站免费 | 欧美日韩瑟瑟在线播放| 99国产精品一区二区蜜桃av| 亚洲av五月六月丁香网| 夜夜爽天天搞| 久热爱精品视频在线9| 欧美成人一区二区免费高清观看 | 日韩大码丰满熟妇| 黑人巨大精品欧美一区二区mp4| 亚洲黑人精品在线| 婷婷精品国产亚洲av| a级毛片在线看网站| 中文亚洲av片在线观看爽| 真人一进一出gif抽搐免费| 男人舔女人的私密视频| 国产成人一区二区三区免费视频网站| 成人av一区二区三区在线看| 人人妻人人澡人人看| 国产av在哪里看| 欧美中文日本在线观看视频| 99re在线观看精品视频| √禁漫天堂资源中文www| 他把我摸到了高潮在线观看| 九色国产91popny在线| 哪里可以看免费的av片| 国产亚洲精品综合一区在线观看 | 天堂√8在线中文| 日日爽夜夜爽网站| a级毛片a级免费在线| 成人永久免费在线观看视频| а√天堂www在线а√下载| 老司机在亚洲福利影院| 久久天躁狠狠躁夜夜2o2o| 欧美三级亚洲精品| 一进一出抽搐动态| 男女午夜视频在线观看| 欧美成人免费av一区二区三区| 午夜福利一区二区在线看| 两个人看的免费小视频| 一二三四社区在线视频社区8| 高潮久久久久久久久久久不卡| 日韩三级视频一区二区三区| 亚洲av成人一区二区三| 国产精品av久久久久免费| 香蕉久久夜色| 欧美成人性av电影在线观看| 在线观看免费视频日本深夜| 欧美乱妇无乱码| 好男人电影高清在线观看| www国产在线视频色| 男女下面进入的视频免费午夜 | 亚洲人成网站在线播放欧美日韩| 久久久国产成人免费|