• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Soft-HGRNs:soft hierarchical graph recurrent networks for multi-agent partially observable environments*#

    2023-02-06 09:44:10YixiangRENZhenhuiYEYiningCHENXiaohongJIANGGuanghuaSONG

    Yixiang REN,Zhenhui YE,,Yining CHEN,Xiaohong JIANG,Guanghua SONG?

    1School of Aeronautics and Astronautics,Zhejiang University,Hangzhou 310027,China

    2College of Computer Science and Technology,Zhejiang University,Hangzhou 310027,China

    Abstract:The recent progress in multi-agent deep reinforcement learning(MADRL)makes it more practical in real-world tasks,but its relatively poor scalability and the partially observable constraint raise more challenges for its performance and deployment.Based on our intuitive observation that human society could be regarded as a large-scale partially observable environment,where everyone has the functions of communicating with neighbors and remembering his/her own experience,we propose a novel network structure called the hierarchical graph recurrent network(HGRN)for multi-agent cooperation under partial observability.Specifically,we construct the multiagent system as a graph,use a novel graph convolution structure to achieve communication between heterogeneous neighboring agents,and adopt a recurrent unit to enable agents to record historical information.To encourage exploration and improve robustness,we design a maximum-entropy learning method that can learn stochastic policies of a configurable target action entropy.Based on the above technologies,we propose a value-based MADRL algorithm called Soft-HGRN and its actor-critic variant called SAC-HGRN.Experimental results based on three homogeneous tasks and one heterogeneous environment not only show that our approach achieves clear improvements compared with four MADRL baselines,but also demonstrate the interpretability,scalability,and transferability of the proposed model.

    Key words:Deep reinforcement learning;Graph-based communication;Maximum-entropy learning;Partial observability;Heterogeneous settings

    1 Introduction

    Human society achieves efficient communication and collaboration,and can be regarded as a largescale partially observable multi-agent system.In recent years,multi-agent deep reinforcement learning(MADRL)has been facilitated to solve real-life problems such as package routing(Adler et al.,2005;Ye DY et al.,2015)and unmanned aerial vehicle(UAV)control(Zhang Y et al.,2020),which are typically large-scale and partially observable tasks.To solve the environmental instability in the multiagent system training process,Lowe et al.(2017)proposed multi-agent deep deterministic policy gradient(MADDPG),in which the centralized training and decentralized execution(CTDE)framework was introduced,leading to many variants(Iqbal and Sha,2019;Ye ZH et al.,2022a).However,centralized training(CT)brings high computational complexity and poor scalability,and decentralized execution(DE)limits the ability of agents to obtain the necessary information for collaboration under partial observability.To handle the information insufficiency in partially observable environments,Foerster et al.(2016)proposed differentiable inter-agent learning(DIAL),pioneering the paradigm of communication learning and performing communication among all agents.An alternative is to allow agents to communicate within a certain range.Wang et al.(2020)proposed recurrent-MADDPG(R-MADDPG),which allows agents to communicate with a fixed number of nodes,and thus could restrain the range of cooperation.In this case,to achieve efficient neighboring communication,a scalable and flexible information aggregation mechanism is needed.Moreover,the recent progress in deep learning and graph learning provides a new idea for multi-agent reinforcement learning(MARL),i.e.,to regard the multi-agent system as a graph.Based on intuition,DGN(Jiang et al.,2020)and HAMA(Ryu et al.,2020)adopt a graph convolutional network with a self-attention kernel(Veli?kovi?et al.,2018)as the communication structure and achieve better performance and scalability.Based on these works,this study further explores a highly scalable MADRL algorithm for a large-scale partially observable Markov decision process(POMDP).

    Inspired by the observation that everyone in human society obtains necessary information for collaboration by communicating with his/her colleagues and recalling the experience,we propose a network structure called the hierarchical graph recurrent network(HGRN).HGRN regards the multi-agent system as a graph and each agent as a node.Each node encodes its local observation as the node embedding.Prior knowledge,such as distance and network connectivity,is included when connecting the nodes.To achieve communication between heterogeneous agents,we modify the hierarchical graph attention(HGAT)layer(Ryu et al.,2020)to better process the graph data with heterogeneous nodes.Specifically,we employ self-attention as the convolutional kernel and extract valuable information from different groups of neighboring agents separately.With the HGAT-based communication channel,each agent can aggregate information from its heterogeneous neighbors to alleviate the information loss in POMDP.In addition,recalling valuable information in temporal histories is helpful for solving partially observable problems.Thus,we use the gated recurrent unit(GRU)(Cho et al.,2014)to record long-term temporal information.In conclusion,HGRN is a spatial-temporal aware network that can make decisions based on information aggregated from the agent’s heterogeneous neighbors and long-term memory.

    Apart from the network structure,the exploration strategy is also critical in POMDP.Instead of learning deterministic policies with heuristic exploration methods such as?-greedy,as many previous works have done(Jiang et al.,2020;Ryu et al.,2020),we propose a maximum-entropy method that can learn stochastic policies of a configurable target action entropy.As the exploration level varies in different scenarios,the optimal target action entropy will change accordingly.Therefore,our strategy is more interpretable and convenient to find the optimal setting compared with the previous maximum-entropy MARL method(Iqbal and Sha,2019)that learns stochastic policies with a fixed temperature parameter.We name the value-based policies trained in this way as Soft-HGRN,and its actor-critic variant as SAC-HGRN.

    The main contributions of this study are summarized as follows:

    1.We introduce a novel network structure named HGRN that first combines the advantages of a graph convolutional network and a recurrent unit in MADRL.It uses spatio-temporal information to handle heterogeneous partially observable environments.

    2.We propose two maximum-entropy MADRL algorithms(Soft-HGRN and SAC-HGRN)that introduce a learnable temperature parameter to learn our HGRN-structured policy with a configurable target action entropy.

    3.Experiments show that our approach outperforms four state-of-the-art MADRL baselines in several homogeneous and heterogeneous environments.Case studies and ablation studies are implemented to validate the effectiveness of each component in our methods.

    2 Related works

    To solve the problem of multi-agent cooperation,the simplest and most straightforward method is to use a single-agent deep reinforcement learning(SADRL)algorithm to train each agent independently.This method belongs to the decentralized training and decentralized execution(DTDE)paradigm and is known as independent learning.However,training multiple policies simultaneously may make the environment too unstable to converge.To solve the environmental instability,MADDPG(Lowe et al.,2017)extends DDPG(Lillicrap et al.,2015)by learning a centralized critic network with full observation to update the decentralized actor network with partial observability.This paradigm is known as CTDE and leads to many variants such as multi-actor-attention-critic(MAAC)(Iqbal and Sha,2019)and PEDMA(Ye ZH et al.,2022a).However,because the input space of the centralized critic expands exponentially to the scale of the multi-agent system,it cannot converge easily in large-scale multi-agent tasks.As a consequence,many large-scale multi-agent cooperative tasks(Rui,2010;Chu et al.,2020)are still handled by independent learning methods such as deep Q-network(DQN)(Mnih et al.,2015)and A3C(Mnih et al.,2016).

    Communication learning aims to learn a communication protocol among agents to enhance cooperation.DIAL(Foerster et al.,2016)is the first work that proposes a learnable communication protocol in the multi-agent partially observable environment.CommNet(Sukhbaatar et al.,2016)uses the average embedding of all agents as the global communication value.Both DIAL and CommNet assume that communication is available among all agents,which is essentially impractical.The development of the graph neural network(GNN)in recent years offers a scalable and flexible communication structure for MADRL.Networked agents(Zhang KQ et al.,2018)construct the multi-agent system in a graph and transfer the network parameters along the edges.DGN(Jiang et al.,2020)stacks two graph attention network(GAT)layers to achieve inter-agent communication in a two-hop perception region.HAMA(Ryu et al.,2020)proposes a novel GNN structure that achieves communication among heterogeneous agents with an agent-level and a group-level selfattention structure,sequentially.

    Similar to DGN and HAMA,the proposed Soft-HGRN and SAC-HGRN use the graph attention mechanism for inter-agent communication.However,DGN considers only communication among homogeneous agents,and HAMA has not devoted much attention to designing the overall network structure.By contrast,Soft-HGRN improves the graph convolution structure of HAMA to better communicate with heterogeneous agents and to properly design the overall network structure.As for memory modules,DRQN(Hausknecht and Stone,2015),QMIX(Rashid et al.,2018),and R-MADDPG(Wang et al.,2020)also use recurrent units to store historical information to address POMDP problems.The novelty of our approach is that the stored historical information is the aggregated information obtained by graph convolution.In MAAC,Iqbal and Sha(2019)used maximum-entropy learning(Haarnoja et al.,2017,2018)to train stochastic policies and set a fixed temperature parameter to control exploration;in our approach,the temperature parameter is learned according to the configurable target action entropy,which shows better interpretability and controllability.

    3 Preliminaries and notations

    Partially observable Markov game(POMG)is a multi-agent extension of the Markov decision process.At each timeslot of a POMG environment withNagents,each agentiobtains a local observationoiand executes an actionai,and then receives a scalar rewardrifrom the environment.The objective of reinforcement learning(RL)is to learn a policyπi(ai|oi)for agentithat maximizes its discounted rewardwhereγ∈[0,1]is a discounting factor andTis the total interaction steps of the Markov decision process.Our work is based on the POMG framework with extra neighboring communication.

    Q-learning(Watkins and Dayan,1992)is a popular method in RL and has been widely used in multiagent domains(Claus and Boutilier,1998).It learns a value functionQ(o,a)that estimates the expected returnafter taking actionaunder observationo,which could be recursively defined asQ(ot,at)=Eat+1[r+Q(ot+1,at+1)].DQN(Mnih et al.,2015)is the first work that learns a Q-function using a deep neural network as its approximator,which introduces experience replay(Lin,1992)and a target network to stabilize the training.At each environmental timeslott,it stores the transition tuple(namely,the experience)(ot,at,rt,ot+1)in a large sliding window container(namely,the replay buffer),and resamples a mini-batch of experience from the replay buffer everyτsteps.Then it updates the Qfunction by minimizing the loss function as follows:

    whereQ'is the target network whose parameters are periodically updated by copying the parameters of the learned networkQ.Now that an action-value functionQis trained,an optimal policy can be obtained by selecting the action with the largest Qvalue:π*(a|o)=arg maxa Q(o,a).As the greedy policy can easily converge to sub-optimum,the DQN is trained and generally executed with heuristic exploration strategies,such as?-greedy.

    GAT(Veli?kovi?et al.,2018)is a remarkable GNN and serves as a powerful network structure for calculating the relationships between agents in MADRL.Generally,agentiin the environment can be represented as a node with its local informationeias the node embedding.The connection between nodes can be determined by distance,network connectivity,or other metrics.For convenience,we useGito represent the set of agentiand its neighbors.To aggregate valuable information from its neighbors,agentiwould calculate its relationship to each neighborj(j∈Gi)with a bilinear transform(Vaswani et al.,2017).

    Important notations used in this study are presented in Table 1.

    Table 1 List of important notations in the study

    4 Soft hierarchical graph recurrent networks

    In this section,we introduce our MADRL approach for large-scale partially observable problems,including a novel network structure named HGRN and two maximum-entropy MADRL methods named Soft-HGRN and SAC-HGRN.

    4.1 HGRN:aggregating information from neighbors and histories

    The design of the HGRN is inspired by human society,in which each individual can communicate with his/her(logical or physical)neighbors and recall valuable information from his/her memory.We construct the multi-agent system in a graph,whereeach agent in the environment is represented as a node and its local information is the node embeddingei.The node embedding could pass through the edge during forward propagation.For each agenti,we defineGias the set of agentiand its neighboring agents that interconnect to agenti.As for heterogeneous environments,there could beKdifferent groups of agents,and we represent the set of agents in groupkasCk.

    There are mainly two challenges in designing a network structure for communication among heterogeneous agents.First,the graph structure will dynamically change over time,which requires each node to have good scalability and robustness to process the neighboring nodes’information.Second,the feature representation between different agent groups may vary greatly,which makes it challenging to use the features from heterogeneous agents.

    To handle the first challenge,we adopt the idea of HGAT in HAMA(Ryu et al.,2020)to achieve communication among heterogeneous agents.The key idea of HGAT is to communicate by group.Specifically for agenti,HGAT extracts information from agent groupkand computes a separate node embedding vectorgki(k=1,2,...,K).We first compute the individual relationship between agentiand each of its neighborsjin groupk,represented

    as

    Then we aggregate the information of each neighborjby

    whereWkQ,WkK,andWkVare learnable matrices of groupkthat transform the node embedding into the query,key,and value vector,respectively.

    To address the second challenge mentioned above,different from the ordinary HGAT method that uses a shared self-attention unit to integrate the group-level embeddings,we concatenate the embeddings of each group and process them with a linear transform:

    where(·|·)denotes the concatenation operation.This is because the characteristics of different groups of agents can be quite different,and using shared parameters to process their embedding directly may lead to training instability.Note that our version of HGAT is equivalent to GAT in homogeneous environments.The detailed structure of our modified HGAT is presented in Fig.1.

    Fig.1 Forward propagation of the proposed hierarchical graph attention(HGAT)layer to calculate agent i’s node embedding.The process is composed of two steps:the inter-agent and inter-group aggregation.Parameters in the purple rectangles are trainable.References to color refer to the online version of this figure

    After using HGAT to realize the spatial information aggregation,we hope that the agent can retain temporal information in memory,which requires the agent to judge and select critical information in its history.To this end,we apply the GRU to process the node embeddinggitat timet:wherehit-1is the hidden state stored in the GRU at timet-1,which records valuable information in historical node embeddings.

    The overall network structure of our proposed HGRN is shown in Fig.2.For scalability and sample efficiency,agents in the same group share the same parameters.To reduce the computational complexity,HGRN takes only observations as the input and outputs the Q-value for each possible action.We use linear encoders to map the raw observation of heterogeneous agents to the same dimension,and then use encoding as the node embedding in HGAT.We perform network architecture search to design the HGAT-based communication structure,and details can be found in Section 5.4.Compared with HAMA(Ryu et al.,2020),which uses only one HGAT layer,our proposed HGRN stacks two HGAT layers and therefore has a two-hop perceptive field.Skip connections(He et al.,2016)are also created over the two HGAT layers to achieve better performance and faster convergence.After the GRU,a linear transform is applied to infer the Q-values for all actions.

    Fig.2 Overall structure of the hierarchical graph recurrent network(HGRN)for heterogeneous environments.For simplicity,we consider three agents from two groups:agent 1 and agent 2 from the same group,and agent 3 from the other group.Agent 2 is interconnected with agent 1 and agent 3.Enc:encoding;HGAT:hierarchical graph attention;Concat:concatenation;GRU:gated recurrent unit

    For clarity,we useOitto represent the set of observations of agentiand its neighbors at timet,i.e.,Oit={ojt|?j∈Git}.Hence,the HGRN-structured Q-function could be represented asQi:Oit→RA,whereAis the dimension of the action space.

    4.2 Soft-HGRN:learning energy-based policies with configurable action entropy

    Deterministic policies,such as DQN and DGN,can easily fall into a local optimum,so?-greedy is often used in the training phase and execution phase to encourage exploration.However,the exploration induced by?-greedy is completely random and may incur performance degradation during execution.Our intuition is that stochastic policy can be used to replace?-greedy to help exploration,which allows the model to learn when and how to explore.To this end,we adopt maximum-entropy learning in the multi-agent setting to train an energy-based stochastic model.Specifically,we first obtain the probability of the actionπ(at|Ot)by using softmax to process the Q-value produced by HGRN.Then we redefine the value functionV(Ot)to learn the energy-based policy.Detailed derivations can be found in supplementary materials.

    Then the Q-function is updated by minimizing the mean squared temporal-difference error(TD-error):

    whereSis the size of the mini-batch andyt=rt+V(Ot+1)is the learning target ofQ(Ot,at).

    In general,action entropy is widely used to measure the degree of exploration of a policy,which is defined as the information entropy of the policy’s action probability:

    As is mentioned above,the action entropy is controlled by the temperature hyper-parameterα.However,the effect ofαon the action entropy varies with different network structures and environments.By contrast,setting the action entropy as the goal of the learned model is more intuitive and interpretable.Therefore,we set a target action entropyHtarget,and then adaptively adjust the temperature parameterαto approximate the goal of E[Hπ(O)]→Htarget.Specifically,we learnαwith gradient descent by

    wherefis a predefined function that meets the condition off(x)·x≤0.Also note thatHtargetis the target entropy value for the policy to approximate and we generally represent it asHtarget=pα·maxHπ,where maxHπis determined by the action space of policyπ,andpα∈[0,1]is a new hyper-parameter to be tuned.We call the approach that learns the HGRN-based stochastic model with configurable action entropy as Soft-HGRN.An intuitive flowchart of the Soft-HGRN learning process is shown in Fig.3.

    Fig.3 Flowchart of the Soft-HGRN learning process.Solid lines denote the forward propagation to calculate the loss,and the dashed lines denote backward propagation(HGRN:hierarchical graph recurrent network;TD:temporal-difference)

    4.3 SAC-HGRN:actor-critic styled stochastic policy

    In some complicated scenarios,decoupling the task of value estimation and action selection can improve the performance of the model.Adapted from soft actor-critic(SAC)(Haarnoja et al.,2018),we design an actor-critic styled variant named SACHGRN.The detailed derivations and a pseudocode for training SAC-HGRN are available in supplementary materials.

    5 Experiments

    We applied our methods in four multi-agent partially observable environments with the main goal of answering the following questions:

    Q1:How do Soft-HGRN and SAC-HGRN perform in homogeneous and heterogeneous tasks compared with other state-of-the-art baseline algorithms?

    Q2:How does the HGRN structure extract the necessary features for effective cooperation?

    Q3:How does the learnable temperature of soft Q-learning work to control the action entropy and improve the performance of the policy?

    We also perform additional experiments including network architecture search,parameter analysis,interpretability study,and ablation study.

    5.1 Environments

    In this subsection,we illustrate each environment in detail.The four simulation environments are all partially observable,including three homogeneous scenarios(unmanned aerial vehicle-mobile base station(UAV-MBS),Surviving,and Pursuit)and one heterogeneous scenario(cooperative treasure collection,CTC).Fig.4 shows screenshots of the four tested environments.

    1.UAV-MBS(Ye ZH et al.,2022b):a group of UAVs(solid blue circles)collaboratively serve as mobile base stations to fly around a target region to cover and provide communication services to the randomly distributed ground users(green squares).The dashed green circles,yellow circles,and blue circles represent the coverage range,observation range,and communication range of each UAV,respectively.The dashed blue arrow between UAV 0 and UAV 1 represents GAT connectivity.

    2.Surviving(Jiang et al.,2020):a group of agents(green circles)cooperate to explore a big map and collect the randomly distributed food to prevent starvation.The dashed green square is the agent’s observation area.Green squares with different shades denote the different numbers of food items that are scattered out from the food resource(yellow squares).

    3.Pursuit(Zheng et al.,2018):an adversarial environment where a group of predators(blue circles)controlled by the model is rewarded by attacking the prey(green circle).The prey is trained to escape from the predator.The gray square represents the obstacle,which can be used by the predator to lock the prey.

    4.Cooperative treasure collection(CTC)(Iqbal and Sha,2019):a heterogeneous environment contains three types of controllable agents,the treasure hunter(gray circles),red bank(red circles),and blue bank(blue circles).The goal of the agents is to collect the colored treasure(represented as the square)and reposit it in the bank with the correct color.The dashed colored arrow denotes HGAT connectivity.

    Detailed descriptions and settings for these environments are available in supplementary materials.

    5.2 Baseline methods

    We compared our approach with four MADRL baselines,including DQN,CommNet,MAAC,and DGN.

    DQN is a simple but strong decentralized approach for large-scale multi-agent tasks.Comm-Net is a centralized approach that performs communication among all agents.Thus,we used it for comparison to show the superiority of our HGATbased communication structure.MAAC is a popular CTDE method that learns the centralized critic with full observability to update the decentralized stochastic policy.Therefore,we used MAAC for comparison to show the need to perform communication during execution,and the fact that there is no need to provide global information during training.DGN is a state-of-the-art MADRL algorithm that also uses GAT to communicate.We compared DGN with HGN(DGN with HGAT)to show the benefit of enabling communication with heterogeneous agents,and compared HGN with Soft-HGRN to examine the necessity of GRU and the maximum-entropy-based stochastic policy.A summary of the comparison of these algorithms is shown in Table 2.

    Table 2 Comparison of the characteristics of each algorithm

    5.3 Implementation details

    We conducted experiments on the four environments with our methods and four baselines.Eachmodel was updated until convergence.We used Adam(Kingma and Ba,2015)as the network optimizer,and used stochastic gradient descent(SGD)with the same learning rate to update the learnable temperature parameterα.To make a fair comparison,the communication structure of DGN and MAAC was implemented as two stacked GAT layers with skip-connection,which is similar to HGRN as shown in Fig.2.The hyper-parameter settings of all tested models in all environments are available in supplementary materials.

    5.4 Network architecture search of communication structure

    In the HGRN network,we processed each node’s local observation with an encoder,and then aggregated information from its(homogeneous/heterogeneous)neighbors through our HGATbased communication structure.During the design process of the communication structure in our network,we tested three candidates:(1)one HGAT layer;(2)two stacked HGAT layers;(3)two stacked HGAT layers with skip-connection.

    We tested each version of the communication structure with Soft-HGRN in UAV-MBS.Two stacked HGAT layers with skip-connection was proved to converge faster and perform better.Therefore,we applied the third communication structure in DGN,MAAC,Soft-HGRN,and SAC-HGRN.The evaluation results are available in supplementary materials.

    5.5 Performance evaluation and ablation studies

    After designing the overall network structure,we first compared the performance of Soft-HGRN and SAC-HGRN with other baselines to answer Q1.

    For homogeneous environments,we used the mean episodic reward over 2000 testing episodes as our evaluation metric.We performed ablation studies by removing each component(HGAT,GRU,and stochastic policy)from our approach and validating the necessity of each component.For each case,we conducted three repeated runs and calculated the average results along with their standard deviations(Table 3).Note that HGAT is equivalent to GAT in homogeneous tasks,and hence DGN is equivalent to Soft-HGRN-G and Soft-HGRN-S in the table.It can be seen that our approach outperforms other baselines by a significant margin,and removing any component would cause performance degradation.All learning curves are available in supplementary materials.

    We also noticed that the necessity of each component varies in different environments,possibly due to the different nature of each task.First,removing GRU in UAV-MBS and Pursuit significantly degraded performance,whereas it was not that obvious in Surviving.This finding shows that the temporal memory is needed in UAV-MBS and Pursuit,possibly because the position of ground users in UAVMBS is fixed during an episode,and predators in Pursuit should construct consistent closure to lock the prey,whereas in Surviving,the food will be quickly consumed and regenerated in another position.Second,HGAT-based communication is extremely important in Surviving,possibly because the dynamic food distribution requires cooperative exploration among the agents.A notable finding is that disabling communication leads to the best performance in Pursuit.Our insight is that the key factor in this task is to form stable closure with neighboring predators,which requires only information of the very nearest neighbors in the local observation range,while information from far away may distract the agent.Third,the performance of the stochastic policies trained with maximum-entropy learning outperforms the corresponding deterministic policies,proving its capability to substitute?-greedy to provide more intelligent exploration.Finally,we discussed the difference between our proposed Soft-HGRN and SAC-HGRN.It can be seen in Table 3 that SAC-HGRN outperforms Soft-HGRN in every homogeneous and heterogeneous scenario.This is possibly because Soft-HGRN uses the Q-value with softmax to output the probability of actions,which leads to a small temperature value(typically around 10-3)in environments where the action-value is very close;hence,the performance may be sensitive and fragile to the update ofα.

    Table 3 Evaluated episodic reward of different methods in three homogeneous tasks

    In the heterogeneous environment CTC,we focused mainly on the impact of HGAT’s communication between different types of agents.To this end,we also tested the DGN with HGAT(namely HGN),and Soft-HGRN/SAC-HGRN without heterogeneous communication(namely Soft-DGRN and SAC-DGRN).The learning curves of all tested models are shown in Fig.5.It can be seen that everyHGAT-based model(the solid line)performs better than its GAT-based variant(the dashed line),which demonstrates the effectiveness of our designed hierarchical communication structure.

    Fig.5 Learning curves of various algorithms in cooperative treasure collection(CTC).References to color refer to the online version of this figure

    5.6 Case studies about interpretability

    In this subsection,we perform a series of case studies to investigate the principle behind the performance of our approach.

    5.6.1 How does HGRN extract valuable spatiotemporal features?

    To answer Q2,we studied the effectiveness of the proposed HGRN structure,which includes two key components,HGAT and GRU.Ideally,each HGRN agent can extract valuable embedding from neighbors and also recall necessary information from its memory,to construct features for decision-making.To provide an intuitive example and get some meaningful insights,we considered a special situation in Surviving.A screenshot of the considered case is shown in Fig.6a,in which we controlled agent 0 to pass through agent 1 and agent 2.Then we inspected the graph attention weights(which describe the predicted importance of each agent’s information to agent 0)and the mean value of GRU’s reset gate(which controls the weight of history features to form the output embedding).The results are shown in Figs.6b and 6c.At time 2,we observed that HGRN managed to pay the most attention to agent 1,who observed the most food.Between time 2 and time 4,in which the HGAT connection was established,the first observation is that the agent pays more attention to itself,possibly because HGRN has well stored the information from its neighbors;meanwhile,the reset gate value of GRU grows rapidly,which denotes that the model relies more on its memory to make the decision.

    5.6.2 How does the self-adaptive temperature work?

    To answer Q3,we first illustrated the training loop of the learnable temperature parameterα.As can be seen in Fig.7,during the Soft-HGRN training process,when the action entropy is smaller than the target entropy,the model tends to increaseαto encourage exploration and vice versa.To prevent instability,we clipped the gradient ofαwhen it exceeded the value of 0.01α,which contributes to stable convergence of the action entropy.As for the principle behind the better performance of the stochastic policy,our insight is that it enables the agent to learn when and how to explore and exploit.

    Fig.7 Learning curves of action entropy,temperature parameter α,and its gradient?α during the training process of Soft-HGRN in Surviving(References to color refer to the online version of this figure)

    To support our idea,we considered the same Surviving situation as shown in Fig.6a to demonstrate that our approach enables the agent to learn when and how to explore and exploit.We tested the Soft-HGRN model in this scenario and inspected its output action probability distribution and its corresponding action entropy,as shown in Figs.6d and 6e.Note that we trained Soft-HGRN withpα=0.9 and the dimension of the action space is 5;thus,the target entropy of the policy isHtarget=0.9 maxH=1.448 494.At time 0 and time 1,because agent 0 cannot observe any food or neighbor,its action probability approximated a uniform distribution that had a large action entropy.At time 2,agent 0 established GAT connection with agent 1 and agent 2,and because agent 1 observed more food,agent 0 tended to go to the right or upward to approach agent 1.At time 3,agent 1 was on the upward side;hence,“upwards”was the action with the largest probability,and the output action probability distribution had smaller action entropy.These observations demonstrate that Soft-HGRN has learned when to explore(by controlling the action entropy of the output action distribution based on the observation),and how to explore(by giving the supposed optimal action with a relatively large probability).

    Fig.6 An intuitive example to show how the hierarchical graph recurrent network(HGRN)extracts valuable features from neighbors and histories:(a)a screenshot of the intuitive example;(b)graph attention weights of agent 0;(c)mean value of GRU’s reset gate;(d)action probability distribution of agent 0;(e)action entropy of agent 0.GAT:graph attention network;GRU:gated recurrent unit

    In addition,we noticed that the optimal action entropy factorpαis dependent on the exploration demand of the environment.For instance,intuitively in Surviving,we needed to consistently explore the food so the action entropy should be large;in Pursuit,a stable closure created by multiple agents is required to lock the prey,and thus the action entropy should be small.To demonstrate this intuition,and to choose the optimal entropy targetpαfor our proposed stochastic policies,we performed parameter analysis to figure out how the target entropy factorpαaffects the performance in each environment,as shown in Fig.8.

    Fig.8 Performance of Soft-HGRN with different target entropy factors in UAV-MBS,Surviving,and Pursuit(References to color refer to the online version of this figure)

    5.7 Transferability

    Weak generalization ability is an urgent problem to be solved by RL algorithms.To test the generalization ability of our model,we trained all models in the Surviving environment with 100 agents,and then deployed them to environments with different agent scales for testing,as shown in Fig.9.It can be seen that our approach performs better than the baselines under all tested agent scales.An interesting finding is that Soft-HGRN performs better when the agent scale is<100,and SAC-HGRN shows the best performance when the scale is>100.Our insight is that it is easier for Soft-HGRN to estimate the return value of a situation with a smaller agent scale,while the discriminative policy of SAC-HGRN is better when the environment becomes too unstable to estimate the return due to the ultra-large agent scale.

    Fig.9 Comparison of different algorithms in Surviving with various agent scales(References to color refer to the online version of this figure)

    6 Conclusions

    In this paper,we have proposed a value-based MADRL model(namely,Soft-HGRN)and its actorcritic variant SAC-HGRN,to address the large-scale multi-agent partially observable problem.The proposed approach consists of two key components,a novel network structure named HGRN that could aggregate information from neighbors and history,as well as a maximum-entropy learning technique that could self-adapt the temperature parameter to learn a stochastic policy with configurable action entropy.The experiments on four multi-agent environments demonstrated that the learned model outperforms other MADRL baselines,and ablation studies showed the necessity of each component in our approach.We have also analyzed the interpretability,scalability,and transferability of the learned model.As for future work,we will investigate to transfer our methods into continuous space and implement more challenging adversarial environments,such as StarCraft II.

    Contributors

    Zhenhui YE,Yixiang REN,and Guanghua SONG designed the research.Zhenhui YE and Yixiang REN processed the data.Zhenhui YE drafted the paper.Yixiang REN helped organize and revise the paper.Yining CHEN,Xiaohong JIANG,and Guanghua SONG revised and finalized the paper.

    Compliance with ethics guidelines

    Yixiang REN,Zhenhui YE,Yining CHEN,Xiaohong JIANG,and Guanghua SONG declare that they have no conflict of interest.

    Data availability

    The data that support the findings of this study are available from the corresponding author upon reasonable request.

    av视频在线观看入口| 激情在线观看视频在线高清| 日本三级黄在线观看| 久久这里只有精品19| 亚洲色图 男人天堂 中文字幕| 精品久久久久久久末码| 18禁美女被吸乳视频| 男女之事视频高清在线观看| 亚洲七黄色美女视频| 亚洲欧美日韩卡通动漫| 国产精品久久久久久人妻精品电影| a级毛片a级免费在线| 午夜福利免费观看在线| 一个人免费在线观看电影 | 欧美中文日本在线观看视频| 熟女少妇亚洲综合色aaa.| 小蜜桃在线观看免费完整版高清| 18禁观看日本| 女生性感内裤真人,穿戴方法视频| 天天添夜夜摸| 久久中文字幕人妻熟女| 999久久久国产精品视频| 在线国产一区二区在线| 欧美乱妇无乱码| 成年人黄色毛片网站| 成年人黄色毛片网站| 国产真实乱freesex| 国产亚洲欧美98| 午夜福利成人在线免费观看| 真实男女啪啪啪动态图| 欧美中文综合在线视频| www.精华液| 最近最新中文字幕大全电影3| 99热精品在线国产| 老熟妇乱子伦视频在线观看| 欧美又色又爽又黄视频| 国产精品一及| 我要搜黄色片| 少妇的丰满在线观看| 韩国av一区二区三区四区| 日韩免费av在线播放| 久久99热这里只有精品18| 母亲3免费完整高清在线观看| 欧美中文日本在线观看视频| 久久久久国产精品人妻aⅴ院| 亚洲国产精品sss在线观看| 国产精品精品国产色婷婷| 精品一区二区三区四区五区乱码| 无限看片的www在线观看| 2021天堂中文幕一二区在线观| 波多野结衣高清作品| 国产亚洲av嫩草精品影院| www.熟女人妻精品国产| 两个人的视频大全免费| 天天添夜夜摸| 午夜成年电影在线免费观看| 999久久久精品免费观看国产| 日韩欧美三级三区| 99久久国产精品久久久| 在线国产一区二区在线| 国产一区二区在线观看日韩 | 国产精品久久久久久人妻精品电影| 精品日产1卡2卡| 免费在线观看视频国产中文字幕亚洲| av天堂在线播放| 嫩草影院精品99| 免费在线观看亚洲国产| 欧美激情久久久久久爽电影| 色吧在线观看| 无人区码免费观看不卡| 天天一区二区日本电影三级| 久久精品aⅴ一区二区三区四区| 亚洲黑人精品在线| 亚洲人成网站高清观看| 嫩草影视91久久| 90打野战视频偷拍视频| 亚洲精品456在线播放app | 嫩草影院入口| 99热这里只有是精品50| 伊人久久大香线蕉亚洲五| 国产一区二区三区在线臀色熟女| 特大巨黑吊av在线直播| 国产视频一区二区在线看| 丰满人妻一区二区三区视频av | 国产亚洲欧美98| 国产免费男女视频| 他把我摸到了高潮在线观看| 亚洲欧美日韩高清专用| 亚洲成av人片在线播放无| 亚洲乱码一区二区免费版| 亚洲av第一区精品v没综合| 亚洲av电影不卡..在线观看| 国产精品久久久久久久电影 | 精品一区二区三区视频在线 | 男人舔奶头视频| 天天添夜夜摸| 欧美国产日韩亚洲一区| 一进一出抽搐gif免费好疼| 欧美一区二区国产精品久久精品| 国产69精品久久久久777片 | 免费观看的影片在线观看| 国内精品久久久久精免费| 午夜福利高清视频| 俄罗斯特黄特色一大片| 91麻豆av在线| 精品国产亚洲在线| 国产精品精品国产色婷婷| 免费看日本二区| av片东京热男人的天堂| 特级一级黄色大片| 国产精品影院久久| 久久精品夜夜夜夜夜久久蜜豆| 亚洲,欧美精品.| 日本精品一区二区三区蜜桃| 99久久无色码亚洲精品果冻| 麻豆国产av国片精品| 日韩欧美在线二视频| 人人妻人人看人人澡| 老司机在亚洲福利影院| 一进一出抽搐gif免费好疼| 国产又色又爽无遮挡免费看| 国产精品爽爽va在线观看网站| 麻豆av在线久日| 久久精品国产99精品国产亚洲性色| 脱女人内裤的视频| 久久国产精品影院| 午夜福利18| 亚洲精品粉嫩美女一区| 欧美黑人巨大hd| 一进一出抽搐动态| 国产高清三级在线| 国产精品一区二区三区四区免费观看 | 国产一级毛片七仙女欲春2| 国产成人福利小说| 91麻豆av在线| 欧美日韩国产亚洲二区| 国产精品免费一区二区三区在线| 亚洲人成伊人成综合网2020| av天堂中文字幕网| 90打野战视频偷拍视频| 色av中文字幕| 18禁观看日本| 色精品久久人妻99蜜桃| 麻豆一二三区av精品| 天天躁狠狠躁夜夜躁狠狠躁| 美女cb高潮喷水在线观看 | 亚洲精品中文字幕一二三四区| 亚洲国产精品久久男人天堂| 午夜视频精品福利| 国产成人精品无人区| 国产高清videossex| 午夜视频精品福利| 欧美激情在线99| 国产精品 欧美亚洲| 国产一区二区三区视频了| 国产成人福利小说| 国产美女午夜福利| 欧美国产日韩亚洲一区| 欧美日韩国产亚洲二区| 国内少妇人妻偷人精品xxx网站 | 亚洲天堂国产精品一区在线| 欧美日韩精品网址| 国内揄拍国产精品人妻在线| 观看美女的网站| 在线免费观看不下载黄p国产 | 国产午夜福利久久久久久| 国产不卡一卡二| 制服丝袜大香蕉在线| 精品无人区乱码1区二区| 精品久久久久久久人妻蜜臀av| 国产99白浆流出| 亚洲av免费在线观看| 叶爱在线成人免费视频播放| 国产精品美女特级片免费视频播放器 | 激情在线观看视频在线高清| 无遮挡黄片免费观看| 国产精品av久久久久免费| 国产成人一区二区三区免费视频网站| 看免费av毛片| 两个人的视频大全免费| 欧美黑人欧美精品刺激| 欧美成人一区二区免费高清观看 | 精品国产三级普通话版| 少妇裸体淫交视频免费看高清| 麻豆久久精品国产亚洲av| 身体一侧抽搐| 久久久国产欧美日韩av| 黄色视频,在线免费观看| 成年免费大片在线观看| 国产精品电影一区二区三区| 国产一级毛片七仙女欲春2| 在线免费观看的www视频| 男人的好看免费观看在线视频| 91老司机精品| 每晚都被弄得嗷嗷叫到高潮| 亚洲av成人av| 热99re8久久精品国产| 国产亚洲精品综合一区在线观看| 亚洲一区二区三区色噜噜| 国产私拍福利视频在线观看| 亚洲精品色激情综合| 午夜精品在线福利| 国产综合懂色| 精品一区二区三区视频在线观看免费| 日韩欧美三级三区| 亚洲精品美女久久久久99蜜臀| 亚洲精品在线美女| 婷婷精品国产亚洲av在线| 亚洲,欧美精品.| 国内毛片毛片毛片毛片毛片| 国产成人精品无人区| 亚洲在线自拍视频| 精品国产亚洲在线| 一级a爱片免费观看的视频| 一个人免费在线观看电影 | 国内精品一区二区在线观看| 亚洲av成人不卡在线观看播放网| 少妇裸体淫交视频免费看高清| 亚洲国产精品久久男人天堂| 动漫黄色视频在线观看| 哪里可以看免费的av片| 母亲3免费完整高清在线观看| 久久精品aⅴ一区二区三区四区| 免费搜索国产男女视频| 老司机福利观看| 亚洲专区字幕在线| 日韩欧美精品v在线| 在线观看午夜福利视频| 中文字幕人成人乱码亚洲影| 99在线人妻在线中文字幕| 桃红色精品国产亚洲av| 最近最新中文字幕大全电影3| 999精品在线视频| 亚洲av成人av| 久99久视频精品免费| 国产激情欧美一区二区| 成人特级黄色片久久久久久久| 在线免费观看的www视频| 无人区码免费观看不卡| 特级一级黄色大片| 俺也久久电影网| 成人无遮挡网站| 国产高清videossex| 精品国产超薄肉色丝袜足j| 视频区欧美日本亚洲| 成人三级做爰电影| 不卡一级毛片| 嫩草影院精品99| 欧美成人免费av一区二区三区| 精品福利观看| 久久久久久国产a免费观看| 97超级碰碰碰精品色视频在线观看| 欧美另类亚洲清纯唯美| 一二三四在线观看免费中文在| 免费电影在线观看免费观看| 精品一区二区三区视频在线 | 亚洲欧洲精品一区二区精品久久久| 亚洲中文字幕一区二区三区有码在线看 | 久久久国产成人免费| 亚洲中文日韩欧美视频| 亚洲国产看品久久| 黄片大片在线免费观看| 天堂av国产一区二区熟女人妻| 精品人妻1区二区| 久久天躁狠狠躁夜夜2o2o| 淫妇啪啪啪对白视频| 国内少妇人妻偷人精品xxx网站 | 国产av麻豆久久久久久久| 国产午夜福利久久久久久| 宅男免费午夜| 热99re8久久精品国产| 欧美绝顶高潮抽搐喷水| avwww免费| 99riav亚洲国产免费| 天堂av国产一区二区熟女人妻| 国产成人系列免费观看| 免费搜索国产男女视频| 久久人妻av系列| 此物有八面人人有两片| av视频在线观看入口| 真实男女啪啪啪动态图| 99国产极品粉嫩在线观看| 51午夜福利影视在线观看| 国产综合懂色| 国产男靠女视频免费网站| 中亚洲国语对白在线视频| 久久久久久久久免费视频了| 又粗又爽又猛毛片免费看| av视频在线观看入口| 国产成人av激情在线播放| 最好的美女福利视频网| 久久精品91无色码中文字幕| 99热精品在线国产| 97人妻精品一区二区三区麻豆| 99riav亚洲国产免费| 看片在线看免费视频| 亚洲国产看品久久| 欧美色视频一区免费| 国产伦一二天堂av在线观看| 国产精品,欧美在线| 一进一出抽搐动态| 欧美激情久久久久久爽电影| 亚洲精品美女久久久久99蜜臀| 国产单亲对白刺激| 精品一区二区三区视频在线观看免费| 极品教师在线免费播放| 国产人伦9x9x在线观看| 久久欧美精品欧美久久欧美| 黄色视频,在线免费观看| 动漫黄色视频在线观看| 国产伦精品一区二区三区视频9 | 欧美乱码精品一区二区三区| 国产探花在线观看一区二区| 免费在线观看影片大全网站| 免费看美女性在线毛片视频| 精品午夜福利视频在线观看一区| 一区二区三区激情视频| 听说在线观看完整版免费高清| 亚洲第一电影网av| 日本五十路高清| a级毛片在线看网站| 一本精品99久久精品77| 日韩精品青青久久久久久| 日韩av在线大香蕉| 免费观看精品视频网站| 99re在线观看精品视频| 成人性生交大片免费视频hd| 成人永久免费在线观看视频| 色在线成人网| 国模一区二区三区四区视频 | 久久婷婷人人爽人人干人人爱| 一级作爱视频免费观看| 国产精品国产高清国产av| 亚洲av成人av| 99久久综合精品五月天人人| 身体一侧抽搐| 日韩欧美免费精品| 欧美zozozo另类| 亚洲第一欧美日韩一区二区三区| 香蕉久久夜色| 国产精品久久久久久精品电影| 18禁裸乳无遮挡免费网站照片| 窝窝影院91人妻| 国产三级中文精品| 男女床上黄色一级片免费看| 三级毛片av免费| 欧美高清成人免费视频www| 天天躁狠狠躁夜夜躁狠狠躁| 国产亚洲av嫩草精品影院| 亚洲一区二区三区不卡视频| 国产91精品成人一区二区三区| 亚洲一区高清亚洲精品| 亚洲成人久久性| 免费观看人在逋| 波多野结衣巨乳人妻| 国产三级黄色录像| 亚洲午夜理论影院| 日本免费a在线| 级片在线观看| 一a级毛片在线观看| 欧美极品一区二区三区四区| 美女免费视频网站| 精品久久久久久成人av| 久久这里只有精品中国| 中文字幕最新亚洲高清| 国产高清视频在线观看网站| 免费一级毛片在线播放高清视频| 一个人免费在线观看的高清视频| 国产精品亚洲美女久久久| 免费大片18禁| 成人国产综合亚洲| 日本 欧美在线| 国产精品一区二区三区四区免费观看 | 国产亚洲欧美在线一区二区| 国产亚洲精品av在线| 国产精品精品国产色婷婷| 国产熟女xx| 在线观看美女被高潮喷水网站 | 成人三级做爰电影| 99久久精品热视频| 国产在线精品亚洲第一网站| 亚洲精品美女久久久久99蜜臀| 免费观看精品视频网站| 淫秽高清视频在线观看| 国产午夜精品论理片| 观看美女的网站| 亚洲欧美日韩东京热| 亚洲aⅴ乱码一区二区在线播放| 日韩精品中文字幕看吧| 久久久久国内视频| 麻豆久久精品国产亚洲av| 国产真实乱freesex| 亚洲专区中文字幕在线| 国产亚洲欧美98| 97人妻精品一区二区三区麻豆| 特大巨黑吊av在线直播| 他把我摸到了高潮在线观看| 精品国产乱子伦一区二区三区| 床上黄色一级片| 一区二区三区国产精品乱码| 少妇裸体淫交视频免费看高清| 亚洲最大成人中文| 一边摸一边抽搐一进一小说| 国内精品美女久久久久久| 国产真实乱freesex| 97碰自拍视频| 中出人妻视频一区二区| 久久精品aⅴ一区二区三区四区| av黄色大香蕉| 99精品欧美一区二区三区四区| 最近最新中文字幕大全电影3| 国产高清视频在线观看网站| 久久久久久久久中文| 免费大片18禁| 国产精品一区二区精品视频观看| 嫁个100分男人电影在线观看| 色哟哟哟哟哟哟| 在线观看日韩欧美| 亚洲自拍偷在线| 人妻久久中文字幕网| 草草在线视频免费看| 久久国产精品人妻蜜桃| 久久精品综合一区二区三区| 久久精品亚洲精品国产色婷小说| 嫩草影院入口| 免费在线观看影片大全网站| 桃色一区二区三区在线观看| 亚洲国产看品久久| 男女那种视频在线观看| 亚洲中文av在线| 国产伦在线观看视频一区| 欧美激情在线99| 女同久久另类99精品国产91| 成人av在线播放网站| 神马国产精品三级电影在线观看| 久久久久国产一级毛片高清牌| 青草久久国产| 久久精品91蜜桃| 一级作爱视频免费观看| 久久久国产欧美日韩av| 亚洲第一电影网av| 久久天堂一区二区三区四区| 国产欧美日韩精品一区二区| 国产伦人伦偷精品视频| 女同久久另类99精品国产91| 国产美女午夜福利| 亚洲成人中文字幕在线播放| 国产又黄又爽又无遮挡在线| 久久香蕉国产精品| 亚洲专区字幕在线| 在线观看午夜福利视频| 日韩免费av在线播放| 首页视频小说图片口味搜索| 国产精品久久久久久精品电影| 怎么达到女性高潮| 精品国产乱码久久久久久男人| 成人国产综合亚洲| tocl精华| 精品人妻1区二区| 欧美极品一区二区三区四区| 亚洲18禁久久av| 亚洲国产欧美网| 国产成人精品无人区| 日韩中文字幕欧美一区二区| 精品无人区乱码1区二区| 啦啦啦免费观看视频1| 制服人妻中文乱码| 国产精品爽爽va在线观看网站| 成人高潮视频无遮挡免费网站| a在线观看视频网站| 亚洲专区中文字幕在线| 色av中文字幕| 五月玫瑰六月丁香| 91九色精品人成在线观看| 亚洲aⅴ乱码一区二区在线播放| 精品国产亚洲在线| 国产视频内射| 亚洲片人在线观看| 欧美最黄视频在线播放免费| 精品免费久久久久久久清纯| 国产亚洲精品久久久久久毛片| 一级作爱视频免费观看| av国产免费在线观看| 真实男女啪啪啪动态图| 黄色视频,在线免费观看| 欧美日韩精品网址| 成人性生交大片免费视频hd| 啦啦啦免费观看视频1| 国产高清三级在线| 亚洲九九香蕉| 久久国产精品影院| 中文字幕精品亚洲无线码一区| 美女免费视频网站| 国产成人av激情在线播放| 极品教师在线免费播放| 黑人巨大精品欧美一区二区mp4| 女警被强在线播放| 热99在线观看视频| 天堂av国产一区二区熟女人妻| 91在线精品国自产拍蜜月 | АⅤ资源中文在线天堂| 日本熟妇午夜| 欧美成狂野欧美在线观看| 久久欧美精品欧美久久欧美| 日本三级黄在线观看| 麻豆国产97在线/欧美| 亚洲最大成人中文| 最近在线观看免费完整版| 国产三级在线视频| 最新美女视频免费是黄的| 嫩草影院精品99| 变态另类成人亚洲欧美熟女| 美女 人体艺术 gogo| 久久这里只有精品中国| 一区二区三区国产精品乱码| 国产精品 国内视频| 日韩人妻高清精品专区| 欧洲精品卡2卡3卡4卡5卡区| 欧美一区二区国产精品久久精品| 老司机在亚洲福利影院| 狂野欧美白嫩少妇大欣赏| 精品国产三级普通话版| 一级毛片高清免费大全| 母亲3免费完整高清在线观看| 国产爱豆传媒在线观看| 蜜桃久久精品国产亚洲av| 国产一区二区激情短视频| 激情在线观看视频在线高清| 操出白浆在线播放| 国内揄拍国产精品人妻在线| 真人做人爱边吃奶动态| 99久久无色码亚洲精品果冻| 久久精品人妻少妇| 亚洲第一电影网av| 亚洲性夜色夜夜综合| 亚洲人成网站在线播放欧美日韩| 亚洲欧美激情综合另类| 亚洲,欧美精品.| 成人永久免费在线观看视频| 欧美成狂野欧美在线观看| 丁香欧美五月| 99精品在免费线老司机午夜| 亚洲欧美日韩卡通动漫| 日韩免费av在线播放| 国产久久久一区二区三区| 九九热线精品视视频播放| 久久国产精品影院| 国产亚洲欧美98| 亚洲av免费在线观看| 不卡av一区二区三区| 好男人在线观看高清免费视频| 国产真人三级小视频在线观看| 成人一区二区视频在线观看| 男女视频在线观看网站免费| 午夜福利在线在线| 国产高清三级在线| 日日干狠狠操夜夜爽| 老汉色∧v一级毛片| 色av中文字幕| 久久精品综合一区二区三区| 在线a可以看的网站| 亚洲人成伊人成综合网2020| 亚洲,欧美精品.| 免费看光身美女| 欧美黑人巨大hd| av天堂在线播放| 黑人欧美特级aaaaaa片| 一个人免费在线观看电影 | 91麻豆av在线| 久久久久久久午夜电影| 国产精品亚洲一级av第二区| 欧美中文日本在线观看视频| 九色成人免费人妻av| 黄色日韩在线| 日本黄色片子视频| 免费高清视频大片| 网址你懂的国产日韩在线| 亚洲人成伊人成综合网2020| 亚洲av美国av| 成人国产综合亚洲| 免费在线观看亚洲国产| 深夜精品福利| 白带黄色成豆腐渣| 97超级碰碰碰精品色视频在线观看| 亚洲精品色激情综合| 国产精品一区二区三区四区免费观看 | 国产单亲对白刺激| 国产午夜精品久久久久久| 国产乱人伦免费视频| 日本撒尿小便嘘嘘汇集6| 国产高清视频在线观看网站| 成年版毛片免费区| 国产视频内射| 免费高清视频大片| 黄色成人免费大全| 女人被狂操c到高潮| 国产一区二区在线av高清观看| 在线观看66精品国产| 成人鲁丝片一二三区免费| 变态另类丝袜制服| 欧美日韩综合久久久久久 | 国产男靠女视频免费网站| 午夜精品一区二区三区免费看| 久久亚洲真实| 午夜影院日韩av| 国产激情欧美一区二区| 日韩高清综合在线| 99re在线观看精品视频| 成人欧美大片| 久久久久国内视频| 18禁黄网站禁片免费观看直播| 国产av麻豆久久久久久久| 一本久久中文字幕| avwww免费| 国产伦一二天堂av在线观看|