• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning*

    2023-12-11 02:37:28ShihminWANGBinqiZHAOZhengfengZHANGJunpingZHANGJianPU

    Shihmin WANG, Binqi ZHAO, Zhengfeng ZHANG, Junping ZHANG, Jian PU

    1Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science,Fudan University, Shanghai 200433, China

    2Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai 200433, China

    ?E-mail: wangshimin20@fudan.edu.cn; bqzhao20@fudan.edu.cn; jpzhang@fudan.edu.cn; jianpu@fudan.edu.cn

    Abstract:As one of the most fundamental topics in reinforcement learning(RL),sample efficiency is essential to the deployment of deep RL algorithms.Unlike most existing exploration methods that sample an action from different types of posterior distributions, we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment.Specifically, we first employ clustering methods in the policy sampling process to generate an action candidate set.Then we introduce a clustering buffer for modeling the internal hierarchy, which consists of on-policy data, off-policy data, and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage.In this way, our approach is able to take advantage of the supervision information in the expert demonstration data.Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling.In particular, on the LGSVL task, our method can reduce the number of convergence steps by 46.7% and the convergence time by 28.5%.Furthermore, our code is open-source for reproducibility.The code is available at https://github.com/Shihwin/SelectiveSampling.

    Key words: Reinforcement learning; Sample efficiency; Sampling process; Clustering methods; Autonomous driving https://doi.org/10.1631/FITEE.2300084 CLC number: TP181

    1 Introduction

    Reinforcement learning (RL) algorithms have recently attained significant performance in various domains, such as autonomous vehicles (Li et al.,2023), traffic signal control (Dai et al., 2022), active object detection (Liu et al., 2022), and StarCraft II(Vinyals et al.,2019).By using the reward function,which defines what an agent should do, RL algorithms determine how to perform an action through trial-and-error learning under supervision,maximizing the cumulative rewards.Then, agents interact with the environment using the learned policy from the RL algorithms.

    However,the effectiveness of RL is heavily constrained by sample efficiency.The reason is that during the RL training process, the agents constantly collect experience by interacting with the environment according to the latest learned policy,and then use experience in updating the policy (Sutton and Barto, 1998) to maximize the expected future returns.Thus,millions of samples are often required to train a satisfactory policy to attain remarkable performance improvement and robust behavior, which results in an expensive environment interaction process.

    One way to alleviate the need for samples is to solve the credit assignment problem to improve sample efficiency.As one of the most successful on-policy algorithms, proximal policy optimization(PPO)(Schulman et al.,2017)uses a truncated version of the generalized advantage estimator (GAE)(Schulman et al., 2016), which employs importance sampling to reduce sample complexity and updates a policy with new samples acquired from the environment interaction per gradient step.Considering that the sample complexity of an effective policy trained with a PPO algorithm increases with increasing task complexity,soft actor–critic(SAC)(Haarnoja et al.,2018), one of the remarkable off-policy algorithms,solves this issue by importing a replay buffer that stores all transition samples.

    Nevertheless, these improvements are still too expensive to use in some costly domains, such as robotic manipulation and autonomous driving.Alternatively,model-based RL (Moerland et al., 2023)teaches the dynamics model of the environment from interaction experience to produce generated samples.That is, expert demonstration is regarded as effective supervision added during training,which is called learning from demonstration(LfD)(Ravichandar et al.,2020).The aforementioned studies aim to solve the credit assignment problem by formulating a Q-value to represent the expected future return,and then use it as a critic to guide the direction of policy optimization.However,a robust and accurate dynamics model is difficult to learn.

    Another problem for sample efficiency is the exploration–exploitation dilemma.Given an amount of exploration, the ideal goal of bounding the expected return is unrealistic in real-world applications because the exploration degree is hard to quantify.Therefore,various heuristic strategies have been proposed to select an appropriate action to balance the degrees between exploration and exploitation throughout training.Several commonly used strategies,including optimistic exploration(Cheung et al.,2020), posterior sampling (Wang and Li, 2020),and information gain(Houthooft et al.,2016),are used to select the following action to better maintain the delicate balance.However, these methods neglect the importance of expert demonstration in exploration.

    In this paper, we propose a novel and efficient sampling approach, selective sampling, by evaluating the next action to interact with the environment.Specifically,we model the internal hierarchy of environments with a clustering buffer where the expert demonstration is added with on-policy and off-policy data.Selective sampling is performed based on cluster values to improve sample efficiency.Our experiments indicate that compared with other algorithms,our proposed method can help deep RL algorithms achieve better performance on continuous locomotion tasks.Furthermore, from the application point of view, our method can achieve performance that surpasses those of other methods in the simulation environment of autonomous driving.The main contributions of this paper include the following: (1)introduction of the principle of internal hierarchy to evaluate samples with a clustering buffer during the sampling process; (2) provision of a two-stage perspective of the sampling process and integration of expert demonstration with a clustering buffer in the exploration stage.

    2 Related works

    In past research,many RL algorithms try to use expert data to achieve good results, which is called human–machine augmented intelligence(HAI)(Xue et al., 2022; Zhang et al., 2023a; Zhou et al., 2023).The use of expert data in past methods can be divided into expert feedback and expert data.Among them, the learning method from expert data is teaching–learning.The core motivation of teaching–learning is to use expert data to guide the two critical processes in RL, “credit distribution” and “exploration and exploitation,” in the process of interaction between the agent and the environment,so that the value network and the policy network can iterate each other under the mitigation of policy evaluation and policy improvement, until the optimum is achieved.

    2.1 Guidance for value networks

    In prior approaches, expert data have guided the value network.The primary motivation for using expert data to guide the value network is the hope that the value function can reasonably distribute credit to the expert’s state–action pair in the policy evaluation process.For example, Hester et al.(2018) used expert data to pre-train the Q-value network in a deep Q-network(DQN) by introducing expert data into the replay buffer,which is called deep Q-learning from demonstration(DQfD).Vecerik et al.(2017) applied similar ideas to deep deterministic policy gradient(DDPG)and proposed DDPG from demonstration (DDPGfD).Soft actor–critic from demonstration(SACfD)(Haarnoja et al.,2018) used a similar guidance approach for Q-value networks in SAC.

    However, the above methods require a large sample size, which hinders their further development.To solve the problem of high sample complexity, subsequent research naturally uses the supervision information contained in expert data more directly.

    2.2 Guidance for policy networks

    Expert data can guide the value network and,naturally, the policy network.The use of expert data to guide the policy network is intended to provide guidance in the early stage of policy exploration.This is achieved by using a pre-training method to give the policy a good initial point in the state–action pair space, making it closer to the expert policy.Then,the parameters of the pre-trained policy model are used as the initialization parameters of the policy model in the RL agent.For example,the SAC-based behavior cloning approach (Nair et al., 2018), the Tencent King of Glory game AI JueWu (Ye et al.,2020), the Go AI AlphaGo (Silver et al., 2016), and the StarCraft II game AI AlphaStar (Vinyals et al.,2019)all belong to this category.

    2.3 Guidance for replay buffers

    Expert data can directly guide the replay buffers.This method usually modifies the sampling mechanism of the interactive experience pool.Instead of random sampling,it uses a playback mechanism to preferentially sample the expert data,for example, prioritized experience replay (PER) (Schaul et al., 2016), hindsight experience replay (HER)(Andrychowicz et al., 2017), and attentive experience replay(AER)(Sun et al.,2020).

    2.4 Guidance for the reward function

    The expert data can guide the reward function.This is done by modeling the reward function and extracting the reward information from the expert trajectory data.This method of extracting knowledge from the heuristic information of expert trajectories is called inverse reinforcement learning (IRL).After the generative adversarial network (GAN) was proposed(Goodfellow et al.,2020),adversarial training techniques came to the attention of RL researchers.Ho and Ermon(2016)proposed generative adversarial imitation learning (GAIL), a generative adversarial network based imitation learning algorithm.GAIL generates the predicted actions of the network based on the generator and determines whether the actions are derived from expert data distribution or the network predictions based on the discriminator,which largely extracts the supervisory information contained in the expert data.GAIL achieves the best results at the time in many continuous control tasks.However, GAIL has the disadvantage of being sensitive to environmental noise.Fu et al.(2017)proposed adversarial inverse reinforcement learning(AIRL), which is more robust to the dynamic characteristics of the environment.

    3 Proposed method

    3.1 Preliminaries

    RL usually formulates sequential decisionmaking problems as discounted finite horizon Markov decision processes, defined by M=(S,A,R,T, γ,T0), where s ∈S represents a state from a set of finite states,a ∈A denotes an action in a set of finite actions,R:S×A →R is the reward function, T :S×A×S →R is the transition probability distribution, T0: S →R is the initial state distribution,and γ ∈(0,1)is a discount factor.Let π represent the current policy π : S →A and τ represent a trajectory with length L illustrated as τ = (s0,a0,r0,s1,a1,r1,...,sL); then the trajectory distribution of the current policy is described as

    Thus, the objective of the expected policy return is

    In SAC (Haarnoja et al., 2018), the function approximation of the Q-value with parameter θ is updated toward the target Q-value by using transitions sampled uniformly from the replay buffer(st,at,rt,st+1) ∈D to minimize the Bellman residual B(Q):

    where θ-and φ are the parameters of the target Q-network and current policy, respectively.As in expression (4), a ~πφ(· | st+1) is the formalism of the sampling process of the policy.We can explicitly model the policy network’s output with a Gaussian distribution.In the implementation,the input to the policy network is st+1, and the output is a Gaussian distribution with mean function μ and variance function σ, which together form the action distribution.Normally,the reparameterization trick(Kingma and Welling, 2014) is employed to sample a continuous action from this Gaussian distribution, shown as

    For the delicate balance of exploration and exploitation, the well-known optimistic exploration method, upper confidence bound (UCB) (Bellemare et al., 2016), argues that new states are potentially good,so we should encourage the agent to choose an action leading to unknown states, formalized as

    where N(a) is a count-based function to record the visiting frequency for action a.

    3.2 Overview of the algorithm structure

    In this subsection, we introduce our proposed approach, selective sampling, in detail.Our goal is to model the internal hierarchy of the environment by clustering methods on the replay buffer and introduce expert data for guidance.Furthermore, selective sampling is integrated with SAC because it achieves better performance on sample complexity than on-policy algorithms.For better understanding,we illustrate two stages of the sampling process,target-Q stage and exploration stage, within deep RL in Fig.1.

    Fig.1 Structure of selective sampling.Integrated with soft actor–critic(SAC), which determines the target-Q stage, learner, and policy network, selective sampling with a clustering buffer consists of on-policy data, off-policy data, and expert data that model the internal hierarchy of three environments.It takes effect mainly in the sampling process of the exploration stage to evaluate the candidate set for the selected action

    In the exploration stage, we first concatenate the on-policy data, off-policy data, and expert data, and subsequently perform clustering operations on these data.The process divides many high-dimensional data,which spread throughout the state–action pair space, into several clear hierarchical clusters.At the same time, the clustering process could obtain a classifier that can classify any of the state–action pairs in the space.From this perspective, this process models the internal hierarchy of the environment.After the above modeling, we use the model as an evaluation criterion of state–action pairs from the candidate set to select an appropriate action.The candidate set is composed of sampling data at the current state denoted as Ds= {(st,a1),(st,a2),...,(st,am)}, where sampling data are acquired by using the reparameterization trick in Eq.(5) from the sampling process of policy ai~πφ(· | st) isotropically, as noise ∈is sampled from the standard Gaussian distribution.

    It is noticeable that most existing methods mentioned above (Houthooft et al., 2016;Cheung et al.,2020; Wang and Li, 2020) lack explicit expert guidance, and the selection of samples is based on the principle of optimism, posterior, and information,rather than the internal environmental hierarchy,which is more valuable to the subsequent action.Unlike the evaluation process of the target-Q stage,where the Q-value function is used for evaluation,we model the expected return of the state–action pair of the current policy toward maximizing cumulative rewards.Then, the Q-function approximator is updated by experience algorithms such as HER(Andrychowicz et al., 2017), PER (Schaul et al.,2016), and AER (Sun et al., 2020).One advantage of this is that some rarely visited but valuable states in the replay buffer do not affect the overall performance of the current policy.However, the target-Q stage requires an enormous replay buffer containing millions of samples for evaluation to guide the current policy.With sample transition denoted as(st,at,rt,st+1), the target Q-value is formalized as follows:

    3.3 Algorithm details

    As illustrated in Algorithm 1,a clustering buffer is added with expert demonstration to use the advantages of the Q-value function and alleviate sample complexity.We employ clustering methods to model the internal hierarchy of the environment as an evaluation criterion for the state–action pair in the candidate set.Let DQrepresent the replay buffer that updates the Q-value function in SAC.Typically,DQcontains all experience and is sampled uniformly.

    Algorithm 1 Selective sampling Require: on-policy data Don, off-policy data Doff, expert data DE Ensure: learned policy πφ(a|s)1: Initialize network parameters θ, θ-, φ 2: for each environment step do 3: Ds ←a ~πφ(·|st)4: Dc ←(Don,Doff,DE)5: Nc ←AgglomerativeClustering(Dc)6: C ←K-means(Dc,Nc,Ds)7: for each c in C do 8: Vη(c)← 1∑i+nh-1k=i η(sk,ak)9: end for 10: aselected ←Random(Softmaxc∈CVη(c))11: st+1 ~T(·|st,aselected)12: DQ ←DQ ∪{st,aselected,rt,st+1}13: for each gradient step do 14: θ ←θ-λQ?θBθ(Q)15: φ ←φ-λπ?φJ(rèn)φ(π)16: θ- ←λtargetθ+(1-λtarget)θ-17: end for 18: end for nh

    With the explicit reward function, it would be easy to evaluate each sample based on the confidence C(a) = r(st,a) of action in the collected candidate set, followed by selecting the action with the highest confidence.However,in most situations, there is no explicit reward signal for different actions in the current state without interacting with the environment, and this type of situation can be regarded as the problem of implicit reward.Therefore, one feasible way to solve this issue is to consider clustering methods as an ideal similarity measure to quantify how good an action is in the candidate set.We call it the implicit value evaluation operation.To quantify the goodness, we introduce a clustering buffer Dc, consisting of state–action pairs from on-policy data, off-policy data, and expert data, denoted as Don, Doff, and DE, respectively.

    These three kinds of data have their own roles to play in Dc.First, an appropriate proportion of expert data allows the policy network to be guided when sampling from the action candidate set, accelerating convergence and improving network performance.Second, the on-policy data are helpful in improving the accuracy of value evaluation because they ensure unbiasedness for the current policy,and compensate for possible biases introduced during training.Using the on-policy data can help the policy converge to the optimal space faster while using the expert data for directed exploration.Third,the off-policy data can improve sample efficiency.The cooperation of the above three kinds of data enables selective sampling to achieve good results.

    Clustering methods are used to discover the internal structure of the data itself.Compared with neural-network-based supervised classification methods, they have no labels, are simple to implement,and are easy to use.Among the latest clustering methods,the more representative ones are deep clustering methods that combine representation learning with unsupervised learning fine-tuning, for example, deep embedded clustering (DEC) (Xie et al.,2016).Huang et al.(2023) proposed ProPos.This work combines the advantages of contrastive and non-contrastive representation learning to deal with the class collision issue and the collapse of clustering in current deep clustering methods.Niu et al.(2022)proposed SPICE,a semantic pseudo-labelingbased image-clustering framework that divides the clustering network into a feature model for measuring instance-level similarity and a clustering head for identifying cluster-level discrepancy.Also, some approaches try to eliminate the multi-stage module and use a single network for deep clustering.For example, Niu et al.(2020) proposed GATCluster, a selfsupervised Gaussian attention network that eliminates the extraction of intermediate features followed by traditional clustering algorithms and instead outputs semantic clustering labels directly without further post-processing.

    Selective sampling improves the sampling effi-ciency of RL by (1) selectively sampling the actions used for exploration and (2) introducing expert demonstrations.To achieve this, we introduce clustering methods to model the replay buffer consisting of three kinds of data as an easy-to-use action value assessment criterion.For implementation difficulty and model reuse feasibility, we use agglomerative clustering (Murtagh and Legendre, 2014) and K-means instead of the deep clustering approach.Agglomerative clustering is a clustering method that constructs data into a hierarchical structure by gradually merging clusters.Compared with other clustering algorithms,which assign data to a fixed number of clusters, agglomerative clustering creates a hierarchy between clusters.Also, it does not require a prior determination of the number of clusters, and the results can be visualized and analyzed by a tree diagram, as shown in Fig.1.In the beginning, we regard each state–action pair in the clustering buffer as a cluster cito model the internal hierarchy of the environment by minimizing the change in variance,where ‖·‖1denotes the L1norm and ‖·‖F(xiàn)is the Frobenius norm:

    As a result, we acquire the number of clusters Ncon the clustering buffer.Then we employ the K-means algorithm to classify the clustering buffer in the cluster set as C.During the evaluation of each cluster, we use the assessment function η of the state–action pair to evaluate the cluster.Let h ∈ C represent one cluster of the cluster set containing nhtransitions:(si,ai,ri,si+1),··· ,(si+nh-1,ai+nh-1,ri+nh-1,si+nh).Then the cluster value is

    Here, the assessment function η, as a reward,provides confidence in state–action pairs such as the reward function,Q-value function,or target-Q value.

    Similarly,we employ the K-means algorithm on cluster Dsand perform the implicit value evaluation operation for each of these clusters.Specifically, we use the model to classify (si,ai) ∈Ds, and evaluate the value of the corresponding (si,ai) based on the mean reward of the cluster in Dc.After that,we use the value of each(si,ai)to evaluate the value of the clusters in Ds.Finally, we use softmax selection to use the value of each cluster in Dsas the selection probability.This soft selection allows the clustering buffer to be more fully used for those samples whose value function is underestimated, but actually has a higher value,described as

    The selected action contains some similarities to the expert policy and then affects the subsequent state–action distribution of the policy.Consequently,the replay buffer experience used to recover the Q-value function would be more promising.

    4 Experiments

    4.1 Mujoco

    Experiments were performed on five challenging continuous control tasks of the benchmark suite Mujoco(Brockman et al.,2016)to evaluate the performance of selective sampling.Compared with the original SAC algorithm, under different environments and configurations, we focused mainly on the sample complexity and convergence performance in five Mujoco environments, including Hopper-v3,Ant-v3,HalfCheetah-v3,Reacher-v2,and Walker2dv3.In this subsection,we will first introduce our experimental environments and implementation.After that, we will present the results and discussion.

    4.1.1 Environments and implementation

    The goal of the Reacher-v2 task is to move the robot’s end effector close to a target spawned at a random position.Moreover, the four other locomotion tasks aim to move forward as quickly as possible.In these environments,the state represents position,velocity,and acceleration,while the action space encapsulates joint forces or torques to control the system.The dimensions of these states and actions are listed in Table 1.

    Table 1 Environmental dimensions

    Each experiment with a specified configuration was initialized with 10 random seeds to guarantee stability and to average the test episode reward during training for evaluation.Beginning with random exploration, the learning process was launched after 50 000 steps.Then, every four steps, we proceeded with a gradient update for the Q-network and policy network with policy delay, where the Adam optimizer was initialized with a 3×10-4learning rate,and a batch size of 512 was used to perform the update.Meanwhile, to ensure the stability of the clustering buffer, the clustering buffer was updated every 1000 steps.In the Hopper-v3, Ant-v3, and HalfCheetah-v3 environments, we conducted ablation experiments by adjusting the ratio of on-policy,off-policy, and expert data that composed the clustering buffer.Specifically, the on-policy degree was 2048, which measures how many of the latest interaction samples lie in the on-policy buffer.With the size of the clustering buffer fixed at 4096, the 112-configuration in our methods means that 1024 on-policy state–action pairs were sampled from the on-policy buffer(size of 2048),1024 off-policy state–action pairs were sampled from the replay buffer,and 2048 expert state–action pairs were sampled from the expert buffer.In addition, data of the expert buffer were generated by running the expert policy trained with TRPO(Schulman et al.,2015)in place of human demonstration in reality.Similarly,we performed a 211-configuration of selective sampling in our experiments.

    4.1.2 Results and discussion

    Fig.2 shows that our method learned rapidly at first and achieved better performance and higher sample efficiency.In addition, as shown in Fig.2c,the ablation study demonstrates the importance of introducing expert data into the clustering buffer because the model with added expert data converged to better performance.In particular, when only off-policy clustering was used as an evaluation criterion,the ablation study struggled like the SAC algorithm.The above results suggest that our method can exploit valuable supervision in the form of an internal hierarchy from the clustering buffer in the exploration stage.

    Based on the results of the ablation experiments with the 112-configuration and the 211-configuration in different environments, we found that in the HalfCheetah-v3 environment, the 112-configuration worked better than the 211-configuration, but in the Ant-v3 and Hopper-v3 environments,the results were reversed.This phenomenon occurs because in different environments,the agent’s skeletal structure is different,resulting in different proportional derivative(PD)control of the agent’s motion.Specifically,in the Hopper-v3 environment, the agent’s skeletal structure is the simplest.In the Ant-v3 environment,the agent is a quadrupedal structure, which allows the agent to remain stable at the beginning and also makes it easier for the agent to move forward to obtain the reward.In contrast, in the HalfCheetah-v3 environment, the agent is a bipedal structure, and its front and rear foot structures are different.At the same time, its rear foot has the highest degree of freedom among all the agents above, making it the most difficult to walk stably.As we discussed earlier, expert data allow the policy network to be guided when selecting actions from the action candidate set.When the structure of the agent makes it more difficult to obtain rewards, the reliance on expert guidance is also higher to some extent.This explains why the conclusion as to which configuration will achieve the best performance will be different in different environments.

    Fig.2 Training curves in three environments of continuous control tasks: (a) Hopper-v3; (b) Ant-v3; (c)HalfCheetah-v3.For these plots, the x axis denotes the number of interaction steps, which reflects the sample complexity.The solid curves correspond to the mean and the shaded region to the range of one standard deviation from the mean episode reward over 10 random seeds.We average the value taken from each point itself with the value taken from the next point as the final value for the current point.The SAC+on-policy setting refers to constructing the clustering buffer using only the on-policy data, while the SAC+off-policy setting and the SAC+expert setting are defined similarly.The numbers in selective sampling (112, 211) indicate the proportions of onpolicy, off-policy, and expert data in the clustering buffer, separately

    As shown in Figs.2a and 2b, in the Hopperv3 and Ant-v3 environments, the 211-configuration outperformed the other settings.Similarly,in the ablation study of the HalfCheetah-v3 environment, as shown in Fig.2c,the 112-configuration outperformed the on-policy and expert data setup.Meanwhile,the 211-configuration,although slightly worsened at the beginning of the training,was able to reach the same level as the first three at the end of the training.In other words, the setup containing all the three kinds of data has the potential to outperform all other setups while staying caught up in the slightly worse cases.Therefore,we believe that a setup with three kinds of data is more general.It illustrates that better performance requires the interplay of all three kinds of data, as discussed above.Note that the 112-configuration and the 211-configuration had significantly better performance than the baseline in the Hopper-v3 and HalfCheetah-v3 environments.

    To give more insight into how each of the onpolicy data, off-policy data, and expert data influences the performance of selective sampling,we supplemented ablation experiments without one of the three kinds of data in the HalfCheetah-v3 environment, as shown in Fig.3a.

    We can see that the performance of the 112-configuration was still optimal, and that the model performance decreased after removing any of the three kinds of data.At the early stage of training,the 112-configuration, the SAC+on-policy+expert setting, and the SAC+off-policy+expert setting converged similarly faster than the other settings between 0 and 2 × 105steps, because they had the same amount of expert data.This indicates that more expert data can guide the model to converge faster in the early stage.However, the convergence speeds of the SAC+on-policy+expert setting and the SAC+off-policy+expert setting decreased in the subsequent training phase.This illustrates that in the absence of one kind of data, the advantage brought by expert data in the early stage of training may not be maintained.In addition, although the SAC+on-policy+off-policy setting achieved similar performance to the 112-configuration in the middle part of the training, it converged slower than the 112-configuration in the early stage and had worse performance in the final stage than both the 112-configuration and 211-configuration.This experiment again illustrates that better performance requires interplay among all three kinds of data.

    Fig.3 Training curves in three environments of continuous control tasks: (a) HalfCheetah-v3; (b)Reacher-v2; (c) Walker2d-v3.For these plots, the x axis denotes the number of interaction steps, which reflects the sample complexity.The solid curves correspond to the mean and the shaded region to the range of one standard deviation from the mean episode reward over 10 random seeds.The SAC+onpolicy+off-policy setting refers to constructing the clustering buffer using the on-policy and off-policy data.The SAC+off-policy+expert setting and the SAC+on-policy+expert setting are defined similarly

    To further demonstrate the effectiveness of selective sampling, we conducted experiments with the 112-configuration and 211-configuration in two additional Mujoco environments, Reacher-v2 and Walker2d-v3.

    As shown in Fig.3b, in the Reacher-v2 environment, both the 112-configuration and 211-configuration converged faster initially and eventually to a better level.The lower difficulty of this task led to similar results for both configurations.The reason for the different starting heights is that the first point was drawn at the 20 000thstep in our implementation.As shown in Fig.3c,in the Walker2dv3 environment, both the 112-configuration and 211-configuration achieved faster initial convergence compared to the baseline.The 112-configuration had the best performance finally.The above results showed that selective sampling can improve model performance in these two environments, making the previous conclusions more convincing.

    To show more details on the clustering buffer,we provide the trend of the ratio of expert data in the clustering buffer Dcover time in the three Mujoco environments,as shown in Fig.4.

    The percentage of expert data was high at the beginning in all the three environments.It is because the rewards of expert data were much larger than those of the two other kinds of data in the early stage, and the positions of expert data were closer to each other in the feature space.As a result,when we selected the top five clusters with the highest mean reward, we will naturally select several clusters mainly composed of expert data.In addition,in the 112-configuration,the percentage of expert data in the clustering buffer was twice as large as that in the 211-configuration.We can also see that this advantage can be maintained until the convergence stage in all the three environments, indicating that our settings for the percentage of expert data in the clustering buffer were consistently effective.This advantage was more beneficial for a task like HalfCheetah-v3,which is significantly dependent on expert demonstration.This explains why it performed better in the 112-configuration.

    To summarize,the results demonstrated the importance of introducing expert data into the clustering buffer, and our method can exploit valuable supervision from it.The amount of expert data is crucial, but it also needs to be specifically analyzed with the specific environment.When using selective sampling,our suggestion is to use all the three kinds of data, which can be more general.In particular,in some environments where there is a high dependency on expert demonstration, we can consider increasing the proportion of expert data.There are four main linkage criteria in agglomerative clustering.The ward setting minimizes the variance of the clusters being merged.It has been used in previous experiments.The average setting uses the average distances of each observation of the two sets.The complete setting uses the maximum distance between all observations of the two sets,and the single setting uses the minimum distance.To explore which linkage criterion is the best, we conducted supplementary ablation experiments in the HalfCheetahv3 environment, with one training session for each linkage criterion setting and 1×106steps for each training session.Table 2 shows the test episode reward that the model can obtain as training proceeds using different linkage criteria.

    Table 2 Ablation study of linkage criteria

    The ward setting had the highest initial convergence speed, significantly ahead of the other settings.The average setting was the second, and the single and complete settings the lowest.However,the average setting can catch up to a level similar to the ward setting, at around 3 ×105steps, and performed better than the ward setting after 5×105steps.

    In selective sampling, a different linkage criterion causes a different number of clusters for the K-means method, which affects the final structure of the clustering buffer.The following conclusions can be drawn: among the four linkage criteria, the ward setting achieves the best balance between the initial speed and the later performance.The average setting sacrifices part of the initial speed but can obtain better performance in the later period.The performances of single and complete settings are similar to that of the original SAC algorithm.According to the application requirements, the implementation of selective sampling can be considered in the choice between the ward setting and the average setting.

    4.2 LGSVL

    Autonomous driving is an essential application of RL, often involving a significant amount of data(Zhang et al.,2023b),which makes sample efficiency critical.To further evaluate the performance of selective sampling, we performed full experiments on an end-to-end task of autonomous driving on the LGSVL (SVL Simulator by LG) simulation platform (Rong et al., 2020).LGSVL is an opensource simulation platform developed by LG for autonomous driving testing.Its implementation of the Python application programming interface(API)enables us to easily adjust to maps, vehicles, sunlight,and weather to obtain large-scale, diverse, and customized autonomous driving testing scenarios.In our experiments, we used the Python API to set up the experimental task, providing the environment abstraction needed for selective sampling.The version of LGSVL used here was svlsimulator-linux64-

    2021.2.2.

    In this subsection, we will first introduce the specific task details and the experimental environments.As an important part of the task setup, we will introduce the reward function settings.After that, we will present the experimental implementation.At the end of this subsection, we will show the results and discuss the results of the ablation experiments.

    4.2.1 Task details

    The current implementation of autonomous driving systems is divided into two main technical routes: modular and end-to-end autonomous driving systems.The modular autonomous driving system is artificially divided into three major modules: perception, planning, and control.In contrast, the end-toend autonomous driving system uses a single model as a direct solver from sensor input to controller output.Considering that the focus of this paper is on RL with high sampling efficiency rather than the coordinated cooperation of different algorithms among different modules, we set an end-to-end autonomous driving task.

    We set up the experiments in the “Shalun”map provided by the LGSVL simulation platform,a digital twin map of a realistic closed test site for autonomous vehicles built by Taiwan CAR Lab in Guiren District, Tainan City, China.As shown in Fig.5a, this autonomous vehicle test site was designed specifically for Asian road conditions and contains rich map elements: straight lanes,curved lanes,intersections, traffic signals, circular lanes, tunnels,railroad level crossings, etc.The section of road shown in Fig.5b was selected as the task scope for our experiments.There was a composite road,which consisted of a straight road and a curved road.

    Here we call the vehicle equipped with our autonomous driving system the Ego Car.The specific task set in this subsection is to train an autonomous driving system to be able to compute the specific control signals that the Ego Car should take at the current moment to perform safe, legal, startto-finish driving, using the signals from the front camera mounted on the Ego Car.Making the whole task more difficult, the initial starting point of the Ego Car(yellow“Start”in Fig.5b)was placed in the right lane of a straight section of that road,while the endpoint (red “End” in Fig.5b) was placed in the left lane of a curved section of that road.In other words,the automated driving system needed to drive the Ego Car along the lane in a straight line at the beginning,change to the left lane at the appropriate time, and drive along the curve of the lane near the endpoint.No collision with obstacles and no illegal driving behavior were permitted.Finally, the Ego Car arrived at the endpoint successfully.Overall,this composite autonomous driving task contained several sub-tasks with a certain degree of difficulty and training significance.

    Fig.5 Details in the LGSVL experiment: (a) full-size bird-eye view of Shalun; (b) experimental range we choose to perform our experiments, where the yellow point represents the starting point of the Ego Car and the red point represents the end point of the Ego Car; (c) front camera view of the Ego Car; (d) overhead view of the Ego Car.References to color refer to the online version of this figure

    4.2.2 Environments

    In the experiments, we still used the Open AI Gym (Brockman et al., 2016) RL environment abstraction format,including state,action,and reward functions.

    1.State

    We used the image from the front view camera mounted on the front of the Ego Car as the state representation of the environment.Specifically, at any moment, it is a 1920 × 1080 image in JPEG format, divided into three RGB channels, with one pixel under each channel taking values in the range[0,255].For ease of computation and to compress the state dimension, we compressed this original image to a 128×128×3 matrix representation.

    2.Action

    We defined the output actions of the policy network in the reinforcement learning model as a 1×2 vector, where each dimension corresponds to the steering wheel and the pedal operations of the Ego Car in the LGSVL simulation platform.First, we implemented the steering operation with the first dimension, which takes values in the range of [-1,1].The value -1 represents the maximum degree of steering wheel leftward rotation applied, the value 0 represents the steering wheel maintaining straightforward operation, and the value 1 represents the maximum degree of steering wheel rightward rotation applied.The rest of the values are applied to the corresponding degree of steering wheel leftward or rightward rotation according to the absolute value of the variable.Second,we implemented the throttle pedal and the brake pedal using the second dimension, which still takes values in the range [-1,1].If this dimension takes values in(0,1],the brake pedal application degree will be set to 0, and the throttle pedal application degree is set according to the variable.Specifically,1 represents the maximum application degree of the throttle pedal, and a value closer to 0 represents a smaller application degree.Similarly, if this dimension is in [-1,0), the degree of application of the throttle pedal will be set to 0,and the degree of application of the brake pedal is set according to the absolute value of the magnitude.Specifically, -1 represents the maximum degree of application of the brake pedal, and a value closer to 0 represents a smaller degree of application.

    3.Reward function

    The reward function is usually set from some prior human knowledge for safe driving, such as we cannot collide with other things, we cannot violate traffic rules, and we need to drive to the specified target.An appropriate reward function setting can significantly improve the convergence speed of the RL network and the final model performance.The reward function setup in the experiments was divided into four parts: risk-driving behavior penalties,baddriving behavior penalties,moving forward rewards,and goal rewards.The specific implementation is summarized in Table 3.

    We gave a one-time penalty of -30 and terminated the current epoch when we detected collisions with other elements of the map(located on the right side of the Ego Car)and when driving across double yellow lines into the opposite lane (located on the left side of the Ego Car).In addition, we calculated the vertical projection distance (in meters) between the current Ego Car position and the road centerline at each step, multiplied it by a factor of -1.4, andcounted it in the reward function.We also calculated the distance (in meters) traveled by the Ego Car for this step, multiplied it by 2.4, and counted it in the reward function.In the goal detection part,we added a well-designed reward calculation method,which was divided into four parts.

    Table 3 Reward function settings

    First,we designed the reward credit assignment within 5 m near the endpoint, described as

    where Xgoalrepresents the coordinates of the endpoint and Xegorepresents the coordinates of the Ego Car.The left part of the denominator calculates the Euclidean distance between the two coordinates at this step.The visualization of the reward is shown in Fig.6, where the reward obtained by Ego increased faster when Ego got closer to the endpoint.

    Fig.6 Reward credit assignment

    Second we judged whether the distance was decreasing or expanding, multiplied the corresponding factor of 1.0 or -0.5, and counted it in the reward function.If not, the Ego Car will receive a huge cumulative reward by passing through an area ≥2 m and ≤5 m from the endpoint, which is not what we expect.Third, we calculated the number of steps to obtain the credit assignment reward,and terminated the current epoch after reaching the limit number of 10.If not, the Ego Car will stop 2 m away from the endpoint to obtain a huge cumulative reward,which is not what we expect.Fourth, we judged that the goal was reached when it reached 2 m near the endpoint and terminated the current epoch.

    4.2.3 Implementation

    In this part, each experiment with a specified configuration was initialized with 10 random seeds.The initial random exploration was 2048 steps.The Adam optimizer was initialized with a 3 × 10-4learning rate, and a batch size of 256 was used to perform gradient update for the Qnetwork and policy network.The clustering buffer was updated every 200 steps.We also conducted ablation experiments by adjusting the ratio of onpolicy, off-policy, and expert data.Specifically,the on-policy degree was 1200, which measures how many of the latest interaction samples were in the on-policy buffer.With the size of the clustering buffer fixed at 2048, the 112-configuration in our methods means that 512 on-policy state–action pairs were sampled from the on-policy buffer, 512 off-policy state–action pairs sampled from the replay buffer, and 1024 expert state–action pairs sampled from the expert buffer.Similarly, we performed the 116-configuration, 134-configuration, and 314-configuration of selective sampling in our experiments.In addition, expert buffer data were generated by running the expert policy trained with SAC in place of human demonstration in reality.

    4.2.4 Results and discussion

    Fig.7a shows that our approach provided better guidance signals for the policy network, allowing the model to converge faster and achieve better performance and higher sample efficiency than the baseline.A higher convergence value means that the Ego Car will have a lower speed before reaching the goal to obtain the nearing goal reward for more steps,which is exactly what we expect.

    Fig.7 Training curves of three different configurations: (a)baseline and 112; (b)baseline,112,and 116;(c) baseline, 112, 134, and 314.For these plots, the x axis denotes the number of interaction steps, which reflects the sample complexity.The solid curves correspond to the mean, and the shaded region corresponds to the range of one standard deviation from the mean episode reward over 10 random seeds.The numbers in selective sampling indicate the proportions of on-policy, off-policy, and expert data in the clustering buffer, separately

    In addition, in the 116-configuration experiments as shown in Fig.7b,we found that settings too biased toward expert data introduced greater volatility.This is because too many expert data crowd out the sampling of on-policy and off-policy data during training,and the large difference between the current policy and expert policy can cause instability.

    According to Fig.7c,the 314-configuration outperformed the 134-configuration.This is because the autonomous driving task is much more complex in terms of the environment state compared to the previous experiments.Therefore, it relies more on the on-policy data,which can eliminate the bias.

    The number of steps required to reach the same level,which means the test episode reward was ≥450,was reduced from 3×104to 1.6×104.It was a 46.7%reduction.Considering the additional computation caused by our method,we calculated the actual time cost of both approaches.In terms of training time,the time required to reach the same level was reduced by 28.5%(from an average of 10 768.25 s to 7701.4 s).This result shows that our approach is still practical for application scenarios that require high algorithm generalizability,such as autonomous driving.

    5 Conclusions

    We propose to separate on-policy,off-policy,and expert demonstration into a clustering buffer and then use the selective sampling model and the internal environment hierarchy as an evaluation criterion.Our experimental results demonstrate that sample complexity is improved for faster convergence and better prediction performance on continuous locomotion tasks, which makes deploying the RL algorithm more practical for the new environment.For future studies, we will investigate a more suitable structure for retaining the internal hierarchy of clustering buffers and develop a more advanced and computationally efficient method to select the following action.

    Contributors

    Shihmin WANG,Binqi ZHAO,Zhengfeng ZHANG,and Junping ZHANG designed the research.Shihmin WANG and Binqi ZHAO processed the data.Shihmin WANG drafted the paper.Junping ZHANG and Jian PU helped organize the paper.Shihmin WANG, Junping ZHANG, and Jian PU revised and finalized the paper.

    Compliance with ethics guidelines

    Junping ZHANG and Jian PU are an editorial board member and a corresponding expert of Frontiers of Information Technology & Electronic Engineering, respectively,and they were not involved with the peer review process of this paper.All authors declare that they have no conflict of interest.

    Data availability

    The code is available at https://github.com/Shihwin/SelectiveSampling.The other data that support the findings of this study are available from the corresponding author upon reasonable request.

    欧美+日韩+精品| 美女内射精品一级片tv| 联通29元200g的流量卡| 日韩欧美三级三区| 插逼视频在线观看| 亚洲精华国产精华液的使用体验| 波多野结衣巨乳人妻| 国产精品一区二区三区四区免费观看| 色综合色国产| 伦理电影大哥的女人| 免费av观看视频| 免费人成在线观看视频色| 国产成人a区在线观看| 中文字幕免费在线视频6| 91精品国产九色| 中文字幕免费在线视频6| 亚洲美女搞黄在线观看| 一卡2卡三卡四卡精品乱码亚洲| 国产91av在线免费观看| 亚洲人成网站在线播| 欧美97在线视频| 日韩视频在线欧美| 亚洲不卡免费看| 亚洲美女搞黄在线观看| 国产成人福利小说| 国产午夜精品一二区理论片| 久久久久久久久久久丰满| 美女高潮的动态| 欧美另类亚洲清纯唯美| 精品国内亚洲2022精品成人| 国产精品国产高清国产av| 国产老妇女一区| 亚洲在久久综合| 成人高潮视频无遮挡免费网站| 国产成人freesex在线| 老司机影院毛片| 国产探花极品一区二区| 九九在线视频观看精品| 99视频精品全部免费 在线| 亚洲av日韩在线播放| 91aial.com中文字幕在线观看| 有码 亚洲区| 最近中文字幕高清免费大全6| 国产精品久久久久久精品电影小说 | 狂野欧美白嫩少妇大欣赏| 国产69精品久久久久777片| 久久精品久久久久久久性| 色视频www国产| 99热这里只有是精品在线观看| 国产精品蜜桃在线观看| 国产黄色视频一区二区在线观看 | 久久99热这里只有精品18| 精品久久久久久电影网 | 99久国产av精品| 成人综合一区亚洲| 不卡视频在线观看欧美| 色网站视频免费| 22中文网久久字幕| 欧美最新免费一区二区三区| 一个人看的www免费观看视频| 黄色日韩在线| 九草在线视频观看| 日本熟妇午夜| 亚洲av日韩在线播放| 两个人的视频大全免费| 日韩欧美 国产精品| 亚洲精品自拍成人| 美女大奶头视频| 国产精品一及| 99久久无色码亚洲精品果冻| 三级毛片av免费| 中文在线观看免费www的网站| 老司机影院毛片| 亚洲av.av天堂| 能在线免费看毛片的网站| 色尼玛亚洲综合影院| 国产精品一区二区在线观看99 | 日韩大片免费观看网站 | 一本一本综合久久| 夜夜爽夜夜爽视频| 国产三级在线视频| 97超视频在线观看视频| 一区二区三区乱码不卡18| 亚洲图色成人| 1000部很黄的大片| 国国产精品蜜臀av免费| 秋霞在线观看毛片| 成年版毛片免费区| 国产高清视频在线观看网站| 国产视频内射| 国产精品乱码一区二三区的特点| 高清日韩中文字幕在线| 国产亚洲av嫩草精品影院| 国产免费又黄又爽又色| 亚洲欧美日韩卡通动漫| 高清毛片免费看| 三级男女做爰猛烈吃奶摸视频| 久久久久久伊人网av| 亚洲国产精品成人综合色| 国产精品美女特级片免费视频播放器| 我的女老师完整版在线观看| 国产伦理片在线播放av一区| av福利片在线观看| 你懂的网址亚洲精品在线观看 | 国产精品精品国产色婷婷| 亚洲四区av| 免费看日本二区| 国产精品永久免费网站| 成年av动漫网址| 国产激情偷乱视频一区二区| 欧美日本视频| 一个人看视频在线观看www免费| 亚洲av二区三区四区| 全区人妻精品视频| 日韩中字成人| 在线免费观看的www视频| 国产男人的电影天堂91| 69av精品久久久久久| 精品欧美国产一区二区三| 午夜福利成人在线免费观看| 青春草国产在线视频| 成人性生交大片免费视频hd| 在线播放国产精品三级| 不卡视频在线观看欧美| 国产又色又爽无遮挡免| 男人狂女人下面高潮的视频| 精品久久久久久电影网 | 男人和女人高潮做爰伦理| 狂野欧美白嫩少妇大欣赏| 一卡2卡三卡四卡精品乱码亚洲| 99热这里只有精品一区| 国产精品永久免费网站| 秋霞伦理黄片| 亚洲真实伦在线观看| 狠狠狠狠99中文字幕| 国产91av在线免费观看| 天天躁夜夜躁狠狠久久av| 最近视频中文字幕2019在线8| 女人被狂操c到高潮| 国内精品一区二区在线观看| 久久久久久伊人网av| 欧美色视频一区免费| 小蜜桃在线观看免费完整版高清| 精品久久久久久久久亚洲| 美女内射精品一级片tv| 少妇的逼水好多| 日韩欧美精品v在线| 男女视频在线观看网站免费| 级片在线观看| 日产精品乱码卡一卡2卡三| 一区二区三区四区激情视频| 久久久精品大字幕| 国产真实伦视频高清在线观看| 国产精品久久久久久av不卡| 久久精品熟女亚洲av麻豆精品 | 少妇丰满av| 美女cb高潮喷水在线观看| 欧美变态另类bdsm刘玥| 久久精品久久久久久噜噜老黄 | 日本五十路高清| 久久99热这里只频精品6学生 | 国产真实伦视频高清在线观看| 亚洲五月天丁香| 大又大粗又爽又黄少妇毛片口| 国产极品天堂在线| 国语对白做爰xxxⅹ性视频网站| 国产精品电影一区二区三区| 国产一区二区在线观看日韩| 国产免费视频播放在线视频 | 亚洲第一区二区三区不卡| 黄色配什么色好看| 午夜日本视频在线| 日韩制服骚丝袜av| 欧美性猛交╳xxx乱大交人| 久久午夜福利片| 人妻少妇偷人精品九色| 欧美极品一区二区三区四区| 99久久人妻综合| 国产成人精品一,二区| 高清视频免费观看一区二区 | 亚洲国产高清在线一区二区三| 九九久久精品国产亚洲av麻豆| 好男人视频免费观看在线| 成年av动漫网址| 国内精品一区二区在线观看| 97人妻精品一区二区三区麻豆| 欧美精品国产亚洲| 国产亚洲av嫩草精品影院| 国产大屁股一区二区在线视频| 网址你懂的国产日韩在线| 欧美日韩精品成人综合77777| 青青草视频在线视频观看| 三级毛片av免费| 国产91av在线免费观看| 少妇高潮的动态图| 色噜噜av男人的天堂激情| 久久韩国三级中文字幕| 成年av动漫网址| 中文资源天堂在线| 日日啪夜夜撸| 综合色av麻豆| 狂野欧美白嫩少妇大欣赏| 欧美日本视频| 成人亚洲欧美一区二区av| 欧美一级a爱片免费观看看| 国产一级毛片七仙女欲春2| 亚洲国产精品sss在线观看| 精华霜和精华液先用哪个| 亚洲综合色惰| 国产毛片a区久久久久| 身体一侧抽搐| 久久久久久久久大av| 亚洲精品影视一区二区三区av| 亚洲国产成人一精品久久久| 熟妇人妻久久中文字幕3abv| 亚洲乱码一区二区免费版| 一级二级三级毛片免费看| 三级国产精品片| 亚洲精华国产精华液的使用体验| 久久久亚洲精品成人影院| 欧美一区二区国产精品久久精品| 老司机影院成人| 亚洲在线观看片| 国产极品精品免费视频能看的| 我的女老师完整版在线观看| 男插女下体视频免费在线播放| 亚洲天堂国产精品一区在线| 国内精品美女久久久久久| 日韩欧美精品v在线| 亚洲av电影在线观看一区二区三区 | 韩国av在线不卡| 日韩视频在线欧美| 国产高清不卡午夜福利| av免费观看日本| av视频在线观看入口| 级片在线观看| 欧美日韩精品成人综合77777| 欧美色视频一区免费| 国内精品一区二区在线观看| 色哟哟·www| 亚洲美女搞黄在线观看| 日本色播在线视频| 精品无人区乱码1区二区| 人体艺术视频欧美日本| 免费搜索国产男女视频| 国产精品99久久久久久久久| 国产精品日韩av在线免费观看| 亚洲成人久久爱视频| 成年版毛片免费区| 校园人妻丝袜中文字幕| 国产亚洲精品av在线| 日韩三级伦理在线观看| 国产在线一区二区三区精 | 丰满少妇做爰视频| 日韩,欧美,国产一区二区三区 | 91狼人影院| 亚洲精品乱久久久久久| 晚上一个人看的免费电影| 成人毛片a级毛片在线播放| 丝袜美腿在线中文| 高清视频免费观看一区二区 | 久久久国产成人免费| 亚洲色图av天堂| 久久久精品94久久精品| 中文字幕久久专区| 我的女老师完整版在线观看| 亚洲美女搞黄在线观看| eeuss影院久久| 国产美女午夜福利| 1000部很黄的大片| 99久久精品国产国产毛片| 少妇熟女aⅴ在线视频| 高清av免费在线| 色尼玛亚洲综合影院| 国产三级在线视频| 久久人人爽人人片av| 你懂的网址亚洲精品在线观看 | 国产午夜福利久久久久久| 日本av手机在线免费观看| 国产精品电影一区二区三区| av又黄又爽大尺度在线免费看 | 只有这里有精品99| 最近的中文字幕免费完整| 美女脱内裤让男人舔精品视频| 午夜亚洲福利在线播放| av在线亚洲专区| 有码 亚洲区| 高清在线视频一区二区三区 | 午夜激情欧美在线| 久久久久久九九精品二区国产| 成人无遮挡网站| 亚洲欧美日韩东京热| 久久久久久久午夜电影| 欧美潮喷喷水| 99在线人妻在线中文字幕| 欧美成人午夜免费资源| 婷婷色综合大香蕉| 成人二区视频| 国产真实伦视频高清在线观看| 亚洲欧美一区二区三区国产| 中文天堂在线官网| av女优亚洲男人天堂| 女人久久www免费人成看片 | 天堂av国产一区二区熟女人妻| 国产一区二区三区av在线| 91精品伊人久久大香线蕉| 亚洲丝袜综合中文字幕| 黄片无遮挡物在线观看| 日韩强制内射视频| 99热这里只有是精品在线观看| 亚洲经典国产精华液单| 中文精品一卡2卡3卡4更新| 最近的中文字幕免费完整| 天天一区二区日本电影三级| 国产av不卡久久| 高清毛片免费看| 久久久午夜欧美精品| 国产成年人精品一区二区| 韩国高清视频一区二区三区| 精品久久久久久成人av| 99在线视频只有这里精品首页| 久久久久久久久久久丰满| 日本免费在线观看一区| 99在线人妻在线中文字幕| 亚洲成人久久爱视频| 日日摸夜夜添夜夜添av毛片| 少妇人妻一区二区三区视频| 国产精品久久视频播放| 中文字幕人妻熟人妻熟丝袜美| 国产爱豆传媒在线观看| АⅤ资源中文在线天堂| 三级国产精品片| 色综合色国产| av免费在线看不卡| 午夜激情福利司机影院| 国语自产精品视频在线第100页| 久久人人爽人人爽人人片va| 又爽又黄无遮挡网站| 久久99蜜桃精品久久| 久久久久精品久久久久真实原创| 三级经典国产精品| a级毛片免费高清观看在线播放| 老司机福利观看| 亚洲av中文字字幕乱码综合| 久久精品人妻少妇| 欧美成人一区二区免费高清观看| 免费一级毛片在线播放高清视频| 日产精品乱码卡一卡2卡三| 校园人妻丝袜中文字幕| 精品一区二区三区视频在线| 国产美女午夜福利| 国产高清不卡午夜福利| 国产精品日韩av在线免费观看| 久久久国产成人精品二区| 中文亚洲av片在线观看爽| 欧美一区二区国产精品久久精品| 成人二区视频| 精品一区二区三区人妻视频| 综合色丁香网| 国产精品麻豆人妻色哟哟久久 | 午夜久久久久精精品| 国产精品不卡视频一区二区| 欧美日韩在线观看h| 在现免费观看毛片| 真实男女啪啪啪动态图| 午夜福利高清视频| 美女cb高潮喷水在线观看| 亚洲三级黄色毛片| 在线观看一区二区三区| 国产黄片视频在线免费观看| 精品国内亚洲2022精品成人| 久久亚洲精品不卡| 51国产日韩欧美| 国产黄片美女视频| 欧美性感艳星| 最近最新中文字幕大全电影3| 国产真实伦视频高清在线观看| 欧美xxxx性猛交bbbb| 少妇高潮的动态图| av又黄又爽大尺度在线免费看 | 国产精品国产高清国产av| 亚洲国产精品专区欧美| 一卡2卡三卡四卡精品乱码亚洲| 午夜福利高清视频| 卡戴珊不雅视频在线播放| 成人美女网站在线观看视频| 2021天堂中文幕一二区在线观| 亚洲人与动物交配视频| 中文字幕av在线有码专区| 亚洲怡红院男人天堂| 国产精品一区www在线观看| 黄色日韩在线| 国产黄片视频在线免费观看| 免费观看精品视频网站| 欧美日韩精品成人综合77777| 毛片女人毛片| 国产精品不卡视频一区二区| 久久久久久久午夜电影| av黄色大香蕉| 日产精品乱码卡一卡2卡三| 国产黄片视频在线免费观看| 中文字幕制服av| 国产 一区精品| 欧美色视频一区免费| 看免费成人av毛片| 国产精品,欧美在线| 国产视频内射| 精品国产一区二区三区久久久樱花 | 男人舔奶头视频| 性插视频无遮挡在线免费观看| 乱码一卡2卡4卡精品| 村上凉子中文字幕在线| 观看美女的网站| 午夜a级毛片| 日本免费一区二区三区高清不卡| 亚洲精品乱码久久久久久按摩| 国产精品女同一区二区软件| 老师上课跳d突然被开到最大视频| 亚洲成人中文字幕在线播放| 日韩 亚洲 欧美在线| 国产精品不卡视频一区二区| 色尼玛亚洲综合影院| 两个人的视频大全免费| 天美传媒精品一区二区| 亚洲,欧美,日韩| 高清日韩中文字幕在线| 久热久热在线精品观看| 1000部很黄的大片| 亚洲国产精品成人综合色| 国产激情偷乱视频一区二区| 亚洲欧美日韩无卡精品| 欧美成人a在线观看| 91狼人影院| 国产精品99久久久久久久久| 亚洲电影在线观看av| 免费看美女性在线毛片视频| 午夜精品一区二区三区免费看| 美女国产视频在线观看| 内地一区二区视频在线| 亚洲乱码一区二区免费版| 日韩亚洲欧美综合| 中文字幕免费在线视频6| 亚洲精品成人久久久久久| 成人二区视频| av在线播放精品| 国产 一区 欧美 日韩| 免费观看a级毛片全部| 91精品伊人久久大香线蕉| 日韩视频在线欧美| 久久人人爽人人爽人人片va| 午夜久久久久精精品| 久久久成人免费电影| 亚洲美女搞黄在线观看| 国产一区二区在线观看日韩| 久热久热在线精品观看| 亚洲自拍偷在线| 亚洲一区高清亚洲精品| 自拍偷自拍亚洲精品老妇| 久久人人爽人人片av| 亚洲精品日韩在线中文字幕| 成人性生交大片免费视频hd| 亚洲最大成人手机在线| 国产又黄又爽又无遮挡在线| 亚洲国产精品专区欧美| 久久精品国产亚洲av天美| 亚洲欧美成人综合另类久久久 | 成年女人看的毛片在线观看| 亚洲va在线va天堂va国产| 国产精品1区2区在线观看.| 丰满人妻一区二区三区视频av| 久99久视频精品免费| 日本-黄色视频高清免费观看| 男女视频在线观看网站免费| 欧美激情国产日韩精品一区| 日韩成人av中文字幕在线观看| 亚洲欧美成人精品一区二区| 丰满乱子伦码专区| 99久久成人亚洲精品观看| 色噜噜av男人的天堂激情| 欧美3d第一页| 亚洲国产精品sss在线观看| 国产成人aa在线观看| 91狼人影院| 欧美日本亚洲视频在线播放| 变态另类丝袜制服| 晚上一个人看的免费电影| 日本午夜av视频| 欧美成人免费av一区二区三区| 欧美一区二区精品小视频在线| 久99久视频精品免费| 精品一区二区三区人妻视频| 日本欧美国产在线视频| 日本黄大片高清| 水蜜桃什么品种好| 男女啪啪激烈高潮av片| 国产精品一区www在线观看| 两个人的视频大全免费| 国产麻豆成人av免费视频| 国产精品国产三级专区第一集| 超碰97精品在线观看| 日韩av不卡免费在线播放| 精品无人区乱码1区二区| 波多野结衣高清无吗| 成年免费大片在线观看| 国产极品天堂在线| 美女高潮的动态| 精品久久久久久久久亚洲| 一卡2卡三卡四卡精品乱码亚洲| 汤姆久久久久久久影院中文字幕 | 日本免费a在线| 免费一级毛片在线播放高清视频| 久久久久久国产a免费观看| 成人欧美大片| 欧美一区二区国产精品久久精品| 亚洲人成网站在线观看播放| 中文精品一卡2卡3卡4更新| 大又大粗又爽又黄少妇毛片口| 国产亚洲午夜精品一区二区久久 | 欧美日本亚洲视频在线播放| 99热全是精品| 国产精品福利在线免费观看| 嘟嘟电影网在线观看| 日本午夜av视频| 国产v大片淫在线免费观看| 少妇人妻一区二区三区视频| 成人鲁丝片一二三区免费| 日韩欧美精品v在线| 女的被弄到高潮叫床怎么办| 久久久久精品久久久久真实原创| 久久精品国产亚洲av天美| 免费av毛片视频| 亚洲图色成人| 色吧在线观看| 日韩视频在线欧美| 国国产精品蜜臀av免费| 亚洲成av人片在线播放无| 久久综合国产亚洲精品| 不卡视频在线观看欧美| 欧美zozozo另类| av福利片在线观看| 禁无遮挡网站| 国产免费一级a男人的天堂| 国产淫片久久久久久久久| 国产爱豆传媒在线观看| 久久久久网色| 久久久亚洲精品成人影院| 建设人人有责人人尽责人人享有的 | 国产伦在线观看视频一区| 我要搜黄色片| 亚洲国产精品合色在线| 久久午夜福利片| 大话2 男鬼变身卡| 精品人妻熟女av久视频| 久久久久久久久久久丰满| 国产免费一级a男人的天堂| 好男人视频免费观看在线| 人体艺术视频欧美日本| 人妻少妇偷人精品九色| 亚洲自拍偷在线| 精品久久久久久成人av| 亚洲18禁久久av| 久久亚洲国产成人精品v| 免费看美女性在线毛片视频| 国产白丝娇喘喷水9色精品| 日韩高清综合在线| 欧美精品一区二区大全| 2021少妇久久久久久久久久久| 汤姆久久久久久久影院中文字幕 | 狠狠狠狠99中文字幕| 人妻制服诱惑在线中文字幕| 午夜福利视频1000在线观看| 日日啪夜夜撸| 午夜精品国产一区二区电影 | 国产乱人偷精品视频| 五月玫瑰六月丁香| 色5月婷婷丁香| 国产精品久久久久久av不卡| 日本-黄色视频高清免费观看| 热99re8久久精品国产| 亚洲国产精品sss在线观看| 国产av一区在线观看免费| 日本猛色少妇xxxxx猛交久久| 国产精品人妻久久久久久| 日韩成人av中文字幕在线观看| 三级男女做爰猛烈吃奶摸视频| 插阴视频在线观看视频| 国产一级毛片七仙女欲春2| 久久久久九九精品影院| 最近中文字幕2019免费版| 精品一区二区三区视频在线| 日日啪夜夜撸| 99久国产av精品| 99久久中文字幕三级久久日本| 亚洲在线自拍视频| 国产一区二区在线av高清观看| 欧美变态另类bdsm刘玥| av在线天堂中文字幕| 国产精品99久久久久久久久| 少妇猛男粗大的猛烈进出视频 | 欧美不卡视频在线免费观看| 一级毛片电影观看 | 人人妻人人澡欧美一区二区| 色尼玛亚洲综合影院| 亚洲av成人av| 亚洲性久久影院| 欧美xxxx黑人xx丫x性爽| 国内少妇人妻偷人精品xxx网站| 蜜桃亚洲精品一区二区三区| 国产精品av视频在线免费观看| 国产精品久久电影中文字幕| 免费电影在线观看免费观看| 免费一级毛片在线播放高清视频| 少妇裸体淫交视频免费看高清| 欧美最新免费一区二区三区| 大香蕉97超碰在线|