• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Target‐driven visual navigation in indoor scenes using reinforcement learning and imitation learning

    2022-05-28 15:17:02QiangFangXinXuXitongWangYujunZeng

    Qiang Fang |Xin Xu|Xitong Wang|Yujun Zeng

    Department of Intelligence Science and Technology,College of Intelligence Science and Technology,National University of Defense Technology,Changsha,China

    Abstract Here,the challenges of sample efficiency and navigation performance in deep reinforcement learning for visual navigation are focused and a deep imitation reinforcement learning approach is proposed.Our contributions are mainly three folds:first,a framework combining imitation learning with deep reinforcement learning is presented,which enables a robot to learn a stable navigation policy faster in the target-driven navigation task.Second,the surrounding images is taken as the observation instead of sequential images,which can improve the navigation performance for more information.Moreover,a simple yet efficient template matching method is adopted to determine the stop action,making the system more practical.Simulation experiments in the AI-THOR environment show that the proposed approach outperforms previous end-to-end deep reinforcement learning approaches,which demonstrate the effectiveness and efficiency of our approach.

    1|INTRODUCTION

    Autonomous navigation plays an important role in the field of robotics.In recent decades,due to the benefits of low cost,lightweight and rich information,visual navigation has attracted considerable attention,and many studies have been presented.In general,most of the traditional methods for visual navigation are rule-based methods [1-5] that locate the robot’s position accurately for the planning and control modules.However,these rule-based methods are relatively complex,involving considerable manual design work and computational resources.Considering human navigation,there is no need to calculate the exact position;we only need to determine where to go according to the scenes we have seen.Inspired by such behaviour,there has been an increasing interest in end-to-end visual navigation approaches that directly navigate from images to actions.Deep learning (DL) can extract useful features directly from pixels,which has made great progress in many fields,such as image classification[6],object detection[7]and image segmentation[8].Reinforcement learning(RL)is similar to the learning processes of humans,where policies are optimized through continuous trial-and-error and feedback with the environment to achieve a higher level of intelligence [9,10].Moreover,the deep reinforcement learning(DRL)method combining DL with RL has been proposed and achieved exciting performance in dealing with input problems for continuous high-dimensional states [11],so using the DRL method to solve end-to-end visual navigation is a good choice.Zhu [12] proposed a targetdriven DRL(TD-DRL)architecture to solve end-to-end visual navigation problems and achieved good navigation performance;however,there are some limitations.Firstly,four sequential images are considered as the observation;in fact,the robot has access to the surrounding observations,which might provide more information.Secondly,the stop action is not considered,although it can be automatically determined in the simulation environment by judging whether the ID number of the current state is that of the target state,but such an approach is not suitable in reality because there is no ID number but only images.Last but most importantly,DRL requires a large number of samples to learn a good navigation policy during training,which leads to a long training time and low efficiency for training processes.To improve sample efficiency during training,the imitation learning(IL)method is an alternative that can imitate expert experience accurately,and it requires less training time to achieve a navigation model.However,using IL alone might result in the problem of overfitting compared to DRL.

    Inspired by the limitations of TD-DRL and the advantages of IL,here,we focus on the challenges of sample efficiency and navigation performance in RL for the visual navigation task and propose a deep imitation reinforcement approach(DIRL),which combines IL with DRL.The main contributions are summarized as follows:

    · A framework combining IL with RL is presented,which enables a robot to learn a stable navigation policy faster in the target-driven navigation task.

    · For the observation,the surrounding images are considered as the observation instead of sequential images,which can improve navigation performance and accelerate the learning process for more information.Additionally,a simple template matching method is adopted to determine the action,which makes the system more practical.

    · Simulation experiments are evaluated,which demonstrates the effectiveness of our approach compared with the stateof-the-art approaches.

    The remainder of the paper is organized as follows:we introduces the related work in Section 2 and describes our proposed approach in Section 3.Experiments in AI2-THOR environment are evaluated and results are discussed in Section 4.Finally,a conclusion is drawn in Section 5.

    2|RELATED WORK

    Robot visual navigation can be categorized into two methods:rule-based and learning-based.Here,we focus on the latter,so we will introduce the related work of learning-based navigation,especially the IL and DRL used in visual navigation.

    IL is the process of learning a behaviour policy from a set of demonstrations,and a survey about IL can be found in [13].Some IL methods treat the task as special supervised learning,that is,behaviour cloning (BC).NVIDIA[14] proposed a typical BC-based end-to-end visual navigation method used in unmanned driving areas,which collects a large number of expert samples from three cameras.Based on IL,Yang [15] proposed a multi-task architecture that can simultaneously predict the vehicle's speed and steering angles using image sequences.To drive a vehicle,Wang [16] proposed a novel angle-branch network,where the inputs include sequential images,the vehicle speed and the desired angle of the subgoal.Codevilla [17]proposed a condition-imitation-learning method based on high-level input,which might partly address the generalization ability of the models trained using IL.To address the limitation that most IL methods need a specifically calibrated actuation setup for the training dataset to be performed,Xu [18] proposed a novel framework (FCN-LSTM)to learn the policy of driving from uncalibrated large-scale videos.Although those imitation-learning-based methods are useful for visual navigation,the systems are easy to overfit using IL alone.

    RL has recently been applied in robot navigation tasks due to its ability to interact with the environment,especially DRL.In general,one typical DRL method is the deep Q-learning algorithm (DQN) proposed by DeepMind,which successfully enabled robots to learn control policies at the human level in Atari games[19].Subsequently,many algorithms for improving the DQN network model were successively presented and achieved impressive results in different fields [20-23].Other methods based on the policy gradient include the deep deterministic policy gradient(DDPG)[24]and the A3C[25].Based on the DRL methods,DeepMind proposed a dual-agent architecture [26] that trains different long short-term memories(LSTMs)through RL to achieve outdoor visual navigation and improves the mobility of the model.In[27],the authors aimed to learn a deep successor representation and proposed a feedforward architecture,which is flexible to the changing of rewards and is tested in the MazeBase gridworld such as 3D VizDoom.The AutoRL method [28] combining DRL with gradient-free hyperparametric optimization was also proposed to solve navigation tasks.However,in many cases,the reward function required by RL is difficult to design and cannot guarantee its optimality.Zhu[12]proposed a target-driven DRL(TD-DRL)architecture to solve end-to-end visual navigation problems and achieved good navigation performance.Although DRL can be used in visual navigation,the authors in [29] summarized nine challenges in RL;one of them is the low-sample efficiency,meaning it needs large samples to train,which is not suitable in real-world systems and limits its application in many tasks.The approaches combining DRL with IL have been proposed in manyother tasks,such as the grasping task[30],the MuJoCo task[31,32]and the traffic control task[33],however,such methods utilize some sequence data of trajectory for IL,while here,we propose an approach combining DRL with IL to address the sample efficiency for visual navigation tasks.

    3|PROPOSED APPROACH

    3.1|Markov decision process

    The process of end-to-end vision navigation can be modelled as a Markov decision process (MDP) [9],where the agent considers the current observed images and the goal as input and obtains a new state and relevant reward from the environment by performing an action.Then,the MDP can be expressed as <sk,ak,sk+1,rk+1>,whereskrepresents the states at stepk,akis the corresponding action,andrk+1is the reward function.The main goal is to find an optimal policyπ*to obtain the maximum return during the navigation process:

    The next state is dependent on the current state and on the decision/action by the involved entity/person;in other words,once the system knows its present state,the future is conditionally independent of the past.

    3.2|Actor‐critic algorithm

    To deal with MDPs with large (continuous) state or action spaces,here,we adopt the actor-critic algorithm[9]to deal with the end-to-end visual navigation task due to the ability to learn a policy and a state-value function simultaneously.The state-value functionVπ(s)is the expected return and can be formulated as:

    Since the optimal state-valueV*(s)is related to the optimal policyπ*,the functionV*(s) can be further defined as:

    Note that in the actor-critic learning algorithm shown in Figure 1,a critic network is designed to learn the value functions of the MDP,and an actor network is used to learn nearoptimal policies using the estimated value functions in the critic to perform policy searching.

    The current state value can be obtained by the value function network in experiments,and the values of actions can be used to evaluate the quality of actions as follows:

    Since the advantage functionA(s,a)=Q(s,a)-V(s)can eliminate the effect of the current state value and reflect values of actions better,it is assumed that some states are inherently better.If some states are always better,the value of each action will be higher only by using the current state value,resulting in exaggerated evaluations that are considered good actions.However,we can evaluate the qualities of actions more objectively after the state value is subtracted for the advantage function.The value of action is positive if it is higher than the evaluation value and negative if it is lower than the evaluation value.The n-step update method for the advantage function is defined as:

    The multi-steps update method can make changes in the advantage function spread faster.If the agent earns an occasional reward,the value function will only slowly move backwards by one step with each iteration.However,each iteration propagates back with an n-step iteration,so it is faster.The estimatedV(s) of the algorithm should be convergent.Then,errors can be calculated by the following equation:

    According to the policy gradient method,the policy loss function is:

    FIGURE 1 Actor-critic algorithm

    whereHrepresents the entropy,and the parameterβadjusts the step length of the entropy regularization term.The value loss function can be calculated as follows:

    The update formula of the policy network is defined as:

    The parameters of the value function networkθvcan be updated by calculating the square of the TD error:

    3.3|Imitation learning

    Here,we first collect expert data (s,a) to achieve experience with navigation [13].The statesis the input for the expert network,and the actionais the expected output.Note that the output for the expert network is a four-dimensional vector,which is a vector obtained by the softmax regression function.Assuming that the original output isy1,y2,…,yn,the softmax regression is as follows:

    The cross entropy is used as the loss function for the neural network.The cross entropy can reflect differences in the distribution of two samples.If the difference in the sample distribution is large,the cross entropy will also be large,and vice versa.The cross entropyH(p,q) can be expressed as follows:

    q(x) are the actual output andp(x) are the desired output,between which the cross entropy can be used to obtain the deviation.After obtaining the loss function,we use the Adam optimizer to optimize and train the network,which is a specialized gradient descent algorithm that uses the computed gradient,its statistics,and its historical values to take small steps in its opposite direction inside the input parameter space as a means of minimizing a function.

    3.4|Template matching

    Template matching is a general approach for determining the similarity of two feature vectors(or matrices),especially image features.Since the features of the observation and target images are both extracted from the same networks,here,we use the normalized cross correlation(NNC)[34]method to match the relation of the two inputs to sequentially determine whether the robot has arrived at the goal location.The method is formulated as follows:

    whereNis the dimension of the feature vector,siandgirepresent theithfeature values of observation and target,respectively,VNNCis the value of NNC,which is limited in[-1,1],and the higher the value is,the better the method,and vice versa.

    3.5|Deep imitation reinforcement learning

    Our purpose is to find the best actions to move a robot from its initial location to a goal using only an RGB camera.We focus on learning a good visual navigation policy functionπvia DRL,and the actionatat timetcan be formulated as [9]:

    whereθis the system parameter,stis the current observation,andgis the observation of the target.

    Our network,shown in Figure 2,is inspired by the TDDRL method proposed in [12],which consists of two parts.The first part is the generic Siamese layers,which are the pretrained ResNet-50,and the outputs are embedded into a 512-day space.The second part is the scene-specific layers that capture the common representations of the same scene.The output of the network contains four policy outputs and a single value output.However,there are some improvements that are marked in red.First,the initial parameters of the network are copied from the training results of the IL.Since the robot can capture the surrounding four views in AITHOR,which is also possible in the real world for the low-speed robot or with a panoramic camera,the second modification adopts the surrounding images to represent the current observation of a robot instead of the sequential four images.Since both use four images,the computational complexities are the same.Moreover,we add a template matching module to determine whether to stop just after the feature extraction.We attempted to add the done action directly into the policy output,but during the training process,we found it is not a good choice.The reason might be that the value of the done action is too sparse,which only occurs at the end of each episode.

    The explicit deep imitation reinforcement learning(DIRL) algorithm is summarized in Algorithm 1.The initial weights are from IL.Here,we use the simple BC method,where most of the network is the same as the DRL shown in Figure 2.The difference is that the policy outputs are known and labelled by experts.Moreover,the parameters related to the value output are not trained and frozen during imitation training.Note that the performed action is determined by an additional template matching process because it can be calculated online using Equation (13),so there is no need to train.

    FIGURE 2 Network architecture of our DIRL model.The generic Siamese layers in black squares are shared by all the targets,and the feature extraction layers(ResNet-50)are pre-trained on ImageNet and frozen during training.Four surrounding images are adopted as inputs instead of four sequential images,and the initial parameters of the scene-specific layers are from imitation learning.Moreover,a simple yet efficient method of template matching is added to determine whether the task is finished or not.DIRL,deep imitation reinforcement learning

    4|EXPERIMENTS AND RESULTS

    We evaluate our approach on visual navigation tasks based on the AI2-THOR[12]environment.The main purpose of visual navigation here is to find the minimum steps from a random initial location to the goal location.We first train and evaluate the navigation results using our model.Then,an additional experiment is performed to investigate the ability of our approach to transfer knowledge across new targets.All of the models are implemented in TensorFlow and trained on an Nvidia GeForce RTX 2080Ti GPU.

    4.1|The AI2‐THOR framework

    To efficiently train and evaluate our model,we use The House Of InteRactions (AI2THOR) framework proposed in [19],which is a good representation of the real world in terms of the physics of the environment.We use four scenes of the simulated environment [1]:bathroom 02,bedroom 04,kitchen 02 and living room 08.Note that the number of indoor areas is scalable using our model.Figure 3 illustrates some sample images from different scenes,and each scene space is represented as a grid world with a constant step size of 0.5 m.To be more specific,the following information is listed:

    State space.In our experiments,the input includes two parts,S=<F,T>andF,which represent first-person views of the agent,where the resolution of one image is 224×224×3.In addition,Trepresents the goal.We use four surrounding images to represent the observation.Each image is embedded as a 2048D feature vector,which is pretrained on ImageNet with Resnet-50.

    FIGURE 3 AI2-THOR environment.There are four common scene types:kitchen,living room,bathroom and bedroom,respectively

    Action space.We consider five discrete actions:ahead,backwards,left,right and done.The robot takes 0.5 m as a fixed step length,and its turning angle is 90° in the environment.Note that the former four actions are trained by the proposed network,while the done action,which is the termination action,is obtained and determined by the template matching methods.

    4.2|Implementation details

    For training,the navigation performance is performed on 20 goals randomly chosen from four indoor navigation environments in our data set.All of the learning models are trained within 10 million frames,and it takes about 1.5 h to pass through one million training frames across all threads.An episode ends when either the robot arrives at the goal location,or after it takes more than 5000 steps and for evaluation,we perform 4000 different episodes (1000 for each scene).In the matching process,the matching coefficient is set to 0.9.We select 5% of the total states as the expert data for IL training.The expert data is collected and stored automatically based on the shortest length to the target point and is selected in a fixed interval across the ID in each scene.The reward is set as 10.0 once the robot reaches the goal,and to encourage shorter trajectories and collision avoidance,the reward is set as -0.01 and -0.1 as an immediate reward.

    4.3|Evaluation metrics

    The performance of our approach is evaluated using three metrics:the average trajectory length (ATL),success rate (SR)and success weighted by path length (SPL).All of the metrics are used to reflect information about navigation efficiency.ATL represents the average steps per episode it takes for a robot to reach a goal.SR is defined asand SPL is defined aswhereNis the number of episodes,Siis a binary indicator of success in episode i,Pidenotes the evaluated step length andLiis the shortest length of the trajectory.

    4.4|Baselines

    Here,five baselines are chosen for comparison,which are shown as follows:

    1.Shortest path.The shortest path is from the initial location to the goal location.Here,it is considered the ideal result.

    2.Random action.The action is randomly sampled using a uniform distribution at each step.Here,the robot randomly selects the action from four actions during navigation.

    3.IL.The output of the policy is directly determined using the networks trained by expert data.

    4.TD‐DRL.The original target-driven method using DRL was proposed in [12].Targets use and update the same Siamese parameters but different scene-specific parameters.

    5.A3C.It is an asynchronous advantage actor-critic approach.It has been demonstrated that the more threads the system uses,the higher the data efficiency during training.Here,we use four threads to train,and the observation is the same as TD-DRL.

    4.5|Results

    In this subsection,we focus on three results:the sample efficiency during training,the navigation performance of our approach to the trained targets and the generalization to the new targets.Additionally,the baselines are evaluated for comparison.

    4.5.1|Sample efficiency

    One of the main purposes of our proposed approach is to improve the sample efficiency during training.To analyse the sample efficiency in the training process,for all learning models,the training performances are measured by the ATL over all targets,and the performances are reported after being trained within 10 million frames (across all threads).

    Figure 4 shows the performance of sample efficiency during training among different approaches:DIRL,TD-DRL and A3C.Note that the shortest path,random action and IL are not considered here because these approaches can be directly evaluated in the testing process,so they are used in the following navigation comparison.All of the approaches are trained in 10 million frames,and the curves in the figure represent the ATL of all training targets versus the training frames.It can be seen that it only takes approximately 1.5 million frames(2.25 h)to achieve a stable navigation policy in our approach.Although the ATL also decreases after some training frames in the other two approaches,it costs nearly 9.0 million frames and 6.1 million frames to train in A3C and TDDRL,respectively.The reason that TD-DRL performs better than A3C is the specific layers used in TD-DRL,which was demonstrated in [12].Therefore,the results indicate that the performance of our approach is the best and demonstrate the efficient learning performance of our approach.

    4.5.2|Navigation performance on trained targets

    To analyse the navigation performance of our approach,we first analyse the performance on the trained targets.We compare our method with five baselines:shortest path,random action,IL,A3C and TD-DRL.Three evaluation metrics (ATL,SR and SPL) are used to reflect the navigation performance.For each target,we randomly select the initial location of the robot and evaluate 500 episodes.An episode ends when either the robot arrives at the target or after it takes more than 200 steps.We focus on the influence of the training frames,so two models are used where the parameters are trained in 5 and 10 million frames.The navigation results are given in Tables 1 and 2.

    FIGURE 4 Comparison results of sample efficiency with the baselines:DIRL (blue),TD-DRL (red) and A3C (black).All of the approaches are trained in 10 million frames and the ATL of all targets is adopted to reflect the performance.In DIRL,it needs about 1.5 million frames to learn a stable navigation policy,however,it costs nearly 6.1 million frames and 9.0 million frames to train in TD-DRL and A3C,respectively,which indicates that our approach learns fastest to obtain a good navigation policy.ATL,average trajectory length;DIRL,deep imitation reinforcement learning approach;TD-DRL,target-driven deep reinforcement learning

    TABLE 1 Quantitative results on trained targets after 5 million training frames

    Table 1 shows the results on trained targets after 5 million training frames.Our approach performs best in terms of ATL,SR and SPL metrics.For ATL,the fewer steps taken,the better the performance is.The shortest path is approximately 12.66 steps,and our method is approximately 13.62 steps,which is the closest to the ideal result;however,it takes more navigation steps in TD-DRL,A3C and IL,which are approximately 3.7 times,13 times and 10.7 times those of ours,respectively.The SR is 1 using our approach,meaning the system can arrive at the goal in 200 steps because after 5 million training frames,our model has learned a good navigation policy,which can also be seen in Figure 4,while for the other methods,the SR is very low for imperfect learned policy,so does the performance of SPL.These results indicate that our approach can learn good performance faster.

    We further analyse the navigation performance on trained targets after 10 million training frames,where both TD-DRL and A3C have learned good navigation policies.Table 2 shows the comparison results.We find that although the performances of SR are 1 in T-DRL,A3C and our approach,our approach DIRL performs the best in SPL,which is anapproximately 6%improvement compared to TD-DRL and two times compared to that of A3C;the reason might be the different observations.The results further indicate the efficiency of our approach.

    TABLE 2 Quantitative results on trained targets after 10 million training frames

    4.5.3|Generalization across new targets

    To analyse the ability of generalization for navigation,we evaluate our model to new targets for the visual navigation task.Note that these new targets are not trained in advance,but they might share the same routes as the trained targets.During testing,we randomly select eight new targets from a series of locations that have a fixed distance(one,two,four or eight steps) from the nearest trained targets.For each new target,we randomly select the initial location of the robot and evaluate 500 episodes.An episode ends when either the robot arrives at the target or after it takes more than 500 steps.We compare our method with A3C and TD-DRL and use the SR metric to analyse the navigation performance,as shown in Figure 5.We find that all the approaches have the ability to generalize new targets,but our approach performs the best under different conditions.This indicates that our learned model has a stronger ability to understand the surrounding regions of the trained targets than other approaches.

    We can also intuitively analyse what systems have learned by the navigation policies using different approaches from Figure 6,which shows the explicit control state at three different steps.Note that the initial location and the goal are set the same among all of the approaches during testing.We take visual navigation in the bathroom as an example.It takes just 55 steps to arrive at the goal location using our approach;however,it is 151 steps and 432 steps in TD-DRL and A3C,respectively.Moreover,we can further investigate the policy of the system among different approaches.In TD-DRL,the robot rotates frequently,shown in three consecutive steps(t=18,19,20).For A3C,the robot moves in the wrong direction frequently,which is also shown in three consecutive steps (t=46,47,48).It can be seen that the control policy is good using our approach,which further indicates the generalizability of the system learned by our proposed approach.

    4.5.4|Ablation study

    In this subsection,we perform ablations from different views on our approach to obtain further insight into the results.

    To perform the action,we introduce an additional module using the template matching method to determine whether it is finished;however,in TD-DRL and A3C,the done action is automatically determined by the environment,there is no problem in the simulation environment,and we find that during the simulation environment,the matching coefficient is always equal to 1 once the robot arrives at the goal location because the actions are discrete and the scenes are stations in AI-THOR.However,in real-world navigation tasks,the matching coefficient is below 1 due to image noise or the dynamic environment;moreover,the navigation system should determine the done action.Therefore,our approach is more practical.

    FIGURE 5 Generalization across new targets.Each histogram group represents the SR of navigation to new targets with fixed steps from the trained targets.It indicates that our learned model has a stronger ability to understand the surrounding regions of the trained targets than other approaches.SR,success rate

    For the observation,Figure 7 shows the ablation results using two observations:sequential (red) and surrounding(black).Note that the former is the method proposed in TDDRL,where four sequential images are considered as the input,and the latter is our proposed approach using four surrounding images.We still compare the sample efficiency,and both approaches are trained in 10 million frames,and the ATL of all targets is adopted to reflect the performance.The former needs nearly 6.1 million frames to learn a stable navigation policy,but it costs only approximately 2.3 million frames for our approach,which indicates that our proposed approach learns faster.

    FIGURE 6 Visualization of the control processes.It takes just 55 steps to arrive at the goal location using our approach,however,it is 151 steps and 432 steps in TD-DRL and A3C,respectively.Moreover,the robot frequently rotates around in TD-DRL and moves to wrong direction in A3C.It can be seen that the control policy is almost good using our approach,which further indicates the ability of generalization of the system learned by our proposed approach TDDRL,target-driven deep reinforcement learning

    FIGURE 7 Observation ablation.Sample efficiency using different observation:surrounding observation(black) and sequential observation(red).We compare the sample efficiency and both approaches are trained in 10 million frames and the ATL of all targets is adopted to reflect the performance.To the former,it needs about 2.3 million frames to learn a stable navigation policy,however,it costs nearly 6.1 million frames for the latter,which indicates that the observation affect the navigation performance and our proposed approach learns faster.ATL,average trajectory length

    FIGURE 8 Number of expert samples ablation.We report the performances among different number of expert data:none,5%,10%and 50% of the total states.It can be seen that the sample efficiency improves according to the increasing number of expert data for IL,which needs about 2.3 million frames,1.5 million frames,0.7 million frames and 0.2 million frames to learn a stable navigation policy,respectively.It further indicates that it is useful to combine DRL with IL to deal with visual navigation tasks.DRL,deep reinforcement learning;IL,imitation learning

    According to our intuition,the more expert samples we select from the state distribution,the better performance we might achieve during training.We report the performances among different numbers of expert data:none,5%,10% and 50% of the total states.Figure 8 shows the ablation results.It can be seen that the sample efficiency improves according to the increasing number of expert data for IL,which needs approximately 2.3 million frames,1.5 million frames,0.7 million frames and 0.2 million frames to learn a stable navigation policy.The different ATLs at the beginning are caused by the random initial locations and different navigation policies,but in the end,all approaches achieve nearly the same ATL.The results further indicate that it is useful to combine DRL with IL to deal with visual navigation tasks.

    5|CONCLUSIONS AND FUTURE WORK

    Here,we proposed an approach for indoor visual navigation combining IL and DRL.The proposed navigation system improved not only the data efficiency during the training process but also the navigation performance.Unlike traditional visual navigation methods,which require calculating the robot’s position accurately,our approach is an end-to-end method,so it might provide an alternative method for visual navigation.Moreover,simulation results demonstrate that the proposed DIRL approach is general and efficient compared with previous DRL methods,such as TD-DRL and A3C.Generally,the proposed approach can adapt to other DRL networks,such as DDPG and PPO.Our future work includes using the meta-learning method in our framework to improve the generalization to new scenes and evaluating our model in real-world navigation environments.

    ACKNOWLEDGEMENT

    The authors would like to thank Yichuan Zhang and Pinghai Gao for their helpful discussion and feedback.This study was supported by the National Natural Science Foundation of China (Grant nos.61703418 and 61825305).

    ORCID

    Qiang Fanghttps://orcid.org/0000-0002-5063-6889

    午夜久久久久精精品| 久久精品人妻少妇| 精品国内亚洲2022精品成人| 夜夜夜夜夜久久久久| 久久久久九九精品影院| 五月玫瑰六月丁香| 国产精品野战在线观看| 小蜜桃在线观看免费完整版高清| 麻豆精品久久久久久蜜桃| videossex国产| 亚洲成人精品中文字幕电影| 欧美最黄视频在线播放免费| 欧美3d第一页| 久久九九热精品免费| 精品免费久久久久久久清纯| 精品久久国产蜜桃| 欧美bdsm另类| 午夜福利在线在线| 麻豆久久精品国产亚洲av| 成年版毛片免费区| 联通29元200g的流量卡| 直男gayav资源| 久久精品国产99精品国产亚洲性色| 99热6这里只有精品| 欧美色视频一区免费| 久久天躁狠狠躁夜夜2o2o| 国产一区二区三区在线臀色熟女| 午夜a级毛片| 亚洲欧美日韩无卡精品| 精品一区二区三区人妻视频| 国产黄片美女视频| 免费av观看视频| 日韩大尺度精品在线看网址| 内射极品少妇av片p| 精品99又大又爽又粗少妇毛片 | 久久久精品大字幕| 日韩欧美国产在线观看| 干丝袜人妻中文字幕| 18禁黄网站禁片午夜丰满| 一区二区三区四区激情视频 | 国产精品一区二区免费欧美| 亚洲久久久久久中文字幕| 美女高潮的动态| 国产欧美日韩一区二区精品| www.www免费av| 欧美xxxx黑人xx丫x性爽| 嫁个100分男人电影在线观看| 中文字幕熟女人妻在线| 黄色女人牲交| 少妇人妻一区二区三区视频| 中文字幕av成人在线电影| 12—13女人毛片做爰片一| av.在线天堂| 国产熟女欧美一区二区| 国产黄片美女视频| 日韩国内少妇激情av| 亚洲性久久影院| 亚洲,欧美,日韩| 亚洲精华国产精华液的使用体验 | 搡女人真爽免费视频火全软件 | 窝窝影院91人妻| 国产在线精品亚洲第一网站| 亚洲第一电影网av| 男人狂女人下面高潮的视频| 在线看三级毛片| 干丝袜人妻中文字幕| 国产日本99.免费观看| 久久精品国产亚洲网站| x7x7x7水蜜桃| 久99久视频精品免费| 午夜激情欧美在线| 午夜激情欧美在线| 亚洲一区高清亚洲精品| 男人狂女人下面高潮的视频| 干丝袜人妻中文字幕| 亚洲人成网站在线播| 亚洲美女搞黄在线观看 | 欧美中文日本在线观看视频| 国产亚洲欧美98| 特大巨黑吊av在线直播| 桃色一区二区三区在线观看| 亚洲成人久久爱视频| 欧美色欧美亚洲另类二区| 欧美色欧美亚洲另类二区| 免费看av在线观看网站| 老司机午夜福利在线观看视频| 麻豆av噜噜一区二区三区| 99热6这里只有精品| 午夜爱爱视频在线播放| 成人av在线播放网站| 人妻少妇偷人精品九色| 一级黄片播放器| 亚洲成a人片在线一区二区| x7x7x7水蜜桃| 精品99又大又爽又粗少妇毛片 | 亚洲自拍偷在线| 97人妻精品一区二区三区麻豆| 久久精品国产自在天天线| 久久精品夜夜夜夜夜久久蜜豆| 我的女老师完整版在线观看| 久久久久久久午夜电影| 亚洲精品国产成人久久av| 亚洲性夜色夜夜综合| 最近视频中文字幕2019在线8| 成人特级av手机在线观看| 亚洲狠狠婷婷综合久久图片| 免费观看人在逋| 成人无遮挡网站| 国产亚洲av嫩草精品影院| 高清毛片免费观看视频网站| 日本欧美国产在线视频| 国产亚洲av嫩草精品影院| 国产人妻一区二区三区在| 在线免费观看不下载黄p国产 | 国产熟女欧美一区二区| 国产一区二区三区视频了| 一进一出抽搐gif免费好疼| 两个人的视频大全免费| 国产爱豆传媒在线观看| 婷婷精品国产亚洲av在线| 久久人人精品亚洲av| 国产一区二区三区av在线 | 男女视频在线观看网站免费| 69av精品久久久久久| 久久香蕉精品热| 欧美日韩瑟瑟在线播放| 人妻制服诱惑在线中文字幕| 日本黄色视频三级网站网址| 国产高潮美女av| 精品人妻视频免费看| 国产不卡一卡二| 国产免费av片在线观看野外av| 久久久成人免费电影| 国产免费av片在线观看野外av| АⅤ资源中文在线天堂| 国产黄色小视频在线观看| 夜夜爽天天搞| 亚洲欧美日韩高清专用| 非洲黑人性xxxx精品又粗又长| 精品日产1卡2卡| 国产综合懂色| 中文字幕av在线有码专区| 人妻夜夜爽99麻豆av| 精品久久久久久,| 2021天堂中文幕一二区在线观| 搡女人真爽免费视频火全软件 | 日韩中文字幕欧美一区二区| 成人av一区二区三区在线看| 亚洲精品在线观看二区| 精品国产三级普通话版| 精品午夜福利在线看| 亚洲内射少妇av| а√天堂www在线а√下载| 性色avwww在线观看| 亚洲人与动物交配视频| 啦啦啦啦在线视频资源| 亚洲av电影不卡..在线观看| 淫妇啪啪啪对白视频| 尾随美女入室| 麻豆成人av在线观看| 91久久精品电影网| 国产 一区精品| 免费高清视频大片| 美女xxoo啪啪120秒动态图| 免费搜索国产男女视频| 热99re8久久精品国产| 亚洲久久久久久中文字幕| 免费人成在线观看视频色| 免费看光身美女| 国产极品精品免费视频能看的| av在线观看视频网站免费| 婷婷六月久久综合丁香| 精品久久久久久久久av| 综合色av麻豆| 一个人观看的视频www高清免费观看| 禁无遮挡网站| 韩国av在线不卡| 亚洲人成网站高清观看| 国产成人一区二区在线| 久久精品91蜜桃| 91av网一区二区| 久久中文看片网| 少妇裸体淫交视频免费看高清| 最近最新免费中文字幕在线| 免费在线观看成人毛片| 一卡2卡三卡四卡精品乱码亚洲| 国产aⅴ精品一区二区三区波| 亚洲无线观看免费| 全区人妻精品视频| 亚洲真实伦在线观看| 久久精品久久久久久噜噜老黄 | 国产男人的电影天堂91| 嫩草影院精品99| 桃红色精品国产亚洲av| 一级黄片播放器| 精品久久久噜噜| 黄色配什么色好看| 三级国产精品欧美在线观看| 午夜福利欧美成人| 亚洲内射少妇av| 国产成人影院久久av| 老司机午夜福利在线观看视频| 欧美最黄视频在线播放免费| 伦理电影大哥的女人| 国产乱人视频| 3wmmmm亚洲av在线观看| 哪里可以看免费的av片| 色播亚洲综合网| 国产日本99.免费观看| 又黄又爽又刺激的免费视频.| 成人永久免费在线观看视频| 国产一区二区三区视频了| 日本 av在线| 精品久久国产蜜桃| 欧美成人a在线观看| 日本黄色片子视频| 久久久精品大字幕| 成人高潮视频无遮挡免费网站| 成人av一区二区三区在线看| 黄片wwwwww| a级毛片免费高清观看在线播放| 久久午夜福利片| 蜜桃亚洲精品一区二区三区| 国产精品电影一区二区三区| 麻豆成人av在线观看| 国产真实伦视频高清在线观看 | 亚洲不卡免费看| 午夜免费男女啪啪视频观看 | 伦理电影大哥的女人| 婷婷精品国产亚洲av在线| 久久久久久久久大av| 99九九线精品视频在线观看视频| 午夜免费激情av| 午夜日韩欧美国产| 夜夜看夜夜爽夜夜摸| 亚洲精品在线观看二区| 亚洲18禁久久av| 欧美黑人巨大hd| 又爽又黄无遮挡网站| 少妇裸体淫交视频免费看高清| 99国产精品一区二区蜜桃av| 午夜激情欧美在线| 日韩在线高清观看一区二区三区 | 变态另类丝袜制服| 啪啪无遮挡十八禁网站| 女人被狂操c到高潮| a级一级毛片免费在线观看| 日本成人三级电影网站| 国产一区二区三区av在线 | 亚洲一区高清亚洲精品| 国产av一区在线观看免费| 欧美+日韩+精品| 悠悠久久av| 99热网站在线观看| 国产伦人伦偷精品视频| 桃红色精品国产亚洲av| 国产v大片淫在线免费观看| 成人永久免费在线观看视频| 久久久久久九九精品二区国产| 我的女老师完整版在线观看| 最近中文字幕高清免费大全6 | 一本精品99久久精品77| 久久精品国产亚洲av涩爱 | 国产精品一区www在线观看 | 在线免费观看的www视频| 久久九九热精品免费| 91精品国产九色| 午夜免费男女啪啪视频观看 | 欧美最新免费一区二区三区| av国产免费在线观看| 有码 亚洲区| 给我免费播放毛片高清在线观看| 内地一区二区视频在线| 成人高潮视频无遮挡免费网站| 国产三级在线视频| 亚洲熟妇熟女久久| 毛片女人毛片| 欧美日韩亚洲国产一区二区在线观看| 在线观看舔阴道视频| 大型黄色视频在线免费观看| 少妇的逼好多水| 春色校园在线视频观看| 亚洲精品乱码久久久v下载方式| 一个人看视频在线观看www免费| 国产午夜精品久久久久久一区二区三区 | 国产精品女同一区二区软件 | 天天一区二区日本电影三级| 日日啪夜夜撸| 日韩强制内射视频| 一进一出抽搐动态| 久久久精品大字幕| 999久久久精品免费观看国产| 国产高清激情床上av| 亚洲精品国产成人久久av| 久9热在线精品视频| 精品人妻一区二区三区麻豆 | 麻豆精品久久久久久蜜桃| 亚洲av成人精品一区久久| 亚洲av五月六月丁香网| 午夜免费成人在线视频| 欧美一区二区国产精品久久精品| 国产麻豆成人av免费视频| 亚洲av成人精品一区久久| 婷婷精品国产亚洲av| 少妇人妻一区二区三区视频| av视频在线观看入口| 成人亚洲精品av一区二区| 搞女人的毛片| 日韩欧美精品v在线| 我要搜黄色片| 麻豆精品久久久久久蜜桃| 亚洲精华国产精华液的使用体验 | av在线天堂中文字幕| 一级黄片播放器| 亚洲av中文字字幕乱码综合| 在线观看av片永久免费下载| 色综合站精品国产| 久久99热6这里只有精品| 中文字幕av在线有码专区| 男人舔奶头视频| 国产精品自产拍在线观看55亚洲| aaaaa片日本免费| 亚洲无线观看免费| 自拍偷自拍亚洲精品老妇| 少妇的逼好多水| 成人av在线播放网站| 亚洲中文日韩欧美视频| 最好的美女福利视频网| 在线观看美女被高潮喷水网站| 婷婷丁香在线五月| 亚洲成av人片在线播放无| 村上凉子中文字幕在线| 舔av片在线| 国产成人av教育| 国产欧美日韩精品一区二区| 日韩人妻高清精品专区| 国产欧美日韩精品亚洲av| 18禁黄网站禁片午夜丰满| 赤兔流量卡办理| 国产一区二区三区av在线 | 久久精品国产清高在天天线| 久久久久久久午夜电影| 国内精品久久久久久久电影| 欧美绝顶高潮抽搐喷水| 午夜福利欧美成人| 夜夜爽天天搞| 久久精品国产鲁丝片午夜精品 | 欧美成人性av电影在线观看| 国产精品久久久久久久久免| 国产精品99久久久久久久久| 精品久久久久久久久亚洲 | 在线国产一区二区在线| 嫩草影视91久久| 超碰av人人做人人爽久久| 欧美色视频一区免费| 欧美xxxx性猛交bbbb| 亚洲黑人精品在线| 久久婷婷人人爽人人干人人爱| 亚洲成人久久性| 国产精品久久久久久精品电影| 有码 亚洲区| 黄色视频,在线免费观看| 成人特级av手机在线观看| 99国产精品一区二区蜜桃av| 91久久精品国产一区二区成人| 成人亚洲精品av一区二区| 在线a可以看的网站| 亚洲综合色惰| 亚洲四区av| 精品国内亚洲2022精品成人| 国产乱人视频| 欧美性猛交黑人性爽| 少妇人妻精品综合一区二区 | 国产一区二区三区视频了| 国产成年人精品一区二区| 国产精品久久视频播放| 亚洲av电影不卡..在线观看| 99视频精品全部免费 在线| 亚洲av.av天堂| 热99re8久久精品国产| 欧美绝顶高潮抽搐喷水| 黄色女人牲交| 欧美色视频一区免费| 18+在线观看网站| 亚洲欧美日韩无卡精品| 少妇猛男粗大的猛烈进出视频 | 观看免费一级毛片| 亚洲电影在线观看av| 色哟哟·www| 成年女人毛片免费观看观看9| 看十八女毛片水多多多| 久久精品夜夜夜夜夜久久蜜豆| 午夜老司机福利剧场| 国产av麻豆久久久久久久| 日韩一本色道免费dvd| 精品久久国产蜜桃| 中文字幕精品亚洲无线码一区| 特级一级黄色大片| 欧美日韩精品成人综合77777| 女人十人毛片免费观看3o分钟| 欧美国产日韩亚洲一区| 不卡一级毛片| 色综合站精品国产| 欧美3d第一页| 国产精品国产高清国产av| 国产精品一及| 在现免费观看毛片| 九色国产91popny在线| 午夜福利在线在线| 校园人妻丝袜中文字幕| 夜夜夜夜夜久久久久| 老熟妇乱子伦视频在线观看| 亚洲熟妇熟女久久| 久久精品国产99精品国产亚洲性色| 欧美区成人在线视频| 麻豆国产97在线/欧美| 久99久视频精品免费| 看黄色毛片网站| 午夜a级毛片| av女优亚洲男人天堂| 日韩国内少妇激情av| 国产私拍福利视频在线观看| 欧美xxxx性猛交bbbb| 黄色日韩在线| 国产一区二区在线av高清观看| 国产精品久久电影中文字幕| 999久久久精品免费观看国产| 国产精品自产拍在线观看55亚洲| 亚洲图色成人| 日本免费一区二区三区高清不卡| 久久亚洲真实| 午夜日韩欧美国产| 亚洲美女黄片视频| 久久久久国内视频| 国产av麻豆久久久久久久| 亚洲精品日韩av片在线观看| 日本 欧美在线| 啦啦啦韩国在线观看视频| 男人舔奶头视频| 啪啪无遮挡十八禁网站| 美女被艹到高潮喷水动态| 久久精品国产自在天天线| 成人无遮挡网站| 中文字幕高清在线视频| 亚洲成人中文字幕在线播放| 99在线视频只有这里精品首页| 不卡视频在线观看欧美| 高清日韩中文字幕在线| 欧美色欧美亚洲另类二区| 亚洲人成网站在线播放欧美日韩| 亚洲精品国产成人久久av| 蜜桃久久精品国产亚洲av| 国产爱豆传媒在线观看| 亚洲精华国产精华精| 中文字幕av在线有码专区| 亚洲av.av天堂| 成人鲁丝片一二三区免费| 国产高清不卡午夜福利| 国产av在哪里看| 神马国产精品三级电影在线观看| 国产伦人伦偷精品视频| 十八禁国产超污无遮挡网站| 久久人人精品亚洲av| 1024手机看黄色片| 波野结衣二区三区在线| 深爱激情五月婷婷| 人妻丰满熟妇av一区二区三区| 久久这里只有精品中国| 无人区码免费观看不卡| 日韩中文字幕欧美一区二区| 欧美zozozo另类| 韩国av一区二区三区四区| 亚洲国产欧洲综合997久久,| 国产精品久久电影中文字幕| 国产男靠女视频免费网站| 亚洲欧美日韩高清专用| 国产高清不卡午夜福利| 日日干狠狠操夜夜爽| av在线亚洲专区| 国产一区二区在线av高清观看| 级片在线观看| 午夜福利欧美成人| 国产v大片淫在线免费观看| 精品人妻偷拍中文字幕| 久久久久国内视频| 搞女人的毛片| 麻豆国产97在线/欧美| 九色国产91popny在线| 男人舔女人下体高潮全视频| 欧美丝袜亚洲另类 | 在线观看美女被高潮喷水网站| 亚洲av中文字字幕乱码综合| 国产精品人妻久久久影院| 久久精品国产亚洲av天美| 美女大奶头视频| 亚洲欧美激情综合另类| 国产高潮美女av| 一个人观看的视频www高清免费观看| 在线观看舔阴道视频| 成年女人永久免费观看视频| 观看免费一级毛片| 草草在线视频免费看| 国产v大片淫在线免费观看| av天堂在线播放| 久久午夜亚洲精品久久| 国产精品1区2区在线观看.| av视频在线观看入口| 国产伦一二天堂av在线观看| 91久久精品国产一区二区三区| 日韩大尺度精品在线看网址| 精品乱码久久久久久99久播| 欧美国产日韩亚洲一区| 久久精品国产自在天天线| 欧美3d第一页| 午夜影院日韩av| 中文字幕精品亚洲无线码一区| 国产精品98久久久久久宅男小说| 免费人成视频x8x8入口观看| 不卡视频在线观看欧美| 欧美性感艳星| 亚洲成av人片在线播放无| 国产一区二区激情短视频| 十八禁国产超污无遮挡网站| 我的女老师完整版在线观看| 中文字幕人妻熟人妻熟丝袜美| 国产免费av片在线观看野外av| 老司机福利观看| 18禁黄网站禁片免费观看直播| 美女高潮的动态| 亚洲精品一区av在线观看| 熟女人妻精品中文字幕| 全区人妻精品视频| 最后的刺客免费高清国语| 亚洲av二区三区四区| 嫩草影院入口| 亚洲av二区三区四区| 男女之事视频高清在线观看| 中文资源天堂在线| 嫩草影院精品99| 亚洲中文日韩欧美视频| 成年女人永久免费观看视频| 成人欧美大片| 免费高清视频大片| 男女那种视频在线观看| 亚洲18禁久久av| 免费观看精品视频网站| 国产av不卡久久| 精品久久久噜噜| 深夜精品福利| 国产在线精品亚洲第一网站| 日韩欧美三级三区| 国产色爽女视频免费观看| 日韩欧美三级三区| 成人美女网站在线观看视频| 亚洲自偷自拍三级| 国产色婷婷99| 亚洲一级一片aⅴ在线观看| 精品人妻一区二区三区麻豆 | 老司机深夜福利视频在线观看| 免费大片18禁| 老司机深夜福利视频在线观看| 国产黄色小视频在线观看| 欧美激情国产日韩精品一区| 日韩欧美免费精品| 内射极品少妇av片p| 国产高清有码在线观看视频| 日韩,欧美,国产一区二区三区 | 禁无遮挡网站| 在线观看免费视频日本深夜| 午夜免费激情av| 免费观看在线日韩| 午夜免费激情av| 国内精品美女久久久久久| 午夜福利视频1000在线观看| 国产午夜福利久久久久久| 一本一本综合久久| 12—13女人毛片做爰片一| 日韩 亚洲 欧美在线| 欧美+亚洲+日韩+国产| 国内精品久久久久久久电影| 色在线成人网| 婷婷丁香在线五月| 精品国产三级普通话版| 九九爱精品视频在线观看| 亚洲综合色惰| 精品久久久久久久久久免费视频| 99热网站在线观看| 干丝袜人妻中文字幕| 久久久午夜欧美精品| 久久99热6这里只有精品| 999久久久精品免费观看国产| 亚洲黑人精品在线| 变态另类成人亚洲欧美熟女| 国产成人影院久久av| 亚洲电影在线观看av| 乱系列少妇在线播放| 波多野结衣高清无吗| 国产视频内射| 无遮挡黄片免费观看| 女的被弄到高潮叫床怎么办 | 夜夜夜夜夜久久久久| 成人三级黄色视频| 亚洲久久久久久中文字幕| 婷婷六月久久综合丁香| 国产综合懂色| 精品久久久久久久久av| 免费av观看视频| 91在线精品国自产拍蜜月| 亚洲经典国产精华液单| 亚洲欧美清纯卡通| 欧美一区二区精品小视频在线| xxxwww97欧美| 少妇丰满av| 真实男女啪啪啪动态图| 国产精品日韩av在线免费观看|