• 
    

    
    

      99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

      A Recurrent Attention and Interaction Model for Pedestrian Trajectory Prediction

      2020-09-02 03:58:42XuesongLiYatingLiuKunfengWangSeniorMemberIEEEandFeiYueWangFellowIEEE
      IEEE/CAA Journal of Automatica Sinica 2020年5期

      Xuesong Li, Yating Liu, Kunfeng Wang, Senior Member, IEEE, and Fei-Yue Wang, Fellow, IEEE

      Abstract—The movement of pedestrians involves temporal continuity, spatial interactivity, and random diversity. As a result,pedestrian trajectory prediction is rather challenging. Most existing trajectory prediction methods tend to focus on just one aspect of these challenges, ignoring the temporal information of the trajectory and making too many assumptions. In this paper,we propose a recurrent attention and interaction (RAI) model to predict pedestrian trajectories. The RAI model consists of a temporal attention module, spatial pooling module, and randomness modeling module. The temporal attention module is proposed to assign different weights to the input sequence of a target, and reduce the speed deviation of different pedestrians.The spatial pooling module is proposed to model not only the social information of neighbors in historical frames, but also the intention of neighbors in the current time. The randomness modeling module is proposed to model the uncertainty and diversity of trajectories by introducing random noise. We conduct extensive experiments on several public datasets. The results demonstrate that our method outperforms many that are state-ofthe-art.

      I. Introduction

      PEDESTRIAN trajectory prediction is defined as predictions of the future trajectory of a pedestrian based on his/her position in the past period of time, which is usually treated as a sequence generation task. Pedestrian trajectory prediction is crucial for path planning of autonomous devices[1], [2]. For example, in an autonomous driving scenario, an autonomous vehicle needs to accurately predict the motion trajectories of pedestrians according to their positions, in order to make the next decision. Moreover, the behaviors of pedestrians are modeled through research on pedestrian trajectory prediction, which can be used for crowd evacuation[3], abnormal target detection [4] and other specific tasks. The predicted object can also be a vehicle, animal, and other targets, but most research has been performed with pedestrians. Perhaps this is because the prediction of pedestrian trajectory is more difficult and has more uses.Therefore, it is necessary to conduct in-depth research on pedestrian trajectory prediction.

      Pedestrian trajectory prediction has attracted much attention. Many researchers have proposed methods for it.Although pedestrian movement is full of randomness, it has a certain regularity. Generally, pedestrian trajectory prediction methods mainly include model-driven methods and datadriven methods. The model-driven methods usually predict external behavior according to underlying principles, while data-driven methods mainly model internal correlation through statistical analysis of data.

      In early research on pedestrian trajectory prediction, lots of works focused on model-driven methods, which typically include the social force model [5] and hidden Markov model[6]. The social force method predicts the behavior of pedestrians according to attraction and repulsion forces. It is believed that attraction forces will attract specific pedestrians to walk towards a target, and repulsion forces will prevent collision between pedestrians. The hidden Markov method predicts the trajectory of pedestrians in spatio-temporal probability. Nevertheless, these methods are too sensitive to parameters and are unable to describe the diverse social behavior of pedestrians, e.g., that people walk in group. Even worse, the representation ability of these methods is not strong.

      In recent years, with the development of deep learning,data-driven methods have become a research hotspot. Such approach usually treats pedestrian trajectory prediction as a time-series prediction problem which takes into account the interaction of pedestrians. Some recent works have used the recursive neural networks (RNNs) to solve this problem, such as social long short-term memory (S-LSTM) [7] and group LSTM [8]. The S-LSTM model presents a social pooling module that meshes space to capture the interactive information of adjacent pedestrians. The group LSTM is an improved method of S-LSTM; it uses motion coherence to cluster trajectories with similar movement trends and group pedestrians. But S-LSTM and group LSTM are not sufficient to assume that the effect on an individual is determined by its location. Some works have used generative adversarial networks (GAN) for model training, such as social GAN(SGAN) [9], SoPhie [10] and social ways [11]. The SGAN model uses a new pooling strategy to describe the influences among neighbors and firstly predicts the trajectory by using GAN. The SoPhie model uses the path history of all the agents and the scene context information to predict the future path.The social ways uses InfoGAN to produce samples and integrates a few hand-crafted interaction features. Although they have achieved competitive performance, they do not model temporal information and their training processes are too troublesome to reproduce the results. Some works consider adding additional auxiliary information to help with trajectory prediction, such as the multi-task learning model named Next [12], scene-LSTM [13] and trajectory prediction with attribute and environment (TAE) [14]. The Next model is an end-to-end multi-task learning system that jointly predicts future paths and activities of pedestrians. The scene-LSTM model incorporates scene information and individual pedestrian movement to predict the trajectory. The TAE model completes path predictions by using object attributes and semantic environments. However, these methods need to introduce additional auxiliary information which is difficult to obtain and learn.

      Pedestrian motion is inherently multimodal and complex, so pedestrian trajectory prediction is a challenging task. As described in Fig.1, first, pedestrians are very flexible and subjective in the decision-making process, so the trajectory prediction model should predict multiple reasonable trajectories to model possibilities; second, a pedestrian is often not independent when making decisions, and the trajectories of multiple pedestrians will influence each other, so we also need to model the pedestrian interaction pattern; last but not least, the speed of the same target at different times is different, and the speeds of different targets are also different,so it is necessary to capture trajectory information in the temporal domain.

      Fig.1. Illustration of pedestrian trajectory prediction.

      In order to solve the above problems, we present a recurrent attention and interaction (RAI) model to model the spatiotemporal information of pedestrian motion. The RAI model includes a temporal attention module, spatial pooling module and randomness modeling module. Considering the impact of different historical locations of pedestrians and speed bias of different pedestrians, the temporal attention module is proposed to act on the input sequence of target in the temporal domain. Considering the trajectory interaction among different pedestrians, the spatial pooling module is proposed to model not only the social information of neighbors in the historical frame, but also the intention of neighbors in the current frame. To model the multimodality and flexibility of the prediction process, we also add a randomness modeling module to the RAI model and introduce a variety loss of SGAN [9]. Our method outperforms many state-of-the-art models on publicly available ETH [15] and UCY [16]datasets.

      The main contributions in this paper are three-fold.

      1) A new trajectory prediction method is proposed by combining a temporal attention module, spatial pooling module and randomness modeling module. Compared with previous methods, the proposed method can sufficiently mine spatio-temporal information to model attention mechanisms,interactions and multimodality of pedestrian motion.

      2) A temporal attention module is introduced to assign different weights to the input sequence of target. It can focus on the importance of different moments and weaken the speed deviation of different pedestrians.

      3) A spatial pooling module is presented to model the social behaviors of neighbors. It is used to model interactions among different targets and improve the accuracy and robustness of our proposed model.

      The remainder of this paper is arranged as follows. In Section II, we present a review on related works of pedestrian trajectory prediction. In Section III, we describe the details of our RAI model and its various modules. In Section IV, we conduct an experimental evaluation of our RAI model on multiple datasets and analyze our method through ablation study. Finally, a conclusion is draw in Section V.

      II. Related Works

      Pedestrian trajectory prediction methods usually include model-driven and data-driven methods. Data-driven methods often lead to stronger feature representation and robustness than model-driven methods. Our work focuses on data-driven methods, and the related works are discussed as follows.

      A. RNNs for Trajectory Prediction

      Recurrent neural networks (RNN) and its variants have been widely used in sequence prediction tasks, such as speech recognition [17], traffic flow forecast [18] and stock prediction [19]. As described in Fig.2, RNN calculates the state htof the current moment based on the state ht?1of the previous moment and the current input Xt, so each state htcontains all useful information before it. Long short-term memory (LSTM) [20] and gated recurrent unit (GRU) [21] are proposed to solve the problems of gradient vanishing and gradient explosion in backward propagation of traditional RNN training. In the task of trajectory prediction, there are many methods based on LSTM due to its high efficiency [7],[9]. State refinement for LSTM (SR-LSTM) [22] proposes a data-driven state refinement module for LSTM network.Crowd interaction with deep neural network (CIDNN) [23]proposes a LSTM-based deep neural network for crowd interaction. In order to facilitate comparison with other methods, we also design a trajectory prediction framework based on LSTM.

      Fig.2. Illustration of a recurrent neural network (RNN) and its variants.

      B. Attention Mechanism

      Attention models have been widely used in recent years for various types of tasks in image processing [24]–[26], speech recognition [27] and natural language processing [28], where their core goal is to select key information for the current mission from a large amount of information. The performance of attention model is also improved in trajectory prediction[29], [30]. Vemula et al. [31] present an attention module to assign weights on hidden states of neighbors for each pedestrian. Fernando et al. [4] present a combined attention module that uses both “soft attention” and “hard-wired attention” to capture trajectory information from the neighbors. They design attention models only in the spatial domain to weigh the effects of different neighbors. In contrast,we design an attention model in the temporal domain to weigh positions of target at different moments and align time intervals of different targets.

      C. Human-Human Interactions

      The trajectory prediction task usually needs to consider interactions among different targets in a scene. The social force model [5] is proposed to extract human-human interaction features according to the attraction and repulsion forces. A social pooling module [7] is presented to capture the interactive information of adjacent pedestrians. Furthermore, a multi-layer perceptron (MLP) network followed by max pooling [9] is employed to extract interactive features. StarNet[32] uses the Hub network to extract the interactive information of all targets, and outputs the prediction of each Host network independently for each target’s trajectory.However, they do not make full use of the trajectory information in the spatio-temporal domain. In this paper, we design a novel interaction module, which considers not only the social behavior in the historical frames, but also the interactions of different objects at the current moment. In addition, we design a spatial pooling module to weigh the effects of different neighbors.

      III. Proposed Method

      As shown in Fig.1, the task of pedestrian trajectory prediction is to observe the locations of all targets from time 1 to time Tobs, and predict their locations from time Tobs+1 to time Tpred. In this section, we describe the RAI model to solve the problem of pedestrian trajectory prediction in crowded scenes. Our model considers not only the motion change in temporal domain, but also the trajectory interaction of different targets in spatial domain.

      A. Overall Architecture of RAI Model

      As shown in Fig.3, we employ an encoder-decoder framework that integrates a temporal attention module, spatial pooling module and randomness modeling module. The temporal attention module is applied to capture the trajectory information of the target in the temporal domain. The spatial pooling module is applied to capture the effects of other nearby targets in the spatial domain. The randomness modeling module is proposed to model the multimodality and uncertainty of the trajectory.

      Fig.3. The framework of our proposed recurrent attention and interaction(RAI) model.

      We encode the trajectory information of each pedestrian through the encoder. However, the output of the encoder does not utilize a temporal attention mechanism, and cannot capture the interactive information of pedestrians. If the decoder takes the output of encoder as input, the network model is called vanilla LSTM (V-LSTM), which assumes all pedestrians are independent.

      To model temporal information and social behavior among pedestrians, the encoder and decoder are connected via the temporal attention module and spatial pooling module. The temporal attention module can not only weigh the trajectory information at different times, but also align different pedestrians in the temporal domain. The spatial pooling module is used to capture the interaction between different targets in the historical moments and the current frame.

      B. Temporal Attention Module

      In the task of pedestrian trajectory prediction, the historical time step k of observation is difficult to determine. Usually the selection of the hyper-parameter is done from experience.There are also problems where a pedestrian has different speeds at different instants of time and different pedestrians move at different speeds. In order to solve these problems, we propose a temporal attention module that is inspired by the soft attention mechanism. The temporal attention module assigns different weights to the output of encoder at each time step. It also extracts the trajectory information in the temporal domain to reduce the impacts of different time steps and temporal misalignment, thereby making the predictions more accurate.

      Fig.4 shows the network structure of the temporal attention module, where k feature vectors hj(j=1,2,...,k) are the output from the encoder and input to the temporal attention module. In order to focus on different parts of the temporal sequence with different degrees of attention, we assign weight ajto feature vector hj. The weight ajis computed by

      where BN refers to Batch Normalization [33]. The score sjis normalized by Batch Normalization and softmax function to get the final weight aj. And sjis an intermediate state, which is computed by

      We improve the accuracy and robustness of the model by using the attention mechanism.

      Fig.4. The network structure of the temporal attention module.

      Fig.6. The illustration of our RAI model for predicting pedestrian trajectory. Best viewed with zooming.

      The smaller the values of ADE and FDE, the smaller the distance between model prediction and ground truth, and the more accurate the trajectory prediction.

      B. Quantitative Evaluation

      We adopt the leave-one-out approach as used in S-LSTM[7] to train our model. For each of the five subsets, our model is trained and validated on four subsets, and tested on the last one. For other methods which are used for comparison, we adopt the same strategy for fairness. Fig.6 shows the visualization of our RAI model for predicting pedestrian trajectory on five real-life scenarios. The first row shows the predicted results of our method, the second row is the ground truth of the trajectory, and the last row shows scene map for each scene.

      As shown in Table I, we compare our method with state-ofthe-art methods on the same settings. The baseline and stateof-the-art methods are described as below.

      1) Linear Model: We minimize the least square error to estimate linear parameters for extrapolating trajectories with assumption of linear acceleration.

      2) V-LSTM: A simple setting of our model without the pooling mechanism and attention mechanism, which assumes that all targets are independent of each other. V-LSTM is used as our baseline model.

      3) S-LSTM: The method by which Alahi et al. [7] presents a novel social pooling module to model human-human interactions.

      4) Scene-LSTM: The method by which Huynh et al. [13]incorporates the scene information and individual pedestrian movement to predict the trajectory.

      5) SGAN: The method by which Gupta et al. [9] presents a novel trajectory prediction framework with generative adversarial network (GAN) and introduces a new pooling mechanism with multi-layer perceptron (MLP).

      6) SoPhie: The method by which Sadeghian et al. [10]presents an interpretable network based on GAN that uses the trajectory history of all agents in the same scene, and the scene context information. It leverages both social and physical information.

      Table I shows that our method outperforms previous methods on two metrics including ADE and FDE. RAI-NV-N refers to a model trained with variety loss with k=N and taking N samples during testing phase as SGAN [9]. When variety loss is not adopted, our model RAI (1V-1) performs the best on the ETH-univ, UCY-zara01, UCY-zara02 and UCY-univ subsets, which shows that our network design is effective and leads to good results. However, we notice that in the ETH-hotel scenario, the linear model and Scene-LSTM model perform better. Some possible reasons include the fact that pedestrians are relatively independent in this scene, most trajectories can be linearized, and the Scene-LSTM model uses additional scene information.

      SoPhie (20V-20), SGAN (20V-20) and our RAI (20V-20)use variety loss during the training phase and sample multiple times during the testing phase, which are used to model the diversity and randomness. Thus, the performance of the model is higher than others. Although SoPhie and SGAN use an internal counter training mechanism and model social information, our temporal attention module and spatial pooling module are much more effective. It is mainly because our model can extract trajectory information from both the temporal domain and spatial domain, and the model has a stronger ability of feature representation and focuses on having more prior knowledge of pedestrian trajectory prediction. The AVG metric is the average result across five

      TABLE I Performance Comparison of Various Trajectory Prediction Methods in Terms of ADE and FDE

      subsets, and it is clear that our method performs better on both single sampling and multiple samplings.

      Through a comprehensive comparison of the results in Table I, it can be seen that the prediction accuracies of the models based on the neural network are higher than traditional linear model, because neural network models are more complex and are better at feature representation. By comparing the results of V-LSTM and S-LSTM, the model with the interactive module is better than the model with independent assumption. And it seems that the performance of SGAN with internal counter training mechanism is not good.The performance of SGAN is good mainly because of its variety loss and multiple samplings. Inspired by these methods, our RAI model improves the interaction module and introduces the variety loss and multiple samplings of SGAN.Meanwhile, due to the introduction of an attention mechanism, our model further improves on the average ADE and average FDE.

      C. Ablation Study

      In order to further illustrate the effectiveness of different components including the temporal attention module, spatial pooling module and randomness modeling module of our RAI model, we conduct ablation experiments for each component.Our ablation experiments are also evaluated on the above five subsets, in terms of ADE and FDE. For each ablation experiment, we strictly control all the hyper-parameters to be consistent.

      1) Ablation Study of Different Components: Table II shows the effects of our proposed components. “Attention”,“Pooling” and “Randomness” indicate whether we add these modules to the baseline model. “ N” is the number of samples taken during the testing phase. ID 1 refers to the experiment of V-LSTM without the pooling module and attention mechanism. ID 2 refers to V-LSTM with the attention mechanism, which realizes an average increase of ADE/FDE= 0.14/0.33 on the five subsets. ID 3 refers to the V-LSTM with the pooling module, which achieves an average increase of ADE/FDE = 0.12/0.28 on the five sets. ID 4 refers to the VLSTM with the attention mechanism and pooling module,which results in more performance improvements. It can be seen from Table II that the attention module performs better than the pooling module. And it is apparent that our RAI model combining the attention mechanism and pooling module performs better than with a single module.

      Table II also shows the effect of the randomness modeling module. When the baseline model adds Gaussian noise that is used to model the diversity and randomness; this means that we can use variety loss to generate more samples. In this ablation experiment, the dimension of noise is 4, and we compare the performance of noise in different dimensions later. IDs 5-8 represent different sampling times, and it is obvious that the more samples, the better the performance. ID 8 achieves an average performance of ADE/FDE = 0.49/1.02 on the five subsets, which shows that variety loss has a great improvement in accuracy.

      2) Comparison of Different Pooling Mechanisms: Table III shows the comparison of different pooling methods. “Social Pooling” is proposed by S-LSTM [7], and “Max Pooling” is proposed by SGAN [9]. Our spatial pooling performs better on ETH-univ, ETH-hotel and UCY-univ, indicating that our model has a strong ability of interactive feature modeling. It also shows that spatial pooling achieves an average performance of ADE/FDE = 0.55/1.16 on the five subsets,indicating that our network design is effective. Compared to the social pooling and max pooling methods, our spatial pooling extracts the interactive information of the historical frame and the current moment, and assigns different weights to different neighbors, which has stronger feature representation ability.

      3) Ablation Study of Random Noise: Table IV shows the comparison of the randomness modeling module with different noise dimensions. Although 16-dimensional noise achieves an average performance of ADE/FDE = 0.48/1.00 on the five subsets, which performs better than others, its metrics are higher than others only on ETH-hotel set. So it is clear that the change of dimensions has a small impact on the performance of the model. The main reason for this situation may be that although increasing the noise dimension can increase the randomness of model, the proportion of useful trajectory information is decreased.

      D. Qualitative Evaluation

      Fig.7 shows the visualization results of comparison

      TABLE II The Competing Results of Different Components

      TABLE III Comparison of Different Pooling Mechanisms

      TABLE IV Comparison of Randomness Modeling Module With Different Noise Dimensions

      Fig.7. Visualization results of comparison between our RAI model and V-LSTM in four typical scenarios. Best viewed with zooming.

      between our RAI model and V-LSTM in four typical scenarios including people merging, person following, group avoiding and group walking. Given a scenario, there may be many reasonable predictions based on past trajectories.Compared to V-LSTM which does not consider the temporal trajectory information and spatial interactive information, RAI gives a reasonable prediction by using the attention mechanism and pooling module. In Fig.7, the first row is the prediction of RAI, while the second row is the prediction of V-LSTM.

      It is shown that one person meets a group in the first column of Fig.7, and it is common that people avoid collision while continuing to move in the same direction. Our RAI model can predict the changes of pedestrian speed and direction to achieve people merging accurately, while V-LSTM makes inaccurate predictions that lead to collision.

      The second column shows one person behind another.Normally, the person in front is not affected by the trajectory of the person behind, but the person behind is affected by the trajectory of the person in front. For example, the person behind wants to overtake the person in front. Our RAI model predicts that the person behind is following the person in front based on historical track information, but V-LSTM predicts that the person in the back is directly away from the person in front.

      In the third column, it is shown that two people avoid four people who are standing still. We can see that the two moving people are also walking together according to historical trajectory information. Our RAI method can avoid obstacles while keeping two people moving together. However, the VLSTM cannot avoid obstacles and have people walk together at the same time.

      The fourth column shows group walking. Our RAI method predicts that three persons will walk together according to historical trajectory information and performs better than VLSTM.

      V. Conclusion

      In this paper, we propose a recurrent attention and interaction (RAI) model that combines a temporal attention module, spatial pooling module and randomness modeling module for pedestrian trajectory prediction. The temporal attention module is designed to assign different weights to different times and capture trajectory information in the temporal domain. The spatial pooling module is proposed to model the social information of neighbors in the historical frame and current frame. In addition, the randomness modeling module is used to model the diversity and randomness of our RAI model. Extensive experiments on multiple datasets demonstrate the effectiveness of our RAI model. Our method can not only be applied to pedestrian trajectory prediction, but also to trajectory prediction of other targets, which is of great significance.

      However, due to the subjectivity and randomness of pedestrian movement as well as the diversity of the environment, the task of pedestrian trajectory prediction still has great challenges. There are also some shortcomings in our method. First, pedestrians should not be simply viewed as a point, this will cause errors in the prediction process. Even worse, we lose information about the pixels of the pedestrian regions, which we should consider in the prediction model.Second, we do not consider additional auxiliary information,such as scene information and pedestrian attribute information, so determining how to get additional auxiliary information and adding them to the trajectory prediction model is a good research direction. Third, there can be multiple categories of targets in a scene, but we only consider the trajectory prediction for a single category, so we should further consider interactions among targets of different categories.

      淮滨县| 龙州县| 利津县| 咸宁市| 长阳| 枝江市| 通化市| 扬州市| 瑞金市| 宜丰县| 济阳县| 兰考县| 迭部县| 巩留县| 临夏县| 三原县| 四会市| 麻栗坡县| 曲松县| 四川省| 汨罗市| 宣恩县| 儋州市| 双江| 八宿县| 利川市| 南康市| 长宁县| 枣庄市| 若尔盖县| 德化县| 哈巴河县| 上林县| 江都市| 大余县| 库车县| 新密市| 辽阳市| 翁源县| 七台河市| 思茅市|