Xiaodong Zhao,Yaran Chen,Jin Guo,and Dongbin Zhao,
Abstract—Human trajectory prediction is essential and promising in many related applications.This is challenging due to the uncertainty of human behaviors, which can be influenced not only by himself, but also by the surrounding environment. Recent works based on long-short term memory(LSTM) models have brought tremendous improvements on the task of trajectory prediction. However, most of them focus on the spatial influence of humans but ignore the temporal influence. In this paper, we propose a novel spatial-temporal attention(ST-Attention) model,which studies spatial and temporal affinities jointly.Specifically,we introduce an attention mechanism to extract temporal affinity,learning the importance for historical trajectory information at different time instants.To explore spatial affinity,a deep neural network is employed to measure different importance of the neighbors. Experimental results show that our method achieves competitive performance compared with state-of-the-art methods on publicly available datasets.
HUMAN trajectory prediction is to predict future path according to the history trajectory.The trajectory is represented by a set of sampled consecutive location coordinates.Trajectory prediction is a core building block for autonomous moving platforms,and the prospective applications include autonomous driving[1]–[3],mobile robot navigation[4],assistive technologies[5],and smart video surveillance[6],etc.
When a person is walking in the crowd, the future path is determined by various factors like the intention,the social conventions and the influence of nearby people.For instance,people prefer to walk along the sidewalk rather than crossing the highway.A person is able to adjust his path by estimating the future path of the people around him,and the people do the same thing which in turn affects the target.Human trajectory prediction becomes an extremely challenging problem due to such complex nature of the people.Benefiting from the powerful deep learning[7],[8], human trajectory prediction has gained a significant improvement in the last few years.Yagiet al.in [5] present a multi-stream convolutiondeconvolution architecture for first-person videos,which verifies pose,scale,and ego-motion cues are useful for the future person localization.Pioneering works by[9],[10]shows that long-short term memory(LSTM)has the capacity to learn general human movements and predict future trajectories.
Although tremendous efforts have been made to address these challenges, there are still two limitations:
1)The historical trajectory information at different time instants has different levels of influence on the target human,which is ignored by most of works.However,it plays an important role on the prediction of the future path.As for the target human, the latest trajectory information usually has a higher level of influence on the future path as shown in Fig.1(a). As for the neighbors,the trajectory information will have a great impact as long as the distance is close to the target,as shown in Fig.1(b).Thus, the historical trajectory information at different time instants ought to be given different weights.The attention mechanism is capable of learning different weights according to the importance.
Fig.1.Illustration of the influences at different time instants.(a)As for the target human( PT ), the trajectory information at timet ?1 and tmay affect future path more compared with that at time t ?2and t ?3.(b) As for the neighbor ( PN ),he turns away from PT at time t.The trajectory information of PN at timet ?1 has a greater influence on PT considering that PT is not allowed to occupy the position where PN just lefts.
2)Most of trajectory prediction methods fail to capture the global context among the environment.Some methods capture the global context through an annotation text recording people location coordinates provided by the dataset.However, the text just annotates a few people,so it is not the real global information.A pre-trained detection model[11]can be used to extract all people in the image rather than relying on the annotation text.
In this work, we propose a spatial-temporal attention network to predict human trajectory in the future.We adopt an LSTM called ego encoder to model the ego motion of the target human.We also consider all people in the scene by the pre-trained detection model extracting the positions of neighbors.The positions are fed into amulti-layer perceptron(MLP)to obtain high dimensional features.Then the inner product is used to acquire the weights which measure the importance of neighbors to the target.Further,another LSTM called interaction encoder is followed to model human-human interaction.It is noted that in most existing models, the trajectory information at different time instants gains equal treatment, which is not suitable for the complex trajectory prediction.Inspired by this,we introduce an attention mechanism to obtain the weights, which represent the levels of influence for trajectory information at different time instants.Finally,an LSTM decoder is employed to generate human trajectories for the next few frames.
Our contributions can be summarized as following:
1)We introduce an attention mechanism to automatically learn the weights.They dynamically determine which time instant’s trajectory information we should pay more attention to.
2)We utilize a pre-trained detection model[11]to capture global context instead of retrieving local context from the dataset,then an MLP and the inner product are used to weight different neighbors.
3) Based on the above two ideas,a spatial-temporal attention(ST-Attention)model,is proposed to tackle the challenges of trajectory prediction.ST-Attention achieves competitive performance on two benchmark datasets:ETH&UCY[12],[13]and ActEV/VIRAT [14].
Kalman filter[15],[16]can be deployed to forecast the future trajectory in the case of linear acceleration,which has proven to be an efficient recursive filter.It is capable of estimating the state of a dynamic system from a series of incomplete and noisy measurements,especially in the analysis of time sequences.Williams[17]proposes to use Gaussian processes distribution to estimate motion parameters like the velocity and the angle offset,then a motion pattern of the pedestrian is built.Further,researchers begin to associate the energy with pedestrians.One representative work is the social forces proposed by Helbing and Molnár [18], which transforms the attraction and the exclusion between pedestrians and obstacles into energy to predict the pedestrian trajectory.The attractive force is used to guide the target to the destination,and the repulsive force is used to keep a safe distance and avoid collision.Subsequently,some methods[19]fit the parameters of the energy functions to improve the social forces model.
However, the above methods rely on hand-crafted features.This becomes an obstacle to advance the performance of the trajectory prediction in light that these methods have the ability to capture simple interaction but fail in complex scenarios.In contrast,data-driven methods based on convolutional neural network(CNN)and recurrent neural network(RNN)overcome the above limitations of traditional ones.
CNN[20] has proven to be powerful to extract rich context information,which is salient cue for trajectory prediction task.Behavior-CNN in[21]employs a large receptive field to model the walking behaviors of pedestrians and learn the location information of the scene.Yagiet al.[5]develop a deep neural network that utilizes the pose, the location-scale,and the ego-motion cues of the target human, but they forget to consider human-human interaction.Huanget al.[22]introduce the spatial matching network and the orientation network.The former generates a reward map representing the reward of every pixel on the scene image,and the latter outputs an estimated facing orientation.However, this method can only consider the static scene but not the dynamic information of pedestrians.
RNN[23]has also proven to be efficient to deal with time sequence tasks.RNN models have shown dominant capability in various domains like neural machine translation[24],speech recognition[25],generating image descriptions[26]and DNA function prediction[27].Some recent works have attempted to use RNN to forecast trajectory.Social-LSTM in[9]introduces a social pooling layer to learn classic interactions that happen among pedestrians.But this pooling solution fails to capture global context.Besides,social-LSTM predicts the distribution of the trajectory locations instead of directly predicting the locations.This makes training process difficult while sampling process is non-defferentiable.Guptaet al.[10]propose Social-GAN combining approaches for trajectory prediction and generative adversarial networks.But the performance has not been improved obviously when sampling only one time during test time.Lianget al.[28]presentNext,an end-to-end learning framework extracting rich visual information to recognize pedestrian behaviors.Furthermore,the focal attention[29]is employed to the framework.It is originally proposed to tackle visual question answering,projecting different features into a lowdemensional space.But the focal attention used inNextis hard-wired and fails to learn from the data.Xuet al.[30]design a crowd interaction deep neural network which considers all pedestrians in the scene as well as their spatial affinity for trajectory prediction.However,they ignore the influence of temporal affinity.In our work we take into account both spatial affinity and temporal affinity.
Fig.2.Overview of proposed ST-Attention.The model utilizes encoder to extract ego featureEgo(h1,...,hT obs) and interaction feature Inter(h1,...,hT obs)from Lit and Btjrespectively ( t ∈[1,T obs]) ,then the following decoder outputs future path Lit′ (t ′∈[T obs+1,T pred]).
Some approaches for trajectory prediction employ the attention mechanism to differentiate the influence of neighbors on the target.Suet al.[31] update the LSTM memory cell state with a coherent regularization, which computes the pairwise velocity correlation to weight the dependency between the trajectories.Further,a social-aware LSTM unit[32]is proposed,which incorporates the nearby trajectories to learn a representation of the crowd dynamics.Zhanget al.[33]utilize motion gate to highlight the important motion features of neighboring agents.Sadeghianet al.[34]apply the soft attention similarly with[26],and emphasis the salient regions of the image and the more relevant agents.However,the above works focus on the spatial influence of the neighboring agents, but ignore the temporal influence of the agents which is also valuable for human trajectory prediction.The attention mechanism in our model connects the decoder state and the temporal encoder state,allowing to give an importance value for each time instant's trajectory state of the neighboring humans and the target human.
with the advancement of the pedestrian re-identification(Re-ID)[35],the same person with different appearances can be identified accurately,which facilitates the extraction of the human trajectory.K?stingeret al.[36]consider that the difference among various image features of the same pedestrian conforms to Gaussian distribution,and propose keep-it-simple-and-straightforward metric learning (KISSME).However,KISSME meets the small sample size problem in calculating various classes of covariance matrices, which blocks the improvement of the Re-ID performance.Hanet al.in[37]verify that virtual samples can alleviate the small sample size problem of KISSME.And the re-extraction process of virtual sample features is eliminated by genetic algorithm,which greatly improves the matching rate of pedestrian Re-ID.Further,KISS+[38]algorithm is proposed to generate virtual samples by using an orthogonal basis vector, which is very suitable for real-time pedestrian Re-ID in open environment due to its advantages of simplicity,fast execution and easy operation.These works are of great significance to the human trajectory prediction.
A person adjusts his trajectory based on the definite destination in mind and the influence of neighbors when he is walking in the crowd.On the one hand,the future trajectory of the target human depends on historical trajectories at different time instants,which we refer to as temporal affinity.On the other hand,the future trajectory hinges on the distances,the velocities and the headings of neighbors,which we refer to as spatial affinity.This idea motivates us to study the trajectory prediction jointly with temporal and spatial affinities.In this section, we present our spatial-temporal attention model tackling the problem.
The overall network architecture is illustrated in Fig.2.Our model employs an encoder-decoder framework.Specifically,the encoder consists of ego module and interaction module,and the decoder includes attention module and prediction module.We feed the locations into ego module to get the ego feature which is used for modeling the motion of the target. At the same time,the observed boxes are fed into interaction module to get the interaction feature which is used for exploring the relationship among neighbors.The above feature vectors are weighted and summed along the temporal dimensions by the attention module.Then the prediction module employs an LSTM to generate future trajectory.In the rest of this section,we will detail the above modules.
Ego module aims at exploring the intention of the target human which can be reflected by the motion characteristics such as the velocity,the acceleration and the direction.Due to the powerful ability of addressing sequence data,LSTM is chosen as the ego module architecture.For the pedestrianpi,we embed the location into a vectoret.Then the embedding is fed into theego encoder, whose hidden statehtis computed by
In this section,we analyze our model on pedestrian trajectory datasets based on world plane and image plane.Specifically, we evaluate meter values on ETH[12]and UCY[13]datasets,and report pixel values on ActEV/VIRAT dataset[14].Experimental results demonstrate that our model performs well on both world plane and image plane.
Similarly to prior works[10],[28],we use two metrics to report prediction error:
1) Average Displacement Error(ADE):AverageL2distance between the ground truth coordinates and the prediction coordinates over all predicted time instants
2) Final Displacement Error(FDE):TheL2distance between the true points and the prediction points at the final time instantTpred
We compare the results of our model with following stateof-the-art methods at the same conditions:
1) Linear[10]:A linear regression model whose parameters are determined by minimizing leastL2distance.
2) S-LSTM[9]:Alahiet al.[9] build one LSTM for each person and share the information between the LSTMs tclayer.At each time instanttduring the prediction period,the LSTM hidden-state represents a bivariate Gaussian distribution described by mean μ,standard deviation σ and correlation coefficient ρ.Then the predicted trajectory at timet+1 is sampled from the distribution.
3) Next[28]:Lianget al.[28]encode a person through rich visual features rather than oversimplifying human as a point.The person behavior module is proposed to capture the visual information,modeling appearance and body movement.And the person interaction module is used to capture other objects information and the surroundings.The future trajectory is predicted by the LSTM with the focal attention [29].
4) LiteNext:We implement a simplified version ofNextmodel which just takes into account the person’s trajectory and the person-objects.We keep the same in other settings.In this way,the input of LiteNext is the same as ours.
The ETH dataset consists of 750 pedestrians and contains two sets(ETH and HOTEL).The UCY dataset embodies 786 pedestrians and is comprised of three sets(ZARA1,ZARA2 and UNIV).These datasets contain rich real-world scenarios including walking in company,giving way for each other and lingering about, which are full of challenges.The number of tags including frame, pedestrian,group,and obstacle in the datasets is summarized as shown in Table I.
TABLE I The Number of Tags
1)Setup:following the same experimental setup as[10],we use the leave-one-out strategy,that is,training on 4 setsand testing on the remaining set.Based on the sampling period of 0.4 s,we observe the trajectory for 8 frames(3.2 s)and predict the next 12 frames (4.8 s),namelyTobs= 8,Tobs+1=9,andTpred= 20.
TABLE II Quantitative Results of Different Methods on ETH &UCY Datasets.We Use ADE and FDE to Measure Prediction Error in Meter Values,and Lower Is Better
2) Implementation Details:In the interaction module,a multi-layer perceptron is employed,which embodies 3 layers.The node sizes in these layers are set to 32,64,128 respectively.And the dimension of embedding layer is 128.The LSTM hidden sizedis set to 256.A batch size of 64 is used and the epoch number of the training stage is 100.We use Adam optimizer with an initial learning rate of 0.001.To facilitate the training, we clip the gradients at a value of 10.A single NVIDIA GeForce GTX Titan-Xp GPU is used for the training.
3)Quantitative Results:We report the experimental results about ADE and FDE for all methods across the crow d sets in Table II.The linear regressor presents high prediction errors since it has no ability to model curved trajectories unlike other methods employing LSTM networks to overcome this deficiency.Moreover,Nextmodel and ours are better than SLSTM as they consider global human-human interaction,and S-LSTM does not perform as well as expected since it fails to consider global context.
Our ST-Attention achieves state-of-the-art performance on ETH and UCY benchmarks.Throughout the Table II, the evaluation error on single ETH crowd set is much higher than those on other sets.This crowd set contains a lot of pedestrians on the narrow passage and they walk in disorder with different velocities and headings.Compared with LiteNext, the same input information is obtained but ST-Attention performs significantly better,especially on ETH dataset.This verifies the effectiveness of our method.Compared withNext,ST-Attention misses two input feature channels and has a lighter network structure.At the same time,ST-Attention is still competitive and achieves powerful results no worse thanNext.This is because the focal attention[29] used inNextis hard-wired and cannot make full use of input features.
Computational time is crucial in order to satisfy real-world applications.For instance,real-time prediction of the pedestrian trajectories in front of the vehicles is necessary in autonomous driving.In Table III, we make a comparison with other models in terms of speed.We can see that S-LSTM has fewer parameters,but the computational time is not as fast as expected.The decrease of speed is because that S-LSTM adopts recursive method to predict future trajectories,which means S-LSTM needs to compute occupancy grids to implement social pooling at each time instant.Compared withNext,our method reduces the number of parameters by almost half,since ST-Attention uses fewer input channels.Correspondingly,our method is 2.5x faster thanNext,taking about 0.02 s on the premise of thatandare obtained.Due to the efficient interaction model,our model is also faster than LiteNext.
TABLE III Speed Comparison with Baseline Models.Our Method with Fewer Number of Parameters Gets 2.5x Speedup Compared to Next
Fig.5.The visualization results of our ST-Attention predicting path on ETH&UCY datasets:history trajectory(orange),ground truth(red),predicted trajectory for Next model(blue)and our model(green).The first three rows show some successful cases and the last row presents some failure examples.we can see that in most cases our predicted trajectory coincides with the ground truth.
4)Qualitative Results:We illustrate the qualitative results ofNextmodel and our ST-Attention with visualization to make a comparison in Fig.5.The results demonstrate the effectiveness of our model.When a person meets a couple as shown in Fig.5(a1)and Fig.5(a3),ST-Attention is able to pass through the cracks whileNextmodel might collide with one of them.In the second row of Fig.5, we present some scenes where people walk in a group,and ST-Attention is able to jointly predict their trajectories with lower error thanNextmodel.We would like to note that in Fig.5(c2), the predicted path byNextmodel through the wall even though theNextmodel encodes scene semantic information, which testifies that focal attention[29]cannot fully utilize the rich visual feature.In the last row of Fig.5,several failure cases are shown.In Fig.5(d1)when a pedestrian waits at the station,he moves slightly as he paces back and forth.ST-Attention assumes he will have a small movement along the previous trend while ground truth has a sudden turn.In Fig.5(d2)people move in the opposite direction,ST-Attention predicts that target human slows down to avoid collision but actually he gives way toward the right.In Fig.5(d3)ST-Attention predicts a change of direction toward a wide space whereas the pedestrian goes ahead.Although the predicted paths do not correspond to the ground truth in failure cases,the outputs still belong to acceptable trajectories that pedestrians may take.
ActEV/VIRAT[14]is a benchmark dataset devoted to human activity research.This dataset is natural,realistic and challenging in the field of video surveillance in terms of its resolution and diversity in scenes,including more than 400 videos at 30 frames/s.
1) Implementation Details:In the experiment,training set includes 57 videos and validation set includes 5 videos.Besides,55 videos are used for testing.In order to keep consistent with the baseline models based on ETH&UCY,activity label is not used in this experiment.Other parameter settings are the same as those in ETH & UCY.
2) Results:Table IV shows quantitative results compared to baseline methods.Combined with Table III,we can see that our model outperforms other methods with lightweight parameters.We also do visualization to reflect the performance of the algorithm intuitively.As show in Fig.6,our prediction trajectory can better match the ground truth.When a person walks at a normal pace,our model is able to predict his future path,including the situation that the person turns to change the direction of the trajectory such as Fig.6(b)and Fig.6(c).In Fig.6(c), the historical trajectory of the target human turns right and then left with a curvature.Nextmodel predicts that the human will continue to turn left,but in fact he turns right to the main road as our model predicts.However,our model performs poor when human has a great change of direction due to obvious external interference or other purposes,such as failed cases in Fig.6.Even though such cases are hard to our model, better performance is achieved compared with other methods.
TABLE IV Quantitative Results on ActEV/VIRAT.We Report ADE and FDE in Pixel Values
Fig.6.The visualization results on ActEV/VIRAT dataset:history trajectory(orange),ground truth(red), predicted trajectory for Next model(blue)and our model (green).
TABLE V Ablation Experiments of the Interaction Module and the Attention Module.We Report ADE and FDE (ADE/FDE) on ETH&UCY (Meter Values) and ActEV/VIRAT (Pixel Values)
To explore the role of each module in the trajectory prediction,we make an ablation study on ETH&UCY and ActEV/VIRAT datasets.
1) Effectiveness of the Interaction Module:To verify the importance of interaction module, we train a network removing the interaction branch.Then a comparative experiment with and without interaction module is done and the results are shown in Table V.We can see that a better performance is achieved by the model with interaction module.This is because the interaction module measures the influence of neighbors on the target.
2) Effectiveness of the Attention Module:To evaluate the effectiveness of our attention module,we make a comparison with focal attention[29]which is not learnable.The comparison result is shown in Table V and our attention module performs better than focal attention.This is because our soft attention can automatically learn the weights while focal attention fails,which suggests our attention module is effective for the trajectory prediction.
In this paper,we present ST-Attention,a spatial-temporal attention model for trajectory prediction.To explore spatial affinity, we use an MLP and the inner product to assign different weights for all pedestrians.To make full use of temporal affinity,the key component named attention model is introduced,which is quite efficient.Our model is fullydifferentiable and accurately predicts the future trajectory of the target, which automatically learns the importance for historical trajectories at different time instants and weights the influence of nearby people on the target.Comprehensive experiments on two publicly available datasets have been conducted to demonstrate that ST-Attention can achieve a competitive performance.
Our approach is designed for human trajectory prediction.Future work can extend the model to vehicle trajectory prediction in view of the many similarities between the two predictions mentioned above.Meanwhile we should also distinguish their differences.For example,pedestrians can turn back easily while it is difficult for vehicles,and vehicles can change speed rapidly while pedestrians fail.In particular,it is critical in autonomous driving field to predict the human trajectory jointly with the vehicle trajectory.Besides,intelligent optimization algorithms[41],[42]can be used to learn all the parameters.
IEEE/CAA Journal of Automatica Sinica2020年4期