• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Action Recognition with Temporal Scale-Invariant Deep Learning Framework

    2017-05-08 01:46:52HuafengChenJunChenRuiminHuChenChenZhongyuanWang
    China Communications 2017年2期

    Huafeng Chen , Jun Chen , Ruimin Hu *, Chen Chen, Zhongyuan Wang

    1 State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072, China

    2 National Engineering Research Center for Multimedia Software, Computer School of Wuhan University, Wuhan 430072, China

    3 Center for Research in Computer Vision, University of Central Florida, Orlando, FL 32816, USA

    * The corresponding author, email: hrm@whu.edu.cn

    I. INTRODUCTION

    Human action recognition aims to enable computer automatically recognize human action in video through related features. Visual features extracted from videos are crucial for dealing with these issues and designing effective action recognition systems [1-8]. Traditional action features can be divided into two categories: hand-crafted features and deep-learned features.

    The hand-crafted features are artificially designed based on the statistical properties of the frame pixels in video. Laptev [9] developed space-time interest points (STIP) by extending image Harris detector to video and integrated the HOG/HOF descriptors to represent local spatio-temporal features of STIP[10]. Scovanner et al. [11] proposed a 3D version of scale-invariant feature transform(3D-SIFT) descriptor. Willems et al. [12] extended the SURF descriptor of image to video domain by computing weighted sums of uniformly sampled responses of spatiotemporal Haar wavelets. Klaser et al. [13] designed a 3D version of HOG descriptor (HOG3D) for video action recognition. Yeffet and Wolf [14]proposed Local Trinary Patterns for videos as the extension of Local Binary Patterns (LBP).Wang et al. [15] introduced motion boundary histograms in Dense Trajectories (DT) to suppress camera motions, and further proposed a camera motion estimation method in [16]to explicitly rectify the image to remove the camera motion. Benefited from double camera motion inhibition, the iDT in [16] performs the best in action recognition accuracy among the hand-crafted features. However, the shortcoming of hand-crafted descriptors is that they are not optimized for visual representation and lack discriminative capacity for action recognition [17].

    Encouraged by the success of CNNs in image classification [18], researchers have exploited the deep-learned features for video action recognition. Taylor et al. [19] proposed convGRBM algorithm to unsupervised learn spatio-temporal features by using Gated Restricted Boltzmann Machine (GRBM). Ji et al. [20] extended 2D CNN to 3D CNN for action recognition. They put gray maps, gradient maps and optical flow maps of raw video frames into multi-channel CNNs. Karpathy et al. [21] introduced several CNN architectures based on stacked RGB images for video classification. Simonyan and Zisserman [22]designed the two-stream architecture which exploits two CNNs to model static appearance and motion variation of action respectively.Based on two-stream CNNs and iDT, Wang et al. [17] designed Trajectory-pooled Deep-Convolutional (TDD) descriptors which enjoy the merits of CNNs and trajectory based method.While the deep-learned features with CNN architecture distinctly improve the accuracy of action recognition compared to hand-crafted features, they are not able to model the longterm temporal structure among frames in a video.

    Recently, Ng et al. [23] utilized the recurrent Long Short Term Memory (LSTM) architecture to capture temporal structure of consecutive frames. Donahue et al. [24] proposed a temporal LSTM structure similar to [23].Wu et al. [25] fed the spatial CNN feature and the motion CNN feature [22] into two sets of LSTM networks for modeling long-term temporal clues. They also employ a regularized feature fusion network to fuse the video-level features. Srivastava et al. [35] learned the unsupervised deep action representations based on a composite LSTM model. Generally, the LSTM-based frameworks model temporal clues among video frames and gain the stateof-the-art performances by integrating deep learning architecture of CNN and LSTM.

    Nevertheless, current LSTM-based methods sample keyframes by serial [23,24] or fixed number [25] sampling strategies, which cannot cope with the change of action speed.Actions have great differences in execution time. Taking the UCF101 dataset [26] for example, the frame numbers of different action sequences vary from 200 to more than 400.An action consists of a number of sub-actions in order. For example, the action of “clean and jerk” can be divided into four sub-actions:clean the weight to the chest, squat, jerk, and keep standing. The videos in the same action class have an equal number of sub-actions while they have different frame numbers. The temporal scales of sub-actions in an action can change, therefor affecting the action speed.The traditional serial (or fixed number) keyframe sampling strategy is mechanically and is not robust to the temporal scale variation of sub-action in an action video.

    To this end, in this paper we propose a novel temporal scale-invariant deep learning framework for action recognition, as shown in Figure 1. The motivation is, instead of using serial or fixed number sampling, we extract keyframes of sub-action clips in an action video to represent the sub-actions. As shown in Figure 1, wefirstly split a video into multiple sub-action clips. A keyframe is sampled from every sub-action clip to represent the sub-action since the frames in each sub-action are similar in appearance while the frames from different sub-actions have significant differences. Based on the keyframe and its adjacent frames, we calculate optical flow maps.We put the keyframe and the corresponding optical flows into CNN [22] architecture separately, and fuse the two-stream CNN features.The fused features of sampled frames in every clip are served as the input of the LSTM framework. The outputs of the sequence-based LSTM are combined as thefinal predictions.

    The contributions of this paper are summarized as follows:

    1) We find sub-action based keyframe sampling is temporal scale-invariant to action speed and is better for action recognition than serial or fixed number sampling.

    2) We propose a sub-action segmentation method based on recent Deepbit feature [29],and evaluate the effect of keyframe location in the sub-action on the action recognition accuracy.

    3) We introduce convolutional fusion in LSTM framework for action recognition.

    The rest of this paper is organized as follows. In Section II, we present the proposed temporal scale-invariant deep learning framework in detail. We conduct the experiments on UCF101 [26] and HMDB51 [27] action recognition databases in Section III. Finally, we conclude this paper in Section IV.

    II. PROPOSED APPROACH

    In this section, we start with the presentation of temporal sub-action clustering and keyframe sampling, then introduce the training/testing method of the spatial and motion CNN.Based on fused descriptors of spatial and motion CNN features, we describe the long-term LSTM-based temporal modeling framework and video-level prediction scores combination methods.

    2.1 Sub-action clustering

    Formally, given a video sequenceis the frame index of the video), sub-actions group thevideo frames intosequential segments:

    We expect a sub-action in a video to have enough motion information that can express a part of action and at the same time it should not be too long that includes irrelevant motions. Fig. 2 illustrates the process of sub-action clustering. Specifically, we create a binary code for each frame based on Deepbit feature[29]. We select 32-bit length for the binary codes since it is the optimal length for image matching [29]. These binary codes should have similar hamming distance for similar points in the CNN feature space. By moving across the frames, we pick the frames that their binary codes are different from their previous frames as the start of the next clip.By repeating the cycle, a video is divided into several segmentations.

    Affected by the variation of camera angle and other factors, the numberof the segmented clips of a video may be more or less than practical sub-actions amount of m. Whenwe merge part of small segmentations to draw near to the sub-actions. We start the merging process from the shortest clip based on the assumption that the long segments can better represent the sub-actions than the small clips in a video. The shortest clip is merged into its adjacent clip which has more frames.The process of merging ends when the number of segments is equal to the number of sub-actions. If(rarely happens in experiments), we split part of long segmentations to draw near to the sub-actions. The longest clip is divided into two equal parts, and the split process ends when the number of segments is equal to the number of sub-actions.

    2.2 Keyframe sampling

    A keyframe sampled from sub-action clip can effectively describe a stage in the evolution of the action [28]. Also, using a keyframe representing a sub-action is able to adapt to the variation of sub-action speed. Theoretically,we can sample the keyframe at any location in the sub-action clip

    where α is the parameter of step size. We’ll evaluate the effect of α on action recognition accuracy in Section 3.2.

    2.3 Feature extracting

    We follow the two-stream architecture [22]and extract the spatial and the short-term motion features from the sampled keyframes. For training/testing CNN model, we select VGG-16 model instead of VGG-M-2048 model as the former significantly improved action recognition accuracy than the latter [27, 28].

    In order to learn robust features from CNNs, we adopt three data augmentation strategies [32]. Firstly, we randomly crop a 224×224 patch from keyframe. Secondly, the cropped patches are horizontally flipped by random. Finally, we use a scale jittering strategy to help CNN to learn robust features [31].We crop a patch on three scales (1, 0.875, and 0.75), which yield the scaled patches of size 224×224, 196×196, and 168×168. The scaled patches are then resized to 224×224. In testing phase, we disuse the data augmentation strategy and just crop one 224×224 patch from the center of testing frame.

    Fig.2 Deepbit feature based sub-action clustering

    Spatial CNN is pre-trained on ImageNet dataset [33] and then the model parameters arefinely tuned on action recognition datasets. In the process of spatial CNN training, the learning rate is set to 10-3, decreases to its 1/10 every 4K iterations, and stops at 10K iterations[31]. We set 0.9 drop out ratios for the two fully connected layer of the spatial CNN.

    For training the temporal CNN, we use optical flow stacking with L=10 frames [22] for learning the short-term motion features around keyframe. We pre-compute the optical flow of images by using the GPU implementation of[34] from the OpenCV toolbox. The horizontal and vertical components of the optical flow are linearly scaled to a [0, 255] range just like two channel images. The training procedure of temporal CNN is similar to spatial CNN while the input data is a 224×224×20 sub-volume.The learning rate initiated from 10-3, decreases to its 1/10 every 10K iterations, and stops at 30K iterations [31]. We set 0.9 and 0.8 drop out ratios for the fully connected layer of the temporal CNN.

    2.4 Feature fusing

    In this section, we consider fusing the two stream CNNs to establish the correspondence among the channel responses at the same pixel position. As an example, we consider discriminating between the actions of opening the door and closing the door. The two actions are identical in appearance and only opposite in motion direction. When a hand moves at some spatial location in the video, the temporal CNN can recognize that motion (opening or closing), and the spatial CNN can recognize the hand location and their combination can discriminate the two actions.

    For feature fusing using in the LSTM framework, the experimental results in [24]have shown that the sum fusion at thefirst full connect (FC) layer perform better than at the last FC layer. Wu et al. [25] also adopt thefirst FC layer as the feature fusion layer. Recently,Feichtenhofer et al. [30] have verified that the convolutional (Conv) fusion method perform well than traditional method of sum fusion at FC layer [22] under CNN based action recognition. In this paper, we introduce the Conv fusion method for the LSTM based action recognition framework, and compare it to the traditional sum fusion method.

    We briefly introduce the concatenation fusion as it is the basis of the Conv fusion[30]. The concatenation fusion functioncombines the two feature maps at the same spatial locations h,w across the feature channels d:

    T h e C o n v f u s i o n f u n c t i o nfirstly combines the two feature maps at the same spatial locations h,w across the feature channels d as (4) and then convolves the combined data [30] with a batch offilters

    where the filter has the dimension ofand the number of output channels is

    2.5 Sequence modeling

    The sequential keyframes of an action video represents the temporal evolution of the action.We select the LSTM framework for modeling this action evolution. The LSTM is a kind of Recurrent Neural Networks (RNN) with controllable memory units and is efficient in many long-term sequential modeling tasks without suffering from the vanishing and exploding gradients problem in traditional RNNs. Usually, LSTM recursively maps the input represen-tations at the current time step to output labels via a sequence of hidden statesand thus the learning process of LSTM should be in a sequential manner (as shown in Fig.1).

    As a neural network, the LSTM model can easily go deep by stacking the hidden states from a layer l-1 as inputs of the next layer l. Considering a framework of K layers, the fused feature vectorof the keyframe in the j-th sub-action segment is fed into the first layer of the LSTM framework together with the hidden statein the same layer obtained from last time step to produce an updated, which will then be used as inputs of the following layer.

    Following [25], we adopt two-layered LSTM framework for temporal modeling (as shown in Fig.1). The upper layer of the LSTM has 1,024 hidden units and the bottom layer has 512 hidden units. The network weights are learnt using a parallel implementation of the method of back propagation through time(BPTT). The mini-batch size is set to 10. Additionally, the learning rate and momentum are set to 10-4and 0.9 respectively. The training is stopped at 150K iterations. The outputs of the LSTM model form the prediction score set

    In order to combine LSTM frame-level prediction scores into a single video-level prediction, Ng et al. [23] have evaluated four approaches: 1) returning the prediction at the last time step, 2) max-pooling the predictions over time, 3) summing the predictions over time and return the max, 4) linearly weighting the predictions over time by 0…1 then sum and return the max. The difference of the accuracy for the four methods was less than 1%, and weighted predictions always resulted in the best performance, supporting the idea that the LSTM’s hidden states become progressively more informed as a function of the number of frames. Wu et al. [25] proposed the first method and returning the prediction at the last time step. The video-level prediction method in [24] is similar to the third approach, but it averaged the sum scores. In this paper, we follow [23] and select the weighted predictions method for video-level prediction. The final prediction score is defined as

    where m is the sub-action number of an action.

    III. EXPERIMENTS AND RESULTS

    In this section, we show the action recognition performance of the proposed method. Wefirstly introduce the datasets for evaluating our proposed approach. Then we analyze the effects of parameterandon action recognition accuracy. Thirdly, the keyframe sampling strategies are discussed. We also compare the convolutional fusion with the traditional sum fusion under LSTM framework. Finally, the proposed method is compared with the stateof-the-art methods.

    3.1 Datasets and parameter settings

    We evaluate the proposed approach on two popular action recognition datasets: UCF101[26] and HMDB51 [27]. The datasets are the two largest datasets for action recognition, and extremely challenging for researchers [35-41].

    The UCF101 dataset contains 101 action classes and there are at least 100 video clips for each class. The whole dataset contains 13,320 video clips. Following the evaluation scheme of the THUMOS14 challenge and adopt the three training/testing splits for evaluation. Results are measured by classification accuracy on each split and we report the mean accuracy over the three splits.

    The HMDB51 dataset is a large collection of realistic videos from movies and web videos. The dataset is composed of 6,766 video clips from 51 action categories, with each category containing at least 100 clips. Our experiments follow the evaluation scheme in [27]and use three different training/testing splits.Each action class has 70 clips for training and 30 clips for testing in every split. We select the average accuracy over three splits as thefinal recognition performance. Since UCF101 is larger than HMDB51, we use the UCF101 dataset to train two-stream CNNs initially, and transfer this learned model for two-stream feature extraction on the HMDB51 dataset.

    3.2 Parameter analysis

    Fig.3 Influence of sub-action number m on recognition accuracy on UCF101(split1)

    Fig.4 Influence of sampling parameter α on recognition accuracy on UCF101(split1)

    We first analyze the influence of sub-action number m on the recognition accuracy. The experiments are carried out on the first split(split1) of UCF101. The sampling parameterisfixed at 1 which ensures the frame is sampled from the middle position of a sub-action segment. The spatial and motion features are fused by Conv fusion at the RELU5_3 layer.The experiment results are presented in Fig.3.We observe that the recognition accuracies rise rapidly when the parametervaries from 2 to 6, and reach to peak whenis 7. Therefore,wefixat 7 in the following experiments.

    We then discus in fluence of sampling step size α on the recognition accuracy. The experiments are also conducted on thefirst split(split1) of UCF101. From the experimental results (shown in Fig.4), we observe that the recognition accuracies reach to peak when α=1, which shows the keyframe sampled from middle of the sub-action segment (α=1) is the most representative image for the sub-action.More interestingly, the curve of recognition accuracies is nearly symmetric, the symmetry axis is located at α=1. Thus, we use α=1 in the following experiments.

    3.3 CNN feature fusion evaluation

    In this section, we compare the Convolutional fusion with the sum fusion. Summing the spatial and the motion CNN features at FC6 layer is adopted by the traditional LSTM based framework in [21, 22]. In this paper we select the Conv fusion at RELU5_3 layer by following [30]. The comparison results between the two methods are reported the average accuracy over all three splits of UCF101 (as shown in Table I). The recognition accuracy based on Conv fusion method is 93.7%, improved 1.4% compared to the accuracy based on the sum fusion. Moreover, fusing the features at RELU5_3 layer saves almost half of the training/testing parameters (149.6M) than at FC6 layer (252.1M) because the parameters in FC6 layer account for more than 70% of the total parameters of CNN framework.

    3.4 Sampling strategy evaluation

    To evaluate the effectiveness of the proposed

    sub-action based frame sampling strategy,we compare it with the traditional sampling strategies: serial sampling and fixed number sampling. For serial sampling, we follow the sampling method in [24] and extract 16 frame clips with a stride of 8 frames from each video and average across clips. For fixed number sampling, we sample a fixed number of 25 frames for the deep learning framework by following the sampling method in [19, 22].The results are reported the average accuracy over all three splits of UCF101 and HMDB51 in Table II. For UCF101 dataset, according to Table II, we can learn that the recognition accuracy of our method is 93.7% which is 1.9%higher than that of fixed number sampling approach, and 2.4% higher than that of serial sampling method. On HMDB51 dataset, the average improvement of the proposed method over the second best method is 3.5%. In general, the proposed temporal scale-invariant deep learning framework obviously improves the action recognition accuracy.

    3.5 Comparison with the state-ofthe-art

    Finally, we compare our method with the state-of-the-art over all three splits on UCF101 and HMDB51. As shown in Table III, our temporal scale-invariant deep learning framework produces the highest performance on the two datasets. Both on the UCF101 and HMDB51 dataset, many works with competitive results are based on two-stream CNN feature. Compared with the result of the original Conv fusion method in [30] (92.5%, 65.4%), our framework (93.7%, 69.5%) is better with the additional LSTM framework to explore the temporal clues among the keyframes in video.The recent LSTM-based works [20, 21] adopted different LSTM neural networks, so the results are not directly comparable. Our deep learning framework is similar to [25], but our results improved 2.4% compared to [25].

    IV. CONCLUSIONS

    In this paper, we have proposed a novel tem-poral scale-invariant deep learning framework for action recognition. The main idea is sampling keyframes from every sub-action sequence in an action video for representing the sub-action and thus learning the deep sequential feature from the action related keyframes.We also introduced the Conv fusion method to characterize the pixel-level correlation between the spatial and motion CNN features. The proposed method can significantly improve the accuracy of action recognition.Experimental results on two public benchmark databases have demonstrated its superiority over the state-of-the-art methods.

    Table I CNN feature fusion evaluation on UCF101

    Table II Sampling strategy evaluation on UCF101 and HMDB51

    Table III Comparison with state-of-the-art results on UCF101 and HMDB51

    ACKNOWLEDGEMENTS

    This work was supported in part by the National High Technology Research and Development Program of China (863 Pro-gram) (2015AA016306), the National Nature Science Foundation of China (61231015),the Technology Research Program of Ministry of Public Security (2016JSYJA12),the Shenzhen Basic Research Projects(JCYJ20150422150029090), and the Applied Basic Research Program of Wuhan City(2016010101010025).

    [1] P Afsar, P Cortez, H Santos, “Automatic visual detection of human behavior: a review from 2000 to 2014”,Expert Systems with Applications,vol. 42, no. 20, pp. 6935-6956, 2015.

    [2] D Weinland, R Ronfard, E Boyer, “A survey of vision- based methods for action representation, segmentation and recognition”,Computer Vision and Image Understanding, vol. 115, no. 2,pp. 224-241, 2011.

    [3] Y Shan, Z Zhang, K Huang, “Visual human action recognition: history, status and prospects”,Journal of Computer Research and Development,no. 1, pp. 93-112, 2016.

    [4] C Chen, Z Gan, “Action recognition from a different view”,China Communications, vol. 10, no.12, pp. 139-148, 2013.

    [5] W Song, N.N Liu, G.S Yang, et al, “A novel human action recognition algorithm based on decision level multi-feature fusion”,China Communications, vol. 12, no. 2, pp. 93-102, 2015.

    [6] C Chen, R Jafari, N Kehtarnavaz, “Fusion of depth, skeleton, and inertial data for human action recognition”,Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2712-2716, 2016.

    [7] C Chen, R Jafari, N Kehtarnavaz, “Action recognition from depth sequences using depth motion maps-based local binary patterns”Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1092-1099,2015.

    [8] C Chen, R Jafari, N Kehtarnavaz, “UTD-MHAD:A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor”,Proceedings of IEEE International Conference on Image Processing, pp. 168-172,2015.

    [9] I Laptev, “On space-time interest points”,International Journal of Computer Vision, vol. 64, no.2-3, pp. 107-123, 2005.

    [10] I Laptev, M Marszalek, C Schmid, et al, “Learning realistic human actions from movies”,Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2008.

    [11] P Scovanner, S Ali, M Shah, “A 3-dimensional sift descriptor and its application to action recognition”,Proceedings of the 15th internationalconference on Multimedia, pp. 357-360, 2007.

    [12] G Willems, T Tuytelaars, G.L Van, “An efficient dense and scale-invariant spatio-temporal interest point detector”,Proceedings of the 10th European Conference on Computer Vision, pp.650-663, 2008.

    [13] A Klaser, M Marsza?ek, C Schmid, “A spatio-temporal descriptor based on 3d-gradients”,Proceedings of the 19th British Machine Vision Conference, pp. 271-210, 2008.

    [14] L Yeffet, L Wolf, “Local trinary patterns for human action recognition”,Proceedings of IEEE 12th International Conference on Computer Vision,pp. 492-497, 2009.

    [15] H Wang, A Kl?ser, C Schmid, et al, “Dense trajectories and motion boundary descriptors for action recognition”,International Journal of Computer Vision, vol. 103, no. 1, pp. 60-79,2013.

    [16] H Wang, D Oneata, J Verbeek, et al, “A robust and efficient video representation for action recognition”,International Journal of Computer Vision, pp. 1-20, 2015.

    [17] L Wang, Y Qiao, X Tang, “Action recognition with trajectory-pooled deep-convolutional descriptors”,Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp.4305-4314, 2015.

    [18] O Russakovsky, J Deng, H Su, et al, “ImageNet large scale visual recognition challenge”,International Journal of Computer Vision, vol. 115,no. 3, pp. 211-252, 2015.

    [19] G.W Taylor, R Fergus, Y Lecun, et al, “Convolutional learning of spatio-temporal features”,Proceedings of European conference on computer vision, pp. 140-153, 2010.

    [20] S Ji, W Xu, M Yang, et al, “3D convolutional neural networks for human action recognition”,IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, 2013.

    [21] A Karpathy, G Toderici, S Shetty, et al, “Largescale video classification with convolutional neural networks”,Proceedings of IEEE conference on Computer Vision and Pattern Recognition,pp. 1725-1732, 2014.

    [22] K Simonyan, A Zisserman, “Two-stream convolutional networks for action recognition in videos”,Proceedings of Advances in Neural Information Processing Systems, pp. 568-576, 2014.

    [23] J.Y Ng, H Matthew, S Vijayanarasimhan, et al,“Beyond short snippets: deep networks for video classification”,Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694-4702, 2015.

    [24] J Donahue, H.L Anne, S Guadarrama, et al,“Long-term recurrent convolutional networks for visual recognition and description”,Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625-2634, 2015.

    [25] Z Wu, X Wang, Y Jiang, et al, “Modeling spa-tial-temporal clues in a hybrid deep learning framework for video classification”,Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, pp. 461-470, 2015.

    [26] K Soomro, A.R Zamir, M Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild”,arXiv preprint, 2012.

    [27] H Kuehne, H Jhuang, E Garrote, et al, “HMDB: a large video database for human motion recognition”,Proceedings of International Conference on Computer Vision, pp. 2556-2563, 2011.

    [28] M Ravanbakhsh, H Mousavi, M Rastegari, et al,“Action recognition with image based cnn features”,arXiv preprint, 2015.

    [29] K Lin, J Lu, C.S Chen, et al, “Learning compact binary descriptors with unsupervised deep neural networks”,Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.

    [30] C Feichtenhofer, A Pinz, A Zisserman, “Convolutional two-stream network fusion for video action recognition”,Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.

    [31] L Wang, Y Xiong, Z Wang, et al, “Towards good practices for very deep two-stream convnets”,arXiv preprint, 2015.

    [32] B Zhang, L Wang, Z Wang, et al, “Real-time action recognition with enhanced motion vector cnns”,Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.

    [33] K Chat field, K Simonyan, A Vedaldi, et al, “Return of the devil in the details: delving deep into convolutional nets”,arXiv preprint, 2014.

    [34] T Brox, A Bruhn, N Papenberg, et al, “High accuracy optical flow estimation based on a theory for warping”,Proceedings of European conference on computer vision, pp. 25-36, 2004.

    [35] N Srivastava, E Mansimov, R Salakhudinov, “Unsupervised learning of video representations using lstms”,Proceedings of the 32nd International Conference on Machine Learning, pp.843-852, 2015.

    [36] W Zhu, J Hu, G Sun, et al, “A key volume mining deep framework for action recognition”,Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.

    [37] H Bilen, B Fernando, E Gavves, et al, “Dynamic image networks for action recognition”,Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2016.

    [38] B Fernando, P Anderson, M Hutter, et al, “Discriminative hierarchical rank pooling for activity recognition”,Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2016.

    [39] X Wang, A Farhadi, A Gupta, “Actions - transformations”,Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2016.

    [40] J Jiang, R Hu, Z Wang, Z Han, and J Ma, “Facial image hallucination through coupled-Layer neighbor embedding”,IEEE Transactions on Circuits and Systems for Video Technology, vol. 26,no.9, pp. 1674-1684, 2016.

    [41] Z Wang, R Hu, C Liang, et al. “Zero-shot person re-identification via cross-view consistency”,IEEE Transactions on Multimedia, vol. 18, no. 2,pp. 260-272, 2016.

    91成人精品电影| 大香蕉久久网| 中国美女看黄片| 欧美精品高潮呻吟av久久| 97在线人人人人妻| 免费一级毛片在线播放高清视频 | 亚洲国产av新网站| 欧美日韩福利视频一区二区| 国产成人91sexporn| 国产黄色免费在线视频| 国产精品久久久久久人妻精品电影 | 中文字幕人妻熟女乱码| 欧美日韩福利视频一区二区| 十分钟在线观看高清视频www| 高清av免费在线| 多毛熟女@视频| 人人妻人人添人人爽欧美一区卜| 亚洲精品av麻豆狂野| 高清不卡的av网站| 夜夜骑夜夜射夜夜干| 精品福利观看| 18禁观看日本| 青春草视频在线免费观看| 男女床上黄色一级片免费看| 两性夫妻黄色片| 成人18禁高潮啪啪吃奶动态图| 欧美日韩精品网址| 熟女av电影| 久久免费观看电影| 欧美日韩亚洲国产一区二区在线观看 | 亚洲午夜精品一区,二区,三区| 777久久人妻少妇嫩草av网站| 亚洲少妇的诱惑av| 精品国产一区二区三区四区第35| 亚洲七黄色美女视频| 国产精品亚洲av一区麻豆| 国产免费一区二区三区四区乱码| 乱人伦中国视频| 在线精品无人区一区二区三| 国产欧美日韩综合在线一区二区| 午夜日韩欧美国产| 91字幕亚洲| 两人在一起打扑克的视频| 婷婷色麻豆天堂久久| av在线播放精品| 免费av中文字幕在线| 一级片'在线观看视频| www.熟女人妻精品国产| 黄色a级毛片大全视频| 99九九在线精品视频| 亚洲欧美日韩另类电影网站| 午夜视频精品福利| 男女之事视频高清在线观看 | 菩萨蛮人人尽说江南好唐韦庄| 91精品国产国语对白视频| 久久久久久久大尺度免费视频| 91精品三级在线观看| 国产一区二区三区av在线| 亚洲情色 制服丝袜| 国产精品国产三级国产专区5o| 午夜福利一区二区在线看| 一级片免费观看大全| 国产亚洲欧美精品永久| 久久人人爽人人片av| 1024香蕉在线观看| 欧美日韩视频精品一区| 中文字幕高清在线视频| 欧美av亚洲av综合av国产av| 母亲3免费完整高清在线观看| 国产亚洲欧美精品永久| 激情视频va一区二区三区| 人妻一区二区av| 久久精品亚洲av国产电影网| 久久影院123| 日韩中文字幕欧美一区二区 | 色播在线永久视频| 国产深夜福利视频在线观看| 国产免费福利视频在线观看| 自拍欧美九色日韩亚洲蝌蚪91| 国产精品熟女久久久久浪| 国产精品香港三级国产av潘金莲 | a级毛片黄视频| 老司机在亚洲福利影院| 黄色 视频免费看| 一个人免费看片子| 久久久精品免费免费高清| 精品人妻熟女毛片av久久网站| 大话2 男鬼变身卡| 成年人黄色毛片网站| 一本—道久久a久久精品蜜桃钙片| 成人18禁高潮啪啪吃奶动态图| 天天躁日日躁夜夜躁夜夜| 亚洲熟女毛片儿| 丁香六月天网| 人妻一区二区av| 男女下面插进去视频免费观看| 国产欧美日韩综合在线一区二区| 亚洲情色 制服丝袜| 国产精品久久久久久精品古装| 99国产精品免费福利视频| 亚洲一卡2卡3卡4卡5卡精品中文| 成年人免费黄色播放视频| 黑丝袜美女国产一区| 十分钟在线观看高清视频www| 热99国产精品久久久久久7| 欧美97在线视频| 伊人亚洲综合成人网| 亚洲精品国产色婷婷电影| 久久精品久久久久久久性| 欧美黄色片欧美黄色片| 亚洲男人天堂网一区| 久久精品亚洲熟妇少妇任你| www日本在线高清视频| 91九色精品人成在线观看| av片东京热男人的天堂| 久久这里只有精品19| 欧美少妇被猛烈插入视频| 一级毛片我不卡| 黄网站色视频无遮挡免费观看| 久久久国产欧美日韩av| 国产在线免费精品| 中国美女看黄片| 亚洲成国产人片在线观看| av在线播放精品| 国产精品免费视频内射| 国产在线观看jvid| 午夜福利在线免费观看网站| 中文字幕色久视频| 一区二区三区激情视频| 精品国产一区二区三区久久久樱花| 大码成人一级视频| 国产精品久久久久久精品古装| 久久久国产精品麻豆| 首页视频小说图片口味搜索 | 中文字幕另类日韩欧美亚洲嫩草| 精品一区在线观看国产| 乱人伦中国视频| 久久精品国产亚洲av高清一级| 国产精品 欧美亚洲| 丰满饥渴人妻一区二区三| 80岁老熟妇乱子伦牲交| 国产精品二区激情视频| 又黄又粗又硬又大视频| 亚洲欧美日韩另类电影网站| 每晚都被弄得嗷嗷叫到高潮| 日韩精品免费视频一区二区三区| 久久女婷五月综合色啪小说| 香蕉国产在线看| 免费av中文字幕在线| 中文字幕精品免费在线观看视频| 国产欧美日韩精品亚洲av| cao死你这个sao货| av片东京热男人的天堂| 国产1区2区3区精品| 妹子高潮喷水视频| 午夜福利乱码中文字幕| 性色av乱码一区二区三区2| 久久久精品国产亚洲av高清涩受| 人人澡人人妻人| 狠狠婷婷综合久久久久久88av| 国产精品 欧美亚洲| 老司机靠b影院| 99国产综合亚洲精品| 国产精品免费大片| 乱人伦中国视频| 欧美在线一区亚洲| 王馨瑶露胸无遮挡在线观看| 亚洲视频免费观看视频| 精品久久久久久电影网| 国产精品一二三区在线看| 久久精品人人爽人人爽视色| 国产精品一区二区在线不卡| 嫩草影视91久久| 天天操日日干夜夜撸| 久久热在线av| 国产免费又黄又爽又色| 国产国语露脸激情在线看| 热99国产精品久久久久久7| 亚洲国产精品999| 日本wwww免费看| 老司机在亚洲福利影院| 大陆偷拍与自拍| 男女边吃奶边做爰视频| 午夜免费成人在线视频| 成人黄色视频免费在线看| 丰满少妇做爰视频| 国产欧美日韩精品亚洲av| 亚洲精品乱久久久久久| 亚洲国产成人一精品久久久| 交换朋友夫妻互换小说| 精品一区二区三区av网在线观看 | 中文精品一卡2卡3卡4更新| 免费在线观看日本一区| 天堂俺去俺来也www色官网| 午夜影院在线不卡| 黄网站色视频无遮挡免费观看| 91成人精品电影| www.av在线官网国产| 亚洲熟女精品中文字幕| 国产成人精品久久二区二区免费| 性色av一级| 欧美变态另类bdsm刘玥| 亚洲激情五月婷婷啪啪| svipshipincom国产片| 亚洲欧美日韩另类电影网站| 免费看不卡的av| 别揉我奶头~嗯~啊~动态视频 | 50天的宝宝边吃奶边哭怎么回事| 国产精品一区二区精品视频观看| 99久久人妻综合| 国产精品 国内视频| 丰满人妻熟妇乱又伦精品不卡| 老司机影院成人| 巨乳人妻的诱惑在线观看| 久热这里只有精品99| 男女之事视频高清在线观看 | 我的亚洲天堂| 婷婷色av中文字幕| 国产欧美日韩精品亚洲av| 免费观看a级毛片全部| 青春草亚洲视频在线观看| 一级a爱视频在线免费观看| 国产av一区二区精品久久| 久久免费观看电影| 男女边吃奶边做爰视频| 久久久久国产精品人妻一区二区| 免费看不卡的av| 成年人黄色毛片网站| 久久青草综合色| 18禁观看日本| 色网站视频免费| 国产亚洲欧美精品永久| 亚洲欧美一区二区三区国产| 丝袜美腿诱惑在线| 久久99热这里只频精品6学生| 精品国产一区二区三区四区第35| 亚洲欧美成人综合另类久久久| 国产精品免费视频内射| 高清黄色对白视频在线免费看| 一级毛片女人18水好多 | 久久99热这里只频精品6学生| 亚洲国产看品久久| 99热网站在线观看| 成年av动漫网址| 这个男人来自地球电影免费观看| 一边摸一边抽搐一进一出视频| 五月开心婷婷网| 伦理电影免费视频| 韩国精品一区二区三区| 观看av在线不卡| 国产老妇伦熟女老妇高清| av欧美777| 久久久久久久大尺度免费视频| 另类精品久久| 一区福利在线观看| 欧美在线一区亚洲| 午夜激情久久久久久久| a级毛片黄视频| 熟女av电影| 日韩av不卡免费在线播放| 中文精品一卡2卡3卡4更新| 欧美乱码精品一区二区三区| 日本91视频免费播放| 永久免费av网站大全| 伦理电影免费视频| 热re99久久精品国产66热6| 亚洲久久久国产精品| 日韩伦理黄色片| 久久精品国产综合久久久| 欧美av亚洲av综合av国产av| 交换朋友夫妻互换小说| 久久国产精品人妻蜜桃| 久久久久久人人人人人| 成人亚洲精品一区在线观看| 成人国产av品久久久| 19禁男女啪啪无遮挡网站| 国产麻豆69| 日韩 亚洲 欧美在线| www.999成人在线观看| 操出白浆在线播放| 一区二区三区激情视频| 亚洲精品自拍成人| 亚洲欧美中文字幕日韩二区| 免费观看人在逋| 黄色毛片三级朝国网站| 高清视频免费观看一区二区| 国产高清国产精品国产三级| 考比视频在线观看| 亚洲天堂av无毛| 女人爽到高潮嗷嗷叫在线视频| 美女脱内裤让男人舔精品视频| 国产亚洲午夜精品一区二区久久| 晚上一个人看的免费电影| 视频区欧美日本亚洲| 自拍欧美九色日韩亚洲蝌蚪91| 一级a爱视频在线免费观看| 熟女av电影| 亚洲国产av新网站| 男人爽女人下面视频在线观看| 欧美老熟妇乱子伦牲交| 18禁黄网站禁片午夜丰满| 久久久久网色| 一本一本久久a久久精品综合妖精| 午夜激情久久久久久久| 国产精品久久久av美女十八| 一级毛片 在线播放| 中文字幕亚洲精品专区| 观看av在线不卡| 亚洲精品一卡2卡三卡4卡5卡 | 欧美日韩福利视频一区二区| 欧美国产精品一级二级三级| 国产精品国产三级国产专区5o| www.999成人在线观看| 亚洲伊人久久精品综合| 一边摸一边抽搐一进一出视频| 亚洲,欧美精品.| 免费不卡黄色视频| 欧美日韩精品网址| 捣出白浆h1v1| 精品久久蜜臀av无| 精品少妇一区二区三区视频日本电影| 国产成人精品久久久久久| 99久久99久久久精品蜜桃| 日韩,欧美,国产一区二区三区| 久久影院123| 两个人免费观看高清视频| 自拍欧美九色日韩亚洲蝌蚪91| 极品少妇高潮喷水抽搐| 日本vs欧美在线观看视频| 成人国产一区最新在线观看 | 999久久久国产精品视频| 免费人妻精品一区二区三区视频| 国产黄色视频一区二区在线观看| 色精品久久人妻99蜜桃| 9191精品国产免费久久| 精品人妻在线不人妻| 悠悠久久av| 另类亚洲欧美激情| 精品熟女少妇八av免费久了| 亚洲精品成人av观看孕妇| 赤兔流量卡办理| 中文欧美无线码| 中文字幕精品免费在线观看视频| 国产精品九九99| 精品一区在线观看国产| 91成人精品电影| 老鸭窝网址在线观看| 丰满迷人的少妇在线观看| 国产亚洲精品第一综合不卡| 日韩一本色道免费dvd| 精品一区二区三区av网在线观看 | 国产成人精品久久二区二区91| 亚洲精品久久午夜乱码| 亚洲欧美色中文字幕在线| 久久九九热精品免费| 国产免费一区二区三区四区乱码| 久久鲁丝午夜福利片| 国产精品一区二区免费欧美 | 国产精品国产三级专区第一集| 777久久人妻少妇嫩草av网站| 交换朋友夫妻互换小说| 97精品久久久久久久久久精品| 国产精品久久久久久人妻精品电影 | 久久精品成人免费网站| 国产亚洲av高清不卡| 麻豆av在线久日| 久热爱精品视频在线9| 美女大奶头黄色视频| 亚洲午夜精品一区,二区,三区| 国产成人精品久久二区二区91| 免费日韩欧美在线观看| 国产精品 国内视频| 日韩 亚洲 欧美在线| 天天躁夜夜躁狠狠久久av| 国产精品99久久99久久久不卡| cao死你这个sao货| 一边摸一边抽搐一进一出视频| cao死你这个sao货| 欧美日韩国产mv在线观看视频| 另类亚洲欧美激情| 亚洲精品自拍成人| 午夜福利在线免费观看网站| 午夜久久久在线观看| 国产无遮挡羞羞视频在线观看| 99国产精品一区二区蜜桃av | 99久久综合免费| 欧美亚洲日本最大视频资源| 精品免费久久久久久久清纯 | 国产精品久久久久久精品古装| 亚洲七黄色美女视频| 在线观看免费高清a一片| 亚洲男人天堂网一区| 我的亚洲天堂| 视频区欧美日本亚洲| 午夜日韩欧美国产| 美女脱内裤让男人舔精品视频| 女人爽到高潮嗷嗷叫在线视频| 人体艺术视频欧美日本| 亚洲精品日本国产第一区| 男的添女的下面高潮视频| 97人妻天天添夜夜摸| 久久精品成人免费网站| 国产一级毛片在线| 人人妻人人澡人人看| 99久久人妻综合| 老司机午夜十八禁免费视频| av在线老鸭窝| √禁漫天堂资源中文www| 赤兔流量卡办理| 国产人伦9x9x在线观看| 国产欧美日韩一区二区三 | 啦啦啦中文免费视频观看日本| 亚洲中文av在线| 99久久99久久久精品蜜桃| 你懂的网址亚洲精品在线观看| 色婷婷久久久亚洲欧美| 国产女主播在线喷水免费视频网站| 黄色视频在线播放观看不卡| 大香蕉久久成人网| 中文字幕亚洲精品专区| 欧美精品一区二区免费开放| 久热爱精品视频在线9| 大话2 男鬼变身卡| 亚洲色图综合在线观看| 在线观看免费视频网站a站| 日本wwww免费看| 97精品久久久久久久久久精品| 大陆偷拍与自拍| 50天的宝宝边吃奶边哭怎么回事| 亚洲精品久久成人aⅴ小说| 日本vs欧美在线观看视频| 国产精品久久久久久精品古装| 欧美日韩福利视频一区二区| 18禁观看日本| 亚洲精品乱久久久久久| 欧美变态另类bdsm刘玥| 久久青草综合色| 精品久久久久久久毛片微露脸 | 少妇 在线观看| 国产精品成人在线| 伊人久久大香线蕉亚洲五| 国产成人91sexporn| 巨乳人妻的诱惑在线观看| 男女高潮啪啪啪动态图| 国产一区二区三区综合在线观看| 欧美乱码精品一区二区三区| 国产精品久久久久久人妻精品电影 | 色综合欧美亚洲国产小说| 成人午夜精彩视频在线观看| 久久久久久久精品精品| 午夜福利在线免费观看网站| 一边摸一边抽搐一进一出视频| 交换朋友夫妻互换小说| 欧美精品啪啪一区二区三区 | 久久久精品94久久精品| 国产在线一区二区三区精| 久9热在线精品视频| 精品少妇一区二区三区视频日本电影| 在线看a的网站| 丝袜美足系列| 捣出白浆h1v1| 久久午夜综合久久蜜桃| 午夜福利乱码中文字幕| 亚洲av男天堂| 国产在线一区二区三区精| 一区在线观看完整版| 久久久精品区二区三区| 婷婷色麻豆天堂久久| 丝袜脚勾引网站| 亚洲国产精品一区二区三区在线| 啦啦啦在线免费观看视频4| 亚洲av日韩精品久久久久久密 | 国产一区二区激情短视频 | 日韩 欧美 亚洲 中文字幕| 搡老乐熟女国产| 啦啦啦 在线观看视频| netflix在线观看网站| 欧美在线一区亚洲| 老司机靠b影院| 国产97色在线日韩免费| kizo精华| 男男h啪啪无遮挡| 天堂中文最新版在线下载| 国产亚洲av高清不卡| 又大又黄又爽视频免费| 国产在线观看jvid| 在线观看国产h片| 国产精品九九99| 免费日韩欧美在线观看| 日本av免费视频播放| 女人高潮潮喷娇喘18禁视频| 性少妇av在线| 精品人妻一区二区三区麻豆| 99久久精品国产亚洲精品| 日日爽夜夜爽网站| 好男人电影高清在线观看| 如日韩欧美国产精品一区二区三区| 狂野欧美激情性bbbbbb| 18在线观看网站| 婷婷色av中文字幕| 亚洲人成电影免费在线| 欧美成人精品欧美一级黄| 久9热在线精品视频| www.自偷自拍.com| 国产精品秋霞免费鲁丝片| 国产爽快片一区二区三区| av福利片在线| 亚洲av在线观看美女高潮| 成人国产av品久久久| 国产国语露脸激情在线看| 在线观看www视频免费| 制服诱惑二区| 自线自在国产av| 日本vs欧美在线观看视频| 欧美在线黄色| 日本猛色少妇xxxxx猛交久久| videos熟女内射| 99热网站在线观看| 美女午夜性视频免费| 欧美日韩亚洲国产一区二区在线观看 | 超碰成人久久| 永久免费av网站大全| 欧美精品啪啪一区二区三区 | 叶爱在线成人免费视频播放| 欧美性长视频在线观看| 国产成人精品久久二区二区91| 日韩 亚洲 欧美在线| 十八禁人妻一区二区| 欧美亚洲日本最大视频资源| 国产极品粉嫩免费观看在线| 波多野结衣av一区二区av| 一级片'在线观看视频| 精品视频人人做人人爽| www.精华液| 国产av一区二区精品久久| 免费av中文字幕在线| 亚洲人成网站在线观看播放| 成人影院久久| 亚洲男人天堂网一区| 9色porny在线观看| 国产精品二区激情视频| 日本av手机在线免费观看| 捣出白浆h1v1| 91精品国产国语对白视频| 亚洲av电影在线进入| 亚洲精品国产av成人精品| a级毛片在线看网站| 伊人久久大香线蕉亚洲五| svipshipincom国产片| 国产成人精品在线电影| 国产视频一区二区在线看| 两人在一起打扑克的视频| 亚洲第一青青草原| 老司机深夜福利视频在线观看 | 黄色怎么调成土黄色| 成年美女黄网站色视频大全免费| 亚洲精品在线美女| 国产男人的电影天堂91| 19禁男女啪啪无遮挡网站| 午夜激情av网站| 亚洲国产av新网站| 黄片小视频在线播放| 老司机深夜福利视频在线观看 | 日日爽夜夜爽网站| 亚洲国产欧美一区二区综合| 91字幕亚洲| 国产精品 欧美亚洲| 国产男女超爽视频在线观看| 一区二区三区激情视频| 大码成人一级视频| 在线天堂中文资源库| 亚洲午夜精品一区,二区,三区| 国产国语露脸激情在线看| 午夜免费男女啪啪视频观看| 一本综合久久免费| 国产熟女午夜一区二区三区| 男人添女人高潮全过程视频| 91精品伊人久久大香线蕉| 18在线观看网站| 少妇精品久久久久久久| 高潮久久久久久久久久久不卡| 亚洲一码二码三码区别大吗| 嫩草影视91久久| 青春草亚洲视频在线观看| 亚洲七黄色美女视频| 好男人电影高清在线观看| 中文字幕最新亚洲高清| 肉色欧美久久久久久久蜜桃| 美女视频免费永久观看网站| 国产亚洲欧美精品永久| 人人妻人人澡人人爽人人夜夜| 亚洲中文字幕日韩| 可以免费在线观看a视频的电影网站| 91精品三级在线观看| 水蜜桃什么品种好| 桃花免费在线播放| 日韩,欧美,国产一区二区三区| 亚洲欧美一区二区三区国产| 国产97色在线日韩免费| 久久鲁丝午夜福利片| 日韩 亚洲 欧美在线| 最新在线观看一区二区三区 | 成人影院久久| 欧美激情极品国产一区二区三区| 女人爽到高潮嗷嗷叫在线视频| 欧美精品一区二区大全| 最黄视频免费看| 男人舔女人的私密视频| 在线av久久热| 亚洲国产看品久久| 国产高清国产精品国产三级| 国产欧美亚洲国产| 欧美 亚洲 国产 日韩一| 免费看十八禁软件| 亚洲专区中文字幕在线|