• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Action Recognition and Detection Based on Deep Learning:A Comprehensive Summary

    2023-12-12 15:49:18YongLiQimingLiangBoGanandXiaolongCui
    Computers Materials&Continua 2023年10期

    Yong Li,Qiming Liang,Bo Gan and Xiaolong Cui

    1College of Information Engineering,Engineering University of PAP,Xi’an,710086,China

    2PAP of Heilongjiang Province,Heihe Detachment,Heihe,164300,China

    3National Key Laboratory of Science and Technology on Electromagnetic Energy,Naval University of Engineering,Wuhan,430033,China

    4Joint Laboratory of Counter Terrorism Command and Information Engineering,Engineering University of PAP,Xi’an,710086,China

    ABSTRACT Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action detection is not clear,and the relevant reviews are not comprehensive.Thus,this paper summarized the action recognition and detection methods and datasets based on deep learning to accurately present the research status in this field.Firstly,according to the way that temporal and spatial features are extracted from the model,the commonly used models of action recognition are divided into the two stream models,the temporal models,the spatiotemporal models and the transformer models according to the architecture.And this paper briefly analyzes the characteristics of the four models and introduces the accuracy of various algorithms in common data sets.Then,from the perspective of tasks to be completed,action detection is further divided into temporal action detection and spatiotemporal action detection,and commonly used datasets are introduced.From the perspectives of the twostage method and one-stage method,various algorithms of temporal action detection are reviewed,and the various algorithms of spatiotemporal action detection are summarized in detail.Finally,the relationship between different parts of action recognition and detection is discussed,the difficulties faced by the current research are summarized in detail,and future development was prospected.

    KEYWORDS Action recognition;action detection;deep learning;convolutional neural networks;dataset

    1 Introduction

    Being widely used in fields like security,video content review,human-computer interaction,and others,action recognition and detection is among the important research directions in the field of computer vision[1].The concept differs for the two components of action recognition and detection,which are action recognition and action detection[2].Action recognition refers to judging the category of human action in a given video footage,while action detection determines the start and end time of certain actions in the video,and locates the spatial position of figures in the picture,besides classifying the action.More specifically,action detection can be divided into temporal action detection and spatiotemporal action detection.Temporal action detection only determines the start and end time of certain actions,while spatiotemporal action detection needs to further determine the position of the figures in the picture.

    Previous reviews of action recognition and detection focus more on the field of action recognition alone and fail to summarize the current research status comprehensively and accurately.A clear explanation of the difference and connection between the concept of action recognition and action detection is not given in some literature.For example,Hassner [3] reviewed the early development of action recognition,focusing on the commonly used datasets for action recognition.Luo et al.[4]reviewed various algorithms commonly used in action recognition from the perspective of descriptors.Zhao et al.[5] provided insight into traditional recognition methods and deep learning-based recognition methods from two aspects: input content and network depth.Chai et al.[6] focused on the comparison between action recognition methods based on descriptors and what is based on deep learning,before prospects the development direction of action recognition.Zhang[2]provided a comprehensive summary of the current research status of both action recognition and action detection,but the latest research results are not mentioned.Zhu et al.[7]reviewed the research status of action recognition and detection in detail,but the concept of action recognition isn’t distinguished clearly from that of action detection.Sun et al.[8] summarized the current research on action recognition and detection in detail from the perspective of data mode.

    In recent years,the transformer model,typified by Vision Transformer (VIT) [9],has made remarkable achievements,which reveals a new trend in the field of action recognition and detection.The existing literature rarely reviews action recognition and action detection side by side,and there are few introductions to transformer-based models.Therefore,based on dividing the action recognition and detection structure,this paper summarizes the action recognition and detection in detail from the perspective of the model structure,and much emphasis is put on the prominent transformer model.Besides,this paper summarizes the various algorithms of action recognition and detection,points out the difficulties faced by current research,and explores the subsequent trends.

    2 Action Recognition

    As shown in Fig.1,action recognition can fit into either traditional frameworks or deep learningbased frameworks,which mainly consist of three steps:preprocessing,action expression,and classification[10].Preprocessing includes serialization of video and extraction of optical flow features.In traditional frameworks,Action expression mainly includes feature extraction and coding,while deep learning frameworks use various deep neural networks to extract features.In traditional frameworks,algorithms such as Support Vector Machine (SVM) and random forests are mainly used for action classification,while the action classification of deep learning frameworks mainly uses Softmax and SVM.

    2.1 Action Recognition Datasets

    The earliest published dataset for action recognition was Kungliga Tekniska H?gskolan(KTH)[11].In recent years,UCF-101[12]and HMDB51[13]have been widely used in action recognition.The KTH dataset contains 6 types of actions completed by 25 people in 4 different scenarios,with a total of 2391 video samples,while UCF-101 contains 13,320 video samples in 101 categories,and HMDB51 includes 6,849 video samples in 51 action categories.With the deepening of research,the scenarios,action categories,and sample sizes covered by UCF-101 and HMDB51 are becoming difficult to meet the needs of the study,thus they are now being phased out.Table 1 shows the comparison of action recognition algorithms in UCF-101 and HMDB51.

    Table 1:Comparison of action recognition algorithms

    Figure 1:Action recognition flow chart

    In recent years,some research institutions have released datasets with larger sample sizes and richer scenarios.Take the datasets of the Kinetic[38]series and the Something-Something[39]series as an example.The Kinetics400 [38] dataset was released in 2017 and includes a total of 306,245 video samples in 400 categories.DeepMind then expanded the Kinetics400 dataset to release the larger Kinetics600 [41] and Kinetics700 [44].The Something-Somethingv1 dataset was released at ICCV2017,with 2–6 seconds per video,divided into training sets,test sets,and validation sets according to the ratio of 8:1:1,containing more than 100,000 sample sizes.Something-Somethingv2 was released at CVPR2020,with a further expanded sample size of more than 200,000 and the image format was updated from the previous JPG to Webm.

    At present,action recognition datasets devote to expanding around specific scenarios for finegrained motion analysis to meet various task scenarios.Table 2 shows the basic information of commonly used datasets for action recognition.

    Table 2:Common datasets for action recognition

    Table 3:Comparison of algorithms in ActivityNet-1.3

    Table 4:Comparison of algorithms in THUMOS’14

    2.2 Descriptor-Based Action Recognition

    Before deep learning was widely used,action recognition mainly adopted descriptor-based methods.Action recognition methods that are based on descriptors can be divided into global feature-based and local feature-based methods.The global feature extraction method initially uses the direction gradient histogram(DGH),before developing into two methods:contour silhouette and human joint point method.For example,Bobick et al.[48]generated Motion History Images(MHI)through the construction of a two-dimensional motion energy map to achieve action classification.Yang et al.[49] constructed the coordinates of joint points and combines static posture,motion attributes,and overall dynamics for action recognition.Local feature extraction mainly includes two methods:spatiotemporal point of interest sampling and dense trajectory tracking.For example,Willems et al.[50]proposed an action recognition method based on 3D Harris corner point detection,and Wang et al.[16,51]proposed action recognition methods Dense Trajectories(DT)and Improved Dense Trajectories(IDT)based on dense trajectory tracking.Through multi-scale intensive sampling,these feature points are closely tracked in the temporal dimension to form trajectories,via which the category of action is judged eventually.

    2.3 Deep Learning-Based Action Recognition

    When judging the category of action,humans usually need to distinguish both the static information of the actor and the dynamic information of the change of the actor’s movement.Therefore,the implementation of action recognition relies on static spatial information and dynamic temporal information of the action in the video.According to the different network structures that obtain these two types of information,deep learning-based action recognition models can generally be divided into four categories: Two stream models,Temporal models,Spatiotemporal models,and the latest transformer models.

    2.3.1 Two Stream Models

    To obtain spatial features and temporal features,the two-stream model uses two parallel pathways to extract spatial and temporal features,respectively.It realizes the fusion of spatiotemporal feature information through appropriate feature fusion methods before finally realizing the classification of action.Such an idea was first proposed in 2014 by Simonyan et al.[17].He inputs preprocessed video frames and optical flow maps to two parallel paths and then uses AlexNet[52]for feature extraction on both paths,where static spatial features are obtained from video frames,and dynamic temporal features are obtained from optical flow maps.The feature information is fused at the end of the channel to achieve action classification.

    Feichtenhofer et al.[21] improved the feature fusion method on this basis.Feature fusion is performed in advance in the convolutional layer of either channel,and additional fusion is performed on the quasi-prediction layer to replace the previous end fusion method.The stated fusion method not only reduces the parameters of the model but also improves the accuracy of recognition.Then,Feichtenhofer et al.[53]introduced He et al.’s residual network(ResNet)[54]to the two stream models,introducing residual connections between the two-Stream architectures to enhance the spatiotemporal feature interaction and fusion between the two streams.

    Similarly,focusing on the fusion of two stream features,Ng et al.[55]introduced the Long Short Term Memory network(LSTM)[56]based on the two stream neural networks.LSTM is used to fuse the output of two stream CNNs,effectively expressing the sequence of frames before and after through the memory unit of LSTM,strengthening the feature extraction ability of temporal information,and realizing the recognition of actions in long videos.

    To achieve feature extraction of long-term video information,Wang et al.[24] used a sparse time sampling strategy to sample multiple video clips from the entire video at the input,giving the preliminary judgment result of the action category in each segment,and then combines the results of multiple fragments to carry out “consensus”to realize the classification of action,thereby realizing the recognition of long-range actions.

    The two-stream model pre-processes video frames and optical flow maps at the input,which requires much time and computing power at the pre-processing stage of the data,making the model far from being able to achieve end-to-end recognition.For the above reasons,Zhu et al.[21]established a network structure called MotionNet based on the two-stream model,which can directly model the temporal features of video frames,replacing the role of the optical flow map.

    Feichtenhofer et al.[57]made great improvements to the two-stream model and built a lightweight two-stream recognition network named SlowFast.SlowFast captures spatial semantics for slow channels operating at low frame rates,and fast channels operating at high frame rates,thereby capturing temporal information with fine temporal resolution.Finally,horizontal connections are used to fuse from the fast path to the slow path to achieve action classification.

    Compared with Temporal models,Spatiotemporal models,and the latest transformer models.,the Two stream models are more complex and the model training is cumbersome,hence is difficult to truly achieve end-to-end recognition.The idea of two-stream models,however,provides important inspiration for algorithm innovation in the field of action detection and promotes the development of action detection.The two-stream model is a compromise model made in the architecture at the beginning stage of the development of the action recognition algorithm,and some algorithms even need to train two pathways step by step during the model training process,which is very timeconsuming.From the architecture design of the two-stream model,it can be clearly seen that the extraction of action features is difficult.

    2.3.2 Temporal Models

    To obtain spatial features and temporal features,the temporal model adopts a cascading method,in which the spatial semantic information is first extracted by the convolutional neural network(CNN),and the temporal feature information is then extracted by the recurrent neural network(RNN).Simple RNNs produce gradient divergence or vanishing gradients when processing long-term feature information,so the actual temporal model uses LSTM with a forgetting gate.

    In the study of Donahue et al.[15],AlexNet and LSTM are used to cascade,and spatial and temporal features are modeled respectively before being classified by the fully connected layer at the end to construct Long-term recurrent convolutional networks (LRCN) for action recognition.To better represent the relationship in space and eliminate redundant information,Sudhakaran et al.[58]introduced ConvLSTM [59] to replace the traditional LSTM for violent scenes,which realized the fusion of spatiotemporal information and further improved the accuracy of recognition.

    Li et al.[19] introduced the attention model into the LSTM network and constructed a new action recognition model VideoLSTM through the fusion of ConvLSTM and Attention LSTM.VideoLSTM introduces motion features and attention mechanisms to spatiotemporal positions,focusing on preserving spatial feature information between video frames.Wang et al.[60]combined I3D with LSTM based on the I3D network,and modeled the high-level temporal features obtained by the I3D model through LSTM.

    When the CNN comes too complex,the constructed spatial feature map will be abstracted,resulting in the loss of temporal feature information,and the limitation of LSTM’s ability to process temporal information.That accounts for the loss of the popularity of current temporal mode-based action recognition research.Since RNN networks and their variants cannot implement multi-GPU parallel computing,temporal models cannot achieve parallel training of models on multiple devices.Action recognition algorithms have high hardware requirements in the model training process,so the inability to build action recognition models by multi-device parallel training will bring great trouble to researchers,which is another important factor in the development bottleneck of current temporal models.

    2.3.3 Spatiotemporal Models

    Spatiotemporal models design an integrated structure and obtain space-time feature information at the same time.A spatiotemporal model usually uses 3D convolution,and the data with different time dimensions is performed on 3D convolution operation.In recent years,some scholars have proposed adopting a specially designed data processing method in the model to fuse the spatiotemporal feature information in advance,and then use 2D convolution to model it,which can also realize action recognition.

    The use of 3D convolution for action recognition was first proposed by Ji et al.[61].Du et al.[14]further extended 3D convolution to the pooling process to form 3D pooling and established the C3D(Convolutional 3D) model.Diba et al.[20] adopted the method of transfer learning in the model construction process and proposed a new temporal transition layer (TTL),which embeds TTL into DenseNet [62] that extends to a 3D structure,thereby constructing a new network Temporal 3D ConvNets(T3D).

    3D convolution is very computationally intensive in the computation process,to alleviate this problem,Qiu et al.[18] formed a new convolutional block Pseudo-3D (P3D) by separating the convolution based on the ResNet.Based on the P3D structure,the new action recognition model P3D was successfully constructed.Final experiments show that P3D ResNet has significantly improved action recognition performance.

    Transfer learning has a wide application in deep learning,which makes it easier to train new models.For example,in the field of object recognition,transfer learning via a trained model obtained by ImageNet can accelerate the convergence speed of a new model.A similar approach can be used in the field of action recognition to reduce the workload of training.To be able to use pretrained models on a 3D convolutional network,Carreira et al.[25] expanded the two-dimensional convolution kernel and pooling kernel to 3D based on Inception-v1 [63].They then pre-trained the three-dimensional model implicitly on ImageNet before obtaining the pre-trained model of 3D convolution in Kinetics.After pre-training,the Inflated 3D(I3D)model obtained by transfer learning gains a great improvement in the accuracy of action recognition,which also greatly reduces the difficulty of model training.

    To make the 3D convolution model more lightweight,scholars have proposed many innovative methods,but it is more than difficult for 3D convolution is difficult to outperform 2D convolution.In 2019,Lin et al.[64] creatively performed the migration splicing of the feature map in the temporal dimension and proposed the Temporal Shift Module (TSM) by processing the temporal features before feature extraction.TSM fuses invisible temporal information into spatial features,and only 2D convolution can achieve the effect of 3D convolution,which alleviates computational overhead by sacrificing storage.Then,Shao et al.[65] proposed a new deformation displacement module,a temporal interlacing network (TIN),based on TSM,which further strengthened the fusion of spatiotemporal information.Fan et al.[66]proposed a learnable 3D shift network RubkisNet,which simultaneously migrates in both the spatial and temporal dimensions,and dynamically generates the proportion of the migration part.RubkisNet obtained a larger range of spatiotemporal information as well as higher accuracy.

    In addition,Li et al.[67] extracted adjacent frame information and multi-frame global information by establishing a temporal excitation and aggregation block(TEA).Both short-time motion and long-term feature aggregation were considered,which effectively reduced the complexity of the network and also effectively avoided the drawbacks of 3D CNNs.

    Before 2019,3D convolution was mainly used in spatiotemporal models.Then through the ingenious design of data preprocessing,2D convolution can also achieve accurate action classification.By sacrificing part of the storage,the computational overhead is greatly alleviated,and action recognition based on the spatiotemporal model has come to the prominent direction of current research.

    2.3.4 Transformer Models

    Transformer is an attention-based codec-like model that originated in the field of natural language processing(NLP)and has begun to achieve high accuracy in applications of computer vision after the release of the VIT model in 2021.The transformer model is a commonly used decoder architecture in the field of NLP,which has advantages in extracting“contextual”correlation information.It is now shining in the field of computer vision,including action recognition,and is becoming an important cross-modal architecture.

    The transformer was first used in the field of action recognition after the release of Video Vision Transformer (VIVIT)[68] in 2021.Similar to VIVIT,Ullah et al.[69] completely abandoned CNNs based on VIT and adopt the attention structure to achieve action recognition.To alleviate the redundancy of the temporal information,Patrick et al.[70]introduced trajectory information based on the transformer,and obtain high accuracy in multiple datasets.

    Truong et al.propose an end-to-end transformer structure,Direcformer [71].The structure introduces ordinal time learning into the transformer,which helps to understand the chronological order of actions.To strengthen its ability to model different spatiotemporal spaces,Google proposed Multiview Transformers for Video Recognition(MTV)[72],which consists of multiple independent encoders to represent different dimensional views of the input video.MTV also fuses the information between the views through horizontal connections.Self-Supervised Video Transformer(SVT)[73]is a new self-supervised approach that trains a teacher-student model using similarity goals that are represented along spatiotemporal dimensions by spatiotemporal attention matching.

    Although the transformer achieves very high accuracy in the field of computer vision,including action recognition,it incurs huge computational overhead,which places a burden on research institutions or researchers with average research conditions.Therefore,Recurrent Vision Transformer(RViT) [74] introduces a loop mechanism and integrates Attention Gate to establish a connection between the current frame and the previous hidden state,thereby extracting the global space-time features between frames,and alleviating the problem of insufficient computing power to a certain extent.

    3 Action Detection

    3.1 Action Detection Datasets

    The datasets commonly used for temporal action detection are mainly THUMOS14[75],MEXaction2[76],and ActivityNet[77].The THUMOS14 dataset includes an action recognition part and a temporal action detection part.The action recognition section includes all the categories covered by the UCF-101 dataset.The temporal action detection section includes 20 categories,divided into the training set,validation set,background fragment set,and test set.The MEXaction2 dataset includes two categories: horseback riding and bullfighting.The background fragment length of the MEXaction2 dataset is relatively long,while the proportion of labeled action fragments is low,which makes it more challenging for temporal action detection.ActivityNet is currently the largest database,it also contains two tasks:action classification and temporal action detection.ActivityNet has a very large sample size of more than 20,000,covering 200 action categories.It can only be downloaded by writing a script based on the official YouTube link.The above dataset only coarsely labeled the temporal action information,which is easy to cause the problem of unclear temporal action boundaries during an experiment,so ECCV2022 released the carefully labeled FineAction[78]dataset.The FineAction dataset contains nearly 17,000 untrimmed videos and 103,000 fine motor temporal annotations.For all 106 action categories,the category definitions are clearer and the temporal annotations are more accurate.

    J-HMDB-21[79],UCF101-24[80],and Atomic Visual Actions(AVA)[81]datasets are commonly used as spatiotemporal action detection datasets.J-HMDB-21 is a subset of the HMDB dataset,containing a total of 21 categories and 960 video samples.UCF101-24 is a subset of UCF101,including a total of 24 Action categories and 3207 video samples.Compared to either of the previous datasets,the labels of the AVA dataset are much sparser.The AVA dataset consists of 300 movies,each captured for 15 min and labeled second by second.The newly released MultiSports[82]dataset by ECCV2022 further increases the sample size and includes more complex scenes,which is a largescale spatiotemporal action detection dataset mainly including basketball,football,gymnastics,and volleyball events.

    3.2 Temporal Action Detection

    Temporal action detection is different from action recognition,which not only needs to classify the action itself,but also needs to locate the temporal position of the action in the video,specifically to locate the start and end time of certain actions accurately from a long video containing background clips,and to determine the category of the action.Temporal action detection usually requires video data with a long time span during model training,such data is huge,and it takes a lot of time and computing resources in the process of data preprocessing and model training,so temporal action detection is very difficult for research institutions with poor research conditions or weak teams.

    Tables 3 and 4 show respectively the accuracy of the commonly used algorithms for temporal action detection in the ActivityNet-1.3 dataset and the THUMOS’14 dataset,where mAP@k represents the Mean Average Precision of a certain algorithm when the intersection and union ratio is equal to k.

    3.2.1 Action Detection Based on Descriptors

    Traditional temporal action detection methods use descriptors to generate target fragments,thereby achieving the detection of temporal action.For example,Richard et al.[102] identified action types by merging two models,one of which is a length model that combines action duration information and the other is a language model that combines contextual context.Yuan et al.[103]extracted a pyramid of score distribution features (PSDF) based on IDT features.They then used the LSTM network to process the PSDF feature sequence and obtained the prediction of the action fragment according to the output frame-level action category confidence score.By training video,Hou et al.[104] automatically determined the number as well as types of sub-movements in each action.To locate an action,the objective function,which combines the appearance,duration,and time structure of a certain sub-action,is optimized as the shortest path problem in the network flow formula,before the best combination is selected by considering both the sub-action score and the distance between the sub-action.

    3.2.2 Deep Learning-Based Action Detection

    Another primary approach to temporal action detection is to use deep neural networks.According to whether the target candidate region needs to be extracted independently,the relevant object detection algorithms can be divided into one-stage-based and two-stage-based algorithms.Similar to object detection,temporal action detection algorithms can also be divided into one-stage-based methods and two-stage-based methods according to either the process of feature extraction or whether the temporal candidate region,where independent extraction action occurs,is required.

    Two-Stage Method

    Inspired by the common object detection algorithm R-CNN,Shou et al.[93]proposed a temporal action detection method based on sliding window Segment-CNN(S-CNN)in 2016.The S-CNN cuts the original video into several clips with different lengths and then sends it to the C3D network through pooling operations to conduct action detection.The flexibility for S-CNN to judge the starting and ending times of certain actions is limited by the sliding window mechanism.In the following year,Shou et al.[94]drew on the ideas of Fully convolutional network(FCN)[105]and introduced convolutional devolutional-De-Convolutional filters based on the C3D algorithm.Frame-level finegrained temporal action detection is then achieved upon the joint effect of upsampling in the temporal dimension and downsampling in the spatial dimension to achieve.

    Also inspired by the object detection algorithm,Xu et al.[83] built Region Convolutional 3D Network (R-C3D) based on the Faster R-CNN [106],which encodes the video stream with a 3D fully convolutional network,generates a time range candidate region that may contain action,and then classifies and fine-tunes the candidate region.Unlike S-CNNs,R-C3D can perform end-to-end detection of actions out of video with arbitrary length.

    Subject to similar influence,Chao et al.[85]used a Faster R-CNN-based multiscale framework to improve the calibration of receptive fields,enabling resilience to extreme changes in the duration of actions in certain videos.Then,by constructing a two-stream network,the characteristics of redgreen-blue (RGB) and optical flow are fused,and action classification is conducted using the late fusion mechanism.

    For temporal action detection in actual scenarios,there may be more than one action fragment contained in the video to be detected,so actions can be judged comprehensively by combining the action categories of multiple proposals.Hence Zeng et al.[88]used Graph Convolutional Networks(GCN) [107] to explore the connections among proposals and constructed a GCN-based temporal action detection framework Proposal-GCN(P-GCN).

    Liu et al.[99]proposed to use both coarse-grained and fine-grained features to build an end-to-end network multi-granularity generator(MGG)through two modules,segment proposal producer(SPP)and frame actions producer(FAP),which is for finding action fragments.Gao et al.[108]proposed a relation-aware pyramid(RapNet)based on the pyramid network,which enhanced the global feature information representation and located the different lengths of Action fragments.Lin et al.[109]established a novel Dense Boundary Generator (DBG),which extracts spatiotemporal features like the two-stream model and establishes the action perception completeness regression branch and the time boundary classification branch to realize rapid detection of action.

    The two-stage method can achieve high detection accuracy by obtaining proposals of the temporal dimension before classifying the action.However,apart from its low operation speed,the two-stage network model is too complex and requires a lot of computing resources.

    One-Stage Method

    The conventional approach for one-stage temporal action detection is to use convolutional layer generation proposals for recognition and boundary regression.The Action proposals obtained by this method are then assigned the same receptive fields.However,the temporal length varies for different actions.To solve such a problem,Long et al.[110]proposed Gaussian kernel learning,which expresses temporal information by learning a Gaussian kernel.Piergiovanni et al.[111]proposed a new convolutional layer Temporal Gaussian Mixture(TGM)layer,which also adopts a Gaussian model.It can capture long-distance dependencies in video effectively by using Gaussian kernels to calculate other time point features near the current temporal window.Inspired by the object detection relatedalgorithm,Lin et al.combined I3D with a borderless target detection algorithm,proposed a boundary learning consistency loss function,and constructed an anchor-free saliency-based action detection method Anchor-Free Saliency-Based Detector (AFSD) based on learning significance boundary features.

    In recent years,with the rapid development of transformer,some scholars have also begun to solve the problem of temporal action detection with transformer.Liu et al.[100] embeded the video features and positions extracted by CNNs as input and decode a set of action predictions in parallel through the transformer.By focusing on certain segments in a video adaptively,it extracts the contextual information required to make motion predictions,which greatly simplifies the process of temporal action detection and increases the speed of detection.Shi et al.[112] also proposed a DETR-like temporal action detection method based on the transformer,which proposes attention,action classification enhancement loss,and fragment quality prediction related to IOU attenuation,and analyzes them from three aspects: attention mechanism,training loss,and network reasoning.Zhang et al.[113]used transformer as a basic module to design a minimalist temporal action detection scheme,where feature pyramid and local self-attention mechanism are used to model long-time temporal features,and classification and regression are realized without generating proposals or predefined bounding boxes.In addition,Liu et al.[114]also combined transformer to propose an endto-end temporal action detection scheme,which obtains higher accuracy and a faster detection rate by constructing a medium-resolution benchmark detector.

    3.3 Spatiotemporal Action Detection

    So-called Spatiotemporal action detection is to determine the temporal and spatial position of certain actors in a video containing background clips and to realize action classification.In other words,spatiotemporal action detection needs to mark the position of the actor in the spatial picture based on temporal action detection.Spatiotemporal action detection can often be divided into multiple stages,including action recognition,target tracking,and object detection.Therefore,spatiotemporal action detection usually requires the construction of multiple network-level models in the process of model construction,which is difficult in model training.Table 5 shows the accuracy of the spatiotemporal action detection algorithms respectively over J-HMDB-21 and UCF101-24 datasets.

    Table 5:Comparison of spatiotemporal action detection algorithms

    Puscas et al.[115]employed a selective search method to produce the initial segmentation of still image-based video frames.This initial recommendation set is pruned and temporarily extended using optical flow and transudative learning.

    Inspired by the object detection algorithm,Kalogeiton et al.[117]built an Action Tubelet detector(ACT)based on the SSD framework.ACT focuses on temporal features between successive frames,reduces the ambiguity of Action prediction,and improves the accuracy of spatiotemporal localization.Yet Gu et al.[118] used I3D for contextual temporal modeling and Faster R-CNN for end-to-end localization and action classification,which is also derived from the object detection algorithm.

    Feichtenhofer et al.[57]used SlowFast for action recognition,Deepsort for object tracking,and YOLO for object detection,and realize action detection through a combination of three algorithms.Inspired by the human visual nervous system,Kpüklü et al.[119] proposed You Only Watch Once(YOWO),a unified architecture for spatiotemporal action detection.The network structure of YOWO is similar to the two-stream model,in which 3D-CNN branches and 2D-CNN branches are used in each parallel stream to extract spatiotemporal feature information,and feature fusion,as well as candidate region definition,is carried out in the end.YOWO uses 3D-ResNet-101 [120] to extract spatiotemporal features and solves classification problems with 3D-CNN branches.To solve the spatial localization problem,DarkNet-19[121]is eventually used to extract the dual-dimensional features of keyframes.

    Based on YOWO,Mo et al.[122]proposed to use of Linknet to introduce a connection between 2D and 3D-convolutional structures.They also use a custom bounding box similar to YOLOv2 to achieve the precise positioning of actors and update the YOWO network to the second version,which effectively reduces the complexity of the model and further improves the accuracy of the network.

    The Holistic Interaction Transformer(HIT)[116]network is a comprehensive dual-mode framework based on the transformer,which includes RGB streams and pose streams.Each flow models human,target,and hand interactions.Within each subnetwork,an Intra-Modal Aggregation Module(IMA) is introduced,which merges individual interaction units selectively.The Attentive Fusion Mechanism(AFM)is then used to glue together the features produced by each pattern.Finally,HIT extracts clues from the time context using cache memory to better classify the possible actions.

    4 Discussion

    Action recognition and action detection have been widely used in practical scenarios.Action recognition can be applied to human-computer interaction,video content review,and other fields,while action detection can be applied to intelligent security,video content positioning,video search,and other fields.Action recognition is the prior work of action detection,and only when the relevant algorithms of action recognition tend to be mature,could action detection have good development.Fig.2 shows the important algorithms for action recognition and detection,above the arrow,are the action recognition algorithms,and below are the action detection algorithms.In the field of action recognition,mainstream models include the two-stream model,temporal model,spatiotemporal model,and transformer model.

    Structurally,the two-stream model uses parallel CNNs to extract the spatiotemporal feature information separately,and it is difficult to train CNN models with two pathways separately during the model training process.The temporal model uses a cascading method to extract spatial information and temporal information respectively,which has a simple structure and less difficulty and can achieve end-to-end recognition.The traditional spatiotemporal model uses a 3D convolution network,and the model rises from two-dimensional to three-dimensional,which can extract spatiotemporal feature information at the same time,but the complexity of the model increases.In recent years,the spatiotemporal model uses the specially designed data preprocessing method to reduce the dimension of the convolution model,which greatly simplifies the structure of the model and reduces the difficulty in the training process.The recent popular transformer model comes from the NLP,which mainly adopts the attention mechanism model that is different from the other three models,and the model complexity is high and the training is difficult.

    In terms of development trends,transformer is currently the most popular model,which has high accuracy but is limited by the complexity of the model,and there is a big gap from the actual deployment application.After 2019,spatiotemporal models have emerged in a new direction of using data preprocessing to achieve spatiotemporal feature data fusion,which has provided new ideas for more researchers for some time.The temporal model relies on the LSTM network and extreme variants to obtain temporal information,but it has fallen into a bottleneck,and there are no good innovation points recently.As one of the earliest deep learning models in the field of action recognition,the twostream model has structural drawbacks,but it is still an important research direction.Table 6 shows the comparison of the characteristics of the four models.

    Table 6:Comparison of action recognition models

    Action detection emerged later than action recognition,but it has undergone faster development under the influence of action recognition algorithms.Temporal action detection can be divided into the one-stage method and the two-stage method according to whether the candidate region is obtained step by step.The latter can obtain higher accuracy,but the corresponding model is more complex,while the former has a more concise structure yet with lower accuracy.The role that action detection plays in the video field is just like the role object detection plays in the image field,as so many algorithms of action detection have been affected by object detection.

    In terms of the tasks that need to be completed,action recognition is the premise of action detection and the most important step in action detection.From the perspective of algorithms,the research of action recognition algorithms is the basis of action detection research,and the research of action detection must solve a series of problems faced by action recognition.Action detection not only needs to pay attention to the link of action recognition but also needs to solve the problem of the time and space position of the actor.In general,many algorithms for action detection are just starting,and room for improvement in efficiency and accuracy remains large.

    4.1 Difficulties and Challenges

    4.1.1 Difficulty in Data Collection

    At present,action recognition and action detection based on deep learning is mainly based on supervised learning,which has a strong dependence on data[123].Hence the requirements for sample size as well as scenarios covered by datasets are increasing.Action recognition and action detection datasets are usually video data,which is larger than images and more cumbersome in the process of data collection,pruning,and labeling.

    Action detection datasets need to label not only the categories of actions but also the spatial and temporal locations of actions,which is an arduous task.Temporal action detection requires the starting and ending time of the action to be annotated at a frame level.Spatiotemporal action detection also requires accurate labeling of the spatial location of the actor.This has led to a significant increase in the workload of dataset calibration,so some datasets have to adopt compromise methods such as sparse calibration to ease the pressure of data calibration.

    4.1.2 High Hardware Requirements

    At present,action recognition and action detection are facing mass computing power costs.With the gradual complication of deep learning models,especially since the transformer has penetrated the field of computer vision,it has become more difficult to train models,for the requirements for computer GPU computing power have gradually increased.The computing power overhead cannot be provided by ordinary hardware,which prevents the large-scale application of current action recognition and detection algorithms[124].

    With the development of action recognition and action detection,to obtain features from videos effectively,the scale of corresponding datasets has been expanding continuously.Thus,some opensource datasets come to hundreds of GB or even several TB,which is a huge burden on the storage as well as read-and-write capacity for computers.At present,most datasets need to undergo preprocessing processes such as serialization or optical fluidization before training,bringing heavy computational and read-write burden to the computer.

    4.1.3 Difficulty Judging Action Features

    Action recognition faces the following difficulties:

    The first is the complexity of fine-grained recognition.Just as there are a thousand Hamlets in the eyes of a thousand viewers,human action is complex and diverse,which may have different meanings from different perspectives.Therefore,it is difficult to strictly divide the categories of action,just as flapping actions,the speed of which directly determines whether the action itself has violent attributes.

    The second is the complexity of spatial information.Lighting variations,occlusion,and noise issues caused by video background information can affect feature extraction adversely.Different observing angles towards the actor will also cause problems related to scale transformation,which will also bring trouble to the judgment of Action characteristics.

    The third is the complexity of temporal information.The modeling problem of temporal dimension is the core problem of action recognition,yet also a key difficulty.From the viewpoint of current development status,the extraction of temporal information remains still very difficult[125].

    Action detection also faces several difficulties:

    The first is that action detection is limited by the performance of action recognition.Action recognition is the basis of action detection,but there are still many problems to be solved in the current action recognition tasks,which brings basic difficulties to subsequent action detection.

    The second is the ambiguity of Action space-time dimension positioning.From the perspective of the temporal dimension,the definition of starting and ending points of certain actions is vague,and the length of actions also varies.From the perspective of spatial dimension,the motion problem needs to be considered when positioning the actor in the spatial dimension,and the combination of multiple frames avoids the problem of jittering.

    4.2 Future Research Trends

    4.2.1 Enrich the Dataset

    Action detection and recognition mainly adopt supervised learning,and sufficient data support must be ensured.Due to the complexity of human action and the fine-grained requirements of practical applications,it is necessary to further expand on the existing datasets.At present,there are many datasets in the field of action recognition,but for specific task scenarios,the data covering related types of actions still need to be expanded.Research for action detection started late,hence the datasets of state-of-art research are few,and the coverage for scenarios remains incomplete.Therefore,it is now necessary to focus on enriching relevant datasets to rid them of data scarcity.For specific task scenarios,it can also be extended based on the existing datasets using data augmentation such as Rotation,Mixup,adding noise,and so on,which can alleviate the problem of insufficient data[126].

    4.2.2 Few-Shot Learning

    Given the above problems,in specific scenarios of action recognition and detection,such as the recognition and detection of violent acts in security scenarios,the recognition and detection of illegal operations in industrial production scenarios,etc.,the few-shot learning method can be used to relieve pressure.The basic idea of few-shot learning is to train the network to learn metaknowledge from a large number of prior tasks,and then use the existing prior knowledge to guide the model to learn faster in the new task [127].Few-shot learning can obtain data features from a small number of samples,reducing the intensive dependence on data in behavior recognition and behavior detection.

    4.2.3 Model Lightweight

    Because the existing algorithms have huge computing power overhead and are difficult to promote and deploy on a large scale,lightweight operations such as pruning the model are an important direction for subsequent research.Besides,it is also necessary to consider reducing the complexity of the model when designing the network structure[128].For example,algorithms such as the SlowFast network and TSM network can effectively promote the progress of research by reducing computing overhead.From the perspective of practical application,the action recognition algorithm or action detection algorithm cannot be deployed in the actual task scenario.The model architecture of the algorithm is too complex,which puts forward high requirements for hardware.Therefore,from the perspective of real needs,model lightweight is a task that must be completed.

    4.2.4 Transformer Model

    From the above content,the current transformer model has extensive participation in action recognition and action detection.As a popular model across the fields of NLP and computer vision,the transformer model will play an important role in the future development of action recognition and detection.At present,Transformer has higher accuracy than CNN in action recognition and action detection tasks,and can better connect or collaborate with NLP in future large model research.The transformer model also needs to be optimized in terms of model parameters to reduce the hardware requirements,so it could be optimized on this basis.

    5 Conclusion

    This paper systematically reviews the current research status of action recognition and detection,focuses on four commonly used models for action recognition,divides action detection into temporal action detection and spatiotemporal action detection,and elaborates on the context of algorithm development in each scenario.Finally,this paper summarizes the action recognition and detection,sorts out the differences and connections between various algorithms,and expounds on the prominent problems faced by current research and the general direction of the next development.

    Acknowledgement:None.

    Funding Statement:This work was supported by the National Educational Science 13th Five-Year Plan Project (JYKYB2019012),the Basic Research Fund for the Engineering University of PAP(WJY201907)and the Basic Research Fund of the Engineering University of PAP(WJY202120).

    Author Contributions:Study conception and design:Y.Li,X.Cui;data collection:Q.Liang;analysis and interpretation of results: B.Gan;draft manuscript preparation: Q.Liang.All authors reviewed the results and approved the final version of the manuscript.

    Availability of Data and Materials:All data in this paper can be found in Google Scholar.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    国产精华一区二区三区| 日本成人三级电影网站| 18禁观看日本| 亚洲熟妇熟女久久| 久99久视频精品免费| 欧美一级毛片孕妇| 亚洲在线自拍视频| 国产高清视频在线观看网站| 欧美黑人欧美精品刺激| 啦啦啦免费观看视频1| 亚洲自拍偷在线| 国产一区二区在线观看日韩 | 最近最新中文字幕大全免费视频| 夜夜爽天天搞| 18禁裸乳无遮挡免费网站照片| tocl精华| 色综合欧美亚洲国产小说| 精品久久久久久成人av| 久久人妻av系列| 国产精品电影一区二区三区| 国产午夜精品论理片| 99国产精品一区二区三区| 久热爱精品视频在线9| 男人舔女人的私密视频| 男女午夜视频在线观看| 91av网站免费观看| 欧美色欧美亚洲另类二区| 日日爽夜夜爽网站| 国产精品久久久久久亚洲av鲁大| 国产一区二区三区在线臀色熟女| 很黄的视频免费| 夜夜躁狠狠躁天天躁| 99精品在免费线老司机午夜| 黄色视频不卡| 岛国视频午夜一区免费看| 在线免费观看的www视频| 久久久久久国产a免费观看| 听说在线观看完整版免费高清| 少妇裸体淫交视频免费看高清 | 十八禁人妻一区二区| 亚洲色图 男人天堂 中文字幕| 久久这里只有精品19| 亚洲第一欧美日韩一区二区三区| 久久伊人香网站| 欧美一级a爱片免费观看看 | 亚洲av成人av| av有码第一页| 精品免费久久久久久久清纯| 一边摸一边抽搐一进一小说| 精品欧美国产一区二区三| 18禁黄网站禁片午夜丰满| 99久久综合精品五月天人人| 久久久久久久久免费视频了| 黑人操中国人逼视频| 757午夜福利合集在线观看| 国产精品亚洲一级av第二区| 久久久久久人人人人人| 久久久水蜜桃国产精品网| 18美女黄网站色大片免费观看| 精品日产1卡2卡| 亚洲最大成人中文| 18禁国产床啪视频网站| 国产精华一区二区三区| 人人妻人人看人人澡| 日韩精品青青久久久久久| 国产精品日韩av在线免费观看| 麻豆成人午夜福利视频| 国产精品野战在线观看| 色在线成人网| 国产精品久久久人人做人人爽| 国产亚洲欧美在线一区二区| 久久久久久亚洲精品国产蜜桃av| 国内精品一区二区在线观看| 在线免费观看的www视频| 亚洲成av人片免费观看| 99久久99久久久精品蜜桃| 亚洲精品在线观看二区| 日日干狠狠操夜夜爽| 欧美成人免费av一区二区三区| 757午夜福利合集在线观看| 男女下面进入的视频免费午夜| 国产高清有码在线观看视频 | 亚洲专区中文字幕在线| 国产精品av久久久久免费| 国产69精品久久久久777片 | 少妇被粗大的猛进出69影院| 又大又爽又粗| 日日爽夜夜爽网站| 啦啦啦免费观看视频1| 夜夜爽天天搞| 中文字幕久久专区| 亚洲美女黄片视频| 久久久久久久精品吃奶| 国产aⅴ精品一区二区三区波| 一区二区三区激情视频| 91成年电影在线观看| 国产日本99.免费观看| 成人永久免费在线观看视频| 两个人免费观看高清视频| 桃红色精品国产亚洲av| 9191精品国产免费久久| 97人妻精品一区二区三区麻豆| 狠狠狠狠99中文字幕| 美女扒开内裤让男人捅视频| 久久久久久免费高清国产稀缺| 亚洲18禁久久av| 国产真实乱freesex| 一级毛片高清免费大全| 精华霜和精华液先用哪个| 亚洲国产精品成人综合色| 国产av一区在线观看免费| 亚洲成人免费电影在线观看| 国产av不卡久久| 免费在线观看成人毛片| 三级男女做爰猛烈吃奶摸视频| 免费电影在线观看免费观看| 无人区码免费观看不卡| 超碰成人久久| 1024香蕉在线观看| 国产成人一区二区三区免费视频网站| 两性午夜刺激爽爽歪歪视频在线观看 | 久久婷婷人人爽人人干人人爱| 日本免费a在线| 亚洲国产欧美一区二区综合| 国产三级中文精品| 2021天堂中文幕一二区在线观| 久久天堂一区二区三区四区| 久久久国产精品麻豆| 性色av乱码一区二区三区2| 欧美乱妇无乱码| 欧美一级a爱片免费观看看 | 亚洲男人天堂网一区| 国产黄色小视频在线观看| 麻豆国产av国片精品| 精品久久久久久成人av| 91在线观看av| 麻豆久久精品国产亚洲av| 国产蜜桃级精品一区二区三区| 九九热线精品视视频播放| 久久久久九九精品影院| 日日摸夜夜添夜夜添小说| 身体一侧抽搐| 免费在线观看影片大全网站| 久久午夜综合久久蜜桃| 99国产精品一区二区三区| 日韩欧美 国产精品| 亚洲国产看品久久| 久久香蕉激情| 色播亚洲综合网| 欧美又色又爽又黄视频| 一本精品99久久精品77| 国产精品亚洲美女久久久| 三级男女做爰猛烈吃奶摸视频| 国产av一区在线观看免费| 亚洲在线自拍视频| 黑人操中国人逼视频| 欧美日韩福利视频一区二区| 成人18禁在线播放| 亚洲av中文字字幕乱码综合| 国产爱豆传媒在线观看 | 午夜福利免费观看在线| 亚洲五月婷婷丁香| 精品久久蜜臀av无| 日日夜夜操网爽| 亚洲va日本ⅴa欧美va伊人久久| 男人舔奶头视频| 国产蜜桃级精品一区二区三区| 男插女下体视频免费在线播放| 欧美黄色淫秽网站| 欧美久久黑人一区二区| 特级一级黄色大片| 一区二区三区高清视频在线| 亚洲av美国av| 两个人视频免费观看高清| 午夜精品在线福利| 久久人妻av系列| 国产精品久久久久久久电影 | 欧美黑人欧美精品刺激| 成人亚洲精品av一区二区| 国产99白浆流出| 国产精品自产拍在线观看55亚洲| 亚洲av成人精品一区久久| 2021天堂中文幕一二区在线观| 亚洲五月天丁香| 真人做人爱边吃奶动态| 宅男免费午夜| 美女免费视频网站| 中国美女看黄片| 国产成人精品久久二区二区免费| 国产私拍福利视频在线观看| 久久久久国产一级毛片高清牌| 国产99白浆流出| 日韩欧美免费精品| 亚洲 欧美一区二区三区| 巨乳人妻的诱惑在线观看| 制服诱惑二区| 国产精品一区二区精品视频观看| or卡值多少钱| 伦理电影免费视频| 巨乳人妻的诱惑在线观看| 制服诱惑二区| 黄色成人免费大全| 国产精品九九99| 婷婷丁香在线五月| 亚洲精品中文字幕在线视频| 国产伦人伦偷精品视频| 中文字幕av在线有码专区| 草草在线视频免费看| 女警被强在线播放| 亚洲熟女毛片儿| 亚洲狠狠婷婷综合久久图片| √禁漫天堂资源中文www| 成人国产一区最新在线观看| 黄片大片在线免费观看| 中文资源天堂在线| 中文字幕人妻丝袜一区二区| 日韩欧美在线二视频| 日韩欧美国产一区二区入口| 午夜福利在线观看吧| 一级毛片高清免费大全| 男女床上黄色一级片免费看| 丰满人妻熟妇乱又伦精品不卡| 欧美+亚洲+日韩+国产| 中国美女看黄片| 香蕉久久夜色| 国产精品九九99| svipshipincom国产片| 国产爱豆传媒在线观看 | 精品电影一区二区在线| 俺也久久电影网| 香蕉久久夜色| 岛国视频午夜一区免费看| 国产v大片淫在线免费观看| 国产一级毛片七仙女欲春2| 黄频高清免费视频| 久久久久久久久中文| 色噜噜av男人的天堂激情| 后天国语完整版免费观看| 少妇熟女aⅴ在线视频| 国产午夜精品久久久久久| 美女扒开内裤让男人捅视频| 亚洲国产精品999在线| 给我免费播放毛片高清在线观看| 国产精品国产高清国产av| 91成年电影在线观看| 九色国产91popny在线| 美女大奶头视频| av福利片在线| 欧美三级亚洲精品| 色精品久久人妻99蜜桃| 91在线观看av| 九九热线精品视视频播放| 国产v大片淫在线免费观看| 亚洲中文字幕日韩| 国产主播在线观看一区二区| 黑人巨大精品欧美一区二区mp4| 老熟妇仑乱视频hdxx| АⅤ资源中文在线天堂| 国产精品久久久久久人妻精品电影| 欧洲精品卡2卡3卡4卡5卡区| 老司机在亚洲福利影院| 日韩欧美在线二视频| 国产激情久久老熟女| 国产成人精品久久二区二区免费| 搞女人的毛片| 午夜视频精品福利| 国产午夜精品久久久久久| 日本一二三区视频观看| 50天的宝宝边吃奶边哭怎么回事| 亚洲中文字幕一区二区三区有码在线看 | 这个男人来自地球电影免费观看| 久久久久久亚洲精品国产蜜桃av| av福利片在线观看| 哪里可以看免费的av片| 免费在线观看亚洲国产| 免费在线观看视频国产中文字幕亚洲| 搡老岳熟女国产| 国产伦人伦偷精品视频| 非洲黑人性xxxx精品又粗又长| 久久久久国产精品人妻aⅴ院| 精品电影一区二区在线| 欧美zozozo另类| 人妻丰满熟妇av一区二区三区| 国产午夜福利久久久久久| 久久精品综合一区二区三区| 操出白浆在线播放| 午夜精品一区二区三区免费看| ponron亚洲| 国产精品日韩av在线免费观看| 女人被狂操c到高潮| 国产精品一区二区三区四区免费观看 | 97碰自拍视频| 国产精品爽爽va在线观看网站| 成年女人毛片免费观看观看9| 亚洲精品一卡2卡三卡4卡5卡| 免费电影在线观看免费观看| 最新在线观看一区二区三区| 日韩免费av在线播放| 桃红色精品国产亚洲av| 中文亚洲av片在线观看爽| 性色av乱码一区二区三区2| 色综合婷婷激情| 波多野结衣高清作品| 亚洲国产中文字幕在线视频| 天天躁狠狠躁夜夜躁狠狠躁| e午夜精品久久久久久久| 免费在线观看影片大全网站| 欧美乱妇无乱码| 婷婷精品国产亚洲av| 亚洲七黄色美女视频| 亚洲午夜理论影院| 亚洲精品粉嫩美女一区| 免费在线观看日本一区| 亚洲熟妇中文字幕五十中出| 国产又黄又爽又无遮挡在线| 他把我摸到了高潮在线观看| 丰满人妻熟妇乱又伦精品不卡| 在线视频色国产色| 国产单亲对白刺激| 亚洲精品一区av在线观看| 一本一本综合久久| 国产主播在线观看一区二区| 色综合欧美亚洲国产小说| 人人妻人人澡欧美一区二区| 国产成人一区二区三区免费视频网站| 久久欧美精品欧美久久欧美| 可以在线观看的亚洲视频| 精品国产亚洲在线| 后天国语完整版免费观看| 国产精品亚洲一级av第二区| 亚洲精品中文字幕一二三四区| netflix在线观看网站| 一二三四社区在线视频社区8| 欧美另类亚洲清纯唯美| 亚洲成人中文字幕在线播放| 老司机午夜福利在线观看视频| 又粗又爽又猛毛片免费看| a在线观看视频网站| 国内精品一区二区在线观看| 亚洲aⅴ乱码一区二区在线播放 | av超薄肉色丝袜交足视频| 不卡av一区二区三区| 国产一区二区三区在线臀色熟女| 真人做人爱边吃奶动态| 国产午夜精品久久久久久| 看免费av毛片| 国产在线观看jvid| 午夜激情福利司机影院| 国产爱豆传媒在线观看 | 亚洲真实伦在线观看| 香蕉国产在线看| 女同久久另类99精品国产91| 亚洲精品中文字幕一二三四区| 亚洲午夜理论影院| 国产亚洲精品久久久久5区| 99精品在免费线老司机午夜| 亚洲av成人av| 99国产精品99久久久久| 哪里可以看免费的av片| 亚洲欧美日韩高清在线视频| 女生性感内裤真人,穿戴方法视频| e午夜精品久久久久久久| 国产精品免费视频内射| 日韩高清综合在线| 国产黄片美女视频| 精品国产乱码久久久久久男人| 亚洲欧美日韩高清在线视频| 他把我摸到了高潮在线观看| 国产久久久一区二区三区| 亚洲欧美日韩无卡精品| www国产在线视频色| 午夜免费成人在线视频| 成人av在线播放网站| 国产一区二区三区视频了| 亚洲精品久久国产高清桃花| 变态另类成人亚洲欧美熟女| a在线观看视频网站| av免费在线观看网站| 国产伦人伦偷精品视频| 成年版毛片免费区| 日本 欧美在线| 成人国产综合亚洲| 久久久久国内视频| 国产亚洲精品久久久久久毛片| 亚洲精品久久国产高清桃花| 久久久久久免费高清国产稀缺| 这个男人来自地球电影免费观看| 精品不卡国产一区二区三区| 一本一本综合久久| 亚洲中文字幕一区二区三区有码在线看 | 51午夜福利影视在线观看| 国内精品久久久久精免费| 91麻豆av在线| 日韩精品中文字幕看吧| 久久精品夜夜夜夜夜久久蜜豆 | 很黄的视频免费| 99国产精品一区二区三区| 久久香蕉国产精品| 精品日产1卡2卡| 国产高清视频在线播放一区| 亚洲一码二码三码区别大吗| 中文资源天堂在线| videosex国产| 天堂av国产一区二区熟女人妻 | xxx96com| 长腿黑丝高跟| xxx96com| 淫秽高清视频在线观看| 久久 成人 亚洲| 岛国视频午夜一区免费看| 50天的宝宝边吃奶边哭怎么回事| 午夜福利免费观看在线| 欧美乱色亚洲激情| 一夜夜www| 在线十欧美十亚洲十日本专区| 亚洲国产精品成人综合色| 在线a可以看的网站| 俺也久久电影网| 免费在线观看影片大全网站| 非洲黑人性xxxx精品又粗又长| 村上凉子中文字幕在线| 精品久久久久久成人av| 宅男免费午夜| 国产99久久九九免费精品| 日本精品一区二区三区蜜桃| 91大片在线观看| 88av欧美| 中文字幕av在线有码专区| 色综合站精品国产| 亚洲自偷自拍图片 自拍| 亚洲av第一区精品v没综合| 欧美色欧美亚洲另类二区| 久久精品成人免费网站| 欧美日韩国产亚洲二区| 久久精品综合一区二区三区| 免费观看精品视频网站| 午夜a级毛片| 夜夜夜夜夜久久久久| 久久精品91蜜桃| 精品久久久久久久末码| 亚洲一区中文字幕在线| 欧美性猛交黑人性爽| 欧美高清成人免费视频www| 国产三级黄色录像| √禁漫天堂资源中文www| 好男人电影高清在线观看| 久久午夜亚洲精品久久| 老熟妇乱子伦视频在线观看| 狠狠狠狠99中文字幕| 亚洲av成人一区二区三| 美女高潮喷水抽搐中文字幕| 欧美极品一区二区三区四区| 神马国产精品三级电影在线观看 | 观看免费一级毛片| 成人精品一区二区免费| 久久精品人妻少妇| 午夜精品在线福利| 久热爱精品视频在线9| 国产精品自产拍在线观看55亚洲| 天堂av国产一区二区熟女人妻 | 最近最新免费中文字幕在线| 午夜日韩欧美国产| 亚洲精品国产精品久久久不卡| 日本一本二区三区精品| 免费电影在线观看免费观看| 级片在线观看| 在线观看一区二区三区| 亚洲人与动物交配视频| av福利片在线观看| 日韩av在线大香蕉| 亚洲一区二区三区不卡视频| 女生性感内裤真人,穿戴方法视频| 国产aⅴ精品一区二区三区波| 亚洲成人中文字幕在线播放| 国产成人一区二区三区免费视频网站| 999精品在线视频| 最近最新免费中文字幕在线| 亚洲自偷自拍图片 自拍| 国产一区二区在线av高清观看| 亚洲男人的天堂狠狠| 美女大奶头视频| 成人亚洲精品av一区二区| 岛国视频午夜一区免费看| 精品久久久久久久人妻蜜臀av| av在线天堂中文字幕| av免费在线观看网站| 后天国语完整版免费观看| 亚洲精品在线观看二区| 高清在线国产一区| 国产三级黄色录像| 欧美日韩精品网址| 麻豆成人午夜福利视频| 在线视频色国产色| 99热这里只有是精品50| 黄色视频,在线免费观看| 欧美精品亚洲一区二区| 一本一本综合久久| 久久精品人妻少妇| svipshipincom国产片| 五月玫瑰六月丁香| 一本综合久久免费| 日韩高清综合在线| 999久久久国产精品视频| 特大巨黑吊av在线直播| 黄片小视频在线播放| 国产99久久九九免费精品| 18禁美女被吸乳视频| 久热爱精品视频在线9| 好男人在线观看高清免费视频| 久久人人精品亚洲av| 国产激情久久老熟女| 十八禁网站免费在线| 超碰成人久久| a在线观看视频网站| 嫩草影视91久久| 欧美色欧美亚洲另类二区| 国产不卡一卡二| 高潮久久久久久久久久久不卡| 国产精品精品国产色婷婷| 成年版毛片免费区| 三级毛片av免费| 午夜a级毛片| 免费在线观看日本一区| 99国产精品一区二区蜜桃av| 欧美成人午夜精品| 中文字幕精品亚洲无线码一区| 亚洲成人精品中文字幕电影| 国产av一区在线观看免费| or卡值多少钱| 一级黄色大片毛片| 午夜精品久久久久久毛片777| 国产一区在线观看成人免费| √禁漫天堂资源中文www| 19禁男女啪啪无遮挡网站| 国产精品乱码一区二三区的特点| a在线观看视频网站| 国产亚洲精品第一综合不卡| 欧美中文综合在线视频| 国产三级黄色录像| 欧美日韩精品网址| 日本一区二区免费在线视频| 久久婷婷人人爽人人干人人爱| 免费在线观看日本一区| 岛国在线免费视频观看| 亚洲成人久久性| 又爽又黄无遮挡网站| 日韩欧美三级三区| 中文资源天堂在线| 一边摸一边抽搐一进一小说| 国产野战对白在线观看| 国内久久婷婷六月综合欲色啪| videosex国产| 久热爱精品视频在线9| 久久久精品国产亚洲av高清涩受| 两个人免费观看高清视频| 国产精品电影一区二区三区| 日日干狠狠操夜夜爽| 国内少妇人妻偷人精品xxx网站 | 国产午夜精品论理片| av超薄肉色丝袜交足视频| 99国产极品粉嫩在线观看| 亚洲国产精品久久男人天堂| 成在线人永久免费视频| 欧美精品啪啪一区二区三区| 日本一区二区免费在线视频| 久久这里只有精品中国| 欧美中文综合在线视频| 国产精品香港三级国产av潘金莲| 丝袜人妻中文字幕| 后天国语完整版免费观看| 欧美不卡视频在线免费观看 | 国产精品 欧美亚洲| 亚洲av美国av| 日韩大尺度精品在线看网址| 19禁男女啪啪无遮挡网站| 国产av麻豆久久久久久久| 亚洲熟女毛片儿| 日韩高清综合在线| 国语自产精品视频在线第100页| 亚洲一码二码三码区别大吗| 国产精品久久久久久久电影 | 18禁裸乳无遮挡免费网站照片| 超碰成人久久| 黑人巨大精品欧美一区二区mp4| 99久久无色码亚洲精品果冻| 在线十欧美十亚洲十日本专区| 日本a在线网址| 在线观看舔阴道视频| 国产精品爽爽va在线观看网站| 看免费av毛片| 变态另类丝袜制服| 久久久久久大精品| 淫妇啪啪啪对白视频| 88av欧美| 亚洲 欧美一区二区三区| 长腿黑丝高跟| 亚洲 欧美一区二区三区| 精品国产乱码久久久久久男人| 日本熟妇午夜| 久久亚洲真实| 国产亚洲av高清不卡| 亚洲人成电影免费在线| 国产精品影院久久| 狂野欧美激情性xxxx| 久久精品夜夜夜夜夜久久蜜豆 | 999精品在线视频| 老司机福利观看| 成在线人永久免费视频| 日本黄色视频三级网站网址| 婷婷丁香在线五月| 亚洲av日韩精品久久久久久密| 三级男女做爰猛烈吃奶摸视频| 成年女人毛片免费观看观看9| 中文字幕人妻丝袜一区二区| 久久精品亚洲精品国产色婷小说|