• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Scribble-Supervised Video Object Segmentation

    2022-01-25 12:51:00PeiliangHuangJunweiHanNianLiuJunRenandDingwenZhang
    IEEE/CAA Journal of Automatica Sinica 2022年2期

    Peiliang Huang,Junwei Han,,Nian Liu,Jun Ren,and Dingwen Zhang,

    Abstract—Recently,video object segmentation has received great attention in the computer vision community.Most of the existing methods heavily rely on the pixel-wise human annotations,which are expensive and time-consuming to obtain.To tackle this problem,we make an early attempt to achieve video object segmentation with scribble-level supervision,which can alleviate large amounts of human labor for collecting the manual annotation.However,using conventional network architectures and learning objective functions under this scenario cannot work well as the supervision information is highly sparse and incomplete.To address this issue,this paper introduces two novel elements to learn the video object segmentation model.The first one is the scribble attention module,which captures more accurate context information and learns an effective attention map to enhance the contrast between foreground and background.The other one is the scribble-supervised loss,which can optimize the unlabeled pixels and dynamically correct inaccurate segmented areas during the training stage.To evaluate the proposed method,we implement experiments on two video object segmentation benchmark datasets,YouTube-video object segmentation (VOS),and densely annotated video segmentation(DAVIS)-2017.We first generate the scribble annotations from the original per-pixel annotations.Then,we train our model and compare its test performance with the baseline models and other existing works.Extensive experiments demonstrate that the proposed method can work effectively and approach to the methods requiring the dense per-pixel annotations.

    I.INTRODUCTION

    VIDEO object segmentation (VOS) aims at segmenting one or multiple foreground objects from a video sequence,which has been actively studied in computer vision applications,such as object tracking [1],video summarization[2],and action recognition [3].Segmenting objects on videos are partially inspired by researches about semantic segmentation,e.g.,semantic segmentation in the urban scene[4],[5].However,VOS needs to simultaneously explore both the spatial and temporal information.

    Based on the levels of human intervention in test scenarios,existing VOS approaches can be divided into two main categories,which are the semi-supervised approaches [6]–[10]and unsupervised approaches [11]–[15],respectively.The semi-supervised VOS approaches need the ground-truth annotations of the objects for the first frame for each test video sequence,while the unsupervised approaches do not need such annotation.However,both the semi-supervised and unsupervised approaches need the per-pixel annotations of each video frame in the training stage,and the per-pixel annotation on large-scale video data is prohibitively expensive and hard to obtain.Due to the lack of prior knowledge of the primary objects in the first frame,the unsupervised approaches cannot select the objects of interest flexibly.

    In this paper,to achieve the trade-off between annotating efficiency and model performance,we propose a novel method to achieve video object segmentation that needs only the scribble-level supervision.Specifically,our method requires the scribble annotation on each video frame in training while only requires the scribble annotation on the first video frame in testing.Consequently,compared to the conventional semi-supervised VOS and unsupervised VOS methods,our scribble-supervised VOS method has two advantages.Firstly,as shown in Fig.1 (a) and 1(b),unlike the per-pixel annotations,drawing a scribble greatly reduces the cost of labeling.Secondly,the annotated scribble on the first frame of each test video can provide informative guidance for segmenting the desired objects,which is beyond the capacity of the unsupervised VOS approaches.

    Implementing the scribble-supervised VOS process is very challenging.As shown in Fig.1 (a),per-pixel annotations allow fully supervised VOS methods to achieve reliable segmentation results since they provide supervision on all aspects of the targeted objects,including structure,position,boundary detail,etc.However,scribble-level annotations (see Fig.1 (b)) only provide a limited number of labeled pixels inside the objects which are highly sparse and most of the pixel information is missing.In this case,missing the supervision information would limit the performance of the deep model trained on scribble-level annotations.Directly using conventional network architectures and object functions to learn video object segmentation models with these sparse scribbles may lead to segmentation results that have poor boundary and details,as shown in Fig.1 (c).

    To achieve high-quality video object segmentation results,we introduce a novel scribble attention module and a scribblesupervised loss to build the video object segmentation model.

    Fig.1.The illustration of different annotations and the results of different methods supervised by scribbles.(a) Manually annotating cost with the perpixel annotations.(b) Manually annotating cost with the scribble-level annotations.(c) Baseline model:trained only on scribbles with partial crossentropy loss [16].(d) Ours.

    First,we design a novel scribble attention module that learns an effective attention map to enhance the contrast between the foreground and background.Most existing selfattention models utilize global context regions to construct contextual features,where the information at each location is aggregated.However,not all contextual information is reliable for the final decision.Instead of learning dense pairwise relationships globally,the scribble attention module is able to selectively learn the relationship between reliable foreground positions and the query position with the help of the foreground scribble annotation information.Specifically,the scribble attention module computes the response at a query position as a weighted sum of the features at the scribble positions in the input feature maps.Intuitively,if the query position belongs to the background region,the pairwise affinities between it and all scribble positions are relatively low,so is the aggregated response of this query position.On the contrary,if the query position belongs to the foreground region,the pairwise affinities between it and all scribble positions are relatively high,so is the aggregated response of this query position.Therefore,the scribble attention module can capture more accurate context information and learn an effective attention map to enhance the contrast between the foreground and background.

    Second,we propose a novel scribble-supervised loss which optimizes both the labeled and unlabeled pixels by making full use of pixel positions and red-green-blue (RGB) color.This scribbled-supervised loss can be seen as a regularized loss and cooperated with the partial cross-entropy loss [16] to optimize the training process.Specifically,we leverage the partial cross-entropy (pCE) loss [16] on the scribble labeled pixels and scribble-supervised loss on both the labeled and unlabeled pixels in “shallow” segmentation [16],[17].Our scribble-supervised loss penalizes disagreements between the pairs of pixel points and softly enforces output consistency among all pixels points based on predefined pairwise affinities.These pairwise affinities are based on dense conditional random field (CRF) [18] which makes full use of pixels position and RGB color.By adding this regularization term,all pixel information is used throughout the training process,which compensates for missing information in scribble-level annotations.Furthermore,we dynamically divide the predicted mask into three different confidence regions and assign different scribble-supervised loss weights to these three regions at the same time.Thus,we can achieve better segmentation of the object’s boundary by dynamically correcting the inaccurate segmentation area throughout the training process.

    The contributions of this paper can be summarized as follows:

    1) We propose a novel weakly-supervised VOS method which trains the model with only scribble-level annotations.To the best of our knowledge,this is one of the earliest VOS frameworks working under the scribble-level supervision.

    2) We design a novel scribble attention module which can capture more accurate context information and learn an effective attention map to enhance the contrast between foreground and background.

    3) We propose the scribble-supervised loss and integrate it into the training process to compensate for the missing information of scribble-level annotations and dynamically to correct the inaccurate segmentation area.

    4) We evaluate the proposed algorithm on two benchmark datasets quantitatively and qualitatively.The extensive experiments show that the proposed algorithm effectively reduces the gap to approaches trained with the per-pixel annotations.

    The remainder of this paper is organized as follows.A brief review of the related works on video object segmentation,scribble-supervised semantic segmentation,and self-attention module is presented in Section II.Then,the problem formulation and detailed explanations of the proposed method are presented in Section III,followed by the experimental results in Section IV.Finally,the conclusions are drawn in Section V.

    II.RELATED WORKS

    We start with providing an overview of related works on video object segmentation,followed by an overview of scribble-supervised semantic segmentation,and finally review several related works for self-attention models.

    A.Video Object Segmentation

    Semi-Supervised Methods:In the semi-supervised setting,the ground truth masks of the foreground objects are given for the first frame in the video sequence,the goal is to propagate the object masks of the given objects for the rest of the video sequence.Perazziet al.[6] combined offline and online learning strategies,where the former produces a refined mask from the estimation on previous frames,and the latter captures the appearance of the specific object instance.Caelleset al.[7] proposed a one-shot video object segmentation (OSVOS)method,which pre-trains a convolutional network for foreground-background segmentation and fine-tunes it on the first-frame ground truth of the video sequence at test time.After that,Maniniset al.[19] improved the OSVOS by semantic information from an instance segmentation network.Voigtlaender and Leibe [20] extended the OSVOS by updating the network online using training examples selected based on the confidence of the network and the spatial configuration.

    These online methods greatly improve segmentation results but sacrifice running efficiency.To address the time cost in the fine-tuning stage,some recent works achieved a better runtime and a satisfactory segmentation performance without fine-tuning.Chenet al.[21] proposed a blazingly fast video object segmentation method with pixel-wise metric learning(PML),which uses a pixel-wise embedding space learned with a triplet loss together with the nearest neighbor classifier.Inspired by PML,Voigtlaenderet al.[10] proposed a fast endto-end embedding learning method for video object segmentation,which uses a semantic pixel-wise embedding mechanism together with a global and a local matching mechanisms to transfer information from the first frame and the previous frame of the video to the current frame.Johnanderet al.[9] proposed a generative appearance model(A-GAME) which learns a powerful representation of the target and background appearance in a single forward pass.Although semi-supervised VOS methods have achieved tremendous progress,they rely on the high cost of per-pixel annotation.

    Unsupervised Methods:In the unsupervised setting,the ground truth masks of the foreground objects are not given at all.Early unsupervised VOS approaches mainly leveraged object proposals [11],[21]–[25],analyzed long-term motion information (trajectories) [12],[26]–[29] or utilized saliency information [30]–[33],to infer the target.Recently,with the success of deep learning,many convolutional neural network(CNN)-based models were proposed.For example,Jainet al.[34] proposed a two-stream architecture which fuses motion and appearance in a unified framework for segmenting generic objects in videos.Chenget al.[35] proposed an end-to-end trainable SegFlow network which simultaneously predicts optical flow and object segmentation in video sequences.The information of optical flow and object segmentation are propagated bidirectionally in this unified framework.This method needs to fine-tune the pre-trained segmentation network for specific objects.Liet al.[36] proposed an instance embedding network to produce an embedding vector for each pixel.This embedding vector can identify all pixels belonging to the same foreground.Then,they combined motion-based bilateral networks for identifying the background.Songet al.[13] used a video salient object detection method to segment objects,which fine-tunes the pre-trained segmentation network for extracting spatial saliency features and trains ConvLSTM to capture temporal dynamics.Luet al.[14] proposed a novel co-attention siamese network to address the unsupervised video object segmentation task from a holistic view.This method emphasizes the importance of inherent correlation among video frames and incorporates a global co-attention mechanism.Wanget al.[37] conducted a systematic study on the role of visual attention in the unsupervised VOS task,which decouples unsupervised VOS into two sub-tasks:Attention-guided object segmentation in the spatial domain and unsupervised VOS-driven dynamic visual attention prediction in spatiotemporal domain.However,due to the lack of prior knowledge about the primary objects,these unsupervised VOS methods are difficult to correctly distinguish the primary objects from the complex background in real-world scenarios.Additionally,they also need fully annotated masks in the training stage like the semi-supervised method,which is expensive and time-consuming.

    Our method uses scribble-level annotations as the supervision of each video frame in the training stage.We also use scribble-level annotations as the object’s inference mask of the first frame in the testing stage.This greatly reduces the cost of labeling during both the training and testing phases.Furthermore,it avoids the lack of prior knowledge about the primary objects.

    B.Scribble-Supervised Semantic Segmentation

    To avoid requiring expensive pixel-wise labels,semantic segmentation methods attempt to learn segmentation from low-cost scribble-level annotations [16],[17],[38],[39].Linet al.[38] first proposed scribble-supervised convolutional networks for semantic segmentation.Tanget al.[16]combined partial cross-entropy loss on labeled pixels and normalized cut loss for unlabeled pixels.After that,Tanget al.[17] further explored the role of different regularized losses for scribble-supervised segmentation tasks.Wanget al.[39] proposed a boundary perception guided method,which employs class-agnostic edge maps for supervision to guide the segmentation network.To the best of our knowledge,there is no published work for scribble-supervised VOS.Comparing with the scribble-supervised semantic segmentation,the scribble-supervised VOS task is more complex since it is still costly to manually draw scribble-level labels for each object in a large-scale video dataset.Further,the model has to detect and segment the objects along the video sequence depending on the scribble-level label only in the first frame.In this paper,we simulate the human annotations on the original per-pixel annotated datasets.Therefore,the time cost of manual labeling is further reduced.Furthermore,we propose a novel scribble attention module and a scribble-supervised loss to build our video object segmentation model.

    C.Self-Attention Mechanisms in Neural Networks

    A self-attention module computes the response at a position by using the weighted sum of all the features in an embedding space,which can capture the long-range information.Vaswaniet al.[40] first proposed the self-attention mechanism to draw global dependencies between input and output and applied it in the task of machine translation.This attention mechanism has been attended in the computer vision field.Wanget al.[41] proposed a non-local network for capturing long-range dependencies,which computed the response at a position as a weighted sum of the features at all positions.Inspired from[41],Yiet al.[42] proposed an improved non-local operation to fuse temporal information from consecutive frames for video super-resolution,which was designed to avoid the complex motion estimation and motion compensation and make better use of spatiotemporal information.Fuet al.[43]proposed a dual attention network to model the semantic interdependencies in spatial and channel dimensions respectively.Zhanget al.[44] proposed an aggregated cooccurrent feature module for semantic segmentation,which learned a fine-grained spatial invariant representation to capture co-occurrent context information across the scene.The difference between the co-occurrent feature module and non-local network is that the co-occurrent feature model learns a prior distribution conditioned on the semantic context[44].

    Our scribble attention module is inspired by the success of above attention modules.The key difference between our scribble attention module and them is that the scribble attention module computes the response at a position as a weighted sum of the features at the scribble positions labeled in objects by multiplying scribble-level ground truth and feature maps.

    III.METHOD

    This work aims to design an algorithm for weakly supervised video object segmentation with scribbles as supervision.We introduce the mask-propagation module [9]to propagate the object appearance and introduce the appearance module [9] to provide discriminative posterior class probabilities.To achieve high-quality segmental results with sparsely annotated scribbles,we add a scribble attention module in the two backbones to enhance the contrast between the foreground and background and propose a scribblesupervised loss to take both the labeled and unlabeled pixels into account.In this section,we first present an overview of our proposed VOS method in Subsection A.Then,we discuss the mask-propagation module and the appearance module in Subsection B.In Subsections C and D,we introduce in detail the scribble attention module and the scribble-supervised loss,respectively.Finally,we elaborate on the generation process of scribble annotation in Subsection E.For easy retrieval,we summarize the used symbols of this paper in Table I.

    A.Overview

    The flowchart of complete VOS in this work is shown in Fig.2,which can be divided into two stages:training stage and test stage.In the training stage,we generate the scribble annotations for all the frames and the detailed generation process is described in Subsection E.The scribble annotations of each frame are used to train our scribble-supervised video object segmentation (SSVOS) model.In the test stage,we only generate the scribble annotation for the first video frame and predict the segmentation mask for each frame.The proposed SSVOS model is illustrated in Fig.3.Given a video frame It,the features are first extracted with a backbone network.Then,they are input to the mask-propagation module and the appearance module.The features and the scribble annotation of the initial frame I0are input to the maskpropagation module to provide a rough prior.The coarse segmentation mask of frame It?1is input to the maskpropagation module to propagate the object’s appearanceinformation from frame It?1to frame It.The coarse segmentation mask and the mean and covariance of frameIt?1are used to update the appearance module.The two features from the mask-propagation module and the appearance module are combined in the fusion module,which comprises two convolutional layers.We first concatenate these two features along the channel dimension and obtain aggregated feature maps.Then,we fuse the aggregated features to generate a coarse mask encoding by using two convolution layers with 3×3 and 1×1 convolutional kernels,respectively.The coarse mask encoding is input to a convolution layer with 3×3 convolutional kernels to generate a coarse segmentation mask.This mask is further used to update the maskpropagation module to propagate the object’s appearance information from frame Itto frame It+1.The coarse segmentation mask is also used to update the appearance module to provide posterior possibilities for frame It+1.Finally,the coarse mask encoding from the fusion module is passed through the upsampling module and a convolution layer with 3×3 convolutional kernels to produce the final segmentation mask.

    TABLE IA SUMMARY OF SYMBOLS USED IN THIS PAPER

    Fig.2.Flowchart of complete video object segmentation in this work.

    Fig.3.The architecture of the proposed scribble-supervised video object segmentation model (SSVOS).The scribble attention (SA) module is only embedded into the third block of the backbone network.The features and the scribble annotation of the initial frame I0 are first concatenated with the features of the current frame It and the coarse segmentation mask of the previous frame It?1.Then,the concatenated features are input to the mask propagation module.The features of It and the coarse segmentation mask of I t?1 are input to the appearance module.Furthermore,the mean and covariance of I t?1 are also input to the appearance module.The two features from the mask propagation module and the appearance module are fused in the fusion module and generate a coarse mask encoding.This coarse mask encoding is input to a convolution layer to generate a coarse segmentation mask which is then used to update the maskpropagation module and the appearance module for the next frame I t+1.The coarse mask encoding is then refined by the upsampling module and a convolution layer to generate the final segmentation.The pCE loss and the auxiliary pCE loss only consider the labeled pixels and scribble-supervised (SS) loss considers both the labeled and unlabeled pixels.“Conv” indicates “convolution”.“r” indicates dilation rate [45].“Cov” indicates covariance.

    B.Mask-Propagation and Appearance Modules

    Inspired by A-GAME [9],we use a mask-propagation module to propagate the object’s appearance and keep the temporal-spatial consistency.As shown in Fig.3,the maskpropagation module consists of three convolutional layers,where the first and third layers are two convolution layers with 1×1 and 3×3 convolutional kernels,respectively.The middle layer is a dilation pyramid [45],which has three 3×3 convolution branches with dilation rates of 1,3,and 6,respectively.In the first frame I1,the scribble annotation of initial frame I0is used as input to the mask-propagation module to propagate the object’s appearance from frame I0to frame I1.In the subsequent frames,where the scribble annotations are not available,the coarse segmentation mask from It?1is used as the input to update the mask-propagation module.Thus,the mask-propagation can propagate the object’s appearance from frame It?1to frame It.As the displacements between adjacent frames are very small,temporal-spatial consistency can be kept,and the mask-propagation module can provide the accurate foreground prior estimation to frame It.

    Furthermore,we also use an appearance module [9] to return the posterior class probabilities at each image location.These posterior class probabilities form an extremely strong cue for foreground/background discrimination.Like [9],let us denote the of features extracted from the video frames as{Mp}.The feature Mpat each pixel locationpis aDdimensional vector.We model these feature vectors as i.i.d samples drawn from the underlying distribution:

    wherezpis a discrete random variable which assigns the observation Mpto a specific componentzp=k.We use a uniform prior Ψ(zp=k)=1/Kforzp.Kis the number of components and each component exclusively models the feature vectors of either the foreground or the background.In(1),each class-conditional density is a multi-variate Gaussian with mean μkand covariance matrix Σk:

    In the first frame,our module parameter is inferred from the extracted features and the scribble annotation of initial frame I0.In subsequent frames,we update the module using the network coarse segmentation predictions as soft class labels.We compute the appearance module parameter of frame Itas follows:

    Given the module parameters computed in the previous frame It?1,τt?1,our appearance module can predict the posterior map of frame Itas follows:

    C.Scribble Attention Module

    As illustrated in Fig.4,given an input feature X ∈RC×H×W,we first transform it into three new feature maps Φ(X),Θ(X)and Ω(X),respectively,where {Φ(X),Θ(X),Ω(X)}∈RC/2×H×W.We resize the scribble-level ground truth to R1×H×W.Then,we perform an element-wise multiplication between the Φ(X) and the scribble-level ground truth.Here,the foreground scribble pixels are set to 1 and the rest are set to 0.After that,we calculate the attention as

    where Ajiindicates the extent to which the model attends to theith position when synthesizing thejth position.δf sindicates the foreground scribble pixels position.Tindicates the transpose operation.Φ(Xi)=WφXi,Θ (Xj)=WθXj,where Wφand Wθare two weight matrices which are learned via 1×1 convolution.Then,the output of the attention layer in thejth position can be represented:

    where Ω(Xi)=WωXiand Wωis a weight matrix which is learned via 1×1 convolution.Finally,we further multiplyOjby a scale parameter γ and perform an element-wise sum operation with the input feature X to obtain the final output Z ∈RH×W×Cas follows:

    where γ is initialized as 0 and gradually learned to assign more weight [43],[46].The output feature Zjof the scribble attention module at each position is a weighted sum of the features at the positions of the foreground scribbles and original features.

    The scribble attention module computes the response at a position as a weighted sum of the features at the scribble positions labeled in objects.Each pixel in the scribble belongs to the foreground object.Thus,the module can capture more accurate context information and learn an effective attention map to enhance the contrast between foreground and background,which plays an important role in the segmentation task.

    Fig.4.Illustration of the proposed scribble attention module.

    D.Scribble-Supervised Loss

    As shown in Fig.3,the annotated scribbles are sparsely provided,and the pixels that are not annotated are considered as unknown.We use the (auxiliary) partial cross-entropy loss[16] to consider the partially labeled pixels which can effectively ignore unknown regions.This partial cross-entropy loss can be seen as a sampling of cross-entropy loss with perpixel masks by rewriting it as follows:

    whereq(i) is a binary indicator,which equals 1 if pixeliis labeled on the foreground scribble and 0 otherwise;P (i) is the softmax output probability of the network;δf sindicates that the scribble labeled pixels.

    So far,all scribble labeled pixels have been used for training with partial cross-entropy loss.However,comparing with perpixel annotations,scribbles are often labeled on the internal the objects which are lack of determinate boundary information of objects and confident background regions.Therefore,with limited pixels as supervision,object structure,and boundary details cannot be easily inferred.Furthermore,since the scribble-level annotations only label extremely fewer pixels,training the CNN networks on scribbles would lead to overfitting.The proposed scribble-supervised loss optimizes the training process on both the scribble-labeled and unlabeled pixels,which compensates for missing pixel information in unlabeled pixels.Thus,the scribble-supervised loss can be seen as a regularized loss and can prevent overfitting.Our scribble-supervised loss penalizes disagreements between the pairs of pixel points and softly enforces output consistency among all pixel points based on predefined pairwise affinities.These pairwise affinities are based on dense CRF [18] which makes full use of pixels position and RGB color.To this end,we first define the energy formulation between pixeliand pixeljbased on [47],[48] as follows:

    wherecaandcbare the class labels.(i) and(j) are the soft-max outputs of network at pixeliand pixelj,respectively.g(i,j) is a Gaussian kernel bandwidth filter which considers both pixel spatial positions and RGB colors and can be represented as follows:

    whereNis the normalized factor,dandrare the pixel spatial position and RGB color,respectively.δdand δrare the hyper parameters that control the bandwidths of Gaussian kernels.We use the Potts model [49] as a discontinuity preserving function to penalty disagreements between the pairs of pixels and encourage the pixels in the same region to have equal labels.Assuming that the segmentation results P are restricted to binary class indicators P ∈{0,1},the standard Potts model[49] could be represented via Iverson brackets [·],as on the left-hand side below:

    whereg(i,j) is a matrix of pairwise discontinuity costs or an affinity matrix.The right-hand side above is a particularly straightforward quadratic relaxation of the Potts model [49]that works for relaxed P ∈[0,1] corresponding to a typical soft-max output of CNNs [17].

    Inspired by [48],we use a modified quadratic relaxation of the Potts model [49] to simplify the energy formulation:

    The affinity matrixg(i,j) is based on a dense Gaussian kernel which makes full use of pixels position and RGB color.We can see that the Potts model [49] penalizes disagreements between the pairs of pixels only based on the spatial and appearance information of the pixels,and there is no need for any supervision information of groundtruth.

    The dense CRF loss [17],[47],[48] can be represented as follows:

    As we all know,the difficulty for the segmentation task is the segmentation of the object’s boundary.In this task,since scribbles have only limited pixels as supervision,it is more difficult to obtain accurate boundary segmentation masks.In order to better segment the boundary regions,we divide the predicted mask into three subregions by predicted mask scores.Specially,we identify pixels with scores larger than 0.3 as reliable foreground pixels,and smaller than 0.05 as reliable background pixels.The remaining pixels are classified as unreliable areas.Then,we assign different loss weights to these areas.Based on this idea,we design a scribblesupervised (SS) loss:

    wherew(i) is an adaptive region weight and can be represented as follows:

    where β controls the weight of the scribble-supervised loss on unreliable regions.

    For each prediction branch,we deploy a new joint loss function combining the partial cross-entropy loss with the designed scribble-supervised loss function:

    where λ controls the weight of the scribble-supervised loss.

    E.Scribble Simulation Generation

    Compared with per-pixel labels,scribbles greatly reduce the expensive cost of annotation.However,it is still costly to manually draw scribble-level labels for each object in a largescale video dataset.In order to reduce the annotation cost of video tasks,Liuet al.[50] designed an interactive computer vision system to allow a user to efficiently annotate motion.Karasevet al.[51] proposed a frame selection method for video annotation,which naturally allows for uncertainty in the measurements,and thus is applicable to both an automated annotation,as well as a manual annotation.Luoet al.[52]used an active learning framework to obtain higher segmentation accuracy with less user annotation.Differently from the above methods,we are inspired by [53] to simulate scribble-level annotations on each video frame from two available per-pixel annotated datasets,YouTube-VOS [54],and densely annotated video segmentation (DAVIS)-2017[55].In these two datasets,as all video frames are provided with per-pixel annotations,we can obtain the scribbles of all video frames by leveraging the proposed scribble simulation generation process.For each video frame,we use morphological skeletonization [56] to automatically generate realistic foreground object scribbles and randomly generate a rectangular box as the background scribbles.As shown in Fig.5,the whole process consists of four steps:1) Foreground skeletonization:we achieve the skeletonization of the perpixel annotated masks to get foreground scribbles by using a fast implementation of the thinning algorithm [56].2)Morphological dilation:these scribbles directly obtained through the thinning algorithm in Step 1 are a single-pixel width.These single-pixel width scribbles have a gap to the manually annotated scribbles.Therefore,we use a morphological dilation algorithm to obtain the five-pixel width scribbles.Then,we use the original per-pixel annotated mask and the obtained scribble-level mask to conduct dot multiplication to eliminate the pixels which are extended to the background due to morphological expansion.We use these scribbles as the final foreground object scribbles.3)Background generation:We randomly generate a five-pixel width rectangular box as the background scribble in the background area around the foreground scribbles.When the foreground object touches an edge of the image,the rectangular edge corresponding to the edge of the image will not be needed,e.g.,the background scribble in Fig.3.4) We label the areas outside of the foreground and background scribble regions as unknown areas.

    Fig.5.Illustration of scribble simulation generation.From left to right:RGB images,per-pixel ground-truth annotations,foreground scribbles,dilated foreground scribbles,final scribbles annotations (best viewed in amplification).

    IV.ExPERIMENTS

    The proposed algorithm is evaluated on two most recent VOS benchmark:YouTube-VOS [54] and DAVIS-2017 [55]dataset.We first conduct ablation studies on YouTube-VOS[54].Then,we compare it with the state-of-the-art VOS methods on both YouTube-VOS [54] and DAVIS-2017 [55]datasets.

    A.Datasets

    YouTube-VOS:The YouTube-VOS consists of 3 471 and 474 videos in the training and validation sets,respectively,being the largest video object segmentation benchmark.Each video sequence is 20 to 180 frames long and every fifth frame is annotated.In the training set,there are 65 unique object categories which are regarded as seen categories.The validation set contains 91 unique object categories,which contain all the seen categories and 26 unseen categories.

    DAVIS2017:The DAVIS2017 is a popular benchmark dataset for VOS tasks which is a multi-object extension of DAVIS2016 [57].In the training set,there are 60 video sequences with multiple annotated instances.The validation and test-dev sets include 30 video sequences with multiple annotated instances,respectively.

    B.Evaluation Metrics

    We evaluate experiment results using three usual evaluation measures suggested by [58]:1) Region similarityJ:The mean intersection-over-union (mIoU) between the ground truth and the predicted segmentation masks.2) Contour accuracyF:TheF-measures of the contour-based precision and recall between the contour points of the ground truth and the predicted masks.3) Overall score G:The average score ofJandF.For the YouTube-VOS dataset,bothJandFare divided into two different measures,depending on whether the categories of the validation are included in the training set(JseenandFseen),or these categories are not included in the training set (JunseenandFunseen.

    C.Implementation Details

    Our framework is based on the A-GAME [9],which is the state-of-the-art in semi-supervised VOS.Our network is implemented based on PyTorch [59].We choose Adam [60]as the optimizer to minimize the joint loss.Adam is an algorithm for first-order gradient-based optimization of stochastic objective functions which is based on adaptive estimates of lower-order moments.Following [9],we divide network training into two stages.In the first stage,we train for 80 epochs on half resolution images (240×480).We set batch size to 4 video snippets and set 8 frames in each snippet.We set a learning rate of 1 0?4,exponential learning rate decay of 0.95 each epoch,and a weight decay of 10?5.In the second stage,we train 100 epochs on full image resolution.The batch size is set to 2 snippets,to accommodate longer sequences of 14 frames.We set a learning rate of 1 0?5,exponential learning rate decay of 0.985 each epoch,and a weight decay of 1 0?6.In both stages,we set the hyper parameters γdand γrof the Gaussian kernels bandwidths to 100 and 15,respectively.We set the scribble-supervised loss weight λ to 2× 10?15and set the unreliable region weight β to 0.1.We determine all hyperparameters via experience and grid search method.To show that they are the best settings,we vary these hyper-parameters in different ways,measuring the change in performance on the YouTube-VOS dataset [54].In Table II rows A and B,we vary the learning rateslr1andlr2in the two training stages,respectively.Each overall score decreases to varying degrees.In rows C and D,we observe that the overall score decreases to varying degrees when varying the hyper-parameters γdand γrof the Gaussian kernels bandwidths.In rows E and F,we vary the loss weight λ and region weight β.Each overall score decreases to varying degrees.These ablation experimental results prove that we can choose the best settings of the hyperparameters via experience and grid search method.

    D.Ablation Studies

    We conduct ablation studies on YouTube-VOS to verify the effectiveness of the scribble attention module and scribblesupervised loss.We compare our baseline with different variant models and the quantitative results for different models are shown in Table III.We also show qualitative comparison results in Fig.6.“pCE” denotes our model training on the scribble labels with the partial cross-entropy loss which can be seen as our baseline.“SA” denotes the proposed scribble attention module.“PSA” denotes the pointwise spatial attention module [61].“NL” denotes the nonlocal module [41].“CCA” denotes the criss-cross attention module [62].“SS” denotes the proposed scribble-supervised loss.“DCRF" denotes the dense CRF loss [17],[47],[48].“ ?” denotes the model with the corresponding module.

    Analysis on Scribble Attention Module:We first analyze the impact of the proposed scribble attention module.To confirm its effectiveness,we compare the performances of the baseline and its variant by adding the proposed scribble attention module.The second row of Table III shows the result of adding the scribble attention module to our baseline model.The overall score increases by more than 3.5% compared with our baseline model (first row of Table III).Fig.6 shows some qualitative results and the segmentation results (second column of Fig.6) with our scribble attention module are greatly improved compared to the baseline model.These quantitative and qualitative comparison results demonstrate usefulness of the proposed scribble attention module for this scribble-supervised video object segmentation.To further demonstrate the effectiveness of the proposed scribbleattention module in this special scribble-supervised task,we replace our scribble attention module with three recent attention modules,“PSA” [61],“NL” [41] and “CCA” [62].As shown in Table III,“PSA”,“NL” and “CCA” all achieve better performance compared with our baseline model,which demonstrates that the self-attention mechanism is able to capture contextual dependencies and is helpful for the VOS task.Compared with the proposed scribble attention module,the overall scores of “PSA”,“NL” and “CCA” drop from 47.8 to 46.5,46.6,and 46.9,respectively.The second and third columns of Fig.6 show the quality results of “SA” and“CCA”,respectively.We can see that the segmentation results of “SA” are better than those of “CCA”,such as human leg area.These quantitative and qualitative comparison results further demonstrate the superior performance of our scribble attention module.

    TABLE IIG OVERALL SCORE UNDER DIFFERENT HYPER-PARAMETERS SETTINGS.“BASE” INDICATES THE BASE HYPER-PARAMETERS SETTINGS WHICH ARE USED IN THIS PAPER.UNLISTED VALUES ARE IDENTICAL TO THOSE OF THE BASE HYPER-PARAMETERS SETTINGS

    TABLE IIIABLATION STUDIES ON YOUTUBE-VOS DATASET

    Scribble-Supervised Loss:Another ablation study is carried out on the use of the proposed scribble-supervised loss by adding it to the baseline model.The sixth row of Table III shows the quantitative result of our baseline with the proposed scribble-supervised loss.Experimental results show that the overall score increases by more than 7.1% compared with the baseline model (first row of Table III).In addition,the other four indicators have also been greatly improved with proposed scribble-supervised loss.As shown in Fig.6,compared with the baseline model (first column of Fig.6),the segmentation quality (fourth column of Fig.6) becomes much better with using the scribble-supervised loss.The baseline model can only segment out the general shape of the objects and the model with scribble-supervised loss can segment out the boundaries and details of the objects.For example,the boundary segmentation results of the man and the tennis racket are rough by the baseline model.Our scribblesupervised loss greatly improves the segmentation results.These quantitative and qualitative comparison results demonstrate that the proposed scribble-supervised loss can indirectly compensate for missing information in weak scribble-level labels.In order to further show the segmentation effects of our scribble-supervised loss on the boundary area,we replace our scribble-supervised loss with dense CRF loss.The overall score decreased from 49.5 to 48.7.As shown in Fig.6,the boundary segmentation results of the man with dense CRF loss (fifth column of Fig.6) are worse than those with scribble-supervised loss (fourth column of Fig.6).These quantitative and qualitative comparison results demonstrate that our scribble-supervised loss can dynamically correct inaccurately segmented areas.

    Fig.6.Quality comparison of different variant modules on Youtube-VOS.The first column to the sixth column are “pCE”,“pCE” with “SA”,“pCE” with“CCA”,“pCE” with “SS”,“pCE” with “DCRF”,and our complete model,respectively.

    TABLE IVQUANTITATIVE RESULTS ON THE YOUTUBE-VOS VALIDATION SET.“OL” DENOTES ONLINE LEARNING

    We also show the quantitative (eighth row of Table III) and qualitative (sixth column of Fig.6) results of our complete model which demonstrate that combining scribble attention module and scribble-supervised loss is able to further improve the segmentation results.

    E.Comparison With State-of-the-Art

    To our best knowledge,there are no published results for weakly-supervised VOS tasks to be compared on YouTube-VOS [54] and DAVIS-2017 [55].We first compare our approach with the state-of-the-art semi-supervised methods on YouTube-VOS [54].Then,we compare our approach with the state-of-the-art unsupervised methods and semi-supervised on DAVIS-2017 [55].

    YouTube-VOS:The validation set of this large-scale dataset contains 474 sequences with 91 categories,26 of which are not included in the training set.We evaluate our method on this validation set and evaluate the results on the open evaluation server [54].Table IV shows the comparison with previous start-of-the-art semi-supervised methods on YouTube-VOS [54].“OL” denotes online learning in the inference stage.Our method achieves 51.1 % of overall scores using scribbles as supervision and without online learning.We obtain almost the same overall scores like [63].Our method reaches comparable results to [6],[20],[64].The gap between our method and [20],[8] is very small.Compared with [7],[9],[54],[58],[65],although there is a certain gap in overall scores,this result is acceptable in view of the huge gap in supervision information.Our method closes the gap to approaches trained on per-pixel annotation.

    Besides,to our best knowledge,end-to-end recurrent network for video object segmentation (RVOS) [8] is the only one that published the results for zero-shot (unsupervised)VOS tasks in YouTube-VOS.Table V shows the quantitative comparison results of our scribble-supervised method with RVOS.We can see that although RVOS uses the per-pixel annotations as supervision during the training phase,our results are still far better than those of RVOS.One of the possible reasons for this is that our method provides the scribble annotation of the first frame in the testing phase,which is able to provide a certain prior knowledge of the objects.

    DAVIS2017:The dataset comprises 60 videos for training,30 videos for validation,and 30 videos for testing.Following[9],we first train on the Youtube-VOS and DAVIS2017 training sets and then evaluate to boost the performance.We evaluate our method on both the validation and test-dev setsand evaluate the results of the test-dev set on the open evaluation server [55].Table VI shows the quantitative comparison results on the DAVIS-2017 validation set of our method to the recent start-of-the-art unsupervised and semisupervised methods.We can see that our method achieves 55.7% ofJ&FMean using scribbles as supervision.Compared with unsupervised methods,our method exceeds [8] largely,and exceeds [13] to some extent.The gap between ours and[37] is very small.Compared with semi-supervised methods with per-pixel annotations as supervision,our method exceeds [63].Our method reaches comparable results to [15],[66].The gap between our method and [7],[8] is very small.Table VII shows the quantitative comparison results on the DAVIS-2017 test-dev set.Our method achieves 41.7 % ofJ&FMean and exceeds the most unsupervised methods [8],[13],[15].The large performance gap is due to the difference of the supervised information,which is acceptable.Compared with semi-supervised methods with per-pixel annotations as supervision,our method exceeds the semi-supervised method[63].The gap between our method and [67],[66] is very small.

    TABLE VQUANTITATIVE RESULTS FOR ZERO-SHOT METHODS IN THE YOUTUBE-VOS DATASET

    TABLE VIQUANTITATIVE RESULTS ON THE DAVIS2017 VALIDATION SET

    TABLE VIIQUANTITATIVE RESULTS ON THE DAVIS2017 TEST-DEV SET

    F.Qualitative Results

    Fig.7 shows the qualitative results of our methods on YouTube-VOS [54] and DAVIS-2017 [55].These videos all contain at least one object with diverse size,shape,motion,and occlusion,etc.It can be seen that in many cases our method can produce high-quality segmentation results.For example,our method is able to successfully segment the challenging motorcycle sequence (first row) in which objects move very fast.In the bmx-trees sequence (fourth row),the objects move fast and parts of objects are occlusion by the tree in some frames.Our method can successfully segment out people and the bike from these occlusion areas.In the last row,our method fails to segment some bottom edges of the car in the start frame.However,afterward,our method is able to recover from that error.

    G.Failure Case

    Fig.7.Qualitative results showing objects on videos from the Youtube-VOS and DAVIS2017.The first second rows are from Youtube-VOS validation set;the latter two are from DAVIS2017 validation set and test-dev set,respectively.

    Fig.8.Illustration of two failure cases on DAVIS-2017.The first row is the paragliding-launch video sequence,where the paragliders ropes are not segmented out properly.The second row is the pigs video sequence,where the segmentation results of the later frames are worse.

    While our approach verifies satisfactory results on both the quantitative and the qualitative evaluations in YouTube-VOS[54] and DAVIS-2017 [55],we find a few failure cases as shown in Fig.8.In the first row,our method fails to segment out paragliders’ ropes.It is because we fail to simulate generating scribbles of very small objects and thus scribbles of paragliders ropes are not well provided in the first frame during the inference stage.We believe that a good future direction is to generate more reliable scribbles with a better simulation generation algorithm.We can also use other data processing methods to handle the problems of the scribbles,such as [68]–[71].In the second row,we found that our method may be less stable to segment these frames which are far from the first frame.Since we take the scribbles as the guidance information in the first frame,these scribbles only occupy the limited pixels information of the objects.It is mostly because the model is difficult to propagate the limited guidance information from the first frame to the too far frames on very challenging scenes.The challenging scenarios can be resolved by incorporating the spatiotemporal information encoding module [72],which will be our future direction.In addition,we can also use other advanced technologies of image segmentation and deep learning [73]–[76] to further improve the performance of the failure cases.

    V.CONCLUSIONS AND FUTURE WORK

    In this paper,we propose a scribble-supervised method for video object segmentation.The first contribution of our method is the scribble attention module,which is designed to selectively learn the relationship between the context regions and the query position.Unlike conventional self-attention modules which consider all the context information,with this selective strategy,our attention module can capture more accurate context information and learn an effective attention map to enhance the contrast between the foreground and the background.Furthermore,to resolve the missing information of scribble-level annotations,we propose a novel scribblesupervised loss that can compensate for the missing information and dynamically correct inaccurate prediction regions.The proposed method alleviates a huge amount of human labor than that of the per-pixel annotation and closes the performance gap to approaches trained on per-pixel annotation,which achieves the trade-off between annotating efficiency and model performance.Therefore,it is of great significance to further exploit weakly supervised method[77]–[80] to learn video object segmentation.In future,we plan to explore the potential of using a new simulation method to generate more reliable scribbles and incorporating the spatiotemporal information encoding module [72],[81] to better propagate guidance information throughout the video sequence.

    久久久久国内视频| 亚洲最大成人中文| 国产亚洲av嫩草精品影院| 婷婷亚洲欧美| 国产探花极品一区二区| 成年女人毛片免费观看观看9| 国产成年人精品一区二区| 欧美成人性av电影在线观看| 国产精品av视频在线免费观看| 亚洲熟妇中文字幕五十中出| 给我免费播放毛片高清在线观看| 午夜久久久久精精品| 亚洲国产精品久久男人天堂| h日本视频在线播放| 成人国产综合亚洲| 亚洲国产精品合色在线| 22中文网久久字幕| 欧美精品国产亚洲| 亚洲七黄色美女视频| 亚洲欧美激情综合另类| 精品久久久久久久人妻蜜臀av| 十八禁国产超污无遮挡网站| 一级黄色大片毛片| www.色视频.com| 九九热线精品视视频播放| 午夜福利视频1000在线观看| 在线播放国产精品三级| 国产老妇女一区| 最新中文字幕久久久久| 99久久无色码亚洲精品果冻| 女的被弄到高潮叫床怎么办 | 99在线人妻在线中文字幕| 欧美黑人巨大hd| 两人在一起打扑克的视频| 直男gayav资源| 老师上课跳d突然被开到最大视频| 性色avwww在线观看| 日韩人妻高清精品专区| 嫩草影视91久久| 无遮挡黄片免费观看| 黄色日韩在线| 国产伦精品一区二区三区四那| av专区在线播放| 亚洲最大成人中文| 12—13女人毛片做爰片一| 久久这里只有精品中国| a级毛片免费高清观看在线播放| 成人无遮挡网站| 国产精品伦人一区二区| 好男人在线观看高清免费视频| 亚洲成a人片在线一区二区| 极品教师在线视频| 久久99热6这里只有精品| 久久久久性生活片| 特大巨黑吊av在线直播| 97超级碰碰碰精品色视频在线观看| 国产欧美日韩一区二区精品| 给我免费播放毛片高清在线观看| a级毛片免费高清观看在线播放| 男女下面进入的视频免费午夜| 九九久久精品国产亚洲av麻豆| 天堂动漫精品| 制服丝袜大香蕉在线| 日韩亚洲欧美综合| 午夜福利视频1000在线观看| 国产精品伦人一区二区| 国产精品电影一区二区三区| 欧美不卡视频在线免费观看| 国产爱豆传媒在线观看| 国产主播在线观看一区二区| 国产三级在线视频| 久久99热这里只有精品18| 变态另类成人亚洲欧美熟女| 动漫黄色视频在线观看| 亚洲国产精品sss在线观看| 亚洲精品色激情综合| 91麻豆精品激情在线观看国产| 哪里可以看免费的av片| 中国美女看黄片| 性欧美人与动物交配| 美女被艹到高潮喷水动态| 日韩在线高清观看一区二区三区 | 亚洲欧美激情综合另类| 亚洲av中文字字幕乱码综合| 亚洲电影在线观看av| 美女大奶头视频| 亚洲最大成人av| 99riav亚洲国产免费| 欧美成人免费av一区二区三区| 欧美xxxx性猛交bbbb| 久久九九热精品免费| 亚洲精品影视一区二区三区av| 我要搜黄色片| 俺也久久电影网| 免费在线观看成人毛片| 欧美国产日韩亚洲一区| 国产色婷婷99| 级片在线观看| 欧美潮喷喷水| 美女被艹到高潮喷水动态| 亚洲专区国产一区二区| 黄色欧美视频在线观看| 国产伦一二天堂av在线观看| 亚洲 国产 在线| 99久久中文字幕三级久久日本| 香蕉av资源在线| 亚洲av日韩精品久久久久久密| 桃红色精品国产亚洲av| 国产麻豆成人av免费视频| 91麻豆av在线| 久久精品国产99精品国产亚洲性色| 校园春色视频在线观看| 成人一区二区视频在线观看| 免费在线观看日本一区| 国产精品1区2区在线观看.| 12—13女人毛片做爰片一| 久久久成人免费电影| 日韩欧美免费精品| 国产精品久久视频播放| 非洲黑人性xxxx精品又粗又长| 日本欧美国产在线视频| 国产亚洲av嫩草精品影院| 夜夜夜夜夜久久久久| av视频在线观看入口| 国产三级中文精品| 国产极品精品免费视频能看的| av天堂在线播放| 久久欧美精品欧美久久欧美| 国产免费一级a男人的天堂| 亚洲精华国产精华液的使用体验 | 99久久无色码亚洲精品果冻| 一个人免费在线观看电影| 99久久精品热视频| 亚洲精品亚洲一区二区| 久久香蕉精品热| 999久久久精品免费观看国产| 国产精品无大码| 91在线精品国自产拍蜜月| 免费黄网站久久成人精品| 成年版毛片免费区| 九色成人免费人妻av| 51国产日韩欧美| 美女被艹到高潮喷水动态| 亚洲真实伦在线观看| 国产黄色小视频在线观看| aaaaa片日本免费| 亚洲国产日韩欧美精品在线观看| 免费av毛片视频| 性色avwww在线观看| 联通29元200g的流量卡| 男人舔女人下体高潮全视频| 午夜精品在线福利| 18+在线观看网站| 日本与韩国留学比较| 午夜亚洲福利在线播放| 日韩欧美国产在线观看| 婷婷精品国产亚洲av在线| 天美传媒精品一区二区| 国产色爽女视频免费观看| 国内精品美女久久久久久| 亚洲内射少妇av| 午夜激情欧美在线| 成人国产综合亚洲| 国产一区二区激情短视频| 国产日本99.免费观看| 久久人妻av系列| h日本视频在线播放| 亚洲国产欧洲综合997久久,| 18禁黄网站禁片午夜丰满| 熟妇人妻久久中文字幕3abv| 免费电影在线观看免费观看| 99精品久久久久人妻精品| 伦精品一区二区三区| 18禁黄网站禁片免费观看直播| 91久久精品国产一区二区三区| 中亚洲国语对白在线视频| 亚洲三级黄色毛片| 3wmmmm亚洲av在线观看| 人妻少妇偷人精品九色| 国产精品乱码一区二三区的特点| 黄色女人牲交| 美女 人体艺术 gogo| 黄色配什么色好看| 久久久久久久久大av| 久久九九热精品免费| 免费av观看视频| 人人妻人人澡欧美一区二区| 午夜精品在线福利| 久久久久国产精品人妻aⅴ院| 日日撸夜夜添| 亚洲欧美激情综合另类| 亚洲一区二区三区色噜噜| 欧美日韩中文字幕国产精品一区二区三区| 99热这里只有精品一区| 国内少妇人妻偷人精品xxx网站| 日韩欧美免费精品| 精品一区二区三区av网在线观看| 国产女主播在线喷水免费视频网站 | 亚洲av电影不卡..在线观看| 日本黄大片高清| 午夜亚洲福利在线播放| 精品国内亚洲2022精品成人| 精品久久久久久,| 97热精品久久久久久| 日本撒尿小便嘘嘘汇集6| 搡老熟女国产l中国老女人| 五月伊人婷婷丁香| 久久久国产成人免费| 精品一区二区三区av网在线观看| 男人舔女人下体高潮全视频| 欧美色欧美亚洲另类二区| 少妇熟女aⅴ在线视频| 一进一出好大好爽视频| 欧美色欧美亚洲另类二区| 国产老妇女一区| 99久久中文字幕三级久久日本| 在线看三级毛片| 国产伦一二天堂av在线观看| 国产成人福利小说| av在线老鸭窝| 深夜a级毛片| 亚洲精华国产精华精| 亚洲在线自拍视频| 日韩欧美国产一区二区入口| 国模一区二区三区四区视频| 久久久久免费精品人妻一区二区| 色吧在线观看| 亚洲成a人片在线一区二区| 熟女电影av网| 日本爱情动作片www.在线观看 | 身体一侧抽搐| 久久亚洲真实| 老司机深夜福利视频在线观看| 亚洲综合色惰| 国产日本99.免费观看| 久久香蕉精品热| 听说在线观看完整版免费高清| 亚洲人成网站在线播| 成人毛片a级毛片在线播放| 免费大片18禁| 国产成人a区在线观看| av国产免费在线观看| 高清毛片免费观看视频网站| 欧美日韩瑟瑟在线播放| 91久久精品国产一区二区三区| 黄色丝袜av网址大全| 久久精品人妻少妇| 久久天躁狠狠躁夜夜2o2o| 国产伦精品一区二区三区四那| 国产伦精品一区二区三区视频9| av女优亚洲男人天堂| 国语自产精品视频在线第100页| 国产亚洲91精品色在线| 老师上课跳d突然被开到最大视频| 欧美日本亚洲视频在线播放| 亚洲最大成人手机在线| av在线亚洲专区| 深夜精品福利| 久久精品国产亚洲av天美| 久久久久久九九精品二区国产| 亚洲欧美日韩卡通动漫| 午夜视频国产福利| 亚洲熟妇中文字幕五十中出| 久久人妻av系列| 国产不卡一卡二| 国产成人av教育| av天堂中文字幕网| 亚洲七黄色美女视频| 久久久久久久久中文| 我的女老师完整版在线观看| 人妻久久中文字幕网| 麻豆久久精品国产亚洲av| 午夜亚洲福利在线播放| 在线观看舔阴道视频| 国产一区二区三区av在线 | 免费人成在线观看视频色| 国语自产精品视频在线第100页| 亚洲av中文av极速乱 | 中文资源天堂在线| 少妇人妻一区二区三区视频| 一个人免费在线观看电影| 国内精品久久久久久久电影| 精品久久久久久成人av| 日韩一区二区视频免费看| 国产亚洲精品av在线| 亚洲电影在线观看av| av天堂在线播放| 国产私拍福利视频在线观看| a级毛片a级免费在线| 日韩欧美精品免费久久| 色视频www国产| 又黄又爽又刺激的免费视频.| 尾随美女入室| 麻豆久久精品国产亚洲av| 男女做爰动态图高潮gif福利片| 久久久久久久午夜电影| 久久久久国内视频| 性欧美人与动物交配| 日日夜夜操网爽| 婷婷亚洲欧美| 国产男靠女视频免费网站| 免费不卡的大黄色大毛片视频在线观看 | 两人在一起打扑克的视频| 99热这里只有是精品50| 成人永久免费在线观看视频| 国产女主播在线喷水免费视频网站 | 免费高清视频大片| 偷拍熟女少妇极品色| 麻豆成人午夜福利视频| 色视频www国产| 日日夜夜操网爽| 黄色日韩在线| av国产免费在线观看| 亚洲性夜色夜夜综合| 美女cb高潮喷水在线观看| 女人被狂操c到高潮| 国产激情偷乱视频一区二区| 欧美最新免费一区二区三区| 成年女人看的毛片在线观看| 国内少妇人妻偷人精品xxx网站| av天堂在线播放| 色噜噜av男人的天堂激情| 超碰av人人做人人爽久久| 男人和女人高潮做爰伦理| 国产亚洲欧美98| 国产伦精品一区二区三区四那| 热99在线观看视频| 国产真实乱freesex| 国产精品乱码一区二三区的特点| 久久精品国产自在天天线| av视频在线观看入口| 亚洲第一电影网av| 日韩人妻高清精品专区| 观看美女的网站| 午夜久久久久精精品| 免费观看在线日韩| 亚洲狠狠婷婷综合久久图片| 久久久久精品国产欧美久久久| 久久人人精品亚洲av| 中亚洲国语对白在线视频| 午夜免费男女啪啪视频观看 | 欧美+亚洲+日韩+国产| 午夜亚洲福利在线播放| 99热这里只有是精品50| 欧美一区二区精品小视频在线| 97人妻精品一区二区三区麻豆| 99久久无色码亚洲精品果冻| 日日摸夜夜添夜夜添小说| 午夜影院日韩av| av视频在线观看入口| 亚洲中文日韩欧美视频| 久久久成人免费电影| 桃色一区二区三区在线观看| 欧美日韩国产亚洲二区| 高清日韩中文字幕在线| 可以在线观看的亚洲视频| 亚洲经典国产精华液单| 九色成人免费人妻av| 日本三级黄在线观看| 成年版毛片免费区| 亚洲性夜色夜夜综合| 精品久久久噜噜| 国产男靠女视频免费网站| 在现免费观看毛片| 狂野欧美白嫩少妇大欣赏| 亚洲国产精品合色在线| 成人特级av手机在线观看| 亚洲男人的天堂狠狠| 久久热精品热| 午夜a级毛片| 99国产精品一区二区蜜桃av| 精品国内亚洲2022精品成人| 黄色女人牲交| 97超级碰碰碰精品色视频在线观看| 日本成人三级电影网站| 国产精品自产拍在线观看55亚洲| 日韩强制内射视频| 最近最新免费中文字幕在线| 精品国内亚洲2022精品成人| 搡老岳熟女国产| 能在线免费观看的黄片| 国产精品亚洲美女久久久| 有码 亚洲区| 亚洲成人久久爱视频| 久久精品国产鲁丝片午夜精品 | 一边摸一边抽搐一进一小说| 精品久久久久久久久久免费视频| 午夜福利18| 99在线人妻在线中文字幕| www.www免费av| 亚洲av熟女| 欧美一级a爱片免费观看看| 色哟哟哟哟哟哟| av天堂在线播放| 久久精品综合一区二区三区| 国产亚洲欧美98| 丰满的人妻完整版| 狠狠狠狠99中文字幕| 亚洲美女搞黄在线观看 | 99riav亚洲国产免费| 99热这里只有是精品在线观看| 蜜桃亚洲精品一区二区三区| 免费观看精品视频网站| av福利片在线观看| 精品一区二区三区人妻视频| 男女做爰动态图高潮gif福利片| 一级黄片播放器| 国产精品人妻久久久久久| 人人妻人人澡欧美一区二区| 亚洲欧美精品综合久久99| 91久久精品电影网| 在线免费观看不下载黄p国产 | 免费看日本二区| 国产三级在线视频| 国产亚洲欧美98| 3wmmmm亚洲av在线观看| 国产人妻一区二区三区在| 久久精品人妻少妇| 亚洲一区高清亚洲精品| 大型黄色视频在线免费观看| 亚洲成a人片在线一区二区| 亚洲专区中文字幕在线| 国产精品免费一区二区三区在线| 国产高清三级在线| 国内少妇人妻偷人精品xxx网站| 国模一区二区三区四区视频| 亚洲人成网站在线播放欧美日韩| www.www免费av| 久久久久久久久久成人| 亚洲精华国产精华精| 色综合色国产| 久久6这里有精品| 国内精品久久久久精免费| 日本五十路高清| 精品午夜福利视频在线观看一区| 麻豆久久精品国产亚洲av| 少妇人妻精品综合一区二区 | 国产精品久久久久久亚洲av鲁大| 嫁个100分男人电影在线观看| 丰满人妻一区二区三区视频av| 又爽又黄无遮挡网站| 免费看a级黄色片| 亚洲一区高清亚洲精品| 神马国产精品三级电影在线观看| 国产69精品久久久久777片| 搡老岳熟女国产| 男人狂女人下面高潮的视频| 久久久久久久久大av| 亚洲第一电影网av| 婷婷色综合大香蕉| 老司机福利观看| 亚洲欧美日韩无卡精品| 日本在线视频免费播放| 变态另类丝袜制服| 日韩大尺度精品在线看网址| av在线天堂中文字幕| 国产激情偷乱视频一区二区| 露出奶头的视频| 欧美激情国产日韩精品一区| 观看美女的网站| 国产高清不卡午夜福利| 99riav亚洲国产免费| 欧美一区二区精品小视频在线| 一本久久中文字幕| 久久欧美精品欧美久久欧美| 国产v大片淫在线免费观看| 美女免费视频网站| 99久久成人亚洲精品观看| 丰满的人妻完整版| 麻豆一二三区av精品| 国产男人的电影天堂91| 欧美+日韩+精品| 亚洲最大成人av| 天堂av国产一区二区熟女人妻| 深爱激情五月婷婷| 在线观看舔阴道视频| 99久久无色码亚洲精品果冻| av在线蜜桃| 麻豆成人午夜福利视频| 97人妻精品一区二区三区麻豆| 亚洲一区高清亚洲精品| 99久久中文字幕三级久久日本| a级一级毛片免费在线观看| 精品久久久久久久久av| 看免费成人av毛片| 嫩草影院入口| 又粗又爽又猛毛片免费看| 免费看日本二区| 婷婷色综合大香蕉| 日韩大尺度精品在线看网址| 一个人免费在线观看电影| 日韩,欧美,国产一区二区三区 | 中文亚洲av片在线观看爽| 一本久久中文字幕| av在线观看视频网站免费| 亚洲最大成人手机在线| 亚洲精品一区av在线观看| 精品99又大又爽又粗少妇毛片 | 伦精品一区二区三区| videossex国产| 99久久久亚洲精品蜜臀av| 成人二区视频| 欧美黑人欧美精品刺激| 中文资源天堂在线| 色播亚洲综合网| 不卡视频在线观看欧美| 久久亚洲真实| 日韩人妻高清精品专区| 久久久色成人| 美女免费视频网站| 夜夜夜夜夜久久久久| avwww免费| 午夜福利成人在线免费观看| 亚洲自偷自拍三级| 精品久久久噜噜| 午夜亚洲福利在线播放| 伦理电影大哥的女人| 别揉我奶头~嗯~啊~动态视频| 悠悠久久av| 极品教师在线视频| 99久久精品热视频| 亚洲午夜理论影院| 国产高清有码在线观看视频| 12—13女人毛片做爰片一| 少妇人妻精品综合一区二区 | 中文在线观看免费www的网站| 国产精品国产三级国产av玫瑰| 最近最新中文字幕大全电影3| av国产免费在线观看| 黄色欧美视频在线观看| 91在线精品国自产拍蜜月| 国产一区二区三区视频了| videossex国产| 欧美一区二区精品小视频在线| 欧美日韩亚洲国产一区二区在线观看| eeuss影院久久| 国产精品电影一区二区三区| or卡值多少钱| 免费av不卡在线播放| 禁无遮挡网站| 色在线成人网| 久久精品国产亚洲av天美| 九九热线精品视视频播放| 国产精品99久久久久久久久| 国产亚洲精品久久久com| 我的老师免费观看完整版| 亚洲av不卡在线观看| 久久久久久久久大av| 中国美白少妇内射xxxbb| 成人午夜高清在线视频| 午夜福利高清视频| 免费在线观看成人毛片| 久久久成人免费电影| 免费av观看视频| 久久午夜福利片| 国产乱人视频| 欧美国产日韩亚洲一区| 嫩草影院精品99| 一级黄色大片毛片| 亚洲欧美激情综合另类| 亚洲av美国av| 熟妇人妻久久中文字幕3abv| 久9热在线精品视频| www日本黄色视频网| 免费观看精品视频网站| 毛片一级片免费看久久久久 | 亚洲va在线va天堂va国产| 午夜福利18| 国产色婷婷99| 男女视频在线观看网站免费| 午夜免费成人在线视频| 99热这里只有是精品50| 欧美激情在线99| 男人舔奶头视频| 乱码一卡2卡4卡精品| 亚洲av一区综合| 悠悠久久av| 欧美国产日韩亚洲一区| 18禁在线播放成人免费| av在线蜜桃| 国产久久久一区二区三区| 国产精品精品国产色婷婷| 亚洲成人免费电影在线观看| 99久国产av精品| 琪琪午夜伦伦电影理论片6080| 国产乱人伦免费视频| 麻豆精品久久久久久蜜桃| 亚洲18禁久久av| 网址你懂的国产日韩在线| 国产日本99.免费观看| 五月玫瑰六月丁香| 国产v大片淫在线免费观看| 精品乱码久久久久久99久播| 看免费成人av毛片| 中文字幕精品亚洲无线码一区| 国产蜜桃级精品一区二区三区| 尾随美女入室| 国产淫片久久久久久久久| 久久精品夜夜夜夜夜久久蜜豆| 日韩欧美国产在线观看| 亚洲七黄色美女视频| 成人国产一区最新在线观看| 精品人妻一区二区三区麻豆 | 久久九九热精品免费| 久久精品人妻少妇| 日本一本二区三区精品| 少妇猛男粗大的猛烈进出视频 | 免费av观看视频| 亚洲黑人精品在线| 国产三级中文精品| 亚洲av成人精品一区久久| 夜夜爽天天搞| 亚洲自拍偷在线| 精品久久久久久久人妻蜜臀av| 久久午夜福利片|