• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Scribble-Supervised Video Object Segmentation

    2022-01-25 12:51:00PeiliangHuangJunweiHanNianLiuJunRenandDingwenZhang
    IEEE/CAA Journal of Automatica Sinica 2022年2期

    Peiliang Huang,Junwei Han,,Nian Liu,Jun Ren,and Dingwen Zhang,

    Abstract—Recently,video object segmentation has received great attention in the computer vision community.Most of the existing methods heavily rely on the pixel-wise human annotations,which are expensive and time-consuming to obtain.To tackle this problem,we make an early attempt to achieve video object segmentation with scribble-level supervision,which can alleviate large amounts of human labor for collecting the manual annotation.However,using conventional network architectures and learning objective functions under this scenario cannot work well as the supervision information is highly sparse and incomplete.To address this issue,this paper introduces two novel elements to learn the video object segmentation model.The first one is the scribble attention module,which captures more accurate context information and learns an effective attention map to enhance the contrast between foreground and background.The other one is the scribble-supervised loss,which can optimize the unlabeled pixels and dynamically correct inaccurate segmented areas during the training stage.To evaluate the proposed method,we implement experiments on two video object segmentation benchmark datasets,YouTube-video object segmentation (VOS),and densely annotated video segmentation(DAVIS)-2017.We first generate the scribble annotations from the original per-pixel annotations.Then,we train our model and compare its test performance with the baseline models and other existing works.Extensive experiments demonstrate that the proposed method can work effectively and approach to the methods requiring the dense per-pixel annotations.

    I.INTRODUCTION

    VIDEO object segmentation (VOS) aims at segmenting one or multiple foreground objects from a video sequence,which has been actively studied in computer vision applications,such as object tracking [1],video summarization[2],and action recognition [3].Segmenting objects on videos are partially inspired by researches about semantic segmentation,e.g.,semantic segmentation in the urban scene[4],[5].However,VOS needs to simultaneously explore both the spatial and temporal information.

    Based on the levels of human intervention in test scenarios,existing VOS approaches can be divided into two main categories,which are the semi-supervised approaches [6]–[10]and unsupervised approaches [11]–[15],respectively.The semi-supervised VOS approaches need the ground-truth annotations of the objects for the first frame for each test video sequence,while the unsupervised approaches do not need such annotation.However,both the semi-supervised and unsupervised approaches need the per-pixel annotations of each video frame in the training stage,and the per-pixel annotation on large-scale video data is prohibitively expensive and hard to obtain.Due to the lack of prior knowledge of the primary objects in the first frame,the unsupervised approaches cannot select the objects of interest flexibly.

    In this paper,to achieve the trade-off between annotating efficiency and model performance,we propose a novel method to achieve video object segmentation that needs only the scribble-level supervision.Specifically,our method requires the scribble annotation on each video frame in training while only requires the scribble annotation on the first video frame in testing.Consequently,compared to the conventional semi-supervised VOS and unsupervised VOS methods,our scribble-supervised VOS method has two advantages.Firstly,as shown in Fig.1 (a) and 1(b),unlike the per-pixel annotations,drawing a scribble greatly reduces the cost of labeling.Secondly,the annotated scribble on the first frame of each test video can provide informative guidance for segmenting the desired objects,which is beyond the capacity of the unsupervised VOS approaches.

    Implementing the scribble-supervised VOS process is very challenging.As shown in Fig.1 (a),per-pixel annotations allow fully supervised VOS methods to achieve reliable segmentation results since they provide supervision on all aspects of the targeted objects,including structure,position,boundary detail,etc.However,scribble-level annotations (see Fig.1 (b)) only provide a limited number of labeled pixels inside the objects which are highly sparse and most of the pixel information is missing.In this case,missing the supervision information would limit the performance of the deep model trained on scribble-level annotations.Directly using conventional network architectures and object functions to learn video object segmentation models with these sparse scribbles may lead to segmentation results that have poor boundary and details,as shown in Fig.1 (c).

    To achieve high-quality video object segmentation results,we introduce a novel scribble attention module and a scribblesupervised loss to build the video object segmentation model.

    Fig.1.The illustration of different annotations and the results of different methods supervised by scribbles.(a) Manually annotating cost with the perpixel annotations.(b) Manually annotating cost with the scribble-level annotations.(c) Baseline model:trained only on scribbles with partial crossentropy loss [16].(d) Ours.

    First,we design a novel scribble attention module that learns an effective attention map to enhance the contrast between the foreground and background.Most existing selfattention models utilize global context regions to construct contextual features,where the information at each location is aggregated.However,not all contextual information is reliable for the final decision.Instead of learning dense pairwise relationships globally,the scribble attention module is able to selectively learn the relationship between reliable foreground positions and the query position with the help of the foreground scribble annotation information.Specifically,the scribble attention module computes the response at a query position as a weighted sum of the features at the scribble positions in the input feature maps.Intuitively,if the query position belongs to the background region,the pairwise affinities between it and all scribble positions are relatively low,so is the aggregated response of this query position.On the contrary,if the query position belongs to the foreground region,the pairwise affinities between it and all scribble positions are relatively high,so is the aggregated response of this query position.Therefore,the scribble attention module can capture more accurate context information and learn an effective attention map to enhance the contrast between the foreground and background.

    Second,we propose a novel scribble-supervised loss which optimizes both the labeled and unlabeled pixels by making full use of pixel positions and red-green-blue (RGB) color.This scribbled-supervised loss can be seen as a regularized loss and cooperated with the partial cross-entropy loss [16] to optimize the training process.Specifically,we leverage the partial cross-entropy (pCE) loss [16] on the scribble labeled pixels and scribble-supervised loss on both the labeled and unlabeled pixels in “shallow” segmentation [16],[17].Our scribble-supervised loss penalizes disagreements between the pairs of pixel points and softly enforces output consistency among all pixels points based on predefined pairwise affinities.These pairwise affinities are based on dense conditional random field (CRF) [18] which makes full use of pixels position and RGB color.By adding this regularization term,all pixel information is used throughout the training process,which compensates for missing information in scribble-level annotations.Furthermore,we dynamically divide the predicted mask into three different confidence regions and assign different scribble-supervised loss weights to these three regions at the same time.Thus,we can achieve better segmentation of the object’s boundary by dynamically correcting the inaccurate segmentation area throughout the training process.

    The contributions of this paper can be summarized as follows:

    1) We propose a novel weakly-supervised VOS method which trains the model with only scribble-level annotations.To the best of our knowledge,this is one of the earliest VOS frameworks working under the scribble-level supervision.

    2) We design a novel scribble attention module which can capture more accurate context information and learn an effective attention map to enhance the contrast between foreground and background.

    3) We propose the scribble-supervised loss and integrate it into the training process to compensate for the missing information of scribble-level annotations and dynamically to correct the inaccurate segmentation area.

    4) We evaluate the proposed algorithm on two benchmark datasets quantitatively and qualitatively.The extensive experiments show that the proposed algorithm effectively reduces the gap to approaches trained with the per-pixel annotations.

    The remainder of this paper is organized as follows.A brief review of the related works on video object segmentation,scribble-supervised semantic segmentation,and self-attention module is presented in Section II.Then,the problem formulation and detailed explanations of the proposed method are presented in Section III,followed by the experimental results in Section IV.Finally,the conclusions are drawn in Section V.

    II.RELATED WORKS

    We start with providing an overview of related works on video object segmentation,followed by an overview of scribble-supervised semantic segmentation,and finally review several related works for self-attention models.

    A.Video Object Segmentation

    Semi-Supervised Methods:In the semi-supervised setting,the ground truth masks of the foreground objects are given for the first frame in the video sequence,the goal is to propagate the object masks of the given objects for the rest of the video sequence.Perazziet al.[6] combined offline and online learning strategies,where the former produces a refined mask from the estimation on previous frames,and the latter captures the appearance of the specific object instance.Caelleset al.[7] proposed a one-shot video object segmentation (OSVOS)method,which pre-trains a convolutional network for foreground-background segmentation and fine-tunes it on the first-frame ground truth of the video sequence at test time.After that,Maniniset al.[19] improved the OSVOS by semantic information from an instance segmentation network.Voigtlaender and Leibe [20] extended the OSVOS by updating the network online using training examples selected based on the confidence of the network and the spatial configuration.

    These online methods greatly improve segmentation results but sacrifice running efficiency.To address the time cost in the fine-tuning stage,some recent works achieved a better runtime and a satisfactory segmentation performance without fine-tuning.Chenet al.[21] proposed a blazingly fast video object segmentation method with pixel-wise metric learning(PML),which uses a pixel-wise embedding space learned with a triplet loss together with the nearest neighbor classifier.Inspired by PML,Voigtlaenderet al.[10] proposed a fast endto-end embedding learning method for video object segmentation,which uses a semantic pixel-wise embedding mechanism together with a global and a local matching mechanisms to transfer information from the first frame and the previous frame of the video to the current frame.Johnanderet al.[9] proposed a generative appearance model(A-GAME) which learns a powerful representation of the target and background appearance in a single forward pass.Although semi-supervised VOS methods have achieved tremendous progress,they rely on the high cost of per-pixel annotation.

    Unsupervised Methods:In the unsupervised setting,the ground truth masks of the foreground objects are not given at all.Early unsupervised VOS approaches mainly leveraged object proposals [11],[21]–[25],analyzed long-term motion information (trajectories) [12],[26]–[29] or utilized saliency information [30]–[33],to infer the target.Recently,with the success of deep learning,many convolutional neural network(CNN)-based models were proposed.For example,Jainet al.[34] proposed a two-stream architecture which fuses motion and appearance in a unified framework for segmenting generic objects in videos.Chenget al.[35] proposed an end-to-end trainable SegFlow network which simultaneously predicts optical flow and object segmentation in video sequences.The information of optical flow and object segmentation are propagated bidirectionally in this unified framework.This method needs to fine-tune the pre-trained segmentation network for specific objects.Liet al.[36] proposed an instance embedding network to produce an embedding vector for each pixel.This embedding vector can identify all pixels belonging to the same foreground.Then,they combined motion-based bilateral networks for identifying the background.Songet al.[13] used a video salient object detection method to segment objects,which fine-tunes the pre-trained segmentation network for extracting spatial saliency features and trains ConvLSTM to capture temporal dynamics.Luet al.[14] proposed a novel co-attention siamese network to address the unsupervised video object segmentation task from a holistic view.This method emphasizes the importance of inherent correlation among video frames and incorporates a global co-attention mechanism.Wanget al.[37] conducted a systematic study on the role of visual attention in the unsupervised VOS task,which decouples unsupervised VOS into two sub-tasks:Attention-guided object segmentation in the spatial domain and unsupervised VOS-driven dynamic visual attention prediction in spatiotemporal domain.However,due to the lack of prior knowledge about the primary objects,these unsupervised VOS methods are difficult to correctly distinguish the primary objects from the complex background in real-world scenarios.Additionally,they also need fully annotated masks in the training stage like the semi-supervised method,which is expensive and time-consuming.

    Our method uses scribble-level annotations as the supervision of each video frame in the training stage.We also use scribble-level annotations as the object’s inference mask of the first frame in the testing stage.This greatly reduces the cost of labeling during both the training and testing phases.Furthermore,it avoids the lack of prior knowledge about the primary objects.

    B.Scribble-Supervised Semantic Segmentation

    To avoid requiring expensive pixel-wise labels,semantic segmentation methods attempt to learn segmentation from low-cost scribble-level annotations [16],[17],[38],[39].Linet al.[38] first proposed scribble-supervised convolutional networks for semantic segmentation.Tanget al.[16]combined partial cross-entropy loss on labeled pixels and normalized cut loss for unlabeled pixels.After that,Tanget al.[17] further explored the role of different regularized losses for scribble-supervised segmentation tasks.Wanget al.[39] proposed a boundary perception guided method,which employs class-agnostic edge maps for supervision to guide the segmentation network.To the best of our knowledge,there is no published work for scribble-supervised VOS.Comparing with the scribble-supervised semantic segmentation,the scribble-supervised VOS task is more complex since it is still costly to manually draw scribble-level labels for each object in a large-scale video dataset.Further,the model has to detect and segment the objects along the video sequence depending on the scribble-level label only in the first frame.In this paper,we simulate the human annotations on the original per-pixel annotated datasets.Therefore,the time cost of manual labeling is further reduced.Furthermore,we propose a novel scribble attention module and a scribble-supervised loss to build our video object segmentation model.

    C.Self-Attention Mechanisms in Neural Networks

    A self-attention module computes the response at a position by using the weighted sum of all the features in an embedding space,which can capture the long-range information.Vaswaniet al.[40] first proposed the self-attention mechanism to draw global dependencies between input and output and applied it in the task of machine translation.This attention mechanism has been attended in the computer vision field.Wanget al.[41] proposed a non-local network for capturing long-range dependencies,which computed the response at a position as a weighted sum of the features at all positions.Inspired from[41],Yiet al.[42] proposed an improved non-local operation to fuse temporal information from consecutive frames for video super-resolution,which was designed to avoid the complex motion estimation and motion compensation and make better use of spatiotemporal information.Fuet al.[43]proposed a dual attention network to model the semantic interdependencies in spatial and channel dimensions respectively.Zhanget al.[44] proposed an aggregated cooccurrent feature module for semantic segmentation,which learned a fine-grained spatial invariant representation to capture co-occurrent context information across the scene.The difference between the co-occurrent feature module and non-local network is that the co-occurrent feature model learns a prior distribution conditioned on the semantic context[44].

    Our scribble attention module is inspired by the success of above attention modules.The key difference between our scribble attention module and them is that the scribble attention module computes the response at a position as a weighted sum of the features at the scribble positions labeled in objects by multiplying scribble-level ground truth and feature maps.

    III.METHOD

    This work aims to design an algorithm for weakly supervised video object segmentation with scribbles as supervision.We introduce the mask-propagation module [9]to propagate the object appearance and introduce the appearance module [9] to provide discriminative posterior class probabilities.To achieve high-quality segmental results with sparsely annotated scribbles,we add a scribble attention module in the two backbones to enhance the contrast between the foreground and background and propose a scribblesupervised loss to take both the labeled and unlabeled pixels into account.In this section,we first present an overview of our proposed VOS method in Subsection A.Then,we discuss the mask-propagation module and the appearance module in Subsection B.In Subsections C and D,we introduce in detail the scribble attention module and the scribble-supervised loss,respectively.Finally,we elaborate on the generation process of scribble annotation in Subsection E.For easy retrieval,we summarize the used symbols of this paper in Table I.

    A.Overview

    The flowchart of complete VOS in this work is shown in Fig.2,which can be divided into two stages:training stage and test stage.In the training stage,we generate the scribble annotations for all the frames and the detailed generation process is described in Subsection E.The scribble annotations of each frame are used to train our scribble-supervised video object segmentation (SSVOS) model.In the test stage,we only generate the scribble annotation for the first video frame and predict the segmentation mask for each frame.The proposed SSVOS model is illustrated in Fig.3.Given a video frame It,the features are first extracted with a backbone network.Then,they are input to the mask-propagation module and the appearance module.The features and the scribble annotation of the initial frame I0are input to the maskpropagation module to provide a rough prior.The coarse segmentation mask of frame It?1is input to the maskpropagation module to propagate the object’s appearanceinformation from frame It?1to frame It.The coarse segmentation mask and the mean and covariance of frameIt?1are used to update the appearance module.The two features from the mask-propagation module and the appearance module are combined in the fusion module,which comprises two convolutional layers.We first concatenate these two features along the channel dimension and obtain aggregated feature maps.Then,we fuse the aggregated features to generate a coarse mask encoding by using two convolution layers with 3×3 and 1×1 convolutional kernels,respectively.The coarse mask encoding is input to a convolution layer with 3×3 convolutional kernels to generate a coarse segmentation mask.This mask is further used to update the maskpropagation module to propagate the object’s appearance information from frame Itto frame It+1.The coarse segmentation mask is also used to update the appearance module to provide posterior possibilities for frame It+1.Finally,the coarse mask encoding from the fusion module is passed through the upsampling module and a convolution layer with 3×3 convolutional kernels to produce the final segmentation mask.

    TABLE IA SUMMARY OF SYMBOLS USED IN THIS PAPER

    Fig.2.Flowchart of complete video object segmentation in this work.

    Fig.3.The architecture of the proposed scribble-supervised video object segmentation model (SSVOS).The scribble attention (SA) module is only embedded into the third block of the backbone network.The features and the scribble annotation of the initial frame I0 are first concatenated with the features of the current frame It and the coarse segmentation mask of the previous frame It?1.Then,the concatenated features are input to the mask propagation module.The features of It and the coarse segmentation mask of I t?1 are input to the appearance module.Furthermore,the mean and covariance of I t?1 are also input to the appearance module.The two features from the mask propagation module and the appearance module are fused in the fusion module and generate a coarse mask encoding.This coarse mask encoding is input to a convolution layer to generate a coarse segmentation mask which is then used to update the maskpropagation module and the appearance module for the next frame I t+1.The coarse mask encoding is then refined by the upsampling module and a convolution layer to generate the final segmentation.The pCE loss and the auxiliary pCE loss only consider the labeled pixels and scribble-supervised (SS) loss considers both the labeled and unlabeled pixels.“Conv” indicates “convolution”.“r” indicates dilation rate [45].“Cov” indicates covariance.

    B.Mask-Propagation and Appearance Modules

    Inspired by A-GAME [9],we use a mask-propagation module to propagate the object’s appearance and keep the temporal-spatial consistency.As shown in Fig.3,the maskpropagation module consists of three convolutional layers,where the first and third layers are two convolution layers with 1×1 and 3×3 convolutional kernels,respectively.The middle layer is a dilation pyramid [45],which has three 3×3 convolution branches with dilation rates of 1,3,and 6,respectively.In the first frame I1,the scribble annotation of initial frame I0is used as input to the mask-propagation module to propagate the object’s appearance from frame I0to frame I1.In the subsequent frames,where the scribble annotations are not available,the coarse segmentation mask from It?1is used as the input to update the mask-propagation module.Thus,the mask-propagation can propagate the object’s appearance from frame It?1to frame It.As the displacements between adjacent frames are very small,temporal-spatial consistency can be kept,and the mask-propagation module can provide the accurate foreground prior estimation to frame It.

    Furthermore,we also use an appearance module [9] to return the posterior class probabilities at each image location.These posterior class probabilities form an extremely strong cue for foreground/background discrimination.Like [9],let us denote the of features extracted from the video frames as{Mp}.The feature Mpat each pixel locationpis aDdimensional vector.We model these feature vectors as i.i.d samples drawn from the underlying distribution:

    wherezpis a discrete random variable which assigns the observation Mpto a specific componentzp=k.We use a uniform prior Ψ(zp=k)=1/Kforzp.Kis the number of components and each component exclusively models the feature vectors of either the foreground or the background.In(1),each class-conditional density is a multi-variate Gaussian with mean μkand covariance matrix Σk:

    In the first frame,our module parameter is inferred from the extracted features and the scribble annotation of initial frame I0.In subsequent frames,we update the module using the network coarse segmentation predictions as soft class labels.We compute the appearance module parameter of frame Itas follows:

    Given the module parameters computed in the previous frame It?1,τt?1,our appearance module can predict the posterior map of frame Itas follows:

    C.Scribble Attention Module

    As illustrated in Fig.4,given an input feature X ∈RC×H×W,we first transform it into three new feature maps Φ(X),Θ(X)and Ω(X),respectively,where {Φ(X),Θ(X),Ω(X)}∈RC/2×H×W.We resize the scribble-level ground truth to R1×H×W.Then,we perform an element-wise multiplication between the Φ(X) and the scribble-level ground truth.Here,the foreground scribble pixels are set to 1 and the rest are set to 0.After that,we calculate the attention as

    where Ajiindicates the extent to which the model attends to theith position when synthesizing thejth position.δf sindicates the foreground scribble pixels position.Tindicates the transpose operation.Φ(Xi)=WφXi,Θ (Xj)=WθXj,where Wφand Wθare two weight matrices which are learned via 1×1 convolution.Then,the output of the attention layer in thejth position can be represented:

    where Ω(Xi)=WωXiand Wωis a weight matrix which is learned via 1×1 convolution.Finally,we further multiplyOjby a scale parameter γ and perform an element-wise sum operation with the input feature X to obtain the final output Z ∈RH×W×Cas follows:

    where γ is initialized as 0 and gradually learned to assign more weight [43],[46].The output feature Zjof the scribble attention module at each position is a weighted sum of the features at the positions of the foreground scribbles and original features.

    The scribble attention module computes the response at a position as a weighted sum of the features at the scribble positions labeled in objects.Each pixel in the scribble belongs to the foreground object.Thus,the module can capture more accurate context information and learn an effective attention map to enhance the contrast between foreground and background,which plays an important role in the segmentation task.

    Fig.4.Illustration of the proposed scribble attention module.

    D.Scribble-Supervised Loss

    As shown in Fig.3,the annotated scribbles are sparsely provided,and the pixels that are not annotated are considered as unknown.We use the (auxiliary) partial cross-entropy loss[16] to consider the partially labeled pixels which can effectively ignore unknown regions.This partial cross-entropy loss can be seen as a sampling of cross-entropy loss with perpixel masks by rewriting it as follows:

    whereq(i) is a binary indicator,which equals 1 if pixeliis labeled on the foreground scribble and 0 otherwise;P (i) is the softmax output probability of the network;δf sindicates that the scribble labeled pixels.

    So far,all scribble labeled pixels have been used for training with partial cross-entropy loss.However,comparing with perpixel annotations,scribbles are often labeled on the internal the objects which are lack of determinate boundary information of objects and confident background regions.Therefore,with limited pixels as supervision,object structure,and boundary details cannot be easily inferred.Furthermore,since the scribble-level annotations only label extremely fewer pixels,training the CNN networks on scribbles would lead to overfitting.The proposed scribble-supervised loss optimizes the training process on both the scribble-labeled and unlabeled pixels,which compensates for missing pixel information in unlabeled pixels.Thus,the scribble-supervised loss can be seen as a regularized loss and can prevent overfitting.Our scribble-supervised loss penalizes disagreements between the pairs of pixel points and softly enforces output consistency among all pixel points based on predefined pairwise affinities.These pairwise affinities are based on dense CRF [18] which makes full use of pixels position and RGB color.To this end,we first define the energy formulation between pixeliand pixeljbased on [47],[48] as follows:

    wherecaandcbare the class labels.(i) and(j) are the soft-max outputs of network at pixeliand pixelj,respectively.g(i,j) is a Gaussian kernel bandwidth filter which considers both pixel spatial positions and RGB colors and can be represented as follows:

    whereNis the normalized factor,dandrare the pixel spatial position and RGB color,respectively.δdand δrare the hyper parameters that control the bandwidths of Gaussian kernels.We use the Potts model [49] as a discontinuity preserving function to penalty disagreements between the pairs of pixels and encourage the pixels in the same region to have equal labels.Assuming that the segmentation results P are restricted to binary class indicators P ∈{0,1},the standard Potts model[49] could be represented via Iverson brackets [·],as on the left-hand side below:

    whereg(i,j) is a matrix of pairwise discontinuity costs or an affinity matrix.The right-hand side above is a particularly straightforward quadratic relaxation of the Potts model [49]that works for relaxed P ∈[0,1] corresponding to a typical soft-max output of CNNs [17].

    Inspired by [48],we use a modified quadratic relaxation of the Potts model [49] to simplify the energy formulation:

    The affinity matrixg(i,j) is based on a dense Gaussian kernel which makes full use of pixels position and RGB color.We can see that the Potts model [49] penalizes disagreements between the pairs of pixels only based on the spatial and appearance information of the pixels,and there is no need for any supervision information of groundtruth.

    The dense CRF loss [17],[47],[48] can be represented as follows:

    As we all know,the difficulty for the segmentation task is the segmentation of the object’s boundary.In this task,since scribbles have only limited pixels as supervision,it is more difficult to obtain accurate boundary segmentation masks.In order to better segment the boundary regions,we divide the predicted mask into three subregions by predicted mask scores.Specially,we identify pixels with scores larger than 0.3 as reliable foreground pixels,and smaller than 0.05 as reliable background pixels.The remaining pixels are classified as unreliable areas.Then,we assign different loss weights to these areas.Based on this idea,we design a scribblesupervised (SS) loss:

    wherew(i) is an adaptive region weight and can be represented as follows:

    where β controls the weight of the scribble-supervised loss on unreliable regions.

    For each prediction branch,we deploy a new joint loss function combining the partial cross-entropy loss with the designed scribble-supervised loss function:

    where λ controls the weight of the scribble-supervised loss.

    E.Scribble Simulation Generation

    Compared with per-pixel labels,scribbles greatly reduce the expensive cost of annotation.However,it is still costly to manually draw scribble-level labels for each object in a largescale video dataset.In order to reduce the annotation cost of video tasks,Liuet al.[50] designed an interactive computer vision system to allow a user to efficiently annotate motion.Karasevet al.[51] proposed a frame selection method for video annotation,which naturally allows for uncertainty in the measurements,and thus is applicable to both an automated annotation,as well as a manual annotation.Luoet al.[52]used an active learning framework to obtain higher segmentation accuracy with less user annotation.Differently from the above methods,we are inspired by [53] to simulate scribble-level annotations on each video frame from two available per-pixel annotated datasets,YouTube-VOS [54],and densely annotated video segmentation (DAVIS)-2017[55].In these two datasets,as all video frames are provided with per-pixel annotations,we can obtain the scribbles of all video frames by leveraging the proposed scribble simulation generation process.For each video frame,we use morphological skeletonization [56] to automatically generate realistic foreground object scribbles and randomly generate a rectangular box as the background scribbles.As shown in Fig.5,the whole process consists of four steps:1) Foreground skeletonization:we achieve the skeletonization of the perpixel annotated masks to get foreground scribbles by using a fast implementation of the thinning algorithm [56].2)Morphological dilation:these scribbles directly obtained through the thinning algorithm in Step 1 are a single-pixel width.These single-pixel width scribbles have a gap to the manually annotated scribbles.Therefore,we use a morphological dilation algorithm to obtain the five-pixel width scribbles.Then,we use the original per-pixel annotated mask and the obtained scribble-level mask to conduct dot multiplication to eliminate the pixels which are extended to the background due to morphological expansion.We use these scribbles as the final foreground object scribbles.3)Background generation:We randomly generate a five-pixel width rectangular box as the background scribble in the background area around the foreground scribbles.When the foreground object touches an edge of the image,the rectangular edge corresponding to the edge of the image will not be needed,e.g.,the background scribble in Fig.3.4) We label the areas outside of the foreground and background scribble regions as unknown areas.

    Fig.5.Illustration of scribble simulation generation.From left to right:RGB images,per-pixel ground-truth annotations,foreground scribbles,dilated foreground scribbles,final scribbles annotations (best viewed in amplification).

    IV.ExPERIMENTS

    The proposed algorithm is evaluated on two most recent VOS benchmark:YouTube-VOS [54] and DAVIS-2017 [55]dataset.We first conduct ablation studies on YouTube-VOS[54].Then,we compare it with the state-of-the-art VOS methods on both YouTube-VOS [54] and DAVIS-2017 [55]datasets.

    A.Datasets

    YouTube-VOS:The YouTube-VOS consists of 3 471 and 474 videos in the training and validation sets,respectively,being the largest video object segmentation benchmark.Each video sequence is 20 to 180 frames long and every fifth frame is annotated.In the training set,there are 65 unique object categories which are regarded as seen categories.The validation set contains 91 unique object categories,which contain all the seen categories and 26 unseen categories.

    DAVIS2017:The DAVIS2017 is a popular benchmark dataset for VOS tasks which is a multi-object extension of DAVIS2016 [57].In the training set,there are 60 video sequences with multiple annotated instances.The validation and test-dev sets include 30 video sequences with multiple annotated instances,respectively.

    B.Evaluation Metrics

    We evaluate experiment results using three usual evaluation measures suggested by [58]:1) Region similarityJ:The mean intersection-over-union (mIoU) between the ground truth and the predicted segmentation masks.2) Contour accuracyF:TheF-measures of the contour-based precision and recall between the contour points of the ground truth and the predicted masks.3) Overall score G:The average score ofJandF.For the YouTube-VOS dataset,bothJandFare divided into two different measures,depending on whether the categories of the validation are included in the training set(JseenandFseen),or these categories are not included in the training set (JunseenandFunseen.

    C.Implementation Details

    Our framework is based on the A-GAME [9],which is the state-of-the-art in semi-supervised VOS.Our network is implemented based on PyTorch [59].We choose Adam [60]as the optimizer to minimize the joint loss.Adam is an algorithm for first-order gradient-based optimization of stochastic objective functions which is based on adaptive estimates of lower-order moments.Following [9],we divide network training into two stages.In the first stage,we train for 80 epochs on half resolution images (240×480).We set batch size to 4 video snippets and set 8 frames in each snippet.We set a learning rate of 1 0?4,exponential learning rate decay of 0.95 each epoch,and a weight decay of 10?5.In the second stage,we train 100 epochs on full image resolution.The batch size is set to 2 snippets,to accommodate longer sequences of 14 frames.We set a learning rate of 1 0?5,exponential learning rate decay of 0.985 each epoch,and a weight decay of 1 0?6.In both stages,we set the hyper parameters γdand γrof the Gaussian kernels bandwidths to 100 and 15,respectively.We set the scribble-supervised loss weight λ to 2× 10?15and set the unreliable region weight β to 0.1.We determine all hyperparameters via experience and grid search method.To show that they are the best settings,we vary these hyper-parameters in different ways,measuring the change in performance on the YouTube-VOS dataset [54].In Table II rows A and B,we vary the learning rateslr1andlr2in the two training stages,respectively.Each overall score decreases to varying degrees.In rows C and D,we observe that the overall score decreases to varying degrees when varying the hyper-parameters γdand γrof the Gaussian kernels bandwidths.In rows E and F,we vary the loss weight λ and region weight β.Each overall score decreases to varying degrees.These ablation experimental results prove that we can choose the best settings of the hyperparameters via experience and grid search method.

    D.Ablation Studies

    We conduct ablation studies on YouTube-VOS to verify the effectiveness of the scribble attention module and scribblesupervised loss.We compare our baseline with different variant models and the quantitative results for different models are shown in Table III.We also show qualitative comparison results in Fig.6.“pCE” denotes our model training on the scribble labels with the partial cross-entropy loss which can be seen as our baseline.“SA” denotes the proposed scribble attention module.“PSA” denotes the pointwise spatial attention module [61].“NL” denotes the nonlocal module [41].“CCA” denotes the criss-cross attention module [62].“SS” denotes the proposed scribble-supervised loss.“DCRF" denotes the dense CRF loss [17],[47],[48].“ ?” denotes the model with the corresponding module.

    Analysis on Scribble Attention Module:We first analyze the impact of the proposed scribble attention module.To confirm its effectiveness,we compare the performances of the baseline and its variant by adding the proposed scribble attention module.The second row of Table III shows the result of adding the scribble attention module to our baseline model.The overall score increases by more than 3.5% compared with our baseline model (first row of Table III).Fig.6 shows some qualitative results and the segmentation results (second column of Fig.6) with our scribble attention module are greatly improved compared to the baseline model.These quantitative and qualitative comparison results demonstrate usefulness of the proposed scribble attention module for this scribble-supervised video object segmentation.To further demonstrate the effectiveness of the proposed scribbleattention module in this special scribble-supervised task,we replace our scribble attention module with three recent attention modules,“PSA” [61],“NL” [41] and “CCA” [62].As shown in Table III,“PSA”,“NL” and “CCA” all achieve better performance compared with our baseline model,which demonstrates that the self-attention mechanism is able to capture contextual dependencies and is helpful for the VOS task.Compared with the proposed scribble attention module,the overall scores of “PSA”,“NL” and “CCA” drop from 47.8 to 46.5,46.6,and 46.9,respectively.The second and third columns of Fig.6 show the quality results of “SA” and“CCA”,respectively.We can see that the segmentation results of “SA” are better than those of “CCA”,such as human leg area.These quantitative and qualitative comparison results further demonstrate the superior performance of our scribble attention module.

    TABLE IIG OVERALL SCORE UNDER DIFFERENT HYPER-PARAMETERS SETTINGS.“BASE” INDICATES THE BASE HYPER-PARAMETERS SETTINGS WHICH ARE USED IN THIS PAPER.UNLISTED VALUES ARE IDENTICAL TO THOSE OF THE BASE HYPER-PARAMETERS SETTINGS

    TABLE IIIABLATION STUDIES ON YOUTUBE-VOS DATASET

    Scribble-Supervised Loss:Another ablation study is carried out on the use of the proposed scribble-supervised loss by adding it to the baseline model.The sixth row of Table III shows the quantitative result of our baseline with the proposed scribble-supervised loss.Experimental results show that the overall score increases by more than 7.1% compared with the baseline model (first row of Table III).In addition,the other four indicators have also been greatly improved with proposed scribble-supervised loss.As shown in Fig.6,compared with the baseline model (first column of Fig.6),the segmentation quality (fourth column of Fig.6) becomes much better with using the scribble-supervised loss.The baseline model can only segment out the general shape of the objects and the model with scribble-supervised loss can segment out the boundaries and details of the objects.For example,the boundary segmentation results of the man and the tennis racket are rough by the baseline model.Our scribblesupervised loss greatly improves the segmentation results.These quantitative and qualitative comparison results demonstrate that the proposed scribble-supervised loss can indirectly compensate for missing information in weak scribble-level labels.In order to further show the segmentation effects of our scribble-supervised loss on the boundary area,we replace our scribble-supervised loss with dense CRF loss.The overall score decreased from 49.5 to 48.7.As shown in Fig.6,the boundary segmentation results of the man with dense CRF loss (fifth column of Fig.6) are worse than those with scribble-supervised loss (fourth column of Fig.6).These quantitative and qualitative comparison results demonstrate that our scribble-supervised loss can dynamically correct inaccurately segmented areas.

    Fig.6.Quality comparison of different variant modules on Youtube-VOS.The first column to the sixth column are “pCE”,“pCE” with “SA”,“pCE” with“CCA”,“pCE” with “SS”,“pCE” with “DCRF”,and our complete model,respectively.

    TABLE IVQUANTITATIVE RESULTS ON THE YOUTUBE-VOS VALIDATION SET.“OL” DENOTES ONLINE LEARNING

    We also show the quantitative (eighth row of Table III) and qualitative (sixth column of Fig.6) results of our complete model which demonstrate that combining scribble attention module and scribble-supervised loss is able to further improve the segmentation results.

    E.Comparison With State-of-the-Art

    To our best knowledge,there are no published results for weakly-supervised VOS tasks to be compared on YouTube-VOS [54] and DAVIS-2017 [55].We first compare our approach with the state-of-the-art semi-supervised methods on YouTube-VOS [54].Then,we compare our approach with the state-of-the-art unsupervised methods and semi-supervised on DAVIS-2017 [55].

    YouTube-VOS:The validation set of this large-scale dataset contains 474 sequences with 91 categories,26 of which are not included in the training set.We evaluate our method on this validation set and evaluate the results on the open evaluation server [54].Table IV shows the comparison with previous start-of-the-art semi-supervised methods on YouTube-VOS [54].“OL” denotes online learning in the inference stage.Our method achieves 51.1 % of overall scores using scribbles as supervision and without online learning.We obtain almost the same overall scores like [63].Our method reaches comparable results to [6],[20],[64].The gap between our method and [20],[8] is very small.Compared with [7],[9],[54],[58],[65],although there is a certain gap in overall scores,this result is acceptable in view of the huge gap in supervision information.Our method closes the gap to approaches trained on per-pixel annotation.

    Besides,to our best knowledge,end-to-end recurrent network for video object segmentation (RVOS) [8] is the only one that published the results for zero-shot (unsupervised)VOS tasks in YouTube-VOS.Table V shows the quantitative comparison results of our scribble-supervised method with RVOS.We can see that although RVOS uses the per-pixel annotations as supervision during the training phase,our results are still far better than those of RVOS.One of the possible reasons for this is that our method provides the scribble annotation of the first frame in the testing phase,which is able to provide a certain prior knowledge of the objects.

    DAVIS2017:The dataset comprises 60 videos for training,30 videos for validation,and 30 videos for testing.Following[9],we first train on the Youtube-VOS and DAVIS2017 training sets and then evaluate to boost the performance.We evaluate our method on both the validation and test-dev setsand evaluate the results of the test-dev set on the open evaluation server [55].Table VI shows the quantitative comparison results on the DAVIS-2017 validation set of our method to the recent start-of-the-art unsupervised and semisupervised methods.We can see that our method achieves 55.7% ofJ&FMean using scribbles as supervision.Compared with unsupervised methods,our method exceeds [8] largely,and exceeds [13] to some extent.The gap between ours and[37] is very small.Compared with semi-supervised methods with per-pixel annotations as supervision,our method exceeds [63].Our method reaches comparable results to [15],[66].The gap between our method and [7],[8] is very small.Table VII shows the quantitative comparison results on the DAVIS-2017 test-dev set.Our method achieves 41.7 % ofJ&FMean and exceeds the most unsupervised methods [8],[13],[15].The large performance gap is due to the difference of the supervised information,which is acceptable.Compared with semi-supervised methods with per-pixel annotations as supervision,our method exceeds the semi-supervised method[63].The gap between our method and [67],[66] is very small.

    TABLE VQUANTITATIVE RESULTS FOR ZERO-SHOT METHODS IN THE YOUTUBE-VOS DATASET

    TABLE VIQUANTITATIVE RESULTS ON THE DAVIS2017 VALIDATION SET

    TABLE VIIQUANTITATIVE RESULTS ON THE DAVIS2017 TEST-DEV SET

    F.Qualitative Results

    Fig.7 shows the qualitative results of our methods on YouTube-VOS [54] and DAVIS-2017 [55].These videos all contain at least one object with diverse size,shape,motion,and occlusion,etc.It can be seen that in many cases our method can produce high-quality segmentation results.For example,our method is able to successfully segment the challenging motorcycle sequence (first row) in which objects move very fast.In the bmx-trees sequence (fourth row),the objects move fast and parts of objects are occlusion by the tree in some frames.Our method can successfully segment out people and the bike from these occlusion areas.In the last row,our method fails to segment some bottom edges of the car in the start frame.However,afterward,our method is able to recover from that error.

    G.Failure Case

    Fig.7.Qualitative results showing objects on videos from the Youtube-VOS and DAVIS2017.The first second rows are from Youtube-VOS validation set;the latter two are from DAVIS2017 validation set and test-dev set,respectively.

    Fig.8.Illustration of two failure cases on DAVIS-2017.The first row is the paragliding-launch video sequence,where the paragliders ropes are not segmented out properly.The second row is the pigs video sequence,where the segmentation results of the later frames are worse.

    While our approach verifies satisfactory results on both the quantitative and the qualitative evaluations in YouTube-VOS[54] and DAVIS-2017 [55],we find a few failure cases as shown in Fig.8.In the first row,our method fails to segment out paragliders’ ropes.It is because we fail to simulate generating scribbles of very small objects and thus scribbles of paragliders ropes are not well provided in the first frame during the inference stage.We believe that a good future direction is to generate more reliable scribbles with a better simulation generation algorithm.We can also use other data processing methods to handle the problems of the scribbles,such as [68]–[71].In the second row,we found that our method may be less stable to segment these frames which are far from the first frame.Since we take the scribbles as the guidance information in the first frame,these scribbles only occupy the limited pixels information of the objects.It is mostly because the model is difficult to propagate the limited guidance information from the first frame to the too far frames on very challenging scenes.The challenging scenarios can be resolved by incorporating the spatiotemporal information encoding module [72],which will be our future direction.In addition,we can also use other advanced technologies of image segmentation and deep learning [73]–[76] to further improve the performance of the failure cases.

    V.CONCLUSIONS AND FUTURE WORK

    In this paper,we propose a scribble-supervised method for video object segmentation.The first contribution of our method is the scribble attention module,which is designed to selectively learn the relationship between the context regions and the query position.Unlike conventional self-attention modules which consider all the context information,with this selective strategy,our attention module can capture more accurate context information and learn an effective attention map to enhance the contrast between the foreground and the background.Furthermore,to resolve the missing information of scribble-level annotations,we propose a novel scribblesupervised loss that can compensate for the missing information and dynamically correct inaccurate prediction regions.The proposed method alleviates a huge amount of human labor than that of the per-pixel annotation and closes the performance gap to approaches trained on per-pixel annotation,which achieves the trade-off between annotating efficiency and model performance.Therefore,it is of great significance to further exploit weakly supervised method[77]–[80] to learn video object segmentation.In future,we plan to explore the potential of using a new simulation method to generate more reliable scribbles and incorporating the spatiotemporal information encoding module [72],[81] to better propagate guidance information throughout the video sequence.

    亚洲 欧美一区二区三区| 久久精品国产清高在天天线| 男女之事视频高清在线观看| 欧美日韩一级在线毛片| 怎么达到女性高潮| 一进一出抽搐gif免费好疼| 国产伦精品一区二区三区视频9 | 国产精品精品国产色婷婷| 午夜免费激情av| 一卡2卡三卡四卡精品乱码亚洲| 国产精品野战在线观看| 亚洲色图av天堂| 欧美av亚洲av综合av国产av| 18禁美女被吸乳视频| 精品一区二区三区四区五区乱码| 国产成人精品久久二区二区免费| 好男人电影高清在线观看| 天堂动漫精品| 久久午夜亚洲精品久久| 看免费av毛片| 国产高潮美女av| 丰满人妻一区二区三区视频av | 一本一本综合久久| 淫妇啪啪啪对白视频| 黑人巨大精品欧美一区二区mp4| 热99在线观看视频| 中国美女看黄片| 国产69精品久久久久777片 | 男人的好看免费观看在线视频| 蜜桃久久精品国产亚洲av| 两性午夜刺激爽爽歪歪视频在线观看| 美女午夜性视频免费| 免费看美女性在线毛片视频| aaaaa片日本免费| 亚洲av中文字字幕乱码综合| 日韩欧美国产一区二区入口| 成在线人永久免费视频| 国产69精品久久久久777片 | 一本久久中文字幕| 国内精品久久久久精免费| 在线a可以看的网站| av欧美777| 岛国在线免费视频观看| 99热这里只有精品一区 | 国产一区二区三区在线臀色熟女| aaaaa片日本免费| 欧美日本视频| 人人妻人人澡欧美一区二区| 日本免费a在线| 夜夜夜夜夜久久久久| 俄罗斯特黄特色一大片| 成年免费大片在线观看| 亚洲va日本ⅴa欧美va伊人久久| 男女午夜视频在线观看| а√天堂www在线а√下载| 欧美黄色淫秽网站| 色老头精品视频在线观看| av国产免费在线观看| 免费观看人在逋| 亚洲精品456在线播放app | 午夜福利在线观看吧| 欧美中文综合在线视频| 国产极品精品免费视频能看的| 亚洲精品色激情综合| 国产亚洲精品久久久com| 一本精品99久久精品77| 国产亚洲精品久久久久久毛片| 丰满的人妻完整版| 国产亚洲av嫩草精品影院| 国产 一区 欧美 日韩| 国产精品亚洲一级av第二区| 国产午夜精品久久久久久| 成人av在线播放网站| 精品国内亚洲2022精品成人| 亚洲 国产 在线| 久久久国产精品麻豆| 99久久国产精品久久久| 国产成+人综合+亚洲专区| 男女午夜视频在线观看| 久久亚洲真实| 91老司机精品| 嫩草影视91久久| 国产精品精品国产色婷婷| 首页视频小说图片口味搜索| 又黄又爽又免费观看的视频| 岛国视频午夜一区免费看| 久久九九热精品免费| 脱女人内裤的视频| 又粗又爽又猛毛片免费看| 久久久久久大精品| 中出人妻视频一区二区| 亚洲成人免费电影在线观看| 国产三级黄色录像| 狂野欧美激情性xxxx| 亚洲最大成人中文| 国产精品,欧美在线| 久久午夜亚洲精品久久| 国产激情偷乱视频一区二区| 亚洲国产精品sss在线观看| 日日干狠狠操夜夜爽| 国产成人啪精品午夜网站| av福利片在线观看| 久久久水蜜桃国产精品网| 在线观看日韩欧美| 国产亚洲精品综合一区在线观看| 在线观看一区二区三区| 亚洲五月婷婷丁香| 国产熟女xx| 日韩欧美精品v在线| 18禁美女被吸乳视频| 欧美中文综合在线视频| 成人av一区二区三区在线看| 午夜亚洲福利在线播放| 欧美大码av| 国产极品精品免费视频能看的| 国产成人福利小说| 免费在线观看成人毛片| 网址你懂的国产日韩在线| 他把我摸到了高潮在线观看| 非洲黑人性xxxx精品又粗又长| 亚洲av中文字字幕乱码综合| 麻豆久久精品国产亚洲av| 欧美午夜高清在线| 老鸭窝网址在线观看| 久久草成人影院| 亚洲自偷自拍图片 自拍| 国产爱豆传媒在线观看| 免费电影在线观看免费观看| 99国产精品一区二区蜜桃av| 精品熟女少妇八av免费久了| 特大巨黑吊av在线直播| 国产激情欧美一区二区| 国产毛片a区久久久久| ponron亚洲| 久久精品aⅴ一区二区三区四区| 我要搜黄色片| 真人一进一出gif抽搐免费| 桃色一区二区三区在线观看| 级片在线观看| 亚洲无线观看免费| 午夜两性在线视频| 成人三级黄色视频| 成人特级av手机在线观看| 后天国语完整版免费观看| 观看免费一级毛片| 97碰自拍视频| 国产成人精品久久二区二区91| 午夜久久久久精精品| 悠悠久久av| 叶爱在线成人免费视频播放| 亚洲精品乱码久久久v下载方式 | 亚洲av电影在线进入| 欧美中文日本在线观看视频| 国产99白浆流出| 亚洲av电影在线进入| 校园春色视频在线观看| 亚洲 欧美 日韩 在线 免费| 亚洲九九香蕉| 欧美国产日韩亚洲一区| 神马国产精品三级电影在线观看| 日日干狠狠操夜夜爽| 黑人欧美特级aaaaaa片| www.自偷自拍.com| 色精品久久人妻99蜜桃| 丁香六月欧美| 男女做爰动态图高潮gif福利片| 青草久久国产| 亚洲va日本ⅴa欧美va伊人久久| 久久精品国产清高在天天线| 国产成人欧美在线观看| 人妻夜夜爽99麻豆av| 欧美性猛交╳xxx乱大交人| 熟女少妇亚洲综合色aaa.| 亚洲电影在线观看av| 舔av片在线| 国产精品久久电影中文字幕| 午夜福利在线观看免费完整高清在 | 日本撒尿小便嘘嘘汇集6| 99热这里只有是精品50| 九九在线视频观看精品| 一二三四在线观看免费中文在| 在线观看一区二区三区| 偷拍熟女少妇极品色| а√天堂www在线а√下载| 午夜激情福利司机影院| 啦啦啦观看免费观看视频高清| 日韩精品中文字幕看吧| 2021天堂中文幕一二区在线观| 久久久久久国产a免费观看| 99riav亚洲国产免费| 免费看日本二区| 老司机深夜福利视频在线观看| 搡老岳熟女国产| 舔av片在线| 免费在线观看亚洲国产| 美女被艹到高潮喷水动态| 久久国产乱子伦精品免费另类| 免费看十八禁软件| 波多野结衣巨乳人妻| 美女午夜性视频免费| 麻豆国产av国片精品| 熟妇人妻久久中文字幕3abv| 亚洲精品美女久久久久99蜜臀| 国产97色在线日韩免费| 1000部很黄的大片| 国模一区二区三区四区视频 | 亚洲七黄色美女视频| 精品国产乱子伦一区二区三区| 日韩成人在线观看一区二区三区| 成人av在线播放网站| 欧美激情久久久久久爽电影| 九九热线精品视视频播放| 精品午夜福利视频在线观看一区| 男人舔奶头视频| 午夜福利在线观看吧| 中文字幕av在线有码专区| 亚洲片人在线观看| 在线观看免费午夜福利视频| 国产精品久久久久久久电影 | 老司机午夜福利在线观看视频| 日本黄色视频三级网站网址| 国产黄色小视频在线观看| 国产高清三级在线| 一区福利在线观看| 成年免费大片在线观看| 国产极品精品免费视频能看的| 性色av乱码一区二区三区2| 精品国产乱子伦一区二区三区| 国产一区在线观看成人免费| 欧美日韩一级在线毛片| 久久精品91蜜桃| 制服丝袜大香蕉在线| 久久久久久久久免费视频了| 精品国产三级普通话版| 又爽又黄无遮挡网站| av在线蜜桃| 老鸭窝网址在线观看| 99久久99久久久精品蜜桃| 精品乱码久久久久久99久播| 精品久久久久久成人av| 亚洲美女黄片视频| 非洲黑人性xxxx精品又粗又长| 久久香蕉精品热| 黄频高清免费视频| 一级毛片女人18水好多| 亚洲激情在线av| 欧美丝袜亚洲另类 | 精品无人区乱码1区二区| 国产视频一区二区在线看| 国产黄色小视频在线观看| 成人一区二区视频在线观看| 黄色女人牲交| 亚洲欧美精品综合久久99| 18禁观看日本| tocl精华| 精品国产亚洲在线| 国模一区二区三区四区视频 | a在线观看视频网站| 一本综合久久免费| 一个人免费在线观看的高清视频| 一级毛片精品| 97超视频在线观看视频| 日韩欧美国产在线观看| 国产伦精品一区二区三区视频9 | 国内精品久久久久精免费| 国内揄拍国产精品人妻在线| 免费在线观看亚洲国产| 亚洲 国产 在线| 亚洲五月婷婷丁香| 手机成人av网站| 亚洲国产欧美人成| 可以在线观看毛片的网站| 他把我摸到了高潮在线观看| 午夜福利成人在线免费观看| 一级a爱片免费观看的视频| 99久久国产精品久久久| 亚洲电影在线观看av| 日韩av在线大香蕉| 欧美激情久久久久久爽电影| 日韩欧美免费精品| 国产伦在线观看视频一区| 19禁男女啪啪无遮挡网站| 一本精品99久久精品77| 一进一出好大好爽视频| 欧洲精品卡2卡3卡4卡5卡区| avwww免费| 亚洲一区高清亚洲精品| 欧美又色又爽又黄视频| 国产成人aa在线观看| 熟女电影av网| 18美女黄网站色大片免费观看| 国产精品亚洲美女久久久| 久久香蕉国产精品| 国产欧美日韩精品一区二区| 国产高清视频在线观看网站| 国产高清三级在线| 99热只有精品国产| 男女那种视频在线观看| 在线免费观看的www视频| 亚洲avbb在线观看| 国产一区二区在线观看日韩 | 久久午夜综合久久蜜桃| 九九久久精品国产亚洲av麻豆 | 亚洲 国产 在线| 一本一本综合久久| 亚洲一区二区三区色噜噜| 日本 欧美在线| 又黄又粗又硬又大视频| 我要搜黄色片| 国产亚洲精品av在线| 亚洲中文av在线| 国产在线精品亚洲第一网站| 宅男免费午夜| 中文在线观看免费www的网站| 亚洲国产看品久久| 欧美另类亚洲清纯唯美| 中文字幕人成人乱码亚洲影| 99久久综合精品五月天人人| 亚洲成人中文字幕在线播放| 亚洲精品456在线播放app | 在线播放国产精品三级| 亚洲美女视频黄频| 国内久久婷婷六月综合欲色啪| 国产成人系列免费观看| 99久国产av精品| 欧美日本亚洲视频在线播放| 性欧美人与动物交配| 97碰自拍视频| 老司机午夜十八禁免费视频| 最好的美女福利视频网| 久久久国产成人精品二区| 亚洲精品在线观看二区| 两人在一起打扑克的视频| 亚洲专区中文字幕在线| 欧美色视频一区免费| 亚洲精品在线美女| 欧美色视频一区免费| 国产午夜福利久久久久久| a在线观看视频网站| 变态另类成人亚洲欧美熟女| 99riav亚洲国产免费| 黄色日韩在线| 在线观看午夜福利视频| 欧美日韩综合久久久久久 | 国产精品久久久久久久电影 | 99热这里只有是精品50| 亚洲国产欧洲综合997久久,| 亚洲 欧美一区二区三区| 中文字幕人妻丝袜一区二区| 欧美最黄视频在线播放免费| 老汉色av国产亚洲站长工具| 看免费av毛片| 亚洲av日韩精品久久久久久密| 99久久精品一区二区三区| 床上黄色一级片| 啦啦啦韩国在线观看视频| 国产欧美日韩精品亚洲av| 九色国产91popny在线| 色吧在线观看| 久久久久久久久免费视频了| 哪里可以看免费的av片| 小蜜桃在线观看免费完整版高清| 淫秽高清视频在线观看| 国产伦在线观看视频一区| 亚洲九九香蕉| 757午夜福利合集在线观看| 亚洲七黄色美女视频| 亚洲最大成人中文| 亚洲国产欧美一区二区综合| 亚洲人成网站在线播放欧美日韩| 男人的好看免费观看在线视频| 变态另类成人亚洲欧美熟女| 国产亚洲av嫩草精品影院| 国产午夜福利久久久久久| 久久午夜亚洲精品久久| 日韩人妻高清精品专区| 亚洲成a人片在线一区二区| 午夜福利在线在线| 成年女人毛片免费观看观看9| 美女大奶头视频| 免费看日本二区| 巨乳人妻的诱惑在线观看| 少妇的丰满在线观看| 91av网一区二区| 色综合欧美亚洲国产小说| 精华霜和精华液先用哪个| 亚洲第一欧美日韩一区二区三区| avwww免费| 国产精品永久免费网站| 国产精品,欧美在线| 精品一区二区三区av网在线观看| 19禁男女啪啪无遮挡网站| 天堂av国产一区二区熟女人妻| 女生性感内裤真人,穿戴方法视频| 天堂av国产一区二区熟女人妻| 亚洲无线在线观看| 国产午夜精品论理片| 精品免费久久久久久久清纯| 婷婷精品国产亚洲av| 亚洲精华国产精华精| 久久国产精品影院| 1024手机看黄色片| 欧美+亚洲+日韩+国产| 国产视频内射| 男女下面进入的视频免费午夜| www.熟女人妻精品国产| 十八禁网站免费在线| 免费观看精品视频网站| 90打野战视频偷拍视频| 亚洲一区二区三区不卡视频| 99国产精品一区二区蜜桃av| 午夜两性在线视频| 可以在线观看的亚洲视频| 久久久久亚洲av毛片大全| 香蕉国产在线看| 亚洲电影在线观看av| 午夜精品久久久久久毛片777| 欧美一区二区精品小视频在线| 国产美女午夜福利| 日韩成人在线观看一区二区三区| 亚洲国产精品久久男人天堂| 99精品欧美一区二区三区四区| 欧美一级a爱片免费观看看| 女生性感内裤真人,穿戴方法视频| 色综合婷婷激情| 偷拍熟女少妇极品色| 国内精品一区二区在线观看| 午夜福利在线观看吧| 精品一区二区三区视频在线观看免费| 黄色片一级片一级黄色片| 人人妻,人人澡人人爽秒播| 成年女人毛片免费观看观看9| 国内揄拍国产精品人妻在线| 91老司机精品| 舔av片在线| 国内精品一区二区在线观看| 一本久久中文字幕| 欧美+亚洲+日韩+国产| 女人被狂操c到高潮| 精品国产三级普通话版| 精品国产乱子伦一区二区三区| 亚洲av日韩精品久久久久久密| 一本综合久久免费| 欧美xxxx黑人xx丫x性爽| 久久精品91蜜桃| 女警被强在线播放| 18禁黄网站禁片免费观看直播| 欧美成人免费av一区二区三区| 91字幕亚洲| 国产单亲对白刺激| 香蕉av资源在线| 高潮久久久久久久久久久不卡| 一进一出好大好爽视频| 免费一级毛片在线播放高清视频| 免费大片18禁| 小说图片视频综合网站| 熟女人妻精品中文字幕| 五月伊人婷婷丁香| 国产又黄又爽又无遮挡在线| 国产午夜精品久久久久久| 欧美日韩国产亚洲二区| 老鸭窝网址在线观看| 国产一级毛片七仙女欲春2| 别揉我奶头~嗯~啊~动态视频| 久久欧美精品欧美久久欧美| 欧美成人免费av一区二区三区| 国产私拍福利视频在线观看| 亚洲中文日韩欧美视频| 亚洲国产精品成人综合色| 午夜两性在线视频| 国内精品久久久久久久电影| 日本 av在线| 中文字幕久久专区| 一个人免费在线观看电影 | 三级毛片av免费| 一级a爱片免费观看的视频| 亚洲欧洲精品一区二区精品久久久| 亚洲精品美女久久av网站| 99久久99久久久精品蜜桃| 日韩av在线大香蕉| 1000部很黄的大片| www国产在线视频色| 欧美午夜高清在线| 看片在线看免费视频| 观看美女的网站| 99视频精品全部免费 在线 | 国产蜜桃级精品一区二区三区| 脱女人内裤的视频| 欧美成人性av电影在线观看| 成年女人看的毛片在线观看| 色视频www国产| 精品国产美女av久久久久小说| 免费观看精品视频网站| 18禁美女被吸乳视频| 一区二区三区激情视频| 亚洲欧美激情综合另类| 日韩欧美一区二区三区在线观看| 波多野结衣巨乳人妻| cao死你这个sao货| 久久久久久人人人人人| 久久久久国产一级毛片高清牌| 日韩欧美三级三区| 亚洲一区二区三区不卡视频| 桃红色精品国产亚洲av| 最好的美女福利视频网| 国产又黄又爽又无遮挡在线| av在线天堂中文字幕| 黑人巨大精品欧美一区二区mp4| 国产麻豆成人av免费视频| 在线观看一区二区三区| 美女被艹到高潮喷水动态| 欧美成人性av电影在线观看| 1000部很黄的大片| 色综合欧美亚洲国产小说| 亚洲国产欧洲综合997久久,| 免费看光身美女| 欧美成狂野欧美在线观看| 精品久久久久久,| 国产高清三级在线| 观看美女的网站| 精品国产超薄肉色丝袜足j| 国产97色在线日韩免费| 久久久久久久久久黄片| 欧美绝顶高潮抽搐喷水| 欧美成狂野欧美在线观看| 亚洲欧美日韩无卡精品| 国产成人av激情在线播放| 观看美女的网站| 999久久久精品免费观看国产| 女人被狂操c到高潮| 深夜精品福利| 日本熟妇午夜| 亚洲色图 男人天堂 中文字幕| 亚洲成人精品中文字幕电影| 九色国产91popny在线| 我要搜黄色片| 国内精品美女久久久久久| 99久国产av精品| 亚洲熟妇熟女久久| 欧美精品啪啪一区二区三区| 午夜精品久久久久久毛片777| 美女午夜性视频免费| 日日夜夜操网爽| 啪啪无遮挡十八禁网站| 999久久久国产精品视频| 国产精品 欧美亚洲| 一个人免费在线观看的高清视频| 精品不卡国产一区二区三区| 色视频www国产| 深夜精品福利| 岛国在线观看网站| 女警被强在线播放| 他把我摸到了高潮在线观看| 国产熟女xx| 欧美乱妇无乱码| www日本黄色视频网| 可以在线观看毛片的网站| 少妇裸体淫交视频免费看高清| 午夜福利在线观看免费完整高清在 | 成在线人永久免费视频| 日本黄色视频三级网站网址| 中出人妻视频一区二区| 麻豆国产97在线/欧美| 国产精品一区二区精品视频观看| 免费在线观看亚洲国产| 亚洲av美国av| 国内精品一区二区在线观看| 国产麻豆成人av免费视频| 丰满的人妻完整版| 国产伦在线观看视频一区| 亚洲色图 男人天堂 中文字幕| 国产又色又爽无遮挡免费看| 中文字幕高清在线视频| 男人的好看免费观看在线视频| 亚洲国产精品sss在线观看| 99精品在免费线老司机午夜| 国产精品 国内视频| 神马国产精品三级电影在线观看| 国产 一区 欧美 日韩| 亚洲欧美一区二区三区黑人| 国产高潮美女av| 国内久久婷婷六月综合欲色啪| 久久中文字幕人妻熟女| 我的老师免费观看完整版| 美女扒开内裤让男人捅视频| 视频区欧美日本亚洲| 日韩中文字幕欧美一区二区| 久久精品国产99精品国产亚洲性色| 亚洲精品美女久久av网站| 国产精品1区2区在线观看.| 桃色一区二区三区在线观看| 一二三四社区在线视频社区8| 精品福利观看| 成年免费大片在线观看| 亚洲精品美女久久av网站| 变态另类成人亚洲欧美熟女| 久久久国产成人精品二区| 国产午夜精品久久久久久| 黑人巨大精品欧美一区二区mp4| 久久精品国产99精品国产亚洲性色| 麻豆国产97在线/欧美| 国产精品亚洲一级av第二区| 久久久久国产一级毛片高清牌| 视频区欧美日本亚洲| 欧美三级亚洲精品| 欧美中文综合在线视频| 久久中文字幕一级| 日韩欧美 国产精品| 日本在线视频免费播放| 制服丝袜大香蕉在线| 男人舔女人的私密视频| 观看免费一级毛片| 91久久精品国产一区二区成人 | 丰满的人妻完整版| 天天一区二区日本电影三级|