• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection

    2024-03-04 07:43:56CongPanJunranPengandZhaoxiangZhang
    IEEE/CAA Journal of Automatica Sinica 2024年3期

    Cong Pan , Junran Peng , and Zhaoxiang Zhang ,,

    Abstract—Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However, they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches, our proposed depthguided vision transformer with a normalizing flows (NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore, with the help of pixel-wise relative depth values in depth maps, we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.

    I.INTRODUCTION

    OBJECT detection is a fundamental task in computer vision.With the development of deep learning [1]-[4],significant progress has been achieved in the field of 2D object detection [5]-[10].It also crucial for applications that require 3D spatial information, such as robot perception,mixed reality, and autonomous driving.Most existing 3D object detection methods rely on LiDAR sensors to provide depth information [11]-[16].However, such sensors cannot be widely used due to the high computational cost.In contrast,3D object detection relying on cameras infers depth from monocular RGB images of rich semantic information that can be used to boost object classification and localization.For monocular 3D object detection, recovering the bounding boxes of objects in 3D space remains challenging due to the lack of depth cues, making monocular object depth estimation naturally ill-posed.

    We conduct pilot experiments in Table I using predictions of representative monocular 3D object detection methods[17]-[19] without depth maps or using depth maps as different auxiliary styles.We compare their 3D object detection performance by substituting sub-task predictions with their corresponding ground-truth values.The table shows that depth estimation is the most crucial sub-task for 3D object detection compared to rotation (R?), horizontal and vertical location(X?Y), and shape (HW?L) prediction.Inspired by the analysis above, our work focuses on leveraging depth features to guide RGB features for high-quality estimation of 3D properties.

    Recent works transform depth maps estimated from off-theshelf depth estimators [20], [21] trained from larger datasets to pseudo-LiDAR representation [19], [22].Then they apply state-of-the-art LiDAR-based detection methods to 3D detection tasks and achieve impressive results on the challenging KITTI3D benchmark [23].However, the performance of pseudo-LiDAR based methods suffers from the gap between inaccurately estimated depth and real-world depth.The depth maps generated from monocular images are often coarse,which leads to worse pseudo point clouds and inaccurate 3D detection results.Meanwhile, LiDAR-based detectors increase the algorithmic complexity of 3D detection heads.Moreover,the generated LiDAR-resembled point cloud data representations discard high-level semantic information from RGB images, which is vital for object classification.

    Another line of work directly leverages depth maps to augment the input RGB images [18], [24], [25].They treat the depth maps as assistance to generate depth-aware features to enhance 3D detection performance in a learning style.These approaches use the relative depth relationship instead of absolute depth values, which can be easily generated by a scale transformation of depth maps.It is mainly because coarse andinaccurate absolute depth values are useless in their framework.Nevertheless, even though the multi-modal fusion of RGB images and depth maps uses relative depth values, it still heavily relies on the precision of the estimated depth maps.Furthermore, the multi-modal fusion of these convolutional neural network (CNN)-based methods is limited due to the locally confined receptive fields of convolutional layers in CNN.

    Therefore, we propose to solve this task by tackling two issues: 1) Determining how to obtain more accurate depth maps without complex architecture design and more computational consumption; 2) Determining how to find a better multi-modal fusion method for depth maps and RGB images to fully use the generated depth maps.

    For the first problem, we use normalizing flows to establish the relationship between the distribution of ground-truth depth values (Z?) of the training dataset and the distribution of the corresponding absolute depth values.It is a depth transformation that uses prior knowledge to generate more accurate depth maps, which takes only a few minutes for training on the CPU.For the second problem, we build a cross-attentionbased fusion module using the Swin Transformer [26] as a novel Transformer-based backbone.Considering that the Transformer is better suited for multi-modal fusion due to its global attention mechanism, using the cross-attention mechanism for image-depth fusion helps us to achieve notable improvements on the KITTI dataset [23], especially the largescale Waymo Open dataset [27].Also, with the help of more accurate depth maps processed by normalizing flows, we develop new relative position embeddings in the cross-attention mechanism to re-weight the cross-attention map to boost the fusion of depth maps and RGB images.

    Our contributions can be summarized as follows.

    1) We establish the relationship between the distribution of the ground-truth depth values (Z?) of the training dataset and the distribution of the corresponding absolute depth values with normalizing flows.This depth transformation method,employed as a pre-processing step for the depth maps obtained from off-the-shelf monocular depth estimators, incorporates prior knowledge to enhance the accuracy of the depth maps.

    2) We build a cross-attention-based fusion module as a novel Swin-Transformer-based backbone to get better fusion of depth maps and RGB images.

    3) We develop new relative position embeddings in the cross-attention mechanism to re-weight the cross-attention map to boost the fusion of depth maps and RGB images with the help of more accurate depth maps processed by normalizing flows.

    4) Extensive evaluations on the KITTI and the challenging Waymo Open datasets demonstrate that our proposed NFDVT model outperforms different strong baselines and achieves better performance than previous counterparts.

    II.RELATED WORK

    A. Image-Based Monocular 3D Detection

    Image-based approaches [28]-[31] typically use geometry constraints, including object shape, ground plane, and keypoints.MonoGRNet [31] decomposes the 3D detection problem into sub-tasks, namely 2D object detection, instance depth estimation, 3D localization, and local corner regression.GS3D[30] exploits an off-the-shelf 2D object detector and computes a coarse cuboid for each predicted 2D box to refine detection results.M3D-RPN [17] generates 2D and 3D object proposals simultaneously and proposes depth-aware convolution to improve 3D parameter estimation.SMOKE [32] is a deep architecture that predicts 3D bounding boxes by relying on keypoint estimation as an intermediate task.MonoRCNN[33] proposes a geometry-based distance decomposition to enable interpretability, accuracy and robustness of distance prediction.Homo [34] introduces a homography loss, which leverages geometric interrelationships among objects within a given scene.And it imposes global constraints on the mutual locations of objects by utilizing the homography between the image view and the Bird’s Eye View.

    Other works, such as Monopair [35], consider the relationship of paired samples to improve monocular 3D object detection.MonoDIS [36] proposes to improve both training convergence and detection accuracy of 3D detection networks by considering loss disentanglement.Zhanget al.[37] points out the importance of dimension estimation in the recovery of 3D locations.To address this concern, they propose to learn a dimension-aware embedding space and utilize this space to learn representative shape templates to improve dimension estimation.MonoCon [38] proposes to learn auxiliary monocular contexts in training and discard these auxiliary branches for better inference efficiency.

    Some methods use external data sources, e.g., Deep-MANTA [39] uses keypoints of vehicles to encode 3D vehicle information.LPCG [40] generates pseudo labels from unlabeled LiDAR point clouds and then trains the monocular 3D detectors with the pseudo-labeled data and benefits from the augmented training set.

    B. Depth-Aware Monocular 3D Detection

    An alternative solution for improving monocular 3D detection performance is to introduce depth information.Existing depth-aware monocular 3D object detection methods can be categorized into three types: pseudo-LiDAR based approaches, depth-image fusion-based approaches, and other depthaware approaches.

    The pseudo-LiDAR based methods [19], [22], [24], [41],[42] first perform an off-the-shelf monocular depth estimation,such as DORN [20] or BTS [21], and lift the input image to a point cloud representation, which is called a pseudo-LiDAR point cloud.Then they predict oriented 3D bounding boxes using a point-cloud-based 3D object detector.Mono3D-PLi-DAR [41] observes that noise in the pseudo-LiDAR hinders 3D detection performance and uses the instance mask instead of the bounding box frustum lifting.Pseudo-LiDAR++ [22]points out that pseudo-LiDAR relies heavily on the quality of depth estimation and therefore more accurate depth estimation can be implemented to aid pseudo-LiDAR methods with the help of a stereo network architecture.DD3D [43] uses a large private dataset DDAD15M for depth pre-training to transfer effective information between depth estimation and 3D detection.

    Some works are dedicated to improving 3D detection performance by fusing depth maps and RGB images in a learning style with traditional convolutional neural networks(CNNs) [18], [25], [44], [45].D4LCN [18] carefully learns dynamic-depthwise-dilated kernels from depth features to integrate with image context.CaDDN [45] predicts categorical depth distributions to generate accurate bird’s-eye-view representations and then uses a BEV detection network for 3D detection.

    Other depth-aware methods, such as DEVIANT [46], propose depth equivariant networks built with scale equivariant steerable blocks to solve the depth translation issue in the projective manifold.DID-M3D [47] reformulates the instance depth to the combination of instance visual depth and instance attribute depth and performs data augmentation based on the decoupled instance depth.MonoDDE [48] proposes a depth solving system with a depth selection and combination strategy to fully use visual clues from subtasks for monocular 3D object detection.

    Compared with the methods mentioned above that use offthe-shelf monocular depth estimators, our method introduces a depth prior by normalizing flows, thereby achieving more accurate depth maps.Training a model that processes depth images by normalizing flows only requires the CPU and takes only a few minutes, without complex architecture design and additional computational consumption.Moreover, unlike the methods mentioned above that fuse depth maps and RGB images with CNNs, our emphasis is on learning good depth maps and RGB images fusion networks via the global attention mechanism of the vision transformer.Meanwhile, more accurate depth maps obtained by normalizing flows can be used to conduct new relative position embeddings to aid the cross-attention-based fusion module.

    C. Vision Transformer

    A transformer with the fundamental component of a selfattention mechanism was first applied in the natural language processing (NLP) field and achieved great success [49], [50].Recently, with the capability of capturing long-range dependencies, transformer architecture has been successfully leveraged in computer vision tasks, such as image classification[51]-[53], object detection [54], [55], semantic segmentation[56], [57], and other vision-related tasks [58]-[61].However,transformers have limited applications in 3D object detection tasks.MonoDTR [62] is the first method using a transformer encoder-decoder structure to model the relationship between context- and depth-aware features for the monocular 3D object detection task.Our NF-DVT method utilizes the hierarchical architecture of the Swin Transformer [26] to extract various scales of image features while fusing depth information, which was extracted from the pre-trained depth estimator and refined with normalizing flows.

    D. Normalizing Flows

    Fig.1.The overall framework of our proposed approach.The depth map is first estimated from the RGB image and refined by building priors in instance depth estimation using normalizing flows.Then the RGB image and the depth map are fed into the Swin Transformer (tiny version, referred to as Swin-T) with cross-attention to extract features.Finally, multi-heads are attached for 3D bounding box estimation with non-maximum suppression (NMS).W-MSA and SWMSA are multi-head self-attention modules with regular and shift windowing configurations, respectively.Cross W-MSA and Cross SW-MSA are W-MSA and SW-MSA modules with cross-attention.

    A normalizing flow [63] describes the transformation of a probability density through a sequence of invertible mappings.Normalizing flows can transform a complex data point into a simple Gaussian Distribution or vice-versa.Some recent works utilize normalizing flows to build priors in vision tasks,such as density estimation [64], [65], image, audio, and video generation [66]-[68] and 3D human pose estimation [69]-[71].Reference [69] proposes a regression paradigm, named residual log-likelihood estimation (RLE), that leverages normalizing flows to estimate the underlying distribution and boosts human pose estimation.In this paper, we integrate the prior distribution of the ground-truth depth values into the distribution of estimated depth maps with normalizing flows to assist the depth-image fusion module.

    III.METHODOLOGY

    A. Framework Overview

    As shown in Fig.1, our framework consists of three key components: a depth map extractor and processor with normalizing flows, a cross-attention-based transformer backbone with depth-guided relative position embedding, and a 2D-3D detection head.Specifically, we utilize BTS [21] as the depth map estimator and use neural spline flows (NSF) [72] to build priors into the generated depth maps.As for a transformerbased backbone, we use the Swin Transformer [26] (tiny version, referred to as Swin-T), which is pre-trained on ImageNet-1K classification dataset [73].

    B. Depth Map Preprocessing

    First, we achieve the absolute depth valuedby projecting the3Dcenter location[x,y,z]3dincamera coordinates into the image planegiven aknown projectionmatrixK∈R3×4as

    where [x,y,z]3drepresents the horizontal position, height, and depth of the 3D center of each object in camera coordinates,and [x,y]pis the projection of the 3D point in the 2D image plane.With the same coordinate [x,y]pin the depth map, we can achieve the pixel depth and convert it to absolute depth valuedwith a scale.There is a one-to-one correlation between the ground-truthz3dand absolute depth valued.

    As shown in Fig.2, we observe that there is a certain relationship between the distribution of ground-truthz3dand the distribution of corresponding absolute depth valued.To model this specific relationship, we use the NSF auto-regressive density estimator, which is fully differentiable and has an analytic inverse.

    A normalizing flow models input data x as the output of an invertible, differentiable transformation f of noise u

    The probability density of x under the flow is obtained by a change of variables

    Fig.2.The illustration of the relationship between ground-truth z3d distribution of training dataset and corresponding absolute depth d distribution.The vertical axis is the distribution of ground-truth z3d (gt_z3d) of the training dataset and the horizontal axis is the distribution of corresponding absolute depth values.

    Given a dataset D={x(n)}, the flow( is )trained by maximizing the total log-likelihoodlogpx(n)with respect to the parameters of the transformation f.All flows in our module use a standard-normal noise distribution.We also constrain the distribution of the absolute depth valuedand the distribution of the ground-truthz3dafter transformation through theL1loss, making the two distributions similar in order to introduce more accurate depth prior information for the depth maps.To sum things up, the overall transformation loss using normalizing flows is

    Fig.3 shows the distribution of the ground-truth depthz3din the blue curve and the distribution of the generated absolute depth valuesdin the red curve before and after transformation using NSF.We use only three simple fully connected neural networks in the coupling transform of NSF, which is easy to train and is suitable for one dimension situations.We can see that, after transformation, the generated distribution of absolute depth valuesdis more similar to the distribution of the ground-truthz3d.With the help of normalizing flows, we can model distributions with complex and introduce flexible priors into depth maps.

    C. Depth-Guided Transformer

    Fig.3.The illustration of the distribution of the ground-truth depth z3d(blue curve) and the distribution of absolute depth d (red curve) before and after transformation through normalizing flows.

    In CNN-based architecture, attention modules are used to fuse different modal features.However, its empirical design cannot guarantee the discriminative ability of the model and the local receptive fields of convolutional layers is not able to fully capture the object context in the condition of perspective projection and occlusion.The trend of using cross-attentionbased transformers to model multi-modal fusion starts to show initial traction as transformers took the computer vision field by storm since mid-2020.Transformers are more suitable to perform the job of multi-modal fusion due to the global attention mechanism.In this paper, we extend a popular vision transformer, called the Swin Transformer [26], that serves as a general-purpose backbone for computer vision.Following the same pipeline, we use the tiny version of the Swin Transformer (Swin-T), which has 4 stages.Meanwhile, we upsample the last layer and then fuse it with the third layer to obtain a larger field of view and keep the network stride at 16.Different from Swin Transformer, we formulate our backbone as a two-branch network: one is the feature extraction network from RGB images, and the other is the depth-guided feature fusion network.

    1)Depth-Guided Cross-Attention: Nevertheless, simply using revised Swin-T as our backbone gives rise to worse performance compared to ResNet50, which will be shown in the ablation study component in Section IV-F.This is mainly due to insufficient training images in the KITTI dataset since the transformer-based network is data-driven.Then we propose a cross-attention module (two successive Swin Transformer blocks with cross-attention in the bottom half of Fig.1) to solve this problem.There are two types of attention mechanisms in each Transformer block, self-attention for the image branch and cross-attention for the image-depth branch.In selfattention, the queryQ, keyK, valueVinputs are the same,whereas in cross-attentionQis a different modal from that forKandV.In computing self-attention, we follow [26] to each head in computing similarity:

    whereQ,K,V∈RM2×nare thequery,keyandvaluematrices;nis thequery/keydimension;B∈RM2×M2is the relative position bias, andM2is the number of patches in a window.Similarly, the cross-attention mechanism for computing similarity is formulated as follows:

    whereQdis thequerymatrix of the depth map andKimg,Vimgare thekeyandvaluematrices of the image feature map.

    2)Depth-Guided Relative Position Embedding: As shown in Fig.4, the depth values of two arbitrary positionsiandjare different in the same local window.Positionjis closer to positioniand positionkis further away from positioni.It is intuitive that attention similarity should be greater the closer the distance is, and smaller the further the distance is.So we use the cosine of the depth difference as the weight coefficient.Then the cross-attention can be formulated as follows:

    wherejandkrepresent different positions.

    Fig.4.An illustration of a window for self-attention in the Swin Transformer architecture.A local window consists of 7 patches.i, j, k represent different patches.

    D. 2D-3D Detection Head

    Inspired by [17], we utilize 2D-3D anchors with priors as our default anchor boxes.And the proposed framework is based on the region proposal network (RPN) with 2D-3D anchors.The RPN locates the predefined anchors at every spatial location of the feature map; the detection head then regresses to the positive samples.

    2D-3D Anchor: Equation (1) represents the 3D-to-2D projection.Firstly, 2D anchors [x,y,w,h]2Dare defined on the 2D space for the 2D bounding box.[x,y]Prepresents the 2D location projected from the 3D center in the image plane.Then for each anchor, the default values of the 3D parameters are the mean statistics calculated from the corresponding priors in the training dataset.Specifically, we use 12 different scales ranging from 32 to 384 pixels and 3 aspect ratios (0.5, 1.0, 1.5) to define a total of 36 anchors.We then project all 3D ground truth boxes to the 2D space and calculate its intersection over union (IoU) with each 2D anchor and assign the corresponding 3D box to the anchors that have an I oU ≥0.5.Then the 3D[z,w,h,l,α]3Dparameters are the statistics across all matching 3D ground truth boxes, which denote 3D object center depth,physical dimension, and rotation, respectively.Note that in practical implementation, we use a minimum enclosing rectangle of the projected 3D box as our ground truth 2D bounding box.

    Formulation: Following [18], we utilizenato represent the number of anchors andncto represent the number of classes.Our detection head predicts 35+ncparameters for each posi-

    Losses: The network loss of our framework is formed as a multi-task learning problem composed of a classification lossLc, a 2D regression lossL2D, a 3D regression lossL3Dand a 2D-3D corner lossLcorner.For the classification loss, we employ the cross-entropy loss

    For both 2D and 3D regression, we use theSmoothL1regression losses

    The total loss is formulated as

    where γ =0.5 in all experiments.

    IV.EXPERIMENTS

    A. Dataset and Evaluation Metrics

    We evaluate our proposed framework on the challenging KITTI [23] and Waymo Open datasets [27].

    KITTI dataset consists of three tasks: 2D detection, 3D detection, and bird’s eye view (BEV).The KITTI dataset is divided into 7481 training samples and 7518 testing samples.The training samples are commonly divided into atrainset with 3712 samples and avalset with 3769 samples following[83].For a fair comparison with other baseline methods, we evaluate our NF-DVT method on thetestset by training our model on both thetrainandvalsets and evaluate on thevalset by training our model on only thetrainset.The evaluation is separated by difficulty settings (Easy, Moderate, and Hard)and by object classes (Car, Pedestrian, and Cyclist).Results on the KITTI dataset are evaluated using precision-recall curves with the IoU threshold of 0.5 or 0.7.The 3D object detection and 3D localization performances are evaluated by AP3dand APbev, respectively.Following [36], we use the 40 recall positions-based metric A P|R40.

    The Waymo Open Dataset is another large-scale challenging dataset for autonomous driving.The official protocol provides 798 training sequences and 202 validation sequences from different scenes, and 150 test sequences without labels.Following PCT [82], we only use the data from the front camera for monocular 3D object detection.We sample one frame out of every three frames from the 798 training sequences (52 386 images) and use all the 202 validation sequences (39 848 images).The evaluation is separated into two difficulty settings through different numbers of lidar points contained in the bounding box: samples with equal or less than 5 points are classified as “LEVEL_2”, and others are classified as“LEVEL_1”.Results on the Waymo open dataset are evaluated using mean average precision (mAP) and the mean average precision weighted by heading (mAPH).Different distances (0-30 m, 30-50 m, 50 m-∞) to the sensor are also considered in the evaluation.

    B. Details on Baselines

    In order to make a fair comparison with the existing stateof-the-art methods and illustrate that our model can be plugged into different 3D object detectors as a backbone, we use anchor-based method D4LCN [18] and anchor-free based method MonoFlex [80] as our base models, respectively.Note that all the ablation studies on KITTI and experiments on the Waymo open dataset are implemented on an anchor-based baseline.

    Anchor-Based Method: D4LCN [18] is a one-stage anchorbased method that leverages a set of pre-defined 2D-3D anchors to derive the estimated 3D boxes.Our proposed method replaces the convolutional backbone with a depth vision Transformer (DVT) module for feature extraction,which concurrently fuses depth maps and regresses object properties on these feature maps.

    Anchor-Free Based Method: MonoFlex [80] is a one-stage anchor-free based method, which utilizes multiple branches to predict 2D bounding boxes, orientation, dimension, depth, and keypoints.As it operates as an anchor-free detector, the representative box’s location is automatically assigned as the 3D projected center in the heatmap head without the need for selection.Following this architecture, we introduce a Transformer-based backbone to extract image features and fuse depth maps.Finally, multiple heads are employed on these feature maps to regress object properties.

    C. Implementation Details

    We use the tiny version of Swin-T as our backbone to extract features from RGB images and depth maps, which has 4 stages with [2, 2, 6, 2] block numbers for each stage.For the KITTI dataset, similar to [85], we double the training set by utilizing images both from the left and right RGB cameras(only RGB images from the left camera are used in thevalandtestsets).We generate both the left and right depth maps using [21].The input images are scaled to 384×1280 with horizontal flipping and photometric distorted data augmentation.Non maximum suppression (NMS) with an IoU threshold of 0.4 is used on the network output in 2D space.Following [17], we use a hill-climbing post-processing step to optimize 3D rotation α.The network is optimized by stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 0.0005.We use the “one-cycle” learning rate policy and set the base learning rate to 0.01.

    Following PCT [82], we utilize the AdaBins [86] monocular depth estimator, trained on the Waymo Open dataset, to derive depth maps for the Waymo Open dataset.The input images are scaled to 640×960 with horizontal flipping and photometrically distorted data augmentation.The network is optimized by AdamW [87] optimizer with the initial learning rate as 3e-4 and weight decay of 1e-5.

    As for normalizing flows, we use NSF [72], which has a tractable Jacobian determinant and can be inverted exactly in a single pass.The neural network in our practical implementation consists of 3 fully connected (FC) layers, followed by atanhactivation function.The network is optimized by Adam with a learning rate of 0.005 and trained for 10 000 iterations.

    D. Results on the KITTI Dataset

    Results of Car category on KITTItestset.Table II shows the results of our NF-DVT method on the KITTItestset compared to the state-of-the-art monocular methods with or without depth as an extra input.For AP|R40on thevalset, our method exceeds two baselines by a large margin and surpasses existing methods on 3D object detection and bird’s eye view tasks for theCarcategory under easy, moderate and hard levels.Specifically, compared with the recently proposed DEVIANT [46], our model based on MonoFlex achieves 2.54%, 2.49%, and 2.11% gains on the easy, moderate, and hard settings with an 0.7 IoU threshold for 3D detection task,respectively.Furthermore, our method based on MonoFlex improves with respect to recently proposed DID-M3D [47](using extra data) by 0.02%, 0.66%, and 0.25% gains on the three difficulty settings for the 3D detection task, which is superior in Moderate and Hard, but with minimal gain in Easy.Since our model improves the accuracy of the depth map by normalizing flows; thus, the depth estimation of distant objects is better, thereby improving the 3D detection results of distant objects, which is demonstrated with the significant improvement under the moderate and hard settings.As previous methods like DID-M3D have already estimated the depth of close objects accurately, the improvement of our model under such a setting is not obvious.

    Results of Car category on KITTIvalset.Table III illustrates the performance of our model on the KITTIvalset for better comparison, including different tasks and IoU thresholds.As our work is aimed at achieving more accurate depth maps and better modeling of fusion of depth maps and RGB images, our approach is mainly compared to D4LCN [18],which carefully designs a local convolutional network to learndynamic-depthwise-dilated kernels for images with the guidance of depth maps.Though the results are higher in their paper [18], we show our replicate results with their provided official code and parameters in Table III with a star sign.Specifically, our method improves with respect to D4LCN*by 1.4%/2.11% for 3D/BEV detection under the moderate setting at 0.7 IoU threshold.Moreover, compared with the recently proposed DEVIANT [46], our method based on Monoflex achieves 0.97%, 1.88%, and 0.8% gains on the easy, moderate, and hard settings with a 0.7 IoU threshold for 3D detection tasks, respectively.Finally, our method outperforms all existing advanced methods under different tasks and IoU thresholds.

    TABLE II AP|R40 SCORE ON TEST SET OF 3D OBJECT DETECTION AND BIRD’S EYE VIEW FOR THE CAR CATEGORY WITH IOU ≥ 0.7.BEST RESULTS ARE MARKED IN BOLD

    TABLE III AP|R40 SCORE ON VAL SET OF 3D OBJECT DETECTION AND BIRD’S EYE VIEW FOR THE CAR CATEGORY.“*” INDICATES OUR REPLICATE RESULTS FOR D4LCN [18] WITH OFFICIAL CODE AND CONFIGURATION.BEST RESULTS ARE MARKED IN BOLD

    TABLE IV PERFORMANCE OF 3D OBJECT DETECTION AND BIRD’S EYE VIEW FOR PEDESTRIANS AND CYCLISTS ON TEST SET,AT 0.5 IOU THRESHOLD.BEST RESULTS ARE MARKED IN BOLD

    Results of Pedestrian/Cyclist on KITTItestset.We further show the evaluation results of Pedestrian and Cyclist on the KITTI test dataset in Table IV.The Pedestrian and Cyclist categories are much more challenging than the Car category for monocular 3D object detection due to their small scale.Our method outperforms existing advanced arts both on 3D object detection and localization tasks to some degree.

    E. Results on the Waymo Open Dataset

    Table V shows the results of our NF-DVT method for the vehicle class on the challenging Waymo Open validation dataset.We also compare the proposed method to the current state-of-the-art methods with or without depth as an extra input in the table.Our proposed method yields improvement over most of the recent approaches.Compared to the baseline method M3DRPN [17] without an extra input, our proposed method achieves 2.41%/2.31% mAP gains in LEVEL_1/LEVEL_2 under an IoU threshold of 0.7, respectively.Compared to the second-best approach DEVIANT [46], our proposed method achieves 0.07%/0.12% mAP gains in LEVEL_1/LEVEL_2 under an IoU threshold of 0.7, respectively.And under an IoU threshold of 0.5, our proposed method achieves 0.34%/0.95% mAP gains in LEVEL_1/LEVEL_2, respectively.Although our method gives worse results than DEVIANT [46] at distances from 0-30 m, our results are better at distances from 30-50 m and 50 m-∞.This indicates that our method perceives objects far from the camera better and can achieve better performance overall.

    F. Ablation on Different Components

    To explain how much improvement each component of our NF-DVT approach provides, we perform ablation studies on the KITTIvalset for thecarcategory.The main results are shown in Table VI.We observe that: 1) Simply using Swin-T as the backbone without depth maps cannot achieve better performance compared to the baseline using ResNet50 (a→b), which is mainly because the number of images in the KITTIvalset is too small.The transformer-based feature extractor is more suitable for training large-scale datasets.2)The model using Swin-T as the backbone to fuse image and depth maps with cross-attention outperforms the baseline (b→c) by a large margin (+ 5.58 for moderate), which demonstrates the effectiveness of transformer-based multi-modal fusion for monocular 3D object detection.3) Our transformerbased backbone for depth-image fusion is more effective than the CNN-based backbone (c → results of D4LCN* shown in Table III).4) The proposed depth-guided relative position embedding module (c → d) can achieve remarkable improvement on the 3D detection task (+ 0.66 under an easy level, +0.5 under a moderate level and + 0.57 under a hard level).5)By comparing setting (d → e), we find that the normalizing flows can effectively introduce priors into depth information and then improve overall performance, e.g., 0.46 point improvement under a moderate level.

    Moreover, to explain how our proposed method improves the baseline method M3DRPN [17] on the Waymo Open dataset, we perform ablation studies on the Waymo Openvalset for thecarcategory in Table VII.Most of the observations are consistent with those on KITTI.In addition, we also find that owing to the increased volume of data contained within the Waymo Open dataset, the Transformer model demonstrates superior performance in comparison to the conventional Convolutional model (a → b).

    G. Ablation on Different Depth Estimators

    To explore the impact of different depth estimators on the performance of our methods and make a fair comparison between our proposed NF-DVT and existing methods, we extract depth maps using two different methods DORN [20]and BTS [21].Table VIII illustrates the evaluation results on the KITTI Eigen split (KITTI provides the dataset [90] with 61 scenes from “city”, “residential”, “road” and “campus” categories.And existing works for monocular depth estimation commonly use a split proposed by Eigenet al.[91] for training and testing).It can be seen that BTS outperforms DORN with a narrow margin in both the 0-80 m capturing range and 0-50 m capturing range.

    TABLE V EXPERIMENTAL RESULTS ON THE WAYMO OPEN VALIDATION SET.WE EVALUATE ON VEHICLE CATEGORY WITH 3D MAP AND 3D MAPH METRICS UNDER IOU THRESHOLD OF 0.7 AND 0.5, RESPECTIVELY.“LEVEL_1” DENOTES THE EVALUATION OF THE BOUNDING BOXES THAT CONTAIN MORE THE 5 LIDAR POINTS.“LEVEL_2” DENOTES THE EVALUATION OF ALL OF THE BOUNDING BOXES.THE BEST RESULTS ARE MARKED IN BOLD

    TABLE VI ABLATION STUDIES ON THE KITTI VAL SET FOR THE CAR CATEGORY.VT IS THE VISION TRANSFORMER, CA IS THE CROSS-ATTENTION TRANSFORMER, DVT IS THE DEPTH-GUIDED VISION TRANSFORMER AND NF IS THE NORMALIZING FLOW.THUS (A) USES RESNET50 AS THE BACKBONE WITHOUT DEPTH MAPS.(B) USES USING SWIN-T AS THE BACKBONE WITHOUT DEPTH MAPS.(C) USES SWIN-T AS THE BACKBONE WITH IMAGE-DEPTH CROSS-ATTENTION.(D) USES THE SWIN-T BASED CROSS-ATTENTION TRANSFORMER WITH DEPTHGUIDED RELATIVE POSITION EMBEDDING.(E) USES USING NORMALIZING FLOWS TO INTRODUCE PRIORS INTO DEPTH INFORMATION

    The evaluation results on the KITTI validation dataset of the 3D object detection task are shown in Table IX.It shows that our proposed NF-DVT method outperforms D4LCN [18] with a significant margin under different depth estimators.Specifi-cally, our NF-DVT method improves D4LCN* by 1.36%/1.4% under the moderate setting at a 0.7 IoU threshold when using DRON and BTS as depth estimators, respectively.Meanwhile, compared to DORN, depth maps generated by BTS bring limited gains (0.08% under moderate setting) in performance on 3D object detection.

    TABLE VII ABLATION STUDIES ON THE WAYMO OPEN DATASET VAL SET FOR THE CAR CATEGORY.VT IS THE VISION TRANSFORMER, CA IS THE CROSS-ATTENTION TRANSFORMER, DVT IS THE DEPTH-GUIDED VISION TRANSFORMER AND NF IS THE NORMALIZING FLOW.THUS (A) USES RESNET101 AS THE BACKBONE WITHOUT DEPTH MAPS.(B) USES SWIN-T AS THE BACKBONE WITHOUT DEPTH MAPS.(C) USES SWIN-T AS THE BACKBONE WITH IMAGE-DEPTH CROSS-ATTENTION.(D) IS THE SWIN-T BASED CROSS-ATTENTION TRANSFORMER WITH DEPTH-GUIDED RELATIVE POSITION EMBEDDING.(E) USES NORMALIZING FLOWS TO INTRODUCE PRIORS INTO DEPTH INFORMATION

    TABLE VIII PERFORMANCE OF DIFFERENT DEPTH ESTIMATORS ON EIGEN KITTI

    TABLE IX COMPARISON OF DIFFERENT DEPTH ESTIMATORS ON THE VAL SET OF 3D OBJECT DETECTION FOR THE CAR CATEGORY.D4LCN*REPRESENTS OUR REPLICATE RESULTS FOR D4LCN [18]WITH OFFICIAL CODE AND CONFIGURATION.ALL RESULTS ARE GIVEN BY AP|R40

    H. Ablation on Different Backbones

    In Table X, we conduct various experiments with different backbones to see the effectiveness of the transformer-based backbone.Note that we use D4LCN [18] as our baseline when constructing a CNN-based backbone.To build Swin-based architecture without extra input data, we substitute the ResNet by Swin Transformer [26].Furthermore, we only implement the cross-attention transformer when utilizing depth maps as extra input for the transformer-based architecture.

    Swin-S and Swin-B are versions approximately 2× and 4×the model size and computational complexity of Swin-T,respectively.The architecture hyper-parameters of these model variants are:

    TABLE X AP|R40 SCORE OF DIFFERENT BACKBONES WITH OR WITHOUT THE HELP OF DEPTH ON THE VAL SET OF 3D OBJECT DETECTION FOR THE CAR CATEGORY

    - Swin-T:C=96 , layer numbers = {2,2,6,2};

    - Swin-S:C=96, layer numbers = {2,2,18,2};

    - Swin-B:C=128 , layer numbers = {2,2,18,2};

    whereCis the channel number of the hidden layers in the first stage.The complexity of Swin-T and Swin-S are similar to those of ResNet-50 and ResNet-101, which is mentioned in[26].

    From the result, we can see that the performance improvement is very limited when using more hidden layers in the backbone (ResNet-50 → ResNet-101 and Swin-T → Swin-S)with or without depth maps.However, the most significant improvement is made by increasing the channel numberCin the transformer-based backbone (Swin-S → Swin-B).Swin-B improves with respect to Swin-S by 3.56%/0.54% under the moderate setting with and without extra input, respectively.Moreover, simply using a transformer-based network instead of a CNN-based network as the backbone results in worse performance for the 3D detection task.Additionally, still achieve improvements after image-depth fusion with the cross-attention mechanism.When using the depth map as an extra input,Swin-T improves with respect to ResNet-50 by 0.44% under the moderate setting and Swin-S improves with respect to ResNet-101 by 0.39% under the moderate setting.

    I. Ablation on Different Transformer-Based Backbones

    As shown in Fig.5, we build a new framework using the transformer encoder-decoder structure [49], [54] to verify the effectiveness of our proposed method.The framework mainly consists of four components: the backbone, transformer encoder module, transformer decoder module and the 2D-3D detection head.We use ResNet-50 [1] as backbones with a network stride factor of 16 to extract features from RGB images and depth maps.Similar to [54], the self-attention mechanism in the transformer encoder module is used to enhance the image features.And the transformer decoder module is the same as the standard structure in [54], except that the object query is the depth feature maps.The predefined 2D-3D anchor-based detection head is the same as mentioned before.

    Fig.5.The new framework of our proposed approach when using a transformer encoder-decoder structure [49].

    In Table XI, we conduct ablation experiments on KITTI validation set using the new framework.The results show that our proposed depth-guided positional encoding and depth map enhancement with normalizing flows are still effective under a new transformer-based structure.

    TABLE XI PERFORMANCE OF OUR PROPOSED NF-DVT ON THE KITTI VALIDATION SET WHEN USING OTHER TRANSFORMER-BASED BACKBONES.DPE MEANS DEPTH-GUIDED POSITIONAL ENCODING AND NF MEANS USING NORMALIZING FLOWS TO BUILD PRIORS IN DEPTH MAPS

    J. Ablation on Inference Speed

    In Table XII, we show the runtime and performance of the CNN-based method with or without depth as extra input(D4LCN and MonoFlex) and our transformer-based method(DVT and NF-DVT) on a single RTX 2080Ti GPU.For engineering considerations, the time consumption of depth estimation and normalized flow modules is taken into account in both the CNN-based model (D4LCN) and our transformerbased model (DVT and NF-DVT), resulting in a much slower inference speed than a model without additional input like MonoFlex.

    The results illustrate the trade-off of our model in speed and performance.Firstly, the results show that our NF module for depth map preprocessing only brings a 3 ms increase during inference, which is almost negligible but will bring performance improvement.Moreover, our NF-DVT method with the Swin-T backbone achieves better performance and faster inference speed than D4LCN with the ResNet-101 backbone.Finally, the inference speed of our model can also be improved with a faster base model.For example, our NFDVT-MonoFlex model is 40ms faster than the NF-DVTD4LCN model.

    K. Qualitative Results

    Qualitative visualization on KITTI validation dataset is presented in Fig.6.The prediction 3D bounding boxes of our model are shown in the images.It can be seen from the results that our model is effective for large objects such as acarand small objects such as apedestrian.For better visualization, we also provide the results in a bird’s-eye-view and 3D space,which demonstrate that our proposed model can recover object distances and predict the rotation of objects accurately.

    Some failure cases of our proposed NF-DVT method on the KITTI validation dataset are provided in Fig.7.We provide two possible cases, i.e., 1) small objects with apedestrianandcyclist, 2) and severely truncated objects close to the camera.These two scenarios are still challenging for the current monocular 3D object detection methods.

    Moreover, qualitative visualization of the Waymo Open Dataset validation set is presented in Fig.8.The results suggest that our proposed model is capable of achieving accurate predictions in complex environmental conditions, including varying weather patterns (sun, rain and fog) and different times of day (Day and Night).Nevertheless, the model’s predictive capacity seems to falter in the case of truncated objects and distant entities situated beyond a 30 m range.This setback is consistent with earlier research and presents a formidable challenge for future investigations.

    V.CONCLUSION

    In this paper, we propose a novel depth-guided vision transformer framework with normalizing flows (NF-DVT) for monocular 3D object detection.We use normalizing flows to build priors in pixel-wise depth maps to achieve more accurate depth information.Then we develop a novel transformerbased backbone with a fusion module based on cross attention for the depth map and RGB image with two separate branches and then fuse them with a cross-attention mechanism to exchange information with each other.Meanwhile, we re-weight the original relative position embeddings in the cross-attention module with the help of the refined depth maps.Extensive experiments on the KITTI dataset and largescale Waymo Open dataset show the effectiveness of our proposed method and superiority over previous state-of-the-art monocular-based methods.

    Fig.6.Qualitative results on KITTI validation dataset.The ground truths and our prediction results are in green boxes and red boxes, respectively.We provide (a) 3D bounding boxes in camera-view images; (b) projections of the 3D bounding boxes in bird’s-eye-viewl; and (c) 3D bounding boxes in 3D space.

    Fig.7.Visualization of failure cases of our proposed NF-DVT method on KITTI validation dataset.The ground truths and our prediction results are in green boxes and red boxes, respectively.We provide two possible cases: 1) Small objects pedestrian and cyclist; 2) Severely truncated objects close to the camera.

    Fig.8.Qualitative results on the Waymo Open Dataset validation set with varying weather patterns (sun, rain and fog) and different time of day (Day and Night).The ground truths and our prediction results are in green boxes and red boxes, respectively.We provide (a) 3D bounding boxes in camera-view images;and (b) projections of the 3D bounding boxes in bird’s-eye-view.

    日韩三级视频一区二区三区| 久热爱精品视频在线9| 窝窝影院91人妻| 久久国产亚洲av麻豆专区| 亚洲精品久久午夜乱码| 热re99久久国产66热| 美女福利国产在线| 精品卡一卡二卡四卡免费| 国产1区2区3区精品| netflix在线观看网站| 午夜91福利影院| 手机成人av网站| 欧美成人午夜精品| 久久av网站| 欧美精品一区二区大全| 2018国产大陆天天弄谢| 男人爽女人下面视频在线观看| 精品亚洲成国产av| 国产亚洲欧美在线一区二区| 日韩一卡2卡3卡4卡2021年| 日日摸夜夜添夜夜添小说| 在线观看www视频免费| 午夜福利一区二区在线看| 亚洲精品中文字幕一二三四区 | 国产97色在线日韩免费| 午夜日韩欧美国产| av在线app专区| 亚洲中文av在线| 后天国语完整版免费观看| 91大片在线观看| 侵犯人妻中文字幕一二三四区| 欧美黑人精品巨大| 又大又爽又粗| 正在播放国产对白刺激| 女人精品久久久久毛片| 国产一区二区三区av在线| 欧美在线一区亚洲| 男女下面插进去视频免费观看| 亚洲一卡2卡3卡4卡5卡精品中文| 男女免费视频国产| xxxhd国产人妻xxx| 欧美大码av| 国产人伦9x9x在线观看| 久久青草综合色| 久久久久久久精品精品| 久久这里只有精品19| 国产极品粉嫩免费观看在线| 777米奇影视久久| 欧美一级毛片孕妇| 亚洲中文日韩欧美视频| 午夜激情av网站| 一二三四在线观看免费中文在| 老熟妇仑乱视频hdxx| 日韩欧美一区二区三区在线观看 | 视频区欧美日本亚洲| 国产成人精品久久二区二区91| 欧美成狂野欧美在线观看| 久久女婷五月综合色啪小说| 亚洲av美国av| 亚洲第一青青草原| 黄色视频不卡| 成年美女黄网站色视频大全免费| 这个男人来自地球电影免费观看| 日韩制服骚丝袜av| 亚洲国产av影院在线观看| 午夜老司机福利片| 国产高清国产精品国产三级| 91精品伊人久久大香线蕉| 久久人人爽人人片av| 久久精品国产a三级三级三级| 亚洲精品美女久久久久99蜜臀| 黑丝袜美女国产一区| 国产亚洲精品久久久久5区| 久久精品国产亚洲av高清一级| 亚洲色图 男人天堂 中文字幕| 69精品国产乱码久久久| 亚洲熟女精品中文字幕| 人成视频在线观看免费观看| 亚洲av欧美aⅴ国产| 国产福利在线免费观看视频| 国产精品久久久久久人妻精品电影 | 国产精品久久久久久精品古装| 午夜激情av网站| 热99国产精品久久久久久7| 窝窝影院91人妻| 精品国产国语对白av| 色老头精品视频在线观看| 欧美 日韩 精品 国产| 国产淫语在线视频| 亚洲精品美女久久久久99蜜臀| 免费女性裸体啪啪无遮挡网站| 制服诱惑二区| 久久久久视频综合| 人人澡人人妻人| 热99国产精品久久久久久7| 老熟妇仑乱视频hdxx| 日韩视频在线欧美| 大香蕉久久网| 亚洲精品日韩在线中文字幕| 国产精品一区二区在线不卡| 久久人人97超碰香蕉20202| 亚洲人成77777在线视频| 日本vs欧美在线观看视频| 在线观看舔阴道视频| 无遮挡黄片免费观看| 男女下面插进去视频免费观看| 黑人猛操日本美女一级片| 亚洲欧美激情在线| 一个人免费在线观看的高清视频 | 热re99久久精品国产66热6| 精品人妻熟女毛片av久久网站| 精品熟女少妇八av免费久了| av电影中文网址| 国产xxxxx性猛交| 亚洲精品中文字幕一二三四区 | 秋霞在线观看毛片| 精品久久久久久久毛片微露脸 | 丝瓜视频免费看黄片| 精品国产一区二区三区久久久樱花| 老司机午夜福利在线观看视频 | 每晚都被弄得嗷嗷叫到高潮| 狠狠婷婷综合久久久久久88av| 色综合欧美亚洲国产小说| 久久国产精品影院| 女性生殖器流出的白浆| 18禁裸乳无遮挡动漫免费视频| 老司机深夜福利视频在线观看 | 日日夜夜操网爽| 亚洲精品一区蜜桃| 亚洲精华国产精华精| 亚洲av电影在线观看一区二区三区| 国产成人欧美| 在线看a的网站| 高清在线国产一区| 伊人久久大香线蕉亚洲五| 久久中文字幕一级| 黑人欧美特级aaaaaa片| 久久久国产欧美日韩av| 电影成人av| 色视频在线一区二区三区| 国产深夜福利视频在线观看| 欧美在线一区亚洲| 真人做人爱边吃奶动态| 午夜免费观看性视频| 亚洲五月婷婷丁香| 国产免费视频播放在线视频| 日韩欧美一区视频在线观看| 三级毛片av免费| 国产精品熟女久久久久浪| 亚洲视频免费观看视频| 久久精品久久久久久噜噜老黄| 一本综合久久免费| 美女高潮喷水抽搐中文字幕| 日韩大片免费观看网站| 老汉色∧v一级毛片| 亚洲人成77777在线视频| 久久久国产一区二区| 国产精品免费视频内射| 五月开心婷婷网| 各种免费的搞黄视频| 亚洲欧美日韩另类电影网站| 亚洲精品粉嫩美女一区| 欧美中文综合在线视频| 亚洲av成人不卡在线观看播放网 | 国产精品一二三区在线看| 视频区欧美日本亚洲| 色综合欧美亚洲国产小说| 一区二区三区激情视频| 黄频高清免费视频| 亚洲情色 制服丝袜| 十八禁高潮呻吟视频| 国产精品久久久久久人妻精品电影 | 真人做人爱边吃奶动态| 欧美午夜高清在线| 久久久久久亚洲精品国产蜜桃av| 国产亚洲一区二区精品| 国产成人精品无人区| 中国国产av一级| 狠狠精品人妻久久久久久综合| 国产亚洲精品久久久久5区| av天堂久久9| 纯流量卡能插随身wifi吗| 日本a在线网址| 中文字幕最新亚洲高清| 精品亚洲成a人片在线观看| 久久久久久久久久久久大奶| 99久久99久久久精品蜜桃| 午夜福利乱码中文字幕| 亚洲色图 男人天堂 中文字幕| av电影中文网址| 国产一卡二卡三卡精品| 国产一区二区在线观看av| 老司机在亚洲福利影院| 亚洲av电影在线进入| 2018国产大陆天天弄谢| 女人高潮潮喷娇喘18禁视频| 中亚洲国语对白在线视频| av国产精品久久久久影院| 国产精品麻豆人妻色哟哟久久| 桃红色精品国产亚洲av| 99国产精品一区二区三区| 欧美激情 高清一区二区三区| 菩萨蛮人人尽说江南好唐韦庄| 一边摸一边抽搐一进一出视频| 在线精品无人区一区二区三| 国产成人啪精品午夜网站| 女人被躁到高潮嗷嗷叫费观| av网站免费在线观看视频| 一级毛片精品| 欧美午夜高清在线| 中文精品一卡2卡3卡4更新| 国产精品久久久久久精品古装| 久久人妻熟女aⅴ| 欧美久久黑人一区二区| 国产主播在线观看一区二区| 在线av久久热| 亚洲全国av大片| 交换朋友夫妻互换小说| 午夜福利影视在线免费观看| 国产精品99久久99久久久不卡| 久久精品aⅴ一区二区三区四区| av福利片在线| 亚洲三区欧美一区| 欧美精品一区二区大全| 黄色毛片三级朝国网站| 一本一本久久a久久精品综合妖精| 欧美黄色淫秽网站| 亚洲精品国产av蜜桃| 一级片免费观看大全| 亚洲中文日韩欧美视频| 国产欧美日韩精品亚洲av| 91麻豆av在线| 伊人久久大香线蕉亚洲五| 久久精品成人免费网站| 搡老熟女国产l中国老女人| avwww免费| 99久久国产精品久久久| 国产亚洲一区二区精品| 亚洲激情五月婷婷啪啪| 久久久久网色| 女性生殖器流出的白浆| 久久久久精品国产欧美久久久 | 亚洲精品在线美女| 欧美日韩福利视频一区二区| 日本vs欧美在线观看视频| 国产欧美日韩一区二区精品| av一本久久久久| 狂野欧美激情性bbbbbb| 精品一区二区三卡| 久久99一区二区三区| 久久这里只有精品19| 亚洲av片天天在线观看| 亚洲精品一区蜜桃| 精品久久久久久久毛片微露脸 | 国产一区二区 视频在线| 九色亚洲精品在线播放| 午夜福利乱码中文字幕| 又大又爽又粗| 精品国产乱子伦一区二区三区 | 热99re8久久精品国产| 精品福利观看| 日韩欧美免费精品| 啦啦啦中文免费视频观看日本| 久热这里只有精品99| 夜夜夜夜夜久久久久| 婷婷成人精品国产| 亚洲人成电影观看| 九色亚洲精品在线播放| 交换朋友夫妻互换小说| 日韩免费高清中文字幕av| www.999成人在线观看| 日本猛色少妇xxxxx猛交久久| 十八禁高潮呻吟视频| 一区在线观看完整版| 久热爱精品视频在线9| 久久人妻熟女aⅴ| 99国产精品99久久久久| 国产在线免费精品| 脱女人内裤的视频| 亚洲男人天堂网一区| 欧美中文综合在线视频| 久久精品熟女亚洲av麻豆精品| 日本wwww免费看| 亚洲熟女毛片儿| 久久99一区二区三区| 亚洲欧美一区二区三区久久| 女性生殖器流出的白浆| 久久国产亚洲av麻豆专区| 亚洲第一青青草原| www.熟女人妻精品国产| 国产精品99久久99久久久不卡| 国产精品一二三区在线看| 国产高清videossex| 亚洲va日本ⅴa欧美va伊人久久 | 国产av一区二区精品久久| 老鸭窝网址在线观看| 老司机福利观看| 亚洲成人免费电影在线观看| 国产福利在线免费观看视频| 亚洲va日本ⅴa欧美va伊人久久 | 青春草亚洲视频在线观看| 欧美精品高潮呻吟av久久| 午夜精品久久久久久毛片777| 一级a爱视频在线免费观看| 肉色欧美久久久久久久蜜桃| 99久久99久久久精品蜜桃| 日韩三级视频一区二区三区| 国产精品一区二区免费欧美 | 真人做人爱边吃奶动态| 啦啦啦 在线观看视频| 伦理电影免费视频| 日韩欧美一区视频在线观看| 亚洲avbb在线观看| 久久亚洲精品不卡| 老司机福利观看| 无遮挡黄片免费观看| 18禁黄网站禁片午夜丰满| 丰满饥渴人妻一区二区三| 老鸭窝网址在线观看| 亚洲自偷自拍图片 自拍| 天天添夜夜摸| 午夜日韩欧美国产| 日韩熟女老妇一区二区性免费视频| 9色porny在线观看| 亚洲男人天堂网一区| 欧美少妇被猛烈插入视频| 亚洲国产看品久久| 777米奇影视久久| 免费黄频网站在线观看国产| 99热国产这里只有精品6| 久久久久国产精品人妻一区二区| 免费女性裸体啪啪无遮挡网站| 日日爽夜夜爽网站| 久久av网站| 国产男人的电影天堂91| 久久精品国产a三级三级三级| 国产精品偷伦视频观看了| 两性午夜刺激爽爽歪歪视频在线观看 | 中文字幕高清在线视频| 亚洲第一青青草原| 国产激情久久老熟女| 亚洲 国产 在线| 91国产中文字幕| 欧美 亚洲 国产 日韩一| 国产欧美亚洲国产| 亚洲精品国产色婷婷电影| 久久精品国产a三级三级三级| 久久精品亚洲av国产电影网| 国产精品免费视频内射| 在线看a的网站| 美女福利国产在线| 欧美黑人欧美精品刺激| 成人影院久久| 精品一品国产午夜福利视频| 日本vs欧美在线观看视频| 每晚都被弄得嗷嗷叫到高潮| 国产精品成人在线| 亚洲中文字幕日韩| 性色av乱码一区二区三区2| 在线观看免费高清a一片| 亚洲久久久国产精品| 欧美另类亚洲清纯唯美| 午夜免费观看性视频| 欧美黄色淫秽网站| 18禁观看日本| 国产极品粉嫩免费观看在线| 另类精品久久| 汤姆久久久久久久影院中文字幕| 亚洲欧美日韩另类电影网站| 纵有疾风起免费观看全集完整版| 9191精品国产免费久久| videosex国产| av国产精品久久久久影院| 91麻豆精品激情在线观看国产 | 欧美在线黄色| 国产麻豆69| 国产精品一区二区精品视频观看| 国产一卡二卡三卡精品| 韩国高清视频一区二区三区| 日韩欧美国产一区二区入口| 最近中文字幕2019免费版| 亚洲av男天堂| 美国免费a级毛片| av欧美777| 国产精品偷伦视频观看了| 啦啦啦在线免费观看视频4| 亚洲国产欧美日韩在线播放| 久久中文字幕一级| 国产成人免费观看mmmm| 下体分泌物呈黄色| 中文字幕另类日韩欧美亚洲嫩草| 99国产精品99久久久久| 女性生殖器流出的白浆| 人人澡人人妻人| 宅男免费午夜| 国产亚洲午夜精品一区二区久久| 久久久国产欧美日韩av| 久久精品久久久久久噜噜老黄| 亚洲国产欧美日韩在线播放| 欧美成狂野欧美在线观看| 久久久久国产精品人妻一区二区| 亚洲av欧美aⅴ国产| 欧美激情久久久久久爽电影 | www日本在线高清视频| 成年人免费黄色播放视频| 两个人看的免费小视频| 国产不卡av网站在线观看| 精品一区在线观看国产| 午夜福利一区二区在线看| 熟女少妇亚洲综合色aaa.| 精品亚洲成国产av| 精品久久久久久电影网| 丝瓜视频免费看黄片| 免费观看a级毛片全部| 一区二区三区激情视频| 亚洲精品久久久久久婷婷小说| 亚洲精品国产av成人精品| 国产男女超爽视频在线观看| 中文字幕av电影在线播放| 99精品欧美一区二区三区四区| 久久人人爽av亚洲精品天堂| 成年人免费黄色播放视频| av在线播放精品| 老司机影院毛片| netflix在线观看网站| 天天添夜夜摸| a 毛片基地| 这个男人来自地球电影免费观看| 激情视频va一区二区三区| 国产欧美日韩一区二区精品| 久久精品国产a三级三级三级| 国产精品秋霞免费鲁丝片| 国产亚洲av片在线观看秒播厂| 亚洲第一欧美日韩一区二区三区 | 欧美黄色片欧美黄色片| 青青草视频在线视频观看| 成在线人永久免费视频| 人人妻人人添人人爽欧美一区卜| 欧美黑人欧美精品刺激| av一本久久久久| 日韩 欧美 亚洲 中文字幕| 香蕉国产在线看| av超薄肉色丝袜交足视频| 日日爽夜夜爽网站| 久久毛片免费看一区二区三区| 国产97色在线日韩免费| av线在线观看网站| 久久久国产成人免费| 9热在线视频观看99| 亚洲国产精品成人久久小说| 国产成人精品在线电影| 久久精品亚洲熟妇少妇任你| 这个男人来自地球电影免费观看| 少妇粗大呻吟视频| 国产成人免费观看mmmm| 国产主播在线观看一区二区| 精品一区二区三区av网在线观看 | 男人添女人高潮全过程视频| 夫妻午夜视频| 久久久久精品人妻al黑| 亚洲专区字幕在线| 久久天躁狠狠躁夜夜2o2o| 建设人人有责人人尽责人人享有的| 欧美日韩一级在线毛片| 亚洲成人免费av在线播放| 久久久国产一区二区| 国产伦理片在线播放av一区| 热re99久久国产66热| 满18在线观看网站| 天堂俺去俺来也www色官网| 日本一区二区免费在线视频| 免费少妇av软件| 欧美另类一区| 岛国毛片在线播放| 精品卡一卡二卡四卡免费| 69av精品久久久久久 | 一本一本久久a久久精品综合妖精| 中文字幕av电影在线播放| 丝袜美腿诱惑在线| 99热全是精品| 国产一区二区在线观看av| 国产真人三级小视频在线观看| 亚洲国产精品一区三区| av有码第一页| 曰老女人黄片| 欧美成狂野欧美在线观看| 99久久人妻综合| 国产男女内射视频| 十八禁网站网址无遮挡| 欧美人与性动交α欧美软件| 国产黄色免费在线视频| 成年av动漫网址| 人妻久久中文字幕网| 久久亚洲国产成人精品v| 色播在线永久视频| 国产精品久久久久久人妻精品电影 | 丰满饥渴人妻一区二区三| av福利片在线| 男女无遮挡免费网站观看| av天堂久久9| 一个人免费在线观看的高清视频 | 精品久久久久久久毛片微露脸 | 国产精品av久久久久免费| 五月天丁香电影| 91国产中文字幕| 国产在线一区二区三区精| 夫妻午夜视频| 美女高潮到喷水免费观看| 99国产精品一区二区三区| a在线观看视频网站| 99国产极品粉嫩在线观看| 法律面前人人平等表现在哪些方面 | 熟女少妇亚洲综合色aaa.| 99精品欧美一区二区三区四区| 亚洲欧美一区二区三区久久| 亚洲av国产av综合av卡| 精品人妻1区二区| 国产精品欧美亚洲77777| 亚洲中文av在线| 色婷婷久久久亚洲欧美| 亚洲一卡2卡3卡4卡5卡精品中文| 国产精品久久久av美女十八| 淫妇啪啪啪对白视频 | 91精品国产国语对白视频| 欧美xxⅹ黑人| 黄网站色视频无遮挡免费观看| 99久久精品国产亚洲精品| h视频一区二区三区| 国产99久久九九免费精品| 久久人妻福利社区极品人妻图片| 久久久久国产精品人妻一区二区| 国产麻豆69| 成年动漫av网址| 国产1区2区3区精品| 永久免费av网站大全| 老熟妇仑乱视频hdxx| 成人三级做爰电影| 黄色视频不卡| 曰老女人黄片| 午夜老司机福利片| 亚洲视频免费观看视频| 精品国产国语对白av| 日韩大码丰满熟妇| 99久久人妻综合| 免费高清在线观看日韩| 亚洲欧美日韩另类电影网站| 99热网站在线观看| 18禁国产床啪视频网站| 中文字幕精品免费在线观看视频| 亚洲色图 男人天堂 中文字幕| 欧美av亚洲av综合av国产av| 久久99热这里只频精品6学生| 欧美xxⅹ黑人| 久久久久久人人人人人| 美女扒开内裤让男人捅视频| 国产一区二区激情短视频 | 一区二区三区精品91| 老熟女久久久| 国产片内射在线| 丝袜美腿诱惑在线| 动漫黄色视频在线观看| 91麻豆av在线| 五月开心婷婷网| a 毛片基地| 菩萨蛮人人尽说江南好唐韦庄| 国产又色又爽无遮挡免| 久久 成人 亚洲| 国产精品 国内视频| 亚洲伊人色综图| www.熟女人妻精品国产| 欧美日韩亚洲综合一区二区三区_| 热re99久久精品国产66热6| 狂野欧美激情性xxxx| 19禁男女啪啪无遮挡网站| 老司机在亚洲福利影院| 日本撒尿小便嘘嘘汇集6| 久久毛片免费看一区二区三区| 人妻人人澡人人爽人人| 国产欧美日韩一区二区精品| 免费少妇av软件| 99国产精品99久久久久| 老鸭窝网址在线观看| 新久久久久国产一级毛片| 久久午夜综合久久蜜桃| 亚洲七黄色美女视频| 国产成人欧美| 欧美少妇被猛烈插入视频| 黑丝袜美女国产一区| 性色av一级| 自拍欧美九色日韩亚洲蝌蚪91| 亚洲av美国av| 亚洲国产精品成人久久小说| 国产欧美日韩综合在线一区二区| 久久毛片免费看一区二区三区| 天天躁日日躁夜夜躁夜夜| 久久免费观看电影| 亚洲成人免费av在线播放| 91大片在线观看| 高清黄色对白视频在线免费看| 成人av一区二区三区在线看 | 日韩熟女老妇一区二区性免费视频| 97精品久久久久久久久久精品| 亚洲伊人久久精品综合| 亚洲男人天堂网一区| 下体分泌物呈黄色| 日韩制服骚丝袜av| 99re6热这里在线精品视频| 欧美精品啪啪一区二区三区 | 9191精品国产免费久久| 少妇裸体淫交视频免费看高清 | 国产精品熟女久久久久浪| 国产成人精品久久二区二区91| 亚洲专区国产一区二区| 亚洲精品久久久久久婷婷小说| 久久久久久久国产电影|