• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Traffic Scene Captioning with Multi-Stage Feature Enhancement

    2023-10-26 13:13:32DehaiZhangYuMaQingLiuHaoxingWangAnquanRenandJiashuLiang
    Computers Materials&Continua 2023年9期

    Dehai Zhang,Yu Ma,Qing Liu,Haoxing Wang,Anquan Ren and Jiashu Liang

    School of Software,Yunnan University,Kunming,650091,China

    ABSTRACT Traffic scene captioning technology automatically generates one or more sentences to describe the content of traffic scenes by analyzing the content of the input traffic scene images,ensuring road safety while providing an important decision-making function for sustainable transportation.In order to provide a comprehensive and reasonable description of complex traffic scenes,a traffic scene semantic captioning model with multi-stage feature enhancement is proposed in this paper.In general,the model follows an encoder-decoder structure.First,multilevel granularity visual features are used for feature enhancement during the encoding process,which enables the model to learn more detailed content in the traffic scene image.Second,the scene knowledge graph is applied to the decoding process,and the semantic features provided by the scene knowledge graph are used to enhance the features learned by the decoder again,so that the model can learn the attributes of objects in the traffic scene and the relationships between objects to generate more reasonable captions.This paper reports extensive experiments on the challenging MS-COCO dataset,evaluated by five standard automatic evaluation metrics,and the results show that the proposed model has improved significantly in all metrics compared with the state-of-the-art methods,especially achieving a score of 129.0 on the CIDEr-D evaluation metric,which also indicates that the proposed model can effectively provide a more reasonable and comprehensive description of the traffic scene.

    KEYWORDS Traffic scene captioning;sustainable transportation;feature enhancement;encoder-decoder structure;multi-level granularity;scene knowledge graph

    1 Introduction

    Sustainable transportation is“the provision of services and infrastructure for the movement of people and goods in a safe,affordable,convenient,efficient and resilient manner.”To ensure sustainable transportation,big data analysis through methods such as[1]and data support through IoT data security transmission technologies such as[2]are used to provide decisions for transportation planning.Deep learning also has an indelible role in this,and natural language description of traffic scenes is important for assisting visually impaired people in their daily lives and in participating in traffic[3,4],as well as generating rich semantic information for drivers,thus assisting in the generation of intelligent decision suggestions,reducing driver decision time,and being important for reducing the risk of accidents [5].This maintains the resilience of traffic as well as the sustainability of traffic by ensuring road safety.The process of describing traffic scenes in natural language is called traffic scene caption generation;the task is a cross-domain task between computer vision and natural language processing;the goal of this task is to find the most efficient pipeline to process the input scene image,represent its content and convert it into a series of words by generating connections between visual and textual elements while maintaining the fluency of the language[6].This requires accurate recognition of the objects and targets in the scene and the ability to detect the attributes contained in multiple targets in the scene and their relationships.The most fundamental task is object detection,which is a very important technology in computer vision and has a wide range of applications,not only in the field of autonomous driving[7],but also in medical applications[8].Based on object detection,it is a challenge to describe the scene accurately and comprehensively in complex traffic scenes.

    Existing methods use images as input and generate simple captions for images using visual features,called image captioning,as shown in Fig.1,left.However,these methods are only capable of simple recognition of the objects involved in the image and simple captioning,but when used in a scene,such as a traffic scene,the generated captions are insufficient to assist in decision generation for the people involved in the traffic and are very uninterpretable.For example,in Fig.1,the image captioning only generates“a man is riding a motorcycle”,and if this is used as an auxiliary decision generation,it is easy to ignore some important information and generate wrong auxiliary decisions,which leads to accidents.In the traffic scene captioning proposed in this paper,the relationships and attributes of the traffic objects are added,and“a man”is“beside”the“stop sign”in the caption generated by the proposed method.Such information is very important for decision generation and can provide stronger interpretability for the traffic scene image.The image captioning method,in which each individual object region can only represent its category and attributes,is unable to represent the relationship between the object regions in the traffic scene,ignoring the prevailing semantic relationship between the regions,which helps to make a more reasonable and rich semantic understanding of the traffic scene,which leads to the generation of a visual-semantic gap.For visual feature processing,most of the existing image captioning methods treat visually salient regions as a collection of individual object regions using bottom-up visual features [9].In this,compared to directly using the extracted image features as the input to the decoder,a more effective approach is to perform further in-depth processing on the bottom-up features,retaining important information and discarding the unimportant information.However,although these methods retain most of the important information,there is still the problem of losing detailed information by using only the processed coarse-grained features as the input of the decoder.

    First,to bridge the visual-semantic gap,a knowledge graph is used to provide prior knowledge for each traffic scene image in our proposed traffic scene captioning method.A knowledge graph describes the concepts and their mutual relations in the physical world in the form of symbols.Its basic units are“entity-relationship-entity”triples and entities and their related attributes-value pairs.Entities are connected through relations to form a net-like knowledge structure,as shown in Fig.2a,which is the basic structure and a unit of the knowledge graph.Circles,namely nodes,represent entities,and the arrows represent relations.The semantic information represented by this triplet can be expressed as“A man is holding a ball”in natural language.At the same time,the entity represented by each node has some attributes.For example,as shown in Fig.2b,some basic information about the“ball”is taken as attributes,such as color and size.In this paper,the scene knowledge graph is applied to the decoding process,which is a graph-based representation of an image that encodes objects and the relationships between them.This representation allows a comprehensive understanding of the image [10].In this paper,a scene knowledge graph fusion module (SKGF) is designed to fuse the scene knowledge graph context-aware embeddings,which assigns different attentional strengths to each of the three semantic information(object,attribute,and relation)through three different attention modules before generating word sequences.The attention mechanism appears in all aspects [11].Specifically,if the contextual information about the current text unit suggests that the subject or object part should be generated at this stage,then the fusion module assigns more attention to the object information,and if the predicate part should be generated at this stage,then it assigns more attention to the relationship.In this way,it adaptively adjusts the attention to different semantic information in combination with the current context,and then aligns the semantic feature units with the text.The purpose of the integration is to achieve a better fit.

    Figure 1:The difference between the image captioning and our traffic scene captioning

    Figure 2:Basic structure of the knowledge graph

    Second,to address the problem of losing detailed information when processing visual features,in this paper,a multi-level granularity fusion module(MLGF)is proposed to fuse the features at different levels of granularity processed by encoders at different depth levels,to enhance the visual features of the encoder output.Specifically,the fine-grained features generated by low-level encoder processing can effectively preserve the detailed part of the traffic scene,the coarse-grained features generated by high-level encoder processing can preserve the overall features,and the new features generated by fusing the features at different levels can effectively combine the overall and detailed features to ensure the generation of more comprehensive traffic scene captioning.

    In this paper,the model is evaluated by quantitative and qualitative results on the MS-COCO dataset.The experimental results show that our method achieves better results in various evaluation metrics and is effective in generating comprehensive and reasonable captions for the traffic scenes.The contributions of this paper are outlined below:

    1.This paper proposes a multi-stage feature enhancement method,which includes the following:

    ■This paper proposes a multi-level granularity feature fusion module(MLGF)that enhances the visual features learned by the encoder to help generate more comprehensive traffic scene captions.

    ■This paper proposes a scene knowledge graph fusion module (SKGF) to enhance the features learned by the decoder to help generate richer and more reasonable traffic scene captions.

    2.Extensive experiments on the MS-COCO dataset validate the effectiveness of our method,and in quantitative analysis,the model proposed in this paper achieves strong performance,especially in qualitative analysis,which fully demonstrates the superiority of our model by comparing the generated captions.

    2 Related Work

    The comparison between the related work and the model proposed in this paper is shown in Table 1.This section will write about the related work from the following two aspects:

    Visual features only.Some early approaches [12,13] using only visual features as the source of information have achieved good results by encoding the images with convolutional neural networks(CNNs) and decoding them with recurrent neural networks (RNNs) to generate sentences.After that,references[14,15]pioneered the introduction of an attention mechanism to allow the decoder to extract image features better.Reference[16]proposed a backtracking method to integrate the attention weights of the previous time step into the attention measure of the current time step,which is more suitable for human visual consistency.In recent years,reference [9] proposed a combined bottomup and top-down visual attention mechanism called Up-Down.This bottom-up mechanism utilizes a set of salient image regions extracted by Faster R-CNN [17],each represented by an ensemble of convolutional feature vectors.Some approaches have achieved good results with model improvements based on this bottom-up feature[18,19].For visual feature extraction,some methods were improved on the basis of Faster R-CNN.Reference[20]proposed an end-to-end scale invariant head detection model to count the flow of people in high-density population,and achieved good results.In addition,some methods use a combination of long short-term memory (LSTM) language models and selfattention mechanisms to pursue high accuracy.Among them,reference[21]expanded the LSTM by using the attention-on-attention module,which computes another step of attention based on visual self-attention,and[22]improved the visual encoding and language models by enhancing self-attention through second-order interactions.They both employ gated linear units (GLUs) [23] to filter out valid information,and similar to them,our approach also employs GLUs to control the information transfer.In [24],a learning configuration neural module is proposed to generate“internal patterns”connecting the visual coder and the language decoder.These methods,which only use visual features as input,can achieve good results with relatively high accuracy,but the generated sentences lack richness and rationality because they do not take into account the relationship between objects.

    Graphs based.Inspired by research representations in the graphics domain,studies such as [25]are devoted to generating graphs to represent scenes,and some recent approaches have introduced the use of graphs based on visual features,hoping to improve the generated sentences by encoding object attributes and relationships.Among them,reference [26] treated relations as edges in a graph and improves region-level properties by using object relations and graph convolutional networks(GCNs).Reference [27] achieved good results by fusing language-induced bias and graphs into the decoder.Reference[28]introduced a hierarchical attention-based module to learn the distinguishing features at each time step for word generation.Reference[29]treated relations as nodes in a graph and combines both semantic and geometric graphs.Reference [30] decomposed the graph into a set of subgraphs,each subgraph capturing one semantic component of the input image.In addition,some works,such as[31],which use only graph features without utilizing visual features,reduce the computational cost while obtaining good results by bridging the semantic gap between the image scene graph and the caption scene graph.Similar to most of the approaches that use graphs,GCNs[32,33]are commonly used to integrate semantic information in graphs and then further decode the sentences using the features aggregated over the whole graph.

    Table 1:Comparison of related models and proposed model

    3 Approach

    3.1 Approach Overview

    The entire flow of our model for implementing traffic scene captioning is shown in Fig.3.The model generally follows the encoder-decoder architecture to generate comprehensive and reasonable traffic scene descriptions by multi-level feature enhancement.First,visual and semantic features of the scene are obtained:(1)for the visual features,bottom-up features extracted based on Faster R-CNN are used.Bottom-up features are the relevant feature vectors of the region where each element in the scene is located.Instead of simply dividing the image into regions of the same size,bottom-up features focus on content more in line with human observation habits,which helps generate more reasonable traffic scene captions;(2)for semantic features,provided by the scene knowledge graph constructed from the traffic scene images.The scene knowledge graph contains the object,the attribute of the object,and the relationship between the objects,and this paper represents both the relationship and attribute as nodes similar to the object and embed the object node,the relationship node,and the attribute node in the scene knowledge graph by the scene knowledge graph embedding method.The relationship node and the attribute node are represented by embedding the scene knowledge graph.Second,the visual features are encoded by the encoder and then fused with multi-level granularity visual features to generate new visual features and input to the decoder,which decodes them by fusing the semantic features provided by the scene knowledge graph and the new visual contextual features to finally generate the semantic description of the traffic scene.

    3.2 Encoder

    The encoder is used to refine the visual features extracted by the visual feature extractor.Because of the lack of interaction between the object features extracted directly by the visual feature extractor,compared to feeding the acquired bottom-up feature vectors directly to the decoder,as done in[9],the method of this paper uses an encoder consisting of multiple layers of the same encoder layer,similar to the transformer,to further process the feature vectors to deeply capture the relationships between the visual regions of the traffic scene images.

    Figure 3:Overall framework of our proposed model

    Our encoder is shown in Fig.4.Through a large number of experiments and verifications,and in order to control the model scale,the encoder is composed of six layers of the same encoder layer,if the number of encoder layers is greater than six layers,it will lead to the model size is too large,and the number of encoder layers is less than six layers,which will lead to a decrease in accuracy.The encoder uses the output of each layer of the encoder layer to be used as the decoder input after feature fusion through the multi-level granularity feature fusion module.

    Figure 4:The framework of the encoder.The framework shows only two encoder layers.In fact,our encoder consists of six identical encoder layers

    3.2.1 Encoder Layer

    First,after inputting the traffic scene images,the regions of interest in the traffic scene are detected by the pre-trained Faster R-CNN,the detected image regions are subjected to feature extraction,and a set of visual feature vectors are represented asZ={z1,z2,z3,...,zn},whereZ∈Rn×d,dis the dimensionality of each of thenvectors.The feature vectors are linearly projected into queriesQ,keysKand valuesV,and then fed into the multi-head attention mechanism,which is an integration ofhidentical self-attention.The adopted multi-head attention mechanism is formulated as follows,where,WOis the trainable weight matrix,here,i∈(1,h),andhis the number of heads of the multi-head attention mechanism.In Eq.(1),theMultiHeadrepresents multi-head attention,and in Eq.(2),theAttentionrepresents self-attention operation:

    After obtaining the output of the multi-headed attention mechanism,the input queryQis concatenated with the output of the multi-head attention mechanism and sent to the gated linear unit(GLU) through the linear layer control,which can filter the output of the multi-head attention and the invalid features in the queryQ,and output the useful feature information.Finally,the encoder layer is output by the Add&Norm layer.One of the encoder layers is formulated as follows,wherei∈(1,N),Nis the number of encoder layers in the encoder,LNdenotes the LayerNorm operation andCatdenotes the concatenation operation:

    3.2.2 Encoder with Multi-Level Granularity Feature Fusion Module

    The encoder is composed of multiple encoder layers with the same structure.In the conventional transformer,the decoder always only takes the output of the last layer of the encoder as its input:

    In this paper,the output of thei-th encoder layer is denoted as,and thei-th encoder layer takes the output of the(i-1)-th encoder layeras input,i∈(1,N),Nis the number of encoder layers in the encoder,and the input of the decoder in the traditional transformer structure isat this time,which is formulated as follows,whereYis the semantic sequence of the decoder output:

    However,the output of the last layer of the encoder can only represent the global features of the traffic scene,i.e.,coarse-grained features,and lacks the finer and more precise details of the traffic scene,i.e.,fine-grained features,and using only the output of the last layer of the encoder as the input of the decoder will inevitably cause the loss of details,thus leading to incomplete traffic scene captioning.Therefore,to enable the decoder to understand the more detailed and precise parts of the traffic scene,this paper proposes the multi-level granularity feature fusion module(MLGF),which can focus not only on the global features of the traffic scene,i.e.,coarse-grained features,but also on the features of different levels of granularity of the traffic scene.In particular,the first encoder layer focuses on the finest granularity features in the traffic scene,and the MLGF fuses the features at different levels of granularity.The MLGF consists of an element-level summation unit and a LayerNorm unit,which takes the output of each encoder layer as input and generates new visual features for the encoder that reflect the different levels of granularity of the original traffic scene.The fused features are then used as the input of the decoder.At this time,the input of the decoder is:

    3.3 Decoder

    The decoder fuses the visual features provided by the encoder and the semantic relation information provided by the scene knowledge graph to generate a text representation of the traffic scene image,that is,to generate traffic scene captions.

    3.3.1 Scene Knowledge Graph Representation

    Following[27],the scene knowledge graph is extracted by the image parser,which consists of three components:the object detector Faster R-CNN,the relationship detector MOTIFNET[34],and the attribute classifier consisting of FC-ReLU-FC-Softmax.The extracted scene knowledge graph can be represented asG=(N,E),whereNis the set of nodes,treating attributes and relationships as nodes similar to objects,soNcontains object nodeO={oi},attribute nodeA=and relationship nodeR={rij},whereoidenotes thei-th object,ai,kis thek-th attribute ofoi,rijis the relationship betweenoiandoj,Eis the set of edges,and the tripledenotes the triple.

    3.3.2 Scene Knowledge Graph Embedding with GCN

    For the scene knowledge graph,there are three types of nodes that need to learn their semantic context-aware embeddings,i.e.,to learn the semantic context-aware embeddings of theoi,ai,kandrijnodes in the scene knowledge graph,and for each set of nodesN,they can be represented byd-dimensional vectors,i.e.,and.Similar to the work in [29],for the object node,contextual embedding cannot be learned using only the object node representation provided by the scene knowledge graph.We fuse the visual featuresdetected by the Faster R-CNN with the object node features to obtain the object node feature representation,and the object node context-aware embedding is calculated as follows:

    For the attribute node,since the attribute is the attribute of the object,it is relative to the object,so the attribute is integrated with its object context.The context-aware embedding of the attribute node is as follows:

    For the relationship node,similar to the attribute node,it is relative to the subject object nodeoiand the object nodeoj,so integrating it with its object node,the context-aware embedding of the relationship node is as follows:

    wherego,gaandgrare spatial graph convolutions with independent parameters but with the same structure,using the FC-ReLU-Dropout layer structure.fo,faandfrare feature projection layers;similarly,vectors are input to the fully connected layer,followed by a RELU.,andare learned semantic context-aware embeddings.

    3.3.3 Decoder with Scene Knowledge Graph Fusion Module

    The decoder uses the encoder-enhanced visual featuresto generate sentences describing the traffic scene.To accurately describe the relationships and attributes of the objects in the traffic scene,this paper proposes the scene knowledge graph fusion module(SKGF).The semantic featuresandprovided by the scene knowledge graph are fused to generate a more reasonable caption of the traffic scene.In a sentence describing a traffic scene,a word usually corresponds to a semantic unit in the image.For example,the subject and object parts correspond to the object unit in the semantic unit,and the predicate part corresponds to the relationship unit in the semantic unit.Therefore,when fusing semantic features,it is necessary to consider the alignment of text words with semantic units,and use SKGF to implement the alignment of text with object units,attribute units,and relationship units to fuse semantic features in a suitable way.

    whereMHAO,MHAAandMHARare multi-head attention mechanisms.With the following multihead attention mechanisms,the input is no longer the linear projection queryQ,the keyKand the valueVof the feature vectors,but the outputhtof the LSTM and the semantic featuresand,calculated as follows:

    The details of the self-attention mechanisms used in the multi-head attention mechanisms are calculated as follows,wheredis the vector dimension ofhtand the semantic features:

    Our decoder is shown in Fig.5.In particular,at decoding time stept,the input to the LSTM is set to the current input wordWt,the visual features provided by the average pooled encoderplus the previously saved context vectorct-1,and the previous LSTM hidden stateht-1,and the current LSTM output is calculated as follows:

    whereWe∈RE×Σis a word embedding matrix for a vocabulary of sizeΣ,andΠtis the one-hot encoding of the input word at time stept.htis used as a query vector to computeand.

    The outputhtof the LSTM is then fused with the aligned context vectorsof the three semantic units,and then filtered using the gating unit to learn which semantic unit the current word is about and decide what the decoder should pay more attention to at the moment.The final filtered context vector is used as input to the generator,which contains a linear layer as well as a softmax layer for generating probability scores to predict the next word:

    whereWpis the learning weight andbp∈RΣis the learning bias.

    4 Experiments

    4.1 Dataset

    This paper has carried out experiments on the popular dataset MS-COCO [35],which contains 123,287 images,containing 82,783 training images and 40,504 validation images,each containing five sentences describing the image.In this paper,the Karpathy data split[13]was used for evaluation,of which 113,287 were training images,5000 were validation images,and 5000 were testing images.For the caption text,this paper converts all words to lowercase,remove rare words that occur less than five times,and modify the maximum length of each sentence to 16.

    4.2 Evaluation Metrics

    In this paper,five standard automatic evaluation metrics are used to evaluate the proposed method,namely,BLEU[36],METEOR[37],ROUGE-L[38],CIDEr-D[39],and SPICE[40].

    4.3 Objective

    For a given sequence of ground truth captions,we train our model by minimizing the following cross-entropy loss,whereθis the model parameter andTis the word sequence length:

    then,the following negative expectations were minimized using the CIDEr-D[39]score as a bonus:

    whereris the scoring function,and the gradient of this loss is approximated according to the selfcritical sequence training(SCST)[41]as:

    4.4 Implementation Details

    Visual Genome(VG)[42]has rich scene graph annotations such as classes of objects,attributes of objects,and pairwise relationships,so this paper used the Visual Genome(VG)dataset to train the image parser,including object detector,attribute classifier and relationship detector,and after filtering the dataset by objects,attributes and relationships,305 objects,103 attributes,and 64 relationships were left for training.

    For the input word,object,attribute,and relationship categories,this paper uses word embeddings of size 1024,128,128,and 128,respectively.For the feature projection layer(fo,fa,fr)and GCN(go,ga,gr),we set the output dimension to 1024.Similarly,for visual features,set its dimension as the input to the encoder to 1024,which is also the hidden size of the LSTM in the decoder.In this paper,the number of encoder layersN=6 and multiple attention mechanismh=8 are set.During training,the Adam[43]optimizer is used to train 35 epochs under cross-entropy loss,set the batch size to 10,initialize the learning rate to 2×10-4,and decay 0.8 every three epochs.Then,by optimizing the CIDEr-D reward,reinforcement learning[41]is used to train the model to 50 epochs,and the initial learning rate is set as 3×10-5.In the inference stage,a beam search with a beam size of 2 is used.

    In this paper,the training and inference process are completed on the server equipped with RTX A5000.Under the training of cross-entropy loss,it takes an average of 0.156 s to train an iteration,and an epoch contains 11,328 iterations,and the total time to train an epoch averages 30 min,counting the simple inference tests done during training.Under the training of CIDEr-D score optimization,it takes an average of 0.280 s to train an iteration,and an epoch also contains 11,328 iterations,and the total time to train an epoch is an average of 54 min.In the inference stage,5000 images are used for inference,and the average inference time for one image is 0.046 s,and the inference time for 5000 images is 230 s.

    4.5 Performance Comparison and Analysis

    4.5.1 Comparison Methods

    This paper compares the proposed traffic scene captioning model with the following image captioning models,and for a fair comparison,all methods are experimented and validated on the same COCO Karpathy test split:(1)SCST[41]proposes a reinforcement learning method based on the idea of self-critical to train sequence generation models;(2)Up-Down[9]proposes a visual attention mechanism combining bottom-up and top-down attention with a two-layer LSTM as the core model for sequence generation,specifically,SCST and Up-Down are two powerful baselines for using more advanced self-critical rewards and visual features;(3)Hierarchical attention network(HAN)[44]uses hierarchical features to predict different words;(4)CIDErBtw[45]proposes a new training strategy to encourage the uniqueness of each image-generated caption;(5)Human consensus-oriented image captioning(HCO-IC)[46]explicitly uses human consensus to measure in advance the quality of ground truth captions and directly encourages the model to learn high quality captions with high priority;(6) Recurrent fusion network (RFNet) [47] uses a recurrent fusion network to fuse different source features and exploit complementary information from multiple encoders;(7)Spatio-temporal memory attention (STMA) [48] uses an attention model to learn the spatio-temporal relationships of image captions;(8)SubGC[30]decomposes the graph into a series of subgraphs by that capture meaningful objects;(9) Graph convolutional networks plus long short-term memory (GCN-LSTM) [26] treats visual relations as edges in a graph to help refine region-level features;(10)Scene graph auto-encoder(SGAE) [27] proposes to introduce self-coding of graphs into the model;(11) Visual semantic units alignment (VSUA) [29] fuses semantic and geometric (geometrical) graphs;(12) SG2Caps [31] uses only graph labels to bridge the semantic gap between image scene graphs and caption scene graphs;(13)Divergent-convergent attention(DCA)[49]proposes a new divergence-convergence attention to focus on fine-grained semantic information.

    4.5.2 Quantitative Analysis

    This paper compares the performance of our proposed traffic scene captioning model with the image captioning models on the MS-COCO dataset trained with cross-entropy loss,and the results are summarized in Table 2.The top part of Table 2 shows the models that use only visual features as the feature source,the middle part shows the models that employ the semantic information of the graph,and the bottom part shows our proposed model that utilizes both visual features and the semantic information provided by the scene knowledge graph.From the results,it can be seen that with crossentropy loss training,our approach improves significantly over the model using only visual features and outperforms the models using semantic information in all evaluation metrics.This shows the advancement and effectiveness of our model that fuses visual features as well as semantic features,while incorporating encoder multi-level granularity features.

    To fairly compare the performance of our model,this paper also reports the performance of the image captioning models trained under CIDEr-D Score optimization compared to our proposed traffic scene captioning model,as shown in Table 3.It can be seen from the reported results that the model trained with CIDEr-D Score optimization shows a significant improvement in all metrics relative to cross-entropy loss,in particular,the CIDEr-D score improves from 119.3 to 129.0,a 9.7% improvement,and 0.6% to 2.9% improvement in other metrics.Meanwhile,our model outperformed other models in most evaluation metrics.

    Table 2:Performance comparison with the existing methods on the MS-COCO Karpathy test split.All models are trained with cross-entropy loss.All values are reported as percentages(%),where B@N,M,R,C,and S are short for the BLEU@N,METEOR,ROUGE-L,CIDEr-D,and SPICE scores.“-”indicates that the value is not mentioned in the published work

    4.5.3 Qualitative Analysis

    To qualitatively evaluate the validity of our proposed model,Fig.6 shows some examples of generated traffic scene captions.This paper adopted the strong baseline model Up-Down as the baseline.In particular,we reimplemented the Up-Down model,and our reimplemented Up-Down model achieved better performance than the official Up-Down model and took this as our baseline,as shown in Table 4,where“*”represents the result of the re-implementation.The“Base”,“Ours”,and“GT”in Fig.6 represent the captions from the baseline,our model,and the ground truth.

    Figure 6:Example results of the captions generated by our model,Up-Down baseline,and ground truth

    In general,the eight examples in Fig.6 show that the baseline can generate smooth and relatively traffic scene-appropriate captions,but our model can generate more reasonable and comprehensive traffic scene captions,i.e.,rich and reasonable traffic scene captions through object attributes,objects and relationships,and more fine-grained feature processing.Specifically,Figs.6a and 6b provide more detailed descriptions of the object attributes in the traffic scene,e.g.,in Fig.6a,our approach describes the attribute of“traffic light”as“red”and the attribute of“buildings”as“red”.The baseline can only identify the objects of“traffic light”and“buildings”,but cannot describe their attributes.Figs.6c and 6d describe the relationship between objects in the traffic scene.In particular,for example,in Fig.6c,our method generates a traffic scene caption in which the relationship between“people”and“motorcycles”is“riding”,and the relationship between“motorcycles”and“road sign”is“next to”,which the baseline cannot express.Figs.6e and 6f show that our model can describe more detailed and comprehensive information in the traffic scene.For example,in Fig.6e,the baseline can only focus on“a man”in the traffic scene,but our model can focus on another object in the traffic scene,that is,“two men”.The“street corner”in Figs.6g and 6h shows that our model can also identify and describe objects in the traffic scene more accurately.In summary,our model has the following advantages:(1)it can make the identification and description of objects in the traffic scene more accurate;(2)it can describe the attributes of objects in detail;(3)the relationships between objects are clearer and more accurate;and(4)it can pay attention to more fine-grained information in the traffic scene.

    4.5.4 Failure Cases Analysis

    In order to comprehensively analyze the model proposed in this paper,the failure cases are analyzed.Fig.7 shows some examples of generated scene captions.The“Ours”and“GT”in Fig.7 respectively represent the captions generated from the model proposed in this paper and the ground truth.

    In Fig.7a,although the model proposed in this paper recognizes“busy street”,the word“street”appears twice,and the same problem also appears in Fig.7b.Through a large number of experimental studies,it is found that this is because“street”appears more frequently in the scene knowledge graph,which leads to“street”being introduced into the model as noise.Therefore,the same word will be generated twice during the generation of captions,resulting in the sentence is not smooth and even grammar errors.In Fig.7c,although the model in this paper describes the color of“bus”,only the color of one“bus”is recognized.In Fig.7d,due to the influence of noise,the model of this paper incorrectly identifies the cyclist as the person running in the rain.From the above examples,it can be concluded that while using the semantic features provided by the scene knowledge graph to describe the semantic relations,the introduction of noise is an inevitable and urgent problem.However,the frequency of such problems is low,which has little impact on the performance of the model.

    4.5.5 Ablative Analysis

    To demonstrate the effectiveness of the two core modules(MLGF and SKGF)of our model for traffic scene captioning,this paper quantifies the contribution of our proposed module through an ablation study.All the models participated in the ablation study used the same hyperparameters and were evaluated through cross-entropy loss training.The results of our ablation study are shown in Table 5.

    In model 1,the encoder does not have the multi-level granularity feature fusion(MLGF)module and the decoder does not have the scene knowledge graph fusion(SKGF)module.Specifically,for the encoder,after deleting the MLGF module,the output of the last layer of the encoder is directly used as the input of the decoder,as shown in Fig.8b.For the decoder,after deleting the SKGF module,the output of the LSTM is directly used as the input of the linear layer,as shown in Fig.8c.Compared with model 1,model 2 adds the MLGF module to the encoder.Similarly,compared with model 3,model 4 adds the MLGF module to the encoder.From the results in Table 5,it can be seen that model 2 and model 4 have some improvement in most metrics compared with model 1 and model 3 without the encoder MLGF module.Thus,it can be seen that our encoder MLGF module focuses not only on the global information of the traffic scene,but also for the fine-grained information on the traffic scene,which helps to improve the performance of the model.Compared with model 1,model 3 uses SKGF to fuse the scene knowledge graph information in the decoder.Similarly,compared with model 2,model 4 also fuses the scene knowledge graph information in the decoder.As seen from the results in Table 5,compared with model 1 and model 2 without fusion scene knowledge graph in the decoder,model 3 and model 4 have improved by 1.1%~7.8% and 1.1%~8.2%,respectively,in all metrics,which indicates that the semantic information of objects,attributes,and relationships provided by the scene knowledge graph plays a very important role in traffic scene captioning.

    Figure 8:Frameworks for ablative analysis:(a)Encoder with the Multi-level granularity feature fusion module(MLGF);(b)Encoder without the Multi-level granularity feature fusion module(MLGF);(c)Decoder without the scene knowledge graph fusion module(SKGF)

    5 Conclusions

    This paper proposes a multi-stage feature enhancement approach,i.e.,a deeper refinement of scene information through visual and semantic feature enhancement in the encoding and decoding stages,which helps to generate traffic scene captions more comprehensively and rationally.Experimental results show that our model achieves better performance in most metrics,especially in CIDEr-D and SPICE evaluation metrics,which have achieved scores of 129.0 and 22.4,respectively,which is a great improvement over other methods that only use visual features or scene graphs.However,the model proposed in this paper is still too large,the total number of model parameters is about 104 M,which may lead to a decrease in efficiency,so in future work,we will try to reduce the size of the model to improve efficiency.In the meantime,we will try more effective feature enhancement methods for further improvement,especially for improving fusion methods for multi-level granularity features.

    Acknowledgement:This research was supported by the Yunnan Provincial Key Laboratory of Software Engineering,the Kunming Provincial Key Laboratory of Data Science and Intelligent Computing.

    Funding Statement:This research was funded by(i)Natural Science Foundation China(NSFC)under Grant Nos.61402397,61263043,61562093 and 61 663046;(ii) Open Foundation of Key Laboratory in Software Engineering of Yunnan Province: No.2020SE304.(iii) Practical Innovation Project of Yunnan University,Project Nos.2021z34,2021y128 and 2021y129.

    Author Contributions:The authors confirm contribution to the paper as follows:study conception and design:D.Zhang,Y.Ma;data collection:Y.Ma;analysis and interpretation of results:D.Zhang,Y.Ma,H.Wang;draft manuscript preparation: H.Wang,A.Ren,J.Liang.All authors reviewed the results and approved the final version of the manuscript.

    Availability of Data and Materials:The authors confirm that the data supporting the findings of this study are available within the article.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    www.av在线官网国产| 免费少妇av软件| 欧美日韩精品网址| 美国免费a级毛片| 我的亚洲天堂| 叶爱在线成人免费视频播放| 男女无遮挡免费网站观看| 少妇粗大呻吟视频| 久久久久久人人人人人| 精品亚洲乱码少妇综合久久| 搡老岳熟女国产| 国产精品国产av在线观看| 捣出白浆h1v1| 色94色欧美一区二区| 久久久久久久久久久久大奶| 一区在线观看完整版| 亚洲欧洲国产日韩| 国产成人av教育| 少妇的丰满在线观看| 久久久精品94久久精品| 欧美久久黑人一区二区| 午夜日韩欧美国产| 人妻人人澡人人爽人人| 亚洲一区中文字幕在线| 国产亚洲一区二区精品| www.自偷自拍.com| 每晚都被弄得嗷嗷叫到高潮| 成人国产一区最新在线观看 | 国产男女超爽视频在线观看| 亚洲欧美精品自产自拍| 狠狠婷婷综合久久久久久88av| 国产不卡av网站在线观看| 久热这里只有精品99| 国产av一区二区精品久久| 18禁国产床啪视频网站| 日韩精品免费视频一区二区三区| 久久久精品免费免费高清| 蜜桃在线观看..| 国产成人系列免费观看| 亚洲第一av免费看| 日韩精品免费视频一区二区三区| 欧美激情 高清一区二区三区| 日韩精品免费视频一区二区三区| 久久精品熟女亚洲av麻豆精品| 80岁老熟妇乱子伦牲交| 日韩欧美一区视频在线观看| 亚洲国产精品999| 看免费成人av毛片| 国产精品一区二区在线观看99| 国产亚洲精品久久久久5区| 男女边摸边吃奶| 如日韩欧美国产精品一区二区三区| 欧美精品一区二区免费开放| 后天国语完整版免费观看| 久久99精品国语久久久| 亚洲熟女精品中文字幕| 性少妇av在线| 中文字幕色久视频| 亚洲欧洲日产国产| 啦啦啦视频在线资源免费观看| 国产在视频线精品| 中文乱码字字幕精品一区二区三区| 女性被躁到高潮视频| 国产精品免费视频内射| 在线观看www视频免费| 一边摸一边做爽爽视频免费| 老司机靠b影院| 成人国产一区最新在线观看 | 日本a在线网址| 电影成人av| 只有这里有精品99| 国产极品粉嫩免费观看在线| 一本色道久久久久久精品综合| 午夜久久久在线观看| 两个人看的免费小视频| 亚洲国产精品一区三区| 欧美日韩视频高清一区二区三区二| 人妻人人澡人人爽人人| 亚洲欧美激情在线| 在线亚洲精品国产二区图片欧美| 国产深夜福利视频在线观看| 欧美97在线视频| 国产一区有黄有色的免费视频| 91精品三级在线观看| 亚洲成色77777| 人人妻人人澡人人爽人人夜夜| 久久久国产精品麻豆| 久久免费观看电影| 少妇人妻 视频| 男女高潮啪啪啪动态图| 亚洲精品av麻豆狂野| 在现免费观看毛片| 激情视频va一区二区三区| 乱人伦中国视频| 欧美黄色片欧美黄色片| 丝袜美腿诱惑在线| 欧美亚洲 丝袜 人妻 在线| 亚洲av男天堂| 久久ye,这里只有精品| 91九色精品人成在线观看| 曰老女人黄片| 热re99久久精品国产66热6| 国产一级毛片在线| 日韩 欧美 亚洲 中文字幕| 国产三级黄色录像| 汤姆久久久久久久影院中文字幕| www.熟女人妻精品国产| 下体分泌物呈黄色| 亚洲国产精品国产精品| 久热这里只有精品99| 亚洲精品日韩在线中文字幕| 精品第一国产精品| 亚洲av综合色区一区| 国产欧美亚洲国产| 国产成人91sexporn| 一本综合久久免费| 丝袜喷水一区| 久久99一区二区三区| 多毛熟女@视频| 亚洲欧美激情在线| 久久久久精品人妻al黑| 18禁黄网站禁片午夜丰满| netflix在线观看网站| 丁香六月天网| 亚洲av欧美aⅴ国产| 国产片内射在线| 999精品在线视频| 成人亚洲精品一区在线观看| 亚洲免费av在线视频| 女性被躁到高潮视频| 少妇被粗大的猛进出69影院| 香蕉国产在线看| 高清不卡的av网站| 91成人精品电影| 国产成人精品久久二区二区91| 男女国产视频网站| 国产精品久久久久久精品古装| www.av在线官网国产| 99热网站在线观看| 一本大道久久a久久精品| 成年动漫av网址| 亚洲人成电影观看| 高清不卡的av网站| 波多野结衣av一区二区av| 国产成人精品无人区| 中文字幕最新亚洲高清| 亚洲精品中文字幕在线视频| 国产三级黄色录像| 可以免费在线观看a视频的电影网站| 9热在线视频观看99| 18禁国产床啪视频网站| 女警被强在线播放| 最近手机中文字幕大全| 亚洲国产毛片av蜜桃av| 免费观看人在逋| 少妇人妻 视频| 欧美国产精品va在线观看不卡| 蜜桃在线观看..| 精品熟女少妇八av免费久了| 国语对白做爰xxxⅹ性视频网站| 日韩电影二区| 国产黄频视频在线观看| 国产免费视频播放在线视频| 精品亚洲成国产av| 天天操日日干夜夜撸| 女人精品久久久久毛片| 亚洲中文日韩欧美视频| 中文字幕高清在线视频| 中文字幕最新亚洲高清| 王馨瑶露胸无遮挡在线观看| 国产无遮挡羞羞视频在线观看| 国产精品二区激情视频| 麻豆av在线久日| 精品欧美一区二区三区在线| av一本久久久久| 老司机午夜十八禁免费视频| 国产免费视频播放在线视频| 极品少妇高潮喷水抽搐| 亚洲一码二码三码区别大吗| 国产精品三级大全| 秋霞在线观看毛片| 国产在线视频一区二区| 黄色片一级片一级黄色片| 婷婷色综合www| 成人午夜精彩视频在线观看| 男女边摸边吃奶| 欧美日韩视频高清一区二区三区二| 另类精品久久| 久久亚洲国产成人精品v| 一级片免费观看大全| 亚洲国产欧美在线一区| 老司机深夜福利视频在线观看 | 国产成人欧美| 日韩 欧美 亚洲 中文字幕| 亚洲一码二码三码区别大吗| 午夜久久久在线观看| 麻豆乱淫一区二区| 十八禁网站网址无遮挡| 美女主播在线视频| 免费在线观看日本一区| 亚洲午夜精品一区,二区,三区| 晚上一个人看的免费电影| 国产91精品成人一区二区三区 | 精品少妇一区二区三区视频日本电影| 男女无遮挡免费网站观看| 国产成人精品无人区| 色综合欧美亚洲国产小说| 一本久久精品| videos熟女内射| 香蕉丝袜av| 国产一区二区在线观看av| 啦啦啦视频在线资源免费观看| 亚洲图色成人| 赤兔流量卡办理| 久久天躁狠狠躁夜夜2o2o | 亚洲国产精品999| av福利片在线| 亚洲情色 制服丝袜| 老司机亚洲免费影院| 丝袜喷水一区| 亚洲av日韩精品久久久久久密 | 高清av免费在线| 一本久久精品| tube8黄色片| 99国产综合亚洲精品| 成人国语在线视频| 国产在视频线精品| 精品国产一区二区三区四区第35| 性色av乱码一区二区三区2| 国产亚洲午夜精品一区二区久久| 五月天丁香电影| 午夜福利乱码中文字幕| 日韩人妻精品一区2区三区| √禁漫天堂资源中文www| 777久久人妻少妇嫩草av网站| 少妇被粗大的猛进出69影院| 又大又爽又粗| 精品第一国产精品| 亚洲国产欧美日韩在线播放| 久久女婷五月综合色啪小说| 免费不卡黄色视频| 高清欧美精品videossex| 日韩视频在线欧美| 香蕉国产在线看| 五月天丁香电影| 国产精品九九99| av一本久久久久| 免费黄频网站在线观看国产| 99香蕉大伊视频| 欧美日韩av久久| 夫妻性生交免费视频一级片| 久久热在线av| 免费少妇av软件| 日韩伦理黄色片| 亚洲欧美日韩另类电影网站| 国产熟女欧美一区二区| 黄片播放在线免费| 国产免费又黄又爽又色| 在线观看免费午夜福利视频| 亚洲五月色婷婷综合| 国产成人精品在线电影| 久久国产精品人妻蜜桃| 巨乳人妻的诱惑在线观看| 中文字幕人妻熟女乱码| 一级a爱视频在线免费观看| 国产av精品麻豆| 久久热在线av| 亚洲精品久久成人aⅴ小说| 极品人妻少妇av视频| 久久精品久久久久久久性| 男女下面插进去视频免费观看| 亚洲国产欧美网| 国产精品二区激情视频| 多毛熟女@视频| 丝袜喷水一区| 国产欧美日韩综合在线一区二区| 久久人人爽人人片av| 中文字幕另类日韩欧美亚洲嫩草| 久久国产亚洲av麻豆专区| 亚洲人成网站在线观看播放| 国产一区亚洲一区在线观看| 日韩制服丝袜自拍偷拍| 在线观看免费日韩欧美大片| a级毛片黄视频| 国产高清不卡午夜福利| 看免费av毛片| 热99久久久久精品小说推荐| 亚洲成国产人片在线观看| 中文字幕亚洲精品专区| 亚洲精品久久久久久婷婷小说| 不卡av一区二区三区| 亚洲成色77777| 人人妻人人添人人爽欧美一区卜| 久久亚洲精品不卡| 国产成人欧美在线观看 | 大码成人一级视频| 十八禁高潮呻吟视频| 中文字幕人妻熟女乱码| 韩国高清视频一区二区三区| 日韩人妻精品一区2区三区| 一本大道久久a久久精品| 成人免费观看视频高清| 免费在线观看日本一区| 久久av网站| 欧美中文综合在线视频| 搡老岳熟女国产| 精品一区二区三区四区五区乱码 | 免费看十八禁软件| 美女国产高潮福利片在线看| 久久人妻福利社区极品人妻图片 | 亚洲七黄色美女视频| 午夜日韩欧美国产| 国产亚洲欧美在线一区二区| 精品第一国产精品| 热99久久久久精品小说推荐| 欧美中文综合在线视频| 在线观看一区二区三区激情| 精品久久久久久电影网| 亚洲精品中文字幕在线视频| 国产精品偷伦视频观看了| 欧美精品亚洲一区二区| 国产女主播在线喷水免费视频网站| 精品少妇内射三级| 久久鲁丝午夜福利片| 在线观看国产h片| 天天影视国产精品| 七月丁香在线播放| 亚洲精品一区蜜桃| 精品国产一区二区久久| 欧美 日韩 精品 国产| 久久性视频一级片| 国产精品国产三级国产专区5o| 狠狠婷婷综合久久久久久88av| 午夜福利视频精品| 国产精品九九99| 免费在线观看日本一区| 中文字幕人妻丝袜一区二区| 国产日韩一区二区三区精品不卡| 精品人妻在线不人妻| 亚洲精品乱久久久久久| 国产亚洲av片在线观看秒播厂| bbb黄色大片| 蜜桃国产av成人99| 国产精品成人在线| 性色av一级| 这个男人来自地球电影免费观看| 黄色视频在线播放观看不卡| 亚洲国产精品一区二区三区在线| 不卡av一区二区三区| 免费观看人在逋| 亚洲av片天天在线观看| 成人亚洲精品一区在线观看| 中文字幕高清在线视频| 美女高潮到喷水免费观看| 国产av国产精品国产| 久久久久精品人妻al黑| 黄色a级毛片大全视频| 国产成人一区二区在线| 国产视频首页在线观看| 亚洲视频免费观看视频| 久久久久精品国产欧美久久久 | 欧美黄色片欧美黄色片| 久久影院123| 少妇被粗大的猛进出69影院| 亚洲欧洲日产国产| 欧美变态另类bdsm刘玥| 欧美日韩亚洲国产一区二区在线观看 | 男女无遮挡免费网站观看| 日韩一卡2卡3卡4卡2021年| 国产日韩欧美视频二区| 午夜福利在线免费观看网站| 国产男女内射视频| 欧美 日韩 精品 国产| 国产欧美日韩精品亚洲av| 亚洲美女黄色视频免费看| av国产久精品久网站免费入址| 日韩电影二区| 亚洲一卡2卡3卡4卡5卡精品中文| 亚洲一区中文字幕在线| 国产在线视频一区二区| 国产精品免费大片| 悠悠久久av| 国产成人精品久久二区二区91| 我的亚洲天堂| 男女无遮挡免费网站观看| 亚洲一卡2卡3卡4卡5卡精品中文| 一边摸一边抽搐一进一出视频| 国产精品.久久久| 国产男女内射视频| 狠狠精品人妻久久久久久综合| 亚洲一区二区三区欧美精品| 一本综合久久免费| 一本一本久久a久久精品综合妖精| 成人国语在线视频| 亚洲中文字幕日韩| 久久天躁狠狠躁夜夜2o2o | 麻豆国产av国片精品| 亚洲精品中文字幕在线视频| 国产熟女午夜一区二区三区| 久久人人97超碰香蕉20202| 国产欧美日韩综合在线一区二区| 久久精品国产a三级三级三级| 日本wwww免费看| 新久久久久国产一级毛片| 亚洲av男天堂| 久久国产精品男人的天堂亚洲| 777米奇影视久久| 久久精品熟女亚洲av麻豆精品| 国产精品免费大片| 蜜桃在线观看..| 国产日韩一区二区三区精品不卡| 久久久久精品人妻al黑| 日日摸夜夜添夜夜爱| 亚洲av日韩精品久久久久久密 | 欧美黑人欧美精品刺激| 久久99热这里只频精品6学生| 黄色视频不卡| 在线观看免费高清a一片| 国产免费现黄频在线看| 咕卡用的链子| 亚洲欧美精品自产自拍| 老司机在亚洲福利影院| 亚洲精品自拍成人| 精品国产超薄肉色丝袜足j| 欧美日韩视频精品一区| 国产高清视频在线播放一区 | 亚洲成人手机| 成在线人永久免费视频| 一本久久精品| 国产成人精品无人区| 一级黄色大片毛片| 亚洲美女黄色视频免费看| 纯流量卡能插随身wifi吗| 精品熟女少妇八av免费久了| 涩涩av久久男人的天堂| 亚洲综合色网址| 午夜免费观看性视频| 精品一区在线观看国产| 老汉色av国产亚洲站长工具| 亚洲精品av麻豆狂野| 黑丝袜美女国产一区| 婷婷色综合www| 国产精品久久久久久人妻精品电影 | 大码成人一级视频| 人妻 亚洲 视频| 青青草视频在线视频观看| 久久精品国产亚洲av涩爱| 免费一级毛片在线播放高清视频 | 国产在线视频一区二区| 色播在线永久视频| 免费观看人在逋| 午夜激情av网站| 一区二区三区精品91| 三上悠亚av全集在线观看| 老司机午夜十八禁免费视频| 久久精品熟女亚洲av麻豆精品| 亚洲精品国产色婷婷电影| 一二三四社区在线视频社区8| 日日夜夜操网爽| 叶爱在线成人免费视频播放| 深夜精品福利| www.熟女人妻精品国产| 国产欧美日韩精品亚洲av| 午夜91福利影院| 免费观看人在逋| 一二三四在线观看免费中文在| 国产精品国产三级专区第一集| 午夜精品国产一区二区电影| 亚洲av男天堂| 国产精品久久久人人做人人爽| 91老司机精品| 国产男女超爽视频在线观看| 亚洲精品日本国产第一区| 久久久亚洲精品成人影院| 老汉色∧v一级毛片| 少妇人妻 视频| 美女午夜性视频免费| 男女国产视频网站| 亚洲av日韩精品久久久久久密 | 久久精品人人爽人人爽视色| 欧美 亚洲 国产 日韩一| 天天躁夜夜躁狠狠久久av| 亚洲三区欧美一区| 18禁黄网站禁片午夜丰满| 一个人免费看片子| 亚洲av电影在线观看一区二区三区| 国产欧美亚洲国产| 欧美精品啪啪一区二区三区 | 国产成人91sexporn| 亚洲国产精品国产精品| 首页视频小说图片口味搜索 | 女人精品久久久久毛片| videos熟女内射| 两人在一起打扑克的视频| 人人妻,人人澡人人爽秒播 | 国产精品久久久av美女十八| www.熟女人妻精品国产| av有码第一页| 中文乱码字字幕精品一区二区三区| 欧美亚洲 丝袜 人妻 在线| 日韩制服丝袜自拍偷拍| 一边亲一边摸免费视频| 看免费成人av毛片| 2021少妇久久久久久久久久久| 嫁个100分男人电影在线观看 | 日本猛色少妇xxxxx猛交久久| 国产欧美日韩精品亚洲av| 亚洲av片天天在线观看| 亚洲一区二区三区欧美精品| 18禁观看日本| 国产亚洲欧美精品永久| 十八禁网站网址无遮挡| 欧美精品啪啪一区二区三区 | 久久久久久亚洲精品国产蜜桃av| 一区二区日韩欧美中文字幕| av在线播放精品| 久久精品国产亚洲av高清一级| 久久久精品区二区三区| 两人在一起打扑克的视频| 亚洲视频免费观看视频| 99久久综合免费| 日韩av不卡免费在线播放| 亚洲国产成人一精品久久久| 亚洲精品国产一区二区精华液| 在线精品无人区一区二区三| 欧美性长视频在线观看| 成年人黄色毛片网站| 国产野战对白在线观看| 精品第一国产精品| 亚洲九九香蕉| 国产日韩欧美视频二区| 免费不卡黄色视频| 黄片小视频在线播放| 国产在视频线精品| 嫁个100分男人电影在线观看 | 亚洲天堂av无毛| 午夜免费成人在线视频| 日韩大片免费观看网站| 免费黄频网站在线观看国产| 久久久久久亚洲精品国产蜜桃av| 好男人电影高清在线观看| 亚洲精品久久久久久婷婷小说| 亚洲欧美日韩另类电影网站| 亚洲精品久久久久久婷婷小说| 麻豆乱淫一区二区| 久久精品亚洲av国产电影网| 建设人人有责人人尽责人人享有的| 成年人黄色毛片网站| 亚洲国产欧美日韩在线播放| 99香蕉大伊视频| av天堂在线播放| 首页视频小说图片口味搜索 | 亚洲国产欧美在线一区| 精品人妻在线不人妻| 男女高潮啪啪啪动态图| 人妻一区二区av| 9色porny在线观看| 日韩,欧美,国产一区二区三区| 成人三级做爰电影| 久久精品国产a三级三级三级| 首页视频小说图片口味搜索 | 久久久久视频综合| 精品一品国产午夜福利视频| 最新在线观看一区二区三区 | 99re6热这里在线精品视频| 国产亚洲午夜精品一区二区久久| 午夜福利视频在线观看免费| 中文字幕色久视频| 男人添女人高潮全过程视频| av国产精品久久久久影院| 亚洲av欧美aⅴ国产| 成年av动漫网址| 伦理电影免费视频| 国产av精品麻豆| 妹子高潮喷水视频| 熟女少妇亚洲综合色aaa.| 欧美日本中文国产一区发布| 下体分泌物呈黄色| 国产成人一区二区三区免费视频网站 | 伊人亚洲综合成人网| 91字幕亚洲| √禁漫天堂资源中文www| 国产高清不卡午夜福利| 一区二区三区激情视频| xxx大片免费视频| 国产精品九九99| 国产精品免费视频内射| 最新在线观看一区二区三区 | 国产精品 欧美亚洲| 国产一区二区激情短视频 | 尾随美女入室| 我要看黄色一级片免费的| 在线观看免费午夜福利视频| 午夜免费男女啪啪视频观看| 人妻人人澡人人爽人人| 日本色播在线视频| 精品国产一区二区三区四区第35| 亚洲成人手机| 久久九九热精品免费| 美女午夜性视频免费| 大片免费播放器 马上看| 视频在线观看一区二区三区| 国产精品麻豆人妻色哟哟久久| 日日摸夜夜添夜夜爱| 麻豆av在线久日| 国产精品久久久av美女十八| 国产av国产精品国产| 最新的欧美精品一区二区| 巨乳人妻的诱惑在线观看| 国产免费视频播放在线视频| 嫁个100分男人电影在线观看 | xxxhd国产人妻xxx| av网站免费在线观看视频| 欧美+亚洲+日韩+国产| 亚洲情色 制服丝袜|