• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Generative Multi-Modal Mutual Enhancement Video Semantic Communications

    2024-03-23 08:16:54YuanleChenHaoboWangChunyuLiuLinyiWangJiaxinLiuandWeiWu

    Yuanle Chen ,Haobo Wang ,Chunyu Liu ,Linyi Wang ,Jiaxin Liu and Wei Wu,?

    1The College of Communication and Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing,210023,China

    2The College of Science,Nanjing University of Posts and Telecommunications,Nanjing,210023,China

    ABSTRACT Recently,there have been significant advancements in the study of semantic communication in single-modal scenarios.However,the ability to process information in multi-modal environments remains limited.Inspired by the research and applications of natural language processing across different modalities,our goal is to accurately extract frame-level semantic information from videos and ultimately transmit high-quality videos.Specifically,we propose a deep learning-based Multi-Modal Mutual Enhancement Video Semantic Communication system,called M3E-VSC.Built upon a Vector Quantized Generative Adversarial Network(VQGAN),our system aims to leverage mutual enhancement among different modalities by using text as the main carrier of transmission.With it,the semantic information can be extracted from key-frame images and audio of the video and perform differential value to ensure that the extracted text conveys accurate semantic information with fewer bits,thus improving the capacity of the system.Furthermore,a multi-frame semantic detection module is designed to facilitate semantic transitions during video generation.Simulation results demonstrate that our proposed model maintains high robustness in complex noise environments,particularly in low signal-to-noise ratio conditions,significantly improving the accuracy and speed of semantic transmission in video communication by approximately 50 percent.

    KEYWORDS Generative adversarial networks;multi-modal mutual enhancement;video semantic transmission;deep learning

    1 Introduction

    Over the past 30 years,wireless communication systems have undergone rapid development.From the introduction of the first-generation (1G) mobile communication system to the gradual commercial deployment of the current fifth-generation (5G) mobile communication system [1],wireless communication technology has experienced generational upgrades approximately every ten years,continuously advancing towards higher efficiency and intelligence.

    In recent years,with the emergence of the concept of semantic communication,communication is no longer confined to the mere transmission of complete bit data.Semantic communication is a revolutionary communication approach based on artificial intelligence technology,aiming to achieve more efficient,accurate,and natural information exchange through the understanding and processing of language and semantics[2].By leveraging techniques such as natural language processing,machine learning,and deep learning,semantic communication enables communication systems to understand and analyze the meaning of language and generate appropriate responses and interactions based on this understanding.This concept has been demonstrated in many related works[3–5].

    In semantic communication,textual or voice information is transformed into interpretable and processable semantic representations.Through the use of semantic analysis and inference techniques,the system can understand and extract the key elements of the information,implicitly embedding semantic knowledge in the training parameters.This deep learning-based approach enhances the efficiency and flexibility of communication,as information can be transmitted and processed more rapidly,without overreliance on extensive bandwidth.

    Video communication plays a crucial role in global communication as a significant means of transmitting information.According to recent statistical data,video content constitutes a substantial portion of internet traffic,and platforms like YouTube and Netflix,due to their delivery of a large volume of video content,have contributed to a significant increase in global bandwidth consumption[6].Similar to the importance and research value emphasized in[7]regarding real-time communication and virtual meetings through video conferencing,the importance of updating video transmission communication methods is evident.

    However,as discussed in [8],traditional methods of video communication transmission predominantly rely on the transmission of audiovisual signals and depend on source-channel coding.These conventional approaches have limitations and issues concerning video quality and bandwidth constraints.

    Firstly,video quality presents a significant challenge for traditional video communication.As shown in [9],Transmitting high-resolution videos requires substantial bandwidth,while network bandwidth is typically limited.This can lead to buffering,freezing,or a decrease in video quality.Applications that require high-definition imagery and smooth video,such as remote healthcare or video conferences,might experience severe user experience issues due to video quality problems.Secondly,reference[9]indicated that most wireless video transmissions employ a modular approach,and this separation-based design can lead to the so-called “cliff effect.” When channel conditions cannot meet the transmission expectations,the overall communication performance of the system is significantly reduced,and the channel capacity decreases.Additionally,extra error correction processing is required to maintain the quality of video transmission.

    As research deepens,although in single-modal scenarios,semantic communication has demonstrated significant potential,offering precise information transmission,including textual-related studies[10–12],image-related studies[13–15],audio-related studies[16,17],and more,the effectiveness of processing information in multi-modal settings remains limited.In multi-modal scenarios,semantic communication encounters several challenges.One of the key difficulties is achieving consistency across different modalities,ensuring that information conveyed through various channels aligns and complements each other.This necessitates the efficient integration and synchronization of textual,visual,and auditory elements to create a cohesive and meaningful communication experience [18].Furthermore,understanding and interpreting the rich and complex information present in multiple modalities poses another challenge.It involves extracting and processing relevant features,patterns,and semantics from different modal inputs and integrating them to form a comprehensive understanding of the communicated content.Additionally,the fusion of heterogeneous modalities,each with its distinct features and representations,presents a challenge in effectively combining and leveraging information from different sources.This requires advanced multi-modal fusion techniques,such as feature-level fusion,decision-level fusion,or semantic-level fusion,to maximize the advantages of each modality and enhance overall communication performance.

    The research on multi-modal semantic communication has gained increasing attention in academia.In this paper,an end-to-end model for extracting and transmitting multi-model information is considered at the semantic level.The purpose is to accurately extract visual and audio information from video clips and transmit semantic information based on the textual modality,aiming to achieve efficient information recovery with low transmission overhead.For the multi-modal video semantic communication we are studying,the following issues are taken into account:

    Question1 Semantic Representation:How to represent and encode the extracted information in a semantic-rich format that facilitates efficient transmission and interpretation.

    Question2 Modality Fusion: How to effectively integrate and fuse information from different modalities,such as video,audio and text,to enhance the overall semantic understanding and transmission.

    Question3 Multi-Modal Alignment: How to align and synchronize the different modalities to ensure coherent and meaningful communication across modalities.

    Question4 Robustness and Adaptability: How to handle challenges such as noisy or incomplete input data and variations in input quality,and adaptability to different communication scenarios.

    In light of the advancements in natural language processing across various modalities,this paper proposes a deep learning-based Multi-Modal Mutual Enhancement Video Semantic Communication model(M3E-VSC).The objective of this model is to leverage mutual enhancement among different modalities to achieve accurate semantic transmission while minimizing spectrum resource consumption.The main contributions of this paper are as follows:

    ? A novel M3E-VSC model is proposed,which advances the field of multi-modal communication by addressing the difficulties in multi-modal information interaction,weak correlation between audio and images,and the potential loss of semantic information during multi-modal transitions,offering a promising solution for maintaining consistency and coherence across different modalities.

    ? A multi-modal mutual enhancement network is designed and trained that leverages generative adversarial networks and information differential mechanisms to achieve complementarity and error correction among different modalities,optimizing the extraction of semantic information.Ultimately,the preservation of the most essential semantic features in image frames aligns with crucial semantic information from multiple modalities,thereby achieving high-quality video restoration tasks.

    ? Simulation results demonstrate that our proposed M3E-VSC performs well in the extraction of semantic information from multi-modal data.It exhibits significantly improved restoration performance and robustness compared to traditional communication methods under low signalto-noise ratio transmission conditions,resulting in approximately 30%to 40%enhancement in the quality of the restored images and audio.Additionally,the overall transmission overhead of the entire communication framework is reduced by 1 to 2 orders of magnitude.These results highlight the effectiveness and robustness of our approach in realistic communication scenarios.

    The remainder of this paper is organized as follows: Section 2 provides a review of the related literature on semantic communication and multi-modal information processing.Section 3 introduces the proposed M3E-VSC model,including its model architecture and implementation details.The experimental results and analysis are presented in Section 4.Finally,Section 5 concludes the paper and discusses potential directions for future research.

    2 Related Work

    2.1 Semantic Communication

    In 1948,Weaver and Shannon first introduced the concept of semantic communication in their seminal work [19],expanding the scope of research in the field of communication beyond the transmission of traditional bit characters.With the advancements in natural language processing,semantic communication gained increased attention in the 1980s and 1990s.Building upon the foundations laid by Carnap and Bar-Hillel in their seminal work on semantic information theory in 1952,a general model for semantic communication was established by Shi et al.[20],who also introduced the concept of semantic noise in semantic encoding and decoding.

    In recent years,with the advancement of artificial intelligence,semantic communication has entered a new stage.Leveraging the powerful feature extraction and modeling capabilities of largescale deep neural networks,there has been significant progress in the quantitative analysis and communication of semantic information.The current paradigm of semantic communication involves embedding the original data into a low-dimensional space that encapsulates semantic information,aiming to achieve compression of the source information [21].In [22],the authors proposed a joint source-channel coding framework based on natural language processing techniques,utilizing bidirectional long short-term memory(BiLSTM)or Transformer networks to extract and compress the semantic information contained in the original data.In addition,in literature [14],the receiver utilized a shared knowledge base to achieve the inverse recovery of information from the sender for implementing semantic communication.By leveraging the inference rules,the transmission of semantic information can be achieved more robustly[10].

    Semantic communication has been continuously developed in the field of single modality,primarily focusing on text,image,or audio modalities.In the context of text modality,researchers have been devoted to utilizing natural language processing and text encoding techniques to achieve the transmission and understanding of semantic information.By transforming text data into vector representations or utilizing pre-trained language models for encoding,semantic information can be extracted from text and transmitted.Weng et al.[23]proposed a semantic communication approach by designing an end-to-end speech recognition system called DeepSC-SR.This system aims to learn and extract semantic features related to text while exploring robust models to cope with complex environments.Yan et al.[24] introduced the concept of semantic spectral efficiency (S-SE) as an optimization criterion for resource allocation in text semantic communication.They optimized the allocation of resources in terms of the number of transmitted semantic symbols.In practical applications of text semantic communication,Xu et al.[25]proposed a collaborative semantic-aware architecture to propagate basic semantics from collaborating users to servers,aiming to reduce data traffic.To evaluate the advantages of the proposed architecture,they presented a case study of vehicle image retrieval tasks in an intelligent transportation system(ITS).

    In the field of image modality,researchers have focused on transforming images into compact representations using image encoding and feature extraction techniques to facilitate the transmission of semantic information.Common approaches include using convolutional neural networks(CNNs)to extract image features or mapping images to low-dimensional representations in a latent space.Huang et al.[26]defined semantic concepts of image data,including categories,spatial arrangements,and visual features,as representation units.They proposed a convolutional neural network (CNN)for extracting semantic concepts and a semantic encoder based on a generative adversarial network(GAN)that incorporates attention modules to fuse local and global features.Bourtsoulatze et al.[7]introduced a joint source-channel coding (JSCC) technique for wireless image transmission.They designed an autoencoder that parameterizes the encoder and decoder functions through two CNNs.Zhang et al.[27]proposed a novel neural network-based semantic communication system for image transmission,employing a dynamic training process for transfer learning.

    For practical applications,Sun et al.[28] presented a new paradigm for image transmission in the context of aerial scene classification.Their approach focused on semantic block transmission for images and channel-condition-aware perception on the frontend unmanned aerial vehicle(UAV).They employed deep reinforcement learning (DRL) to explore the contributions of the optimal semantic blocks to the backend classifier under various channel conditions.These advancements in image semantic communication showcase various techniques and architectures that aim to efficiently transmit and utilize semantic information in the context of image data.

    In the field of audio research,semantic communication has become a key technology for achieving more intelligent human-machine interaction and speech communication.In their studies,Tong et al.[29] focused on wireless network semantic communication based on audio.To extract semantic information from audio signals,they proposed a CNN-based auto-coder based on the wav2vec architecture,which enabled high-precision audio transmission with a small amount of data.Following this,Han et al.[30]proposed a novel DL-based end-to-end transceiver that can extract and encode semantic information from the input speech spectrogram at the transmitter and output the corresponding transcriptions of decoded semantic information at the receiver.For speech-to-speech transmission,they further introduced a connectionist temporal classification(CTC)alignment module to extract a small amount of additional speech-related but semantically irrelevant information to better reconstruct the speech signal on the receiver for the original text.Overall,research on semantic communication in the audio modality primarily focuses on speech emotion recognition,semantic understanding and generation,and speech synthesis.Accurately extracting semantic information and achieving more natural and realistic speech generation remain important areas for further investigation.

    In addition to studying the semantic context of information in different modalities,improving the efficiency of semantic transmission in practical communication settings is also an important research direction.In [31],Wang et al.proposed an adaptive semantic-bit communication structure based on resource efficiency enhancement for extended reality (XR),in which part of the XR users employ semantic communication,while others employ the conventional way.By employing a paradigm selection scheme based on the Signal-to-Interference-plus-Noise Ratio(SINR)and a power allocation algorithm based on genetic algorithms,they utilized adaptive communication and power allocation to maximize the achievable system-level performance indicated by the effective semantic rate.

    Furthermore,Wang et al.in [32] built a bidirectional caching task model to achieve enhanced computation through caching.They proposed a content popularity-based Deep Q-Network (CPDQN) algorithm to make caching decisions.Subsequently,CP-DQN was extended to the cachecomputation coordination optimization algorithm (CCCA) to achieve the tradeoff of computing,caching,and communication (3C) resources.This work is highly beneficial for the practical communication process of semantic transmission as it reduces computational costs while improving the utilization of edge resources.Although our work does not focus on resource optimization in real transmission scenarios,the proposed model architecture by the authors provides valuable insights for optimizing our approach.

    2.2 Multi-Modal Information Processing

    The research in the field of multi-modal information processing aims to investigate and enhance methods and techniques for effectively handling diverse types of information,such as text,images,and speech,to improve the processing of information from multiple sensory modalities.Multi-modal information processing involves capturing diverse types of information from multiple sensors or data sources,such as images,speech,text,etc.,and aims to extract and integrate meaningful information from these different modalities.In recent years,numerous scholars and research teams have extensively studied multi-modal information processing and proposed various methods and techniques to address its challenges.

    Deep learning-based methods have shown great potential in multi-modal information processing.For instance,Xin et al.[33]presented a multi-modal fusion network that combines visual and textual information for image classification tasks,achieving improved accuracy compared to using individual modalities alone.In addition,they proposed an interpretable deep multi-modal fusion(DMFusion)framework based on deep canonical correlation analysis(CCA).The DMFusion framework employs CCA loss to leverage inter-modal correlations and optimizes multi-modal information in a lowdimensional latent fusion space through reconstruction loss and cross-entropy loss,effectively incorporating both within-modal structure and discriminative information.Similarly,Wu et al.[34]proposed a multi-modal attention network that captures the interactions between visual and textual modalities for image captioning,leading to more accurate and descriptive captions.Graph-based methods have also been widely explored in multi-modal information processing.In[35],the authors introduced a graph convolutional network for joint image-text representation learning,leveraging the graph structure to model the interactions between different modalities.This approach has demonstrated promising results in tasks such as image retrieval and cross-modal matching.

    In addition,research on data fusion,feature extraction,modal alignment and cross-modal inference plays a crucial role in multi-modal information processing.These techniques aim to better utilize information from multiple modalities and achieve effective integration and interaction between different modalities.

    In the context of multi-modal information processing,feature extraction involves extracting the most representative and meaningful feature representations from multi-modal data.Traditional methods include handcrafted feature extractors such as scale-invariant feature transform(SIFT)and histogram of oriented gradients (HOG).However,with the rise of deep learning,feature extraction methods based on deep neural networks have become mainstream.For example,Zuo et al.[36]proposed a deep multi-modal fusion network for medical image classification tasks.This network can extract discriminative feature representations simultaneously from different modalities of images,thereby improving classification performance.

    Modal alignment tackles the challenge of mapping data from different modalities to a shared representation space,facilitating cross-modal information transfer and fusion.One challenge in this task is the heterogeneity between different modalities,such as structural differences between images and text.To address this issue,a common approach is to use GAN for modal alignment.For instance,Zhu et al.[37]proposed a CycleGAN-based method that achieves cross-modal translation from images to text.By learning the mapping relationship between two modalities,modal alignment and crossmodal information transfer can be achieved.In addition to CycleGAN,various other GAN networks can be used for modal alignment tasks.For example,the Unsupervised Image-to-Image Translation network enables unsupervised image translation,transforming images from one domain to another.The MUNIT(Multimodal Unsupervised Image-to-image Translation)network,on the other hand,is designed to handle multi-modal image translation tasks,allowing for mapping and conversion between multiple modalities.In[38],Esser et al.introduced Vector Quantized Generative Adversarial Network(VQGAN),which combines the effectiveness of CNN’s positional biases with the expressiveness of transformers,enabling them to model and synthesize high-resolution images.Their approach is easily applicable to conditional synthesis tasks,where both non-spatial information (such as object class)and spatial information (such as segmentation) can control the generated images.Subsequently,the method of semantic-guided image generation using VQGAN+CLIP was proposed,which is also one of the important research foundations referenced in our proposed M3E-VSC.

    Cross-modal inference,another important aspect of multi-modal information processing,aims to leverage multi-modal information for comprehensive and accurate reasoning and decision-making.It requires integrating information from different modalities and performing cross-modal associations and inferences.In [39],Wang et al.proposed a knowledge graph reasoning method based on multimodal fusion,which combines information from images,text,and knowledge graphs to solve complex reasoning problems.This method improves the accuracy and robustness of inference by utilizing the complementarity between different modalities.In [40],Qin et al.proposed two transformer-based models named DeepSC-IR and DeepSC-MT,which perform image retrieval and machine translation tasks,respectively.Their groundbreaking contribution lies in the design of a novel hierarchical transformer that incorporates connections between each encoder layer and decoder layer to facilitate the fusion of multi-modal data,which greatly benefited our work.

    Fine-grained image retrieval,as one of the tasks in both single-modal and cross-modal semantic communication,has received extensive attention in recent years.In [41],Ma et al.proposed a deep progressive asymmetric quantization(DPAQ)method based on causal intervention to learn compact and robust descriptions for fine-grained image retrieval tasks.Later,in [42],they further proposed a complementary part contrastive learning method for weakly supervised object co-localization in fine-grained images,representing similar parts of fine-grained objects with similar features in the feature space.They integrated local and contextual cues through self-supervised attention,channel attention,and spatial attention,to improve the discriminability of image objects.Additionally,they introduced a cross-class object complementary part contrastive learning module,which pulls similar part features closer and pushes different part features apart to recognize extractive part regions with different semantic information,alleviating confusion biases caused by co-occurrence environments within specific categories.

    For multi-modal scenarios,particularly in video transmission tasks,several noteworthy research efforts have been made.In [43],Zhou et al.proposed a clean and effective framework to generate controllable talking faces,focusing on the generation of speaking facial expressions.They manipulated unaligned original facial images using a single photograph as an identity reference.The key innovation was designing implicit low-dimensional pose codes to modularize audio-visual representations,enabling the accurate generation of synchronized talking faces with controllable poses that could be influenced by other videos.

    In [44] by Ji et al.,they introduced Emotion Video Portraits (EVP),a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audio.They introduced crossreconstruction emotion disentanglement techniques,breaking down audio into two decoupled spaces:an emotion space independent of duration and a content space related to duration.Through these disentangled features,dynamic 2D emotional facial features could be inferred.Furthermore,in[45],Wang et al.designed a novel class of efficient deep joint source-channel encoding methods for end-toend video transmission over wireless channels.This approach utilized nonlinear transformations and conditional coding architectures to adaptively extract semantic features across video frames and transmit them over wireless channels through deep joint source-channel encoding in the semantic feature domain.In[9],Jiang et al.proposed an Incremental Redundancy Hybrid Automatic Repeat reQuest(IR-HARQ)framework with a novel semantic error detector for video conferencing communication tasks.This framework aimed to enhance video communication quality by incorporating semantic error detection capabilities.

    These recent studies exemplify advancements in data fusion,feature extraction,modal alignment,and cross-modal inference within the domain of multi-modal information processing.While many works are relevant to video transmission tasks in the multi-modal field,there remains a lack of research on video semantic transmission in natural scenes.This shortage primarily stems from the complexity of multi-modal information in natural scenes,presenting challenges in extracting meaningful semantics.By delving deeper into these techniques and innovating upon them,we aim to construct a semantic communication framework for video transmission tasks in natural scenes.This framework seeks to achieve accurate and comprehensive semantic information analysis and understanding of complex videos.

    3 System Model and Problem Formulation

    3.1 Semantic Communication Framework

    As shown in Fig.1,at the sender side,the video segments are first processed by extracting key frames.Using neural network recognition,which is represented by the “recognition” module in Fig.1,the obtained images and audio modalities are analyzed,and a text modality is generated by incorporating them into a corpus.This generates feature vectors for the text,audio,and image modalities.Next,the system utilizes a multi-modal mutual enhancement network,referred to as the“M3E”module in Fig.1,which utilizes VQGAN and set difference processing to enhance the multimodal semantic information.This module generates the textual representation of the complete video semantic information and the differential information between the audio and image modalities.Then,the three-modal feature vectors of each key frame’s video undergo an attention mechanism network,which separately outputs multi-frame semantic feature vectors for the three modalities.These feature vectors,denoted as audio modalityHt,text modalityHe,and image modalityHv,which exhibit high semantic continuity and modality correlation over time.Subsequently,the semantic encoding layer encodes these feature vectors using a BERT pre-training network.The encoded information is then transmitted through a physical channel,overcoming transmission obstacles such as distortion and noise.The channel encoding module and semantic encoding module subsequently decode the transmitted information,and the multi-modal information is restored using VQGAN networks.Finally,the multi-frame modal information is individually extracted and input into the audio frame processing module(AFP)and the visual frame processing module(VFP).These modules perform bit synchronization and inter-frame semantic error processing,resulting in the final reconstructed original audio and image frames and achieving video reconstruction.

    Figure 1:The proposed multi-modal mutual enhancement video semantic communication system

    Specifically,to facilitate the mutual enhancement of multi-modal information,it is essential to ensure the temporal consistency of modal information within a certain time domain.In this model,the range of selected video keyframes is determined by assessing the cross-correlation coefficients,which quantify the similarity between two images,represented asfiandfj,and can be calculated as

    where(m,n)represents the pixel coordinates,andμiandμjare respectively the mean values of the imagesfiandfj.The numerator of the equation calculates the sum of the pixel-wise product of the difference between the pixel value of each image and its mean value.The denominator normalizes the result by dividing the square root of the sum of squared differences between the pixel values and their mean values for each image in it.

    To filter keyframes based on a similarity threshold,the code snippet can be used as

    whereFrepresents the set of all frames in the video.The function C(fi,fj)measures the similarity between framesfiandfjas mentioned above.If the similarity betweenfiand all other frames inFis below the specified threshold,fiis treated as a selected keyframe and added to the set selected_frames.

    To achieve more precise modal conversion with text as the transmission backbone,the semantic information is extracted and processed from the image and audio data of each selected keyframe to be sent.Additionally,the key frame information is subjected to further processing to enable target recognition for image and audio modalities.Due to the dependence of target detection algorithms in the image and audio modalities on different databases,the detection results between the two modalities will inevitably exhibit slight discrepancies.Besides,the confidence levels for the same targets may also vary.Therefore,to comprehensively extract semantic information from video frames,hyperparameters are set to strike a balance in the textual semantics of the target objects and generate the final detection results.By incorporating text predicates using the BERT corpus,complete descriptive text representations are constructed asei=[o1,o2,...,on],whereoiis the text semantic targets extracted from each frame modality.The key frame texts are combined to form a textual sequence represented asE=[e1,e2,...,en].

    Next,multi-modal information is extracted from the key frames of the video.The objective of this system is to simulate end-to-end transmission using text as the backbone.Therefore,how to efficiently leverage the interaction and error correction among different modalities,specifically image and audio,to improve the accuracy of information extraction is a problem that needs to be addressed.To accomplish this,a multi-modal mutual enhancement network is proposed to eliminate redundancy in the multi-modal information and enhance the complementarity among the modalities.The semantic information obtained from the multi-modal mutual enhancement network can be denoted as

    whereT′represents the transformed output of the audio modality after undergoing the multi-modal mutual enhancement network.Similarly,E′andV′denote the transformed outputs for the text and image modalities,respectively.

    However,extracting semantics from individual key frames of a video does not guarantee the quality of the final video reconstruction.Therefore,following the multi-modal mutual enhancement network,a multi-frame verification mechanism is employed to ensure the correlation between different frames.Specifically,a Recurrent Neural Network(RNN)is utilized as the foundational framework,integrated with attention networks and pooling operations.This mechanism verifies and normalizes the multi-modal information from multiple frames to achieve consistency in semantic information across video frames.

    In this framework,the cross-entropy function is adopted as the loss function to measure the difference in multi-frame multi-modal information.Each frame has multiple modal features represented asXi.For each modal feature,there are corresponding labels and reference informationYi.The crossentropy loss function can be defined as

    whereYijrepresents the label or reference information for thejth class of theith modality,andPijis the predicted probability of thejth class for theith modality.

    The loss function calculates the difference between the predicted probabilities and the corresponding labels for each modality and averages the differences across all modalities.Minimizing the loss function encourages our model to learn more accurate representations and predictions for multi-frame multi-modal information.

    After verifying and correcting the multi-frame semantic information,the multi-modal semantic information of the obtained keyframes is processed into tensors,as shown in Fig.1,and finally,a concatenated tensor is obtained,denoted asy.In the M3E-VSC framework,the transmission of semantic information is simulated using the established semantic communication framework.On the transmitting side,semantic encoders and channel encoders are designed to perform the transmission of semantic information in the physical channel,ensuring the integrity of semantics.In this model,the processed informationxis inputted into the pre-trained BERT model,and the encoded semantic symbols can be represented as

    whereBα(·)represents the semantic encoder based on the pre-trained BERT model with parameter set‘α’,andCβ(·)represents the channel encoder with parameter set‘β’.For the collection of multiple frame texts and residual dataxto be transmitted,the process of simulating transmission can be represented as

    wherehrepresents the fading coefficient,which is a random variable,and its probability density function(PDF)can be represented by the Rayleigh distribution,andwrepresents the Additive White Gaussian Noise(AWGN).The Probability Density Function(PDF)of the fading coefficient can be expressed as the Rayleigh distribution and is given by

    whereσis the scale parameter of the fading channel,determining the fading severity.The fading coefficienthtakes non-negative real values.This study simulates and controls the signal-to-noise ratio(SNR) by taking into account the dual effects of fading gain and fading loss in the fading channel.This allows us to examine the stability and performance of M3E-VSC in complex environments.

    On the receiver side,a semantic decoder and a channel decoder are designed.The decoding process,which is the inverse of the encoding process,can be represented as

    whereSrepresents the decoded information,Cγ(·)is the semantic decoder,Bδ(·)is the channel decoder,andγandδdenote their respective parameters.

    It is worth noting that our model replaces traditional bit symbols with semantic symbols as the main form of transmission.Therefore,semantic similarity is introduced to measure the quality of transmission.The formula of this measure can be expressed as

    whereξrepresents the semantic similarity,which is a continuous value ranging from 0 to 1.B(s)anddenote the mapped representations of the original semantic symbolsand the decoded semantic symbolat the receiving end.respectively,||B(s)||andrepresent the norms ofB(s)and.For each frame of transmitted information,after decoding by the decoder,the minimum semantic similarity thresholdξminis set to 0.6.Frames with a similarity belowξminare flagged by the system and retransmitted to ensure the quality of semantic transmission.

    When our model successfully decodes the received semantic symbols at the receiving end,generating high-quality videos accurately becomes our primary concern.By combining the multiframe semantic detection mechanism and applying smooth processing,the restoration of video using frames of images and audio is successfully achieved Throughout the entire process,the spatial and temporal characteristics of videos,along with the continuity and consistency of semantic information,play a vital role in supervising the generation of videos.

    Accurately generating high-quality videos based on the decoded semantic symbols presents a significant challenge.To overcome this challenge,a multi-frame semantic detection mechanism is employed,which incorporates object detection and tracking algorithms to identify semantic objects in the video and extract their position and motion information.Specifically,assuming that images and audio corresponding to key-frame semantic text have been generated using a VQGAN network,denoted asItfor a total of T frames,wheretrepresents the frame index.By utilizing object detection and tracking algorithms,the target positions and bounding box information for each frame are obtained and represented asBt.

    To achieve smoother transitions between frames and enhance the temporal continuity of the video,a smooth processing technique is introduced.By leveraging the obtained smoothed semantic object positions and motion information,the information can be utilized for image restoration.A memory-based approach is employed in the multi-frame image semantic detection to ensure consistency of semantic representations within a specific range,resulting in a final generated video closely resembling a real video.Additionally,optical flow estimation is utilized to estimate motion vectors between frames,which are then used to generate intermediate frames through interpolation techniques.Assuming the goal is to restore the imageItfor thet-th frame,the interpolated framecan be calculated using the motion vectorVtobtained from optical flow estimation.This process can be expressed as follows:

    whereInterpolate(·)represents the interpolation function,which interpolates based on the motion vector and the images of adjacent frames to generate intermediate frames.

    In the case of audio processing,a similar approach is employed as that of image frames.Filters are utilized to enhance audio quality,and interpolation techniques are employed to ensure temporal continuity.By applying filters,the audio signal can be improved by reducing noise or adjusting the frequency response,thereby enhancing its overall quality.Furthermore,interpolation techniques are used to maintain the temporal coherence of the audio frames.As a result,the complete video segment,encompassing both the visual and audio components,is successfully restored.

    3.2 Multi-Modal Mutual Enhancement Network

    As shown in Fig.2,the aim is to achieve precise information extraction from the determined keyframes by utilizing the interaction and error correction among different modalities.To achieve this,a multi-modal mutual enhancement network is proposed,which eliminates redundancy in multi-modal information and enhances the complementarity among modalities.Specifically,the network consists of three modules: an encoder,a Vector Quantized Generative Adversarial Network (VQGAN),and an information difference processing module.The network takes audio,text descriptions,and image modalities as inputs.Each modality is processed by VGG,BERT,and Transformer networks,respectively,to obtain their discrete representations.These representations then pass through the encoder layer to generate the three-modal information.The network introduces an information difference algorithm to obtain the output,which includes the main text informationas well as the audio difference informationand image difference informationof this key framei.

    During the network input processing,encoding is applied to both the image and audio modalities to reduce the dimensionality of their features while preserving their semantic information.This dimensionality reduction simplifies the subsequent information processing.Within our multi-modal semantic transmission framework,the interaction between text,image,and audio modalities is intricate.Through the dimensionality reduction operation of the encoder,the semantic information of each modality can be extracted using pre-trained models.Specifically,the VGG network is employed for image processing,and the Transformer network is used for audio processing.Utilizing operations such as convolution,pooling,and sequential layers,self-encoding operations are achieved on both the image and audio modalities.

    Figure 2:The proposed multi-modal information processing network

    In the mutual enhancement network,the primary approach for modality conversion is the application of VQGAN,enabling the transfer of semantic information during the conversion process.In the proposed overall framework,the model obtains descriptive textual information(E)by recognizing the audio and image modalities of video keyframes.This information is a comprehensive text generated by the pre-trained BERT model under the unified condition of multi-modal semantic correction.To map the descriptive text to specific feature subspaces,the text is embedded into a VQGAN network,leveraging pre-trained models to accomplish the mapping from text to the audio and image modalities.

    Specifically,two separate VQGAN networks are independently trained:one for mapping text to audio and another for mapping text to image.Firstly,a VQGAN network is trained for the textto-audio modality conversion task.The generator takes an input text description and generates the corresponding audio waveform,while the discriminator aims to distinguish between the generated audio waveform and the real audio waveform.Secondly,for the text-to-image modality conversion task,another independent VQGAN network is trained.The generator in this network receives a text description as input and generates the corresponding image,while the discriminator differentiates between the generated images and real images.Through adversarial training of the generator and discriminator,the generator gradually learns to generate realistic audio and images that closely resemble the real ones.The optimization problems for the two VQGAN networks can be represented as follows:

    whereDrefers to the discriminator andGrepresents the generator.The loss functionL(D,G)can be expressed as

    wherexrepresents the real samples drawn from the true data distributionpdata(x),zrepresents the noise samples drawn from a prior distributionpz(z),G(z)denotes the generated samples from the generator,andD(x)represents the output of the discriminator for real samples.

    In our proposed approach,modality conversion relies on the information description provided by the text,which may result in information loss or redundancy during processing.Therefore,in the third part of our multi-modal mutually-enhancing network,an information difference processing module is included,which is shown in Fig.2 as the “diff process” module,to automatically learn and process the difference between the generated vectors by the VQGAN network and the original modalities.Following the principle of information processing,a dynamic network is also designed to perform the differential processing of cross-modal features and real modality features,aiming to obtain information that cannot be reflected in modality conversion or remove redundant information.

    As illustrated in Fig.2,the true modality information of audio and image is denoted asTandV,respectively,whileErepresents the descriptive text.The difference set information obtained after the information difference processing is denoted asT′andV′.The processing procedure can be summarized as

    whereQrepresents the semantic information of the images generated by VQGAN,Krepresents the semantic information of the original audio and images,andVrepresents the value vector.The dimension of the key vector is denoted bydk.The equation performs a dot product operation between the query vector and the key vector and scales the attention using the scaling factor.It then applies the softmax function to compute the attention weights.Finally,the attention weights are multiplied by the value vector to obtain the attention-weighted result.

    Through the application of semantic difference processing and attention mechanism,our multimodal mutual enhancement network can extract and modulate semantic differences in information across text,image,and audio modalities,enabling more effective modality conversion and semantic processing.Ultimately,the network outputs a sentence vector combining foundational textual semantics with the supplemented semantic information obtained through differential processing.This output is used to complete the intended task.

    3.3 Semantic Encoder and Decoder Design

    The semantic encoding and decoding module is of paramount importance in the framework while designing a semantic communication system using text as the transmission basis.The descriptive text obtained after processing with the multi-modal mutual enhancement network undergoes tokenization as the first step.Tokenization involves splitting the text into individual sub-word units,such as Wordpiece or Sentence-piece,which serve as the basic units for the BERT model.After tokenization,special tokens are added to the beginning and between sentences of the tokenized sequence.The special token“[CLS]”(standing for classification)is added at the beginning of the sequence,while the special token“[SEP]”is inserted between each pair of sentences.Assuming the descriptive text is represented asX=x1,x2,...,xn,the tokenized sequence can be denoted as

    This tokenization process breaks down the text into smaller meaningful units,enabling the BERT model to process and understand the semantics at a more granular level.Then,position embeddings are added to each word vector to retain the positional information of words within the sentence.After tokenization and position embedding,the processed text is input into the BERT pre-trained model.

    In the encoding process,the self-attention mechanism is utilized in each encoder layer to capture the relationships and dependencies among different words in the input sequence.This mechanism allows the model to focus on relevant words and their interactions,enabling it to build contextual representations.To achieve this,the self-attention mechanism calculates attention weights for each token in the input sequence.These attention weights determine the importance of each word concerning the others in the sequence.The mechanism is“self-attention”because it computes the weights based on the similarity of each word to all other words in the sequence.

    The self-attention mechanism in BERT is specifically referred to as“multi-head attention”which means that BERT utilizes multiple parallel self-attention heads,each capturing different aspects of contextual information and relationships.The multiple heads allow the model to attend to various patterns and dependencies simultaneously,enhancing its ability to understand complex contexts.For a given input sequenceX=x1,x2,...,xn,the output of the self-attention mechanism can be represented asH=h1,h2,...,hn,wherehirepresents the representation of theith token after incorporating information from all other tokens in the sequence.Eachhicarries contextual information from the entire input sequence,thanks to the ability of the self-attention mechanism to effectively gather information from all words in the sequence based on their relevance to each other.

    After the self-attention mechanism,each word vector is further transformed through a feedforward neural network to introduce non-linearity.Specifically,for the input sequenceX,the output of the feed-forward neural network can be represented as

    where ReLU is the rectified linear unit activation function,W1andW2are weight matrices,andb1andb2are bias vectors.This feed-forward transformation enables the model to capture more complex relationships and dependencies between words in the sequence.

    Afterward,the pooling operation is performed on the text data,using the output of the“[CLS]”token as the representation of the entire text.This representation is considered as a summarized and compressed semantic vector representation of the input sequence.Mathematically,it can be expressed as

    where BERT Output[[CLS]] refers to the output of the BERT model corresponding to the “[CLS]”token,andPool(·)denotes the pooling operation,which aggregates the information from the entire sequence into a single vector,capturing the overall semantic meaning of the text to be transmitted.

    The semantic decoding process involves designing a reverse Transformer decoder to convert the semantic vector into a text sequence.By generating the next word iteratively,the complete text sequence is gradually constructed.In each step of word generation,the decoder model utilizes its output and the historical context to predict the next word.This can be represented by the function as

    whereyirepresents theith generated word,yi+1,...,ymdenotes the previously generated prefix sequence,andzrepresents the semantic vector.

    4 Experimental Results

    4.1 Simulation Settings

    Due to the utilization of multiple neural networks in our communication framework,each neural network is specialized for specific tasks and operates on distinct datasets.For the audio and image recognition modalities at the receiver end,we train them on the FSD50K and MSCOCO datasets,respectively.Audio segments are uniformly sampled at a rate of 16 kHz.Textual descriptions undergo processing using a pre-trained BERT model,incorporating predicate connection labels to create complete sentences.For instance,the label“plane,fly,sky”is transformed into the sentence“a plane is flying in the sky.”In the evaluation of audio modality tasks,separate simulation experiments are conducted to compare the generative model against traditional compression methods such as AAC,FLAC,PCM,and MP3,especially in complex channel environments.For the proposed multi-modal mutual enhancement network,we perform classification filtering on the VGG-sound audio-visual dataset.It consists of 34 different categories of objects,and the test set and experimental set are split in a 2:8 ratio.Each video sample is defined with a resolution of 720×480,a length of 10 seconds,and a target frame rate of 30 fps.The learning rate is set to 0.01.

    4.2 Simulation Results

    In the simulation experiments,various evaluation metrics are set to assess the performance of the model from multiple aspects.These metrics primarily include the transmitted bit rate,text recovery effectiveness (semantic similarity metric),image recovery structural similarity index (SSIM) metric,image object detection accuracy,and audio recovery timbre similarity.

    Fig.3 showcases the restored videos by M3E-VSC at the receiver end for different video tasks,with detection categories being zebra,snare drum,cat,and train.Each transmission task is achieved through the joint processing of audio and image information using the multi-modal mutually beneficial enhancement network.

    Figure 3:Presentation of multi-modal mutual enhancement video semantic communication system

    In the following sections,a comparison is made between our model and some traditional video and image transmission methods.In particular,for the primary comparison schemes,an H.265 video compression encoder[46]is used for source encoding,employs low-density parity-check codes(LDPC)coding for channel encoding,and utilizes quadrature amplitude modulation (QAM) modulation.Additionally,we also take into account different LDPC encoding modes at varying bit rates,providing a comprehensive comparison of metrics.

    In Fig.4,the graph displays the bit sizes required for each Gop (Group of Pictures) in various video transmission schemes under AWGN channel conditions.Signal-to-Noise Ratio (SNR)is employed to assess changes in channel transmission conditions.This comparison specifies that each scheme operates at a maximum efficient frame rate of 30 fps with a fixed-quality video,resulting in a noticeable decline in metrics as SNR increases.Notably,the model presented here,which utilizes multi-modal mutually beneficial enhancement for semantic processing rather than pixel-level compression,substantially reduces the necessary bit size during transmission compared to traditional approaches.This underscores that our communication model can allocate more substantial transmission bandwidth for video communication,thereby significantly enhancing communication efficiency.

    Figure 4:Comparison of bit rates for single-frame image compression and transmission between M3EVSC and other compression methods in AWGN channel

    Fig.5 examines the transmission performance of our proposed semantic model,comparing it with several traditional text compression methods as baseline models including Huffman,5-bit,and Brotli.The experimental results show that compared to these traditional compression models,the semantic transmission model exhibits approximately 30% to 40% performance improvement in restoring both text and its additional information,especially under low signal-to-noise ratio conditions.This highlights the robust anti-interference capability and stability of the semantic encoding and decoding approach.

    Fig.6 investigates the restoration performance of our proposed model on single-frame video images,focusing on the requirements imposed on our generation model in complex transmission environments.The experiment includes several baseline schemes,including H.265+LDPC 3/4 QPSK,H.265+LDPC 3/4 16 QAM,and JPEG+16 QAM.The experimental results demonstrate that our generation model exhibits significant improvement in the restoration of image frames compared to other schemes,particularly in low signal-to-noise ratio transmission environments.This improvement can be attributed to the reception of relatively complete textual semantic information by our model.As the signal-to-noise ratio increases,the advantage of our proposed model gradually diminishes.However,the SSIM metric remains relatively stable,indicating the consistent performance of our approach.

    Figure 5:Comparison of similarity in recovering transmitted text at the receiver between M3E-VSC and other conventional text compression methods in AWGN channel

    Figure 6:Comparison of SSIM index between M3E-VSC and other image compression and transmission methods for recovered images at the receiver in AWGN channel

    Fig.7 evaluates the preservation of semantic information in video frame generation based on object detection accuracy.The baseline schemes,H.265+LDPC 3/4 QPSK,H.265+LDPC 3/4 16 QAM,represent a comparison with traditional image compression and transmission methods.DeepSC:text and DeepSC:image represent cases where only textual semantic information and only image semantic information are transmitted,respectively,using the classical semantic text transmission model,DeepSC.The simulation results demonstrate that our communication model outperforms other schemes regarding image object detection within the specified signal-to-noise ratio range.This advantage is primarily attributed to the error-correcting capabilities of the multi-modal mutual enhancement network in our proposed model,which preserves semantic information effectively.

    Figure 7:Comparison of object detection accuracy scores between M3E-VSC and traditional image compression methods for recovered images at the receiver in AWGN channel.Here,DeepSC: text and DeepSC:image represent conventional semantic transmission methods that solely transmit text information and image information,respectively

    Table 1 presents the results of the ablation experiment.We evaluate the models based on the object recognition scores of video transmission image frames.The YOLOv6 model is used for the detection task,and multiple ablation models are compared with the traditional approach.The evaluated models include:

    ?Ours:This refers to our proposed multi-modal mutually enhanced semantic communication model.It incorporates multiple modalities to improve video semantic communication performance.

    ?Ours (without M3E):This corresponds to our proposed semantic communication model without the M3E component.It serves as a comparison to assess the impact of the M3E scheme.

    ?Ours (only text):This corresponds to our proposed semantic communication model without the semantic-based disjoint network component.In this model,only the text feature vectors are encoded for semantic purposes at the sending end,without involving multi-modal semantic processing tasks or the M3E component.It serves as a comparison to assess the impact of the semantic-based disjoint network on our overall approach.

    ?Traditional communications:This represents traditional semantic communication methods,specifically referring to the results of H.265 video compression and transmission techniques.

    The table presents evaluation metrics for different classes such as Car,Dog,Zebra,Airplane,Train,Umbrella,Chair,and the overall class average.For each class,we conduct simulations using 30 different video clips with a signal-to-noise ratio of 8 dB.The goal is to study the accuracy of object detection in keyframe images for the video transmission task and assess the importance of our proposed semanticbased disjoint network and multi-modal mutual enhancement network in the overall system.

    The results indicate that our complete model outperforms the two ablation models and significantly surpasses the traditional H.264 model in terms of image semantic transmission,specifically under low signal-to-noise ratio conditions.The comparison among the three scenarios highlights the substantial effectiveness of our semantic-based disjoint network and multi-modal mutual enhancement network in preserving image reconstruction quality and enhancing semantic information.It is worth noting that,for certain classes such as Car,Dog,and Zebra,there is a noticeable difference between the “Ours (without M3E)” model and the complete model.This divergence is due to the pronounced audio modality features of these classes,which are greatly influenced by the semantic optimization of the multi-modal mutual enhancement network.Furthermore,the relatively poorer image reconstruction performance of the “Ours (only text)” model,compared to the first two comparison scenarios,can be attributed to the absence of the image semantic representation module at the sending end,resulting in the loss of important image feature information.

    Fig.8 investigates the restoration performance of our approach for audio.The model primarily focuses on transmitting audio semantics and does not specifically emphasize detailed audio restoration.The main goal is to achieve similar semantic representations in the audio.To assess this,the timbre similarity metric is used as a comparison scheme to determine whether the audio maintains similar semantic information.Several baseline schemes are considered,including traditional audio compression methods like AAC,FLAC,PCM,and MP3.The simulation results indicate that the approach exhibits significant performance advantages over the baseline schemes in low signal-to-noise ratio transmission environments,resulting in better audio restoration effects.

    Figure 8:Comparison of timbre accuracy scores between M3E-VSC and traditional audio compression methods for recovered audio at the receiver in AWGN channel

    5 Conclusion

    In this paper,we proposed a multi-modal mutual enhancement video semantic communication system called M3E-VSC,which leveraged computer vision and natural language processing techniques to achieve more effective and efficient transmission and understanding of semantic information in videos.Additionally,our framework incorporated advanced machine learning algorithms,such as deep neural networks,to facilitate the fusion and integration of multi-modal information.By jointly optimizing the encoding,decoding,and video restoration processes,we enhanced the overall efficiency and effectiveness of video semantic communication.Simulation results demonstrated that our multimodal mutual enhancement approach exhibited a significant advantage in terms of transmission volume while ensuring performance in semantic content transmission and comprehension.By leveraging the complementary nature of different modalities,our framework achieved higher semantic accuracy and robustness in conveying video content.

    Acknowledgement:We would like to thank the reviewers who remained anonymous for their constructive criticism and recommendations.We also thank the journal,CMES,for their support for the publication of this Special Issue.The authors are grateful for the support by the National Key Research and Development Project and the National Natural Science Foundation of China.

    Funding Statement:This work was supported by the National Key Research and Development Project under Grant 2020YFB1807602,Key Program of Marine Economy Development Special Foundation of Department of Natural Resources of Guangdong Province (GDNRC[2023]24) and the National Natural Science Foundation of China under Grant 62271267.

    Author Contributions:The authors confirm their contribution to the paper as follows:write the main manuscript text and perform the experiments: Yuanle Chen;conduct research and design the basic system framework:Haobo Wang,Chunyu Liu;collect data and assist in the training process of deep learning: Linyi Wang;assist in analysis and interpretation of results: Jiaxin Liu;give guidance and suggestions for method of this article,as well as reviewed and edited the manuscript: Wei Wu.All authors reviewed the results and approved the final version of the manuscript.

    Availability of Data and Materials:All the reviewed research literature and used data in this research paper consists of publicly available scholarly articles,conference proceedings,books,and reports.The references and citations are contained in the reference list of this manuscript and can be accessed through online databases,academic libraries,or by contacting the respective publishers.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    久久精品91无色码中文字幕| 亚洲精品久久国产高清桃花| 禁无遮挡网站| 日本在线视频免费播放| 特级一级黄色大片| 日本与韩国留学比较| 深夜精品福利| 色精品久久人妻99蜜桃| 国产精品久久电影中文字幕| 午夜免费男女啪啪视频观看 | 淫秽高清视频在线观看| 禁无遮挡网站| 亚洲男人的天堂狠狠| 亚洲成av人片在线播放无| 国产精品99久久99久久久不卡| 日韩精品青青久久久久久| 免费电影在线观看免费观看| 成人午夜高清在线视频| 美女高潮的动态| 欧美最黄视频在线播放免费| 三级国产精品欧美在线观看| 丰满人妻一区二区三区视频av | 高潮久久久久久久久久久不卡| 搞女人的毛片| 丁香欧美五月| 亚洲七黄色美女视频| 可以在线观看的亚洲视频| 老师上课跳d突然被开到最大视频 久久午夜综合久久蜜桃 | 国产乱人伦免费视频| 国产午夜精品久久久久久一区二区三区 | 亚洲国产精品合色在线| 在线观看一区二区三区| 91久久精品国产一区二区成人 | 99久久久亚洲精品蜜臀av| 亚洲美女视频黄频| 69av精品久久久久久| 最新在线观看一区二区三区| 成人永久免费在线观看视频| 91av网一区二区| 久久久国产精品麻豆| 亚洲欧美日韩高清专用| 欧美最黄视频在线播放免费| 嫩草影视91久久| 亚洲熟妇中文字幕五十中出| 欧美成狂野欧美在线观看| 欧美成人免费av一区二区三区| 亚洲成av人片在线播放无| 亚洲专区中文字幕在线| 最近视频中文字幕2019在线8| 亚洲国产色片| 天堂av国产一区二区熟女人妻| 精品国产超薄肉色丝袜足j| 草草在线视频免费看| 亚洲av第一区精品v没综合| 两人在一起打扑克的视频| 一级黄色大片毛片| 男女做爰动态图高潮gif福利片| 91在线精品国自产拍蜜月 | 日韩欧美一区二区三区在线观看| 夜夜爽天天搞| 亚洲第一电影网av| 成年女人看的毛片在线观看| 久久精品国产清高在天天线| 成年人黄色毛片网站| 国产av一区在线观看免费| 亚洲av五月六月丁香网| 久久婷婷人人爽人人干人人爱| 国产高清有码在线观看视频| 国产久久久一区二区三区| 天天一区二区日本电影三级| 男女下面进入的视频免费午夜| 51午夜福利影视在线观看| 三级毛片av免费| 国产精品野战在线观看| 村上凉子中文字幕在线| 欧美成人性av电影在线观看| svipshipincom国产片| 色老头精品视频在线观看| 亚洲国产精品sss在线观看| 欧美日韩中文字幕国产精品一区二区三区| 两人在一起打扑克的视频| 欧美日本视频| www日本黄色视频网| 日本a在线网址| 成年女人毛片免费观看观看9| 久久精品亚洲精品国产色婷小说| 国产精品香港三级国产av潘金莲| 免费人成视频x8x8入口观看| 中文在线观看免费www的网站| 精品不卡国产一区二区三区| 亚洲av成人av| 可以在线观看的亚洲视频| 免费一级毛片在线播放高清视频| 久久人人精品亚洲av| 99国产综合亚洲精品| 精品无人区乱码1区二区| 亚洲国产欧美网| 国产爱豆传媒在线观看| 深爱激情五月婷婷| 亚洲男人的天堂狠狠| 三级男女做爰猛烈吃奶摸视频| 久久久久久久久大av| 国产成人影院久久av| 黑人欧美特级aaaaaa片| 成人av一区二区三区在线看| 露出奶头的视频| 国产亚洲欧美98| 伊人久久精品亚洲午夜| 国产精品av视频在线免费观看| 最后的刺客免费高清国语| 国产成人a区在线观看| 免费在线观看影片大全网站| 欧美日韩中文字幕国产精品一区二区三区| 午夜福利成人在线免费观看| 一级黄色大片毛片| 一二三四社区在线视频社区8| 成人av在线播放网站| 天堂av国产一区二区熟女人妻| 日本五十路高清| 一个人免费在线观看电影| 婷婷丁香在线五月| 在线天堂最新版资源| 亚洲18禁久久av| 可以在线观看的亚洲视频| 中文资源天堂在线| 午夜免费观看网址| 少妇人妻精品综合一区二区 | 免费av不卡在线播放| 黄片小视频在线播放| 精品欧美国产一区二区三| 99久久精品一区二区三区| 色在线成人网| 天天躁日日操中文字幕| 啦啦啦韩国在线观看视频| 欧美+亚洲+日韩+国产| 偷拍熟女少妇极品色| 欧美成人免费av一区二区三区| 日本成人三级电影网站| 亚洲av成人av| 成熟少妇高潮喷水视频| 不卡一级毛片| 精品免费久久久久久久清纯| 亚洲人成网站高清观看| 国产三级在线视频| 国产成人影院久久av| av专区在线播放| 午夜影院日韩av| 啪啪无遮挡十八禁网站| 国产激情欧美一区二区| 欧美bdsm另类| 男人的好看免费观看在线视频| 窝窝影院91人妻| 久久国产乱子伦精品免费另类| 国产高清有码在线观看视频| 美女被艹到高潮喷水动态| 国产野战对白在线观看| 高清毛片免费观看视频网站| 18+在线观看网站| 日韩 欧美 亚洲 中文字幕| 18美女黄网站色大片免费观看| 淫妇啪啪啪对白视频| 男人舔奶头视频| 91在线精品国自产拍蜜月 | 亚洲无线在线观看| 亚洲第一电影网av| 九九在线视频观看精品| 最近视频中文字幕2019在线8| 欧美色欧美亚洲另类二区| 午夜视频国产福利| 久久99热这里只有精品18| 亚洲欧美日韩高清专用| 手机成人av网站| 欧美黄色片欧美黄色片| 国内精品久久久久久久电影| 精品不卡国产一区二区三区| 少妇裸体淫交视频免费看高清| 国产精品久久电影中文字幕| 欧美又色又爽又黄视频| 老鸭窝网址在线观看| 免费在线观看日本一区| 久久婷婷人人爽人人干人人爱| 成人av在线播放网站| 伊人久久精品亚洲午夜| 亚洲国产欧美人成| 黄色丝袜av网址大全| 在线观看av片永久免费下载| а√天堂www在线а√下载| 最新美女视频免费是黄的| 日韩免费av在线播放| 亚洲七黄色美女视频| 久久久久久久亚洲中文字幕 | 99国产综合亚洲精品| 高清在线国产一区| 色综合站精品国产| 日本三级黄在线观看| 免费在线观看影片大全网站| 亚洲欧美日韩卡通动漫| 午夜视频国产福利| 美女黄网站色视频| 99久国产av精品| 精品一区二区三区视频在线观看免费| 亚洲性夜色夜夜综合| 亚洲天堂国产精品一区在线| 欧美在线黄色| 无人区码免费观看不卡| 女同久久另类99精品国产91| 免费观看人在逋| 久久久久久久亚洲中文字幕 | 亚洲欧美一区二区三区黑人| 亚洲在线观看片| 无遮挡黄片免费观看| 最近最新免费中文字幕在线| 中文资源天堂在线| 亚洲熟妇熟女久久| 午夜久久久久精精品| 中文字幕av成人在线电影| 亚洲aⅴ乱码一区二区在线播放| 中文字幕久久专区| 国产一区在线观看成人免费| 一本一本综合久久| h日本视频在线播放| 欧美性猛交黑人性爽| 国产久久久一区二区三区| 成人国产一区最新在线观看| 麻豆一二三区av精品| 精品国产三级普通话版| 色视频www国产| 久久香蕉精品热| 18禁在线播放成人免费| 99热精品在线国产| 国产免费av片在线观看野外av| 操出白浆在线播放| 欧美三级亚洲精品| 一个人观看的视频www高清免费观看| 免费观看精品视频网站| 国产精品,欧美在线| 又爽又黄无遮挡网站| 麻豆久久精品国产亚洲av| 欧美zozozo另类| 夜夜躁狠狠躁天天躁| 亚洲精品日韩av片在线观看 | 在线观看66精品国产| or卡值多少钱| 最近最新中文字幕大全电影3| 日韩成人在线观看一区二区三区| 操出白浆在线播放| 三级国产精品欧美在线观看| 老鸭窝网址在线观看| 亚洲成av人片在线播放无| 丰满人妻一区二区三区视频av | 男人舔奶头视频| 男人的好看免费观看在线视频| 在线国产一区二区在线| 亚洲精品美女久久久久99蜜臀| 婷婷丁香在线五月| 久久人人精品亚洲av| 久久久国产成人免费| 老司机在亚洲福利影院| 日日干狠狠操夜夜爽| 麻豆一二三区av精品| 久久精品国产清高在天天线| 欧美另类亚洲清纯唯美| 成人三级黄色视频| 一级作爱视频免费观看| 欧美日韩黄片免| 亚洲aⅴ乱码一区二区在线播放| 一级毛片高清免费大全| 国产私拍福利视频在线观看| 日本在线视频免费播放| 国产真实伦视频高清在线观看 | 亚洲精品影视一区二区三区av| 中文字幕av在线有码专区| 18禁国产床啪视频网站| 国产成人啪精品午夜网站| 最好的美女福利视频网| 少妇人妻精品综合一区二区 | 亚洲精品一区av在线观看| 亚洲最大成人手机在线| 国产精品99久久99久久久不卡| 国产精品自产拍在线观看55亚洲| 可以在线观看毛片的网站| 色综合亚洲欧美另类图片| 老司机午夜福利在线观看视频| 国产乱人视频| 成年人黄色毛片网站| 黄色成人免费大全| 啪啪无遮挡十八禁网站| 99久久九九国产精品国产免费| 国产精品自产拍在线观看55亚洲| 少妇的逼水好多| 亚洲,欧美精品.| 99热这里只有精品一区| 五月伊人婷婷丁香| 久久久国产成人精品二区| 亚洲专区中文字幕在线| 久99久视频精品免费| 亚洲电影在线观看av| 亚洲五月天丁香| 天天一区二区日本电影三级| 听说在线观看完整版免费高清| 精品一区二区三区av网在线观看| 日韩欧美国产一区二区入口| 女人十人毛片免费观看3o分钟| 欧美色欧美亚洲另类二区| 欧美成人a在线观看| 波多野结衣巨乳人妻| 久久亚洲精品不卡| tocl精华| 波多野结衣巨乳人妻| 男女下面进入的视频免费午夜| 日韩有码中文字幕| 黄色成人免费大全| 午夜福利免费观看在线| 99久久综合精品五月天人人| 在线观看66精品国产| 亚洲国产欧美人成| 特大巨黑吊av在线直播| 又粗又爽又猛毛片免费看| 欧美成人一区二区免费高清观看| 老师上课跳d突然被开到最大视频 久久午夜综合久久蜜桃 | 亚洲成a人片在线一区二区| 成人高潮视频无遮挡免费网站| 免费av毛片视频| 国产成人aa在线观看| 国产精品乱码一区二三区的特点| 搡老岳熟女国产| 性色avwww在线观看| 日日摸夜夜添夜夜添小说| 亚洲成人精品中文字幕电影| 中国美女看黄片| 午夜精品一区二区三区免费看| 国产精品久久久久久亚洲av鲁大| 免费看美女性在线毛片视频| 欧美中文综合在线视频| 亚洲专区国产一区二区| 免费av毛片视频| 给我免费播放毛片高清在线观看| 级片在线观看| 日本黄大片高清| or卡值多少钱| 欧美极品一区二区三区四区| 精品不卡国产一区二区三区| 久久精品影院6| 又爽又黄无遮挡网站| 久久久久久久亚洲中文字幕 | 午夜福利在线观看免费完整高清在 | 国产97色在线日韩免费| 成年版毛片免费区| 欧美日韩中文字幕国产精品一区二区三区| 免费人成在线观看视频色| 97超视频在线观看视频| 黑人欧美特级aaaaaa片| 岛国视频午夜一区免费看| 高清在线国产一区| 久久久成人免费电影| 精品日产1卡2卡| 欧美日本亚洲视频在线播放| 亚洲一区高清亚洲精品| 午夜影院日韩av| 非洲黑人性xxxx精品又粗又长| 一区二区三区高清视频在线| 最近在线观看免费完整版| 亚洲中文字幕日韩| 俺也久久电影网| 亚洲精品一卡2卡三卡4卡5卡| 女人十人毛片免费观看3o分钟| 一个人看视频在线观看www免费 | 国产视频一区二区在线看| 国产精品 国内视频| 757午夜福利合集在线观看| 日本免费a在线| 一本综合久久免费| 搡老妇女老女人老熟妇| 国内精品一区二区在线观看| 亚洲欧美一区二区三区黑人| 欧美日韩瑟瑟在线播放| 久久亚洲真实| 又紧又爽又黄一区二区| 一本久久中文字幕| 亚洲精品在线美女| 男人的好看免费观看在线视频| 在线a可以看的网站| 国产精品久久视频播放| 中文亚洲av片在线观看爽| 久久精品91无色码中文字幕| 久久香蕉精品热| 日韩高清综合在线| 日本黄色视频三级网站网址| 最近最新免费中文字幕在线| 午夜两性在线视频| 禁无遮挡网站| 成人av一区二区三区在线看| 一本久久中文字幕| 757午夜福利合集在线观看| 国内毛片毛片毛片毛片毛片| 成人永久免费在线观看视频| 国产三级黄色录像| 精品久久久久久久久久久久久| 色综合亚洲欧美另类图片| 高清在线国产一区| 97超视频在线观看视频| 精品国内亚洲2022精品成人| 成人特级av手机在线观看| 久久久久久久久中文| or卡值多少钱| 久久精品国产亚洲av涩爱 | 性欧美人与动物交配| 亚洲欧美日韩无卡精品| 老司机午夜福利在线观看视频| 97人妻精品一区二区三区麻豆| 成年女人毛片免费观看观看9| 国产真人三级小视频在线观看| 亚洲avbb在线观看| 在线免费观看的www视频| xxx96com| 哪里可以看免费的av片| 午夜免费男女啪啪视频观看 | 久久国产精品影院| 亚洲av熟女| 欧美最新免费一区二区三区 | 给我免费播放毛片高清在线观看| 国产爱豆传媒在线观看| 成年免费大片在线观看| 精品国产三级普通话版| 精品久久久久久久久久免费视频| 伊人久久大香线蕉亚洲五| 一边摸一边抽搐一进一小说| 日本黄色视频三级网站网址| 国产精品美女特级片免费视频播放器| 女人高潮潮喷娇喘18禁视频| 欧美成人免费av一区二区三区| 亚洲美女视频黄频| 久久久精品大字幕| 中文字幕熟女人妻在线| 国产精品乱码一区二三区的特点| 男女视频在线观看网站免费| 午夜两性在线视频| 国产麻豆成人av免费视频| 最好的美女福利视频网| 内地一区二区视频在线| 88av欧美| netflix在线观看网站| 日本免费a在线| 老熟妇乱子伦视频在线观看| 男人的好看免费观看在线视频| 我的老师免费观看完整版| 国产中年淑女户外野战色| 亚洲欧美日韩无卡精品| 尤物成人国产欧美一区二区三区| 国产精品一区二区三区四区久久| 午夜日韩欧美国产| 两个人视频免费观看高清| 亚洲av二区三区四区| 免费观看的影片在线观看| 久久久久久人人人人人| 日本黄色片子视频| 婷婷六月久久综合丁香| 99精品在免费线老司机午夜| 观看美女的网站| 欧美不卡视频在线免费观看| 免费观看的影片在线观看| 在线观看av片永久免费下载| 99久国产av精品| 色综合亚洲欧美另类图片| 国产色婷婷99| 乱人视频在线观看| 免费看光身美女| 最后的刺客免费高清国语| 成人av一区二区三区在线看| 性欧美人与动物交配| 国产伦精品一区二区三区视频9 | 久久婷婷人人爽人人干人人爱| 久久久久久久亚洲中文字幕 | 国产视频内射| 色噜噜av男人的天堂激情| xxx96com| 亚洲性夜色夜夜综合| 老熟妇乱子伦视频在线观看| 最近在线观看免费完整版| 亚洲真实伦在线观看| 精品99又大又爽又粗少妇毛片 | 国产一区二区在线观看日韩 | 久久精品夜夜夜夜夜久久蜜豆| 两个人视频免费观看高清| 在线a可以看的网站| 国产淫片久久久久久久久 | 久久精品国产亚洲av香蕉五月| 日本一本二区三区精品| 亚洲无线在线观看| 色老头精品视频在线观看| 禁无遮挡网站| 欧美日本亚洲视频在线播放| 黄色丝袜av网址大全| 女生性感内裤真人,穿戴方法视频| netflix在线观看网站| 久久亚洲真实| ponron亚洲| 亚洲欧美精品综合久久99| 97超视频在线观看视频| 国产亚洲精品av在线| 亚洲五月天丁香| 成人av一区二区三区在线看| 99久国产av精品| 久久久久久久久久黄片| 91九色精品人成在线观看| 国产精品久久视频播放| 伊人久久大香线蕉亚洲五| 欧美另类亚洲清纯唯美| 波多野结衣高清作品| 亚洲一区二区三区不卡视频| 国产成人影院久久av| 国产精品久久久久久亚洲av鲁大| svipshipincom国产片| 19禁男女啪啪无遮挡网站| 国产精品亚洲美女久久久| 俺也久久电影网| 欧美又色又爽又黄视频| 高清在线国产一区| 九九热线精品视视频播放| 国产乱人伦免费视频| 熟女人妻精品中文字幕| 男女下面进入的视频免费午夜| 免费观看精品视频网站| 亚洲七黄色美女视频| 亚洲一区二区三区色噜噜| 欧美中文日本在线观看视频| 岛国视频午夜一区免费看| 亚洲无线观看免费| 午夜免费激情av| 2021天堂中文幕一二区在线观| www.熟女人妻精品国产| 90打野战视频偷拍视频| 小说图片视频综合网站| xxxwww97欧美| 他把我摸到了高潮在线观看| 91在线观看av| 一本久久中文字幕| 精品久久久久久,| 最近最新中文字幕大全电影3| 最近视频中文字幕2019在线8| 精品不卡国产一区二区三区| 国产视频内射| 黄片小视频在线播放| 成人永久免费在线观看视频| 免费观看的影片在线观看| 精品久久久久久,| 亚洲在线观看片| 男女之事视频高清在线观看| 国产免费一级a男人的天堂| 欧美一级毛片孕妇| 老汉色av国产亚洲站长工具| 少妇熟女aⅴ在线视频| 欧美成狂野欧美在线观看| 97人妻精品一区二区三区麻豆| 午夜精品一区二区三区免费看| 男女之事视频高清在线观看| 亚洲成人久久性| 性欧美人与动物交配| 午夜a级毛片| 天堂影院成人在线观看| 少妇的逼好多水| 欧美激情久久久久久爽电影| 成人无遮挡网站| 精品一区二区三区视频在线观看免费| 操出白浆在线播放| 国产精品久久久人人做人人爽| 亚洲午夜理论影院| 国产精华一区二区三区| 丁香欧美五月| 可以在线观看毛片的网站| 国产精品 国内视频| 婷婷精品国产亚洲av在线| 亚洲天堂国产精品一区在线| 欧美日韩乱码在线| 国产高潮美女av| 国产成人啪精品午夜网站| or卡值多少钱| 亚洲精品在线美女| 色综合欧美亚洲国产小说| 国产精品日韩av在线免费观看| 一级黄色大片毛片| 国产69精品久久久久777片| 亚洲av成人av| 手机成人av网站| 国内精品久久久久久久电影| 久久久久久久久久黄片| 成人高潮视频无遮挡免费网站| bbb黄色大片| 国产欧美日韩一区二区三| 国产成+人综合+亚洲专区| 久9热在线精品视频| 久久精品夜夜夜夜夜久久蜜豆| 18+在线观看网站| 国产亚洲av嫩草精品影院| 成人三级黄色视频| 国产激情偷乱视频一区二区| 18禁裸乳无遮挡免费网站照片| 丁香欧美五月| 天堂av国产一区二区熟女人妻| www.www免费av| 国产真实伦视频高清在线观看 | 日本三级黄在线观看| 俄罗斯特黄特色一大片| 看免费av毛片| 欧美在线一区亚洲| 99国产极品粉嫩在线观看| 午夜精品一区二区三区免费看| 日韩av在线大香蕉| 女人高潮潮喷娇喘18禁视频| 校园春色视频在线观看| 丰满人妻一区二区三区视频av | 五月伊人婷婷丁香| 亚洲天堂国产精品一区在线| 国产一区二区亚洲精品在线观看| 无限看片的www在线观看|