• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    HLR-Net:A Hybrid Lip-Reading Model Based on Deep Convolutional Neural Networks

    2021-12-11 13:29:12AmanySarhanNadaElshennawyandDinaIbrahim
    Computers Materials&Continua 2021年8期

    Amany M.Sarhan,Nada M.Elshennawy and Dina M.Ibrahim,2,*

    1Department of Computers and Control Engineering,Faculty of Engineering,Tanta University,Tanta,37133,Egypt

    2Department of Information Technology,College of Computer,Qassim University,Buraydah,51452,Saudi Arabia

    Abstract:Lip reading is typically regarded as visually interpreting the speaker’s lip movements during the speaking.This is a task of decoding the text from the speaker’s mouth movement.This paper proposes a lip-reading model that helps deaf people and persons with hearing problems to understand a speaker by capturing a video of the speaker and inputting it into the proposed model to obtain the corresponding subtitles.Using deep learning technologies makes it easier for users to extract a large number of different features, which can then be converted to probabilities of letters to obtain accurate results.Recently proposed methods for lip reading are based on sequence-to-sequence architectures that are designed for natural machine translation and audio speech recognition.However, in this paper, a deep convolutional neural network model called the hybrid lip-reading (HLR-Net) model is developed for lip reading from a video.The proposed model includes three stages,namely,preprocessing, encoder, and decoder stages, which produce the output subtitle.The inception, gradient, and bidirectional GRU layers are used to build the encoder,and the attention,fully-connected,activation function layers are used to build the decoder,which performs the connectionist temporal classification(CTC).In comparison with the three recent models,namely,the LipNet model,the lip-reading model with cascaded attention(LCANet),and attention-CTC(A-ACA)model,on the GRID corpus dataset,the proposed HLR-Net model can achieve significant improvements, achieving the CER of 4.9%, WER of 9.7%,and Bleu score of 92%in the case of unseen speakers,and the CER of 1.4%,WER of 3.3%,and Bleu score of 99%in the case of overlapped speakers.

    Keywords: lip-reading; visual speech recognition; deep neural network;connectionist temporal classification

    1 Introduction

    Lip reading can be defined as the ability to understand what people are saying from their visual lip movement.Lip reading is a difficult task for humans because lip movements corresponding to different letters are visually very similar (e.g., b and p, or d and t) [1,2].Automatic lip reading is currently used in many applications, either as a standalone application or as a supplementary one.It has been considered as a human-computer interaction approach that has been recently trending in the literature.One of the famous applications of lip reading is as an assisting tool for deaf persons by transforming the speech in the form of a video into subtitles.It can also be used as a supplementary means in noisy or virtual reality (VR) environments by completing the unheard sentences affected by the noise existence.In addition, it can greatly enhance the real experience of immersive VR [3].Besides, automatic lip reading has been used in many applications, including security, speech recognition, and assisted driving systems.Automatic lip reading involves several tasks from different fields, including image processing, pattern recognition,computer vision, and natural language understanding.

    Machine learning (ML) is a type of models that allows software applications to become more accurate in predicting outcomes without being explicitly programmed [4,5].The basic premise of machine learning is to build algorithms that can predict output data based on the input data using statistical analysis and update output data when new input data become available.Deep learning is a machine learning method that uses an inputXto predict an outputY.For instance, for given stock prices of the past week, a deep learning algorithm can predict the stock price of the next day [6].Given a large number of input-output data pairs, a deep learning algorithm aims to minimize the difference between the predicted and expected outputs by learning the relationship between the input and output data, which enables the deep learning model to generalize accurate outputs for previously unseen inputs.

    Lip reading is considered to be a difficult task for both humans and machines because of the high similarity of lip movements corresponding to uttering letters (e.g., letters b and p, or d and t).In addition, lip size, orientation, wrinkles around the mouth, and brightness also affect the quality of the detected words.These problems can be addressed by extracting the spatio-temporal features from a video and then mapping them to the corresponding language symbols, which represents a nontrivial learning task [7].Due to its hardness, machine learning approaches have been proposed.

    Recently, deep learning approaches have been applied to lip reading [7-9], which resulted in the fast development of the speech recognition field.In general, these approaches first extract the mouth area from a video of interest and then feed the extraction result to the input of a deep learning-based model.This model is commonly trained such that both feature extraction and word identification are automatically performed.

    Currently, most of the existing lip-reading methods are based on the sequence-to-sequence architectures that are designed for applications such as natural language translation and audio speech recognition.The most common lip-reading methods are the LipNet model [7], lip-reading model with cascaded attention-CTC (LCANet) [10], which is also known as the attention high network-CTC (AH-CTC), and attention-CTC (A-CTC) models [6].However, the lip-reading performance requires further improvement, which can be achieved by using more robust deep learning models and utilizing available datasets.The ambiguity of the translation from videos to words makes lip reading a challenging problem, which has not been solved yet.To address these problems, in this paper, a video-based sentence-to-sentence lip-reading model is developed using a deep convolutional neural network model, and it is denoted as the hybrid lip-reading network (HLR-Net).The proposed model consists of three stages:pre-processing, encoder, and decoder stages.The pre-processing stage is responsible for extracting the mouth movements from the video frames, frame normalization, and sentence preparation.The encoder is built of the inception, gradient, and bidirectional GRU layers, while the decoder consists of the attention,fully-connected, activation function layers, and connectionist temporal classification (CTC).The proposed model output is a subtitle of the input video provided in the form of a sentence.

    The rest of this paper is organized as follows.Section 2 summarizes the recent lip-reading related work categorized from different perspectives.Section 3 describes the proposed model.Section 4 presents and discusses the experimental results.Finally, Section 5 draws the conclusion and introduces the ongoing future work.

    2 Related Work

    Introducing artificial intelligence to lip reading can greatly help deaf people by providing an automated method to understand video data presented to them.Millions of people around the world suffer from a certain extent of hearing deficiency, and developing a suitable lip-reading model can help them to understand other people’speech and thus allow them to participate in conversations, thus making them be connected to the real world.However, developing such a model is challenging for both designers and researchers.These models should be well designed,perfected, and integrated into smart devices to be widely available to all people in need of speechunderstanding assistance.

    Generally speaking, speech recognition can be conducted on the letter, word, sentence, digit,or phrase level.Also, it can be based on a video with or without a voice.Some of the recent studies on word-level lip reading have been focused on speaker-independent lip reading by adapting a system using Speaker Adaptive Training (SAT) technique, which was originally used in the speech recognition field [3].The feature dimension was reduced to 40 using the linear discriminant analysis (LDA), and then the features were decorrelated using the maximum likelihood linear transform (MLLT).Namely, the 40-dimensional speaker-adapted features were spliced across a window of nine frames first, and then the LDA was applied to decorrelate the concatenated features and reduce the dimensionality to 25 Next, the obtained features were fed to the input of a context-dependent deep neural network (CD-DNN).In this way, the error rate of the speakerindependent lip reading was significantly reduced.Furthermore, it has been shown that the error can be even further reduced by using additional deep neural networks (DNNs).It has also been proven that there is no need to transform phonemes to visemes to apply the context-dependent visual speech transcription.

    In [4], a method for automatically collecting and processing very large-scale visual speech recognition data using British television broadcasting was proposed.The proposed method could process thousands of hours of spoken text covering it into the data having an extensive vocabulary of thousands of different words.To validate the method, the VGG-M, 3D convolution with early fusion, 3D convolution with multiple towers, multiple towers, and early fusion models were used.The input image size was chosen to be 112 × 112 pixels.Multiple towers and early fusion models achieved the best accuracy among those models when testing in 500 and 333 classes.A learning architecture for word-level visual speech recognition was presented in [8].This model combined spatiotemporal convolutional, residual (ResNet), and bidirectional LSTM networks.The ResNet building blocks were composed of two convolutional layers with BN and ReLU activation functions, while the skip connections facilitated information propagation in the maxpooling layers.This model ignored irrelevant parts of utterance and could detect target words without the knowledge about word boundaries.The database entries were fully automatic, and the words in subtitles were identified by using the optical character recognition (OCR) technique and synchronized the audio data.This model incorporated data augmentation processes, such as applying random cropping and horizontal flips, during training.The proposed model achieved a word recognition accuracy of 83%.

    In [10], three types of visual features were studied from the image-based and model-based aspects for a professional lip-reading task.These features included the lips ROI, lip-shape geometrical representation, and deep bottle-neck features.A six-layer deep auto-encoder neural network(DANN) was used to extract the three mentioned features.These features were then used in two lip reading systems:the conventional GMM-HMM system and the DNN-HMM hybrid system.Based on the reported results, the DBNFs system achieved an average relative improvement of 15.4% compared to the shape features system, while the shape features system achieved an average relative improvement of 20.4% compared to the ROI features system when applied to test data.

    In [11], the authors targeted the lip-reading problem using only video data and considered variable-length sequence frames words or phrases.They designed a twelve-layer convolutional neural network (CNN) using two batch-normalization layers to train the model and to extract the visual features in the end-to-end.The aim of using the batch normalization was to decrease the internal and external variances in the features that could affect speech-recognition performance,such as speaker’s accent, lighting and quality of image frames, the pace of the speaker, and posture of speaking.To avoid the problem of a variable speaking speed of different speakers,a concatenated lip image was created by extending the sequence to a fixed length.The MIRACLEVC1 dataset was used to evaluate the system, and a 96% training accuracy and a 52.9% validation accuracy were achieved.

    The performances of speaker-dependent and speaker-independent lip-reading models based on CNNs, such as AlexNet, VGG, HANN, and Inception V3, have been studied in [12-15].The main ideas and findings of the previous research on lip reading supported by AI methods, the type of used dataset, and achieved accuracy values are given in Tab.1.As shown in Tab.1, the speechrecognition performance was evaluated by using only one metric, which was the accuracy.

    Table 1:Comparison between earlier work based on the accuracy as a performance metric

    Table 1:Continued

    Various datasets have been used in the development of the lip-reading methods, including LSR, LSR2, MV-LSR, BCC, and CMLR, as summarized in Tab.2.In [9], an aligned trainingcorpuscontaining profile faces was constructed by applying a multi-stage strategy on (the LRS dataset) called the MV-LRS.In [17], lip reading was considered as an open-world problem containing unrestrained NL sentences.Their model was trained on the LRS dataset, and it achieved better performance than other compared methods on several datasets.In [18], another public dataset called the LRS2-BBC was introduced, and it contained thousands of natural sentences acquired from British television.Some of the transformer models [12] obtained the best result of 50% when they were decoded with a language method, achieving an improvement of over 20% compared to the previous result of 70.4% obtained by the state-of-the-art models.A lip by speech (LIBS) model was proposed in [19], and it supported the lip reading with speech recognition.The extracted features could provide complementary and discriminant features that could be assigned to the lip movements.In [20,21], the authors simplified the training procedure,which allowed training the model in a single stage.A variable-length augmentation approach was used to generalize the models to the variations found on the sequence length.

    One of the most commonly used datasets in the word-level lip-reading models is the GRIDcorpusdataset [16].In [6], the authors presented the LipNet model that mapped a variable-length sequence of video frames to the text at the sentence level and used the spatiotemporal convolution,recurrent network, and connectionist temporal classification loss.The input, which was a sequence of frames, was passed to three layers of the spatiotemporal CNN (STCNN) first and then to a spatial max-pooling layer.The extracted features were up-sampled and managed by a bidirectional ong short-term memory (Bi-LSTM).The LSTM output was passed to a two-layer feedforward network and a SoftMax network.The model training was performed using the CTC on the GRIDcorpusdataset from which the videos for speaker 21 were omitted because they were missing;also, all empty and corrupted videos were removed.The LipNet achieved the CER, WER, and accuracy of 2.4%, 6.6%, and 93.4%, respectively, when it was trained and tested on the GRIDcorpusdataset, which was the sentence-level dataset.This model has the advantage that it does not require alignment.

    Furthermore, in [8], a DNN model was built using the feedforward and LSTM networks.The training of the model was performed using the error gradient backpropagation algorithm.

    The GRIDcorpusdataset was used for training and testing the model, and 19 speakers with 51 different words were chosen to classify.The frame-level alignments obtained word-level segmentation of a video, producing a training dataset that consisted of 6×1000=6000 single words per speaker.A 40×40-pixel window containing the mouth area was detected from each video frame using the Gaussian function, a threshold value for the center determination and scaling parameter.The experiments were, however, speaker-dependent, and the experimental data were randomly divided into training, validation, and test data, while classifiers were always taken from the same speaker.The results were averaged over 19 speakers, and the word-recognition accuracy of 79.6% was achieved.

    Table 2:Performance comparison between different AI methods with different Lip-reading datasets

    In [16], the authors presented a lip-reading deep neural network that utilized the asynchronous spiking outputs of the dynamic vision sensor (DVS) and dynamic audio sensor (DAS).The eventbased features produced from the spikes of the DVS and DAS were used for the classification process.The GRID visual-audio lip reading dataset was used for model testing.Networks were trained both separately and jointly on the two modalities.It was concluded that the singlemodality networks that were trained separately on DAS and DVS spike frames achieved lower accuracy than the single-modality networks that were trained on both the audio MFCC features and the video frames.The recurrent neural network (RNN) using the MFCC features achieved an accuracy of approximately 98.41%.In addition, the audio inputs yielded better performance than the corresponding video inputs, achieving an accuracy of 84.27%, which was expected because the audio was more informative than the lip movement for this task.When the jointly trained network was tested on only the DVS spike frames, an accuracy of 61.64% was achieved, which represented a significant increase in the accuracy compared to the single video modality network,whose accuracy was 38.26%.

    In [22], the authors conducted a comprehensive study analyzing the performances of some of the recent visual speech recognition (VSR) models for different regions of the speaker’s face.The considered regions included the mouth, the whole face, the upper face, and the cheeks.Their study targeted the word-level and sentence-level benchmarks with different characteristics.Two models were used, the 3D-ResNet18 model, which was the word-level model, and the LipNet model,which was the sentence-level model.The experiments were conducted using two fixed-size mouth cropping methods:the fixed bounding box coordinates and the mouth-centered crops.Different methods of determining ROI regions were also considered.In addition, the LRW, LRW-1000, and GRID datasets were used.They concluded that using the information from the extraoral facial regions could enhance the VSR performance compared to the case when the use lip region was used as the model input.Accuracy on the word level was 85.02% on the LRW dataset, 45.24%on the LRW-1000 dataset, while on the sentence level, WER was 2.9%, and CER was 1.2%.The state-of-the-art research that used the GRIDcorpusdataset is given in Tab.3.Based on the recent studies, the GRIDcorpusdataset is chosen to be used in this study.

    Table 3:State-of-the-art researches based on the GRID corpus dataset

    3 Proposed Model

    In this paper, a deep convolutional neural network model for lip reading from a video is developed and denoted as the HLR-Net model.The proposed model is built using CNN model followed by an attention layer and a CTC layer.The CTC is a type of neural network layer having an associated scoring function as an activation function.The structure of the proposed HLR-Net model is presented in Fig.1.The proposed model consists of three stages.The first stage is the pre-processing stage, which processes the input video by executing certain operations on it, including mouth movement extraction from the frames of movements, frame normalization,and finally, sentence preparation to obtain the input for the deep learning model.The second and third stages are the encoder and decoder stages that produce the output subtitle.

    Figure 1:Abstraction view of our proposed HLR-Net model architecture

    The second stage is composed of inception layers, gradient preservation layer, and bidirectional GRU layer, while the third stage consists of the attention layer, fully connected layer,activation function, and CTC layer.The proposed model is designed based on the attention deep learning model.It takes a video of lip movement as an input, then converts the video to frames using the OpenCV library, and extracts the mouth part using the dlip library.The resulting frames are normalized to obtain the final frame.The final frame is passed to the deep learning model to produce the final encoded sentence.In the next subsections, each stage is explained in more detail.

    3.1 Stage 1:Data Pre-Processing

    3.1.1 Mouth Movement Extraction

    To perform mouth extraction, the dlib and OpenCV libraries are used to detect facial landmarks.Detection of facial landmarks can be considered as a shape prediction problem.An input image is fed to a shape predictor that attempts to localize points of interest regarding the shape,which are, in this case, the face parts as eyes, nose, and mouth.The facial landmark detection includes two steps, of which the first step is responsible for localizing the face in the image and detecting the key facial points, and the second step is responsible for detecting the mouth from the face part, as shown in Fig.2.As illustrated in Fig.2, the mouth facial landmarks are located in between (49, 68) and represented by blue rectangles.Therefore, these points are extracted as a mouth in a frame with a size of 50 × 100, which is the final step of the mouth extraction.

    Figure 2:Face parts localization to the extracted images

    3.1.2 Frame Normalization

    The frame normalization process is to distribute the mouth pixels’values over the frame size.The video is divided into a number of frames for each mouth movement, and for each frame, the mouth localization and frame normalization are performed, as shown in Fig.3.

    Figure 3:Mouth movement extraction and frame normalization from the input video

    3.1.3 Sentence Preparation

    For saving video sentences, the “.align” format is used.It is represented by a tuple for each word, represented by a frame, in the sentence.Each tuple contains the frame duration (start-time and end-time) in addition to the word.In order to separate adjacent sentences, the word “sil”is used.An example of a saved file is presented in Tab.4.When a file is loaded, a sentence is constructed by appending the last column and ignoring “sil” and “sp” words.The sentence is then passed to a function that converts each sentence to a list of labels, where numbers refer to characters’orders in the sentence.In training, the sentence is passed together with the processed video to the video augmenter function, in which a label is assigned to each frame for training.

    Table 4:Example of frame tuples for preparing a sentence

    As previously mentioned, the input of the proposed model is a video, and its output is a mathematical representation of frames of the input video or in an array in the NumPy library.Thus, this stage can be summarized as follows:the input video is processed using the OpenCV library, and each frame of the video is captured.The number of frames per second (fps) is 25 fps,and the formed frames are normalized and converted to the form of a NumPy array.The NumPy array is fed directly to the proposed model.

    3.2 Stage 2:Encoder Part

    3.2.1 Three Inception Layers

    Three layers of inception modules are used as CNNs to realize more efficient computation and deeper networks through dimensionality reduction with stacked 1 × 1 × 1 convolutions.The modules are designed to solve the problem of computational cost, as well as overfitting, among other issues.The STCNNs are characterized by their processing time of video frames in the spatial domain.

    The modules use multiple kernel filter sizes in the STCNN, but instead of stacking them sequentially, they are ordered to operate on the same level.By structuring the STCNN such that to perform the convolutions on the same level, the network becomes progressively wider but not deeper.To reduce the computationally cost even more, the neural network is designed such that an extra 1 × 1 × 1 convolution is added before 3 × 3 × 3 and 3 × 5 × 5 layers.In this way, the number of input channels is limited, and 1 × 1 × 1 convolutions are much cheaper than 3 × 5 ×5 convolutions.The 1 × 1 × 1 convolution layer is placed after the max-pooling layer.The most simplified version of an inception module works by performing a convolution on the input using three different sizes of filters (1 × 1 × 1, 3 × 3 × 3, 3 × 5 × 5) not only one.Likewise, max pooling is performed.Then, the resulting outputs are concatenated and sent to the next layer.The details of the three stages of the proposed HLR-Net model are illustrated in Fig.4.

    Figure 4:The proposed HLR-Net model architecture

    3.2.2 Gradient Preservation Layer

    The recurrent neural networks have a problem of gradient vanishing during the backpropagation.Namely, gradients are values used to update the neural network weights.The problem is that the gradient decreases what it back propagates through time, and if the gradient value becomes extremely small, it will slightly contribute to the learning process.For this reason, the gradient preservation layer is introduced.In this layer, two identity residual network blocks are used for solving the gradient vanishing problem.This layer is followed by a max-pooling layer to reduce the high-dimensionality problem.

    3.2.3 Bidirectional GRU Layers

    The output of the max-pooling layer is first flattened while preserving the time dimension and then passed to two bidirectional GRU (Bi-GRU) neural networks.The sequence model is used to add new information to spatial features.The bidirectional GRU can also be used with the attention model or a combination of the attention model and CTC.The Bi-GRUs denote improved versions of a standard RNN.To solve the gradient vanishing problem of a standard RNN, GRU adopts the so-called update gate and reset gate.The reset gate performs the mixing of the current input and the previous memory state, and the updated gate specifies the portion taken from the previous memory state to the current hidden state.The Bi-GRU is used to capture both forward and backward information flows to attain the past and future states.

    Basically, these are two vectors that decide what information should be passed to the output.It should be noted that they can be trained to keep information from a long time ago without vanishing it through time or removing information irrelevant to the prediction.

    3.3 Stage 3:Decoder Part

    3.3.1 Attention Layers

    The output of the Bi-GRU in the attention layer is passed to the dense layer with 28 outputs representing the output characters.Attention layer is an example of a sequence-to-sequence sentence translation using a bidirectional RNN with attention.It represents the attention weights of the output vectors at each time step.There are several methods to compute the attention weights,for instance, by using the dot product or a neural network model with a single hidden layer.These weights are multiplied by each of the words in the source, and this product is fed to the language model along with the output from the previous layer to obtain the output for the current time step.These weights determine how much importance should be given to each word in the source to determine the output sentence.The final output is calculated using the SoftMax function in the dense layer.

    3.3.2 CTC Layer

    The CTC is an alignment-free, scalable, non-autoregressive method used in the sequence transduction in applications such as hand-written text recognition and speech recognition.The CTC is an independent component that is used to specify the output and scoring.It takes a group of samples sequence as an input and produces a label for each of them; it also generates blank outputs.When the number of observations is larger than the number of labels, the training is difficult; for instance, when there are multiple time slices that could correspond to a single phoneme.A probability distribution is used at each time step to predict the label as the alignment of the observed sequence is unknown [7,10].The output of the CTC layer is continuous (for instance,obtained from a SoftMax layer), and it is adjusted and determined during the model training phase.The CTC deduce any recurrent characters that resolve to one character in spelling to form the right word.The CTC scores are processed by the backpropagation algorithm to update network weights.This approach makes the training process faster compared to that of RNN.

    4 Experimental Results

    4.1 Experimental Parameters and Datasets

    The Google Colab with Pro version was used to test the proposed model.It had 2 TB storage with a server, which had 26 GB Ram and P100 GPU.There are many available lipreading datasets, including AVICar, AVLetters, AVLetters2, BBC TV, CUAVE, OuluVS1, and OuluVS2 [15,23], but they are either plentiful, but include only single words, or are too small.In this study, the sentence-level lip reading was considered, so the Grid dataset [14] was used.This dataset included many audio and video recordings of 34 speakers, having 1000 sentences per speaker.In total, it consisted of 28-h data, including 34000 sentences.

    For the HLR-Net model, the GRIDcorpussentence-level dataset was used.The following simple structure of sentences was assumed:color (4) + command (4) + preposition (4) +digit (10) + letter (25) + adverb (4).The number next to each part indicated the number of possible word choices for each of the six word groups.These groups consisted of the following data:{bin, lay, place, set}, {blue, green, red, white}, {at, by, in, with}, {A,...,Z}{W},{zero,...,nine}, and {again, now, please, soon}, and included a total of 64000 possible sentences.The code of the proposed HLR-Net has been uploaded to the GitHub website [24], and the parameters of the proposed HLR-Net are given in Tab.5, whereTrefers to time,Crefers to channels,Frefers to feature dimension;HandWrefer to height and width, respectively, andVrefers to the number of words in the vocabulary, including the CTC blank symbol.

    Table 5:The proposed HLR-Net architecture hyper parameters

    Table 6:Mapping of 32 Phoneme to 12 viseme

    To train the HLR-Net model, phoneme and viseme units were used.Visemes represented visually distinguishable speech units that had a one-to-many mapping to phonemes.Phoneme-to-Viseme mapping [14] was used, and the mapping is shown in Tab.6, where it can be seen that 32 phonemes were grouped into 12 visemes denoted by V1, V2,..., and V12.

    The confusion matrix was used to test the visemes of the proposed model.The matrix showed that the proposed HLR-Net model could recognize elements with a little confusion.Namely, V1,V2, and V3, which mapped the phoneme groups {/aa/ /ah/ /ay/ /eh/ /r/ /iy/}, {/ae/ /ih/ /y/ /v/}, and{/aw/ /u/ /au/ /uw/}, respectively, were frequently misclassified during the text decoding process because they were similarly pronounced.However, misclassification of visemes was not signifivant,as shown in the confusion matrix in Fig.5, which verified the effectiveness of the proposed HLRNet model.

    Figure 5:Confusion Matrix for the proposed HLR-Net model

    4.2 Performance Metrics

    To evaluate the performance of the proposed HLR-Net model and compare it with the baselines, the word error rate (WER), the character error rate (CER), and the bleu score [25,26]were used, and they were calculated by Eqs.(1)-(3), respectively.

    In Eq.(1),Sdenotes the number of substitutions,Dis the number of deletions,Iis the number of insertions,Crepresents the number of correct words, andNis the number of words in the reference, and it is expressed asN=S+D+C.In Eq.(2),nis the total number of characters, andidenotes the minimal number of character insertions;sandddenote the numbers of substitutions and deletions required to transform the reference text into the output.

    The Bleu score adopts a modified form of precision to compare a candidate translation against multiple reference translations.In Eq.(3),mdenotes the number of words from the candidate, which are found in the reference, andwtis the total number of words of the candidate.

    Table 7:Performance between our proposed HLR-Net proposed models and other recently work

    Figure 6:CER values for our proposed HLR-Net model compared with the other three model in case of unseen and overlapped speakers

    4.3 Experimental Comparisons and Discussions

    The CER/WER was defined as the least number of character (or word) substitutions, insertions, and deletions required to convert the prediction into the base truth, divided by the number of characters (or words) in the base.The Bleu score value indicated how similar the candidate text was to the reference texts, where values closer to 100% represented higher similarity.Smaller WER/CER values meant higher prediction accuracy, while a larger Bleu score was preferred.The results are given in Tab.7.The proposed HLR-Net was compared with three recent models:LipNet, LCANet (AH-CTC), and A-CTC models.All models were trained on the GRID dataset.The performances of the models were tested for two different types of speakers:unseen speakers and overlapped speakers.The comparisons of the models regarding the CER, WER, and Bleu score values are presented in Figs.6-8, respectively.

    In the case of unseen speakers, the proposed HLR-Net model achieved CER of 4.9%, WER of 9.7%, and Bleu score of 92%; the LipNet model achieved CER of 6.4%, WER of 11.4%,and Bleu score of 88.2%; the LCANet model achieved CER of 5.3%, WER of 10.0%, and Bleu score of 90.4%; lastly, the A-CTC model achieved CER of 5.6%, WER of 10.8%, and Bleu score of 90.7%.

    Figure 7:WER values for our proposed HLR-Net model compared with the other three model in case of unseen and overlapped speakers

    Figure 8:Bleu score values for our proposed HLR-Net model compared with the other three model in case of unseen and overlapped speakers

    In the case of overlapped speakers, the proposed model also achieved better performance than the other models, having the CER of 1.4%, WER of 3.3%, and Bleu score of 99%.However, it should be noted that the CER value of the LCANet, which was 1.3%, was slightly better than that of the proposed model.Based on the overall result, the proposed HLR-Net model outperformed the other models.

    5 Conclusions and Future Work

    In this paper, a hybrid video-based lip-reading model is developed using deep convolutional neural networks and denoted as the HLR-Net model.The proposed model consists of three stages:pre-processing stage, encoder stage, and decoder stage.The encoder stage is composed of inception layers, gradient preservation layer, and bidirectional GRU layer, while the decoder consists of the attention layer, fully-connected layer, activation function, and CTC layer.The proposed model is designed based on the attention deep learning model.It uses a video of lip movement as an input, then converts this video to frames using the OpenCV library, and finally extracts the mouth part using the dlip library.The resulting frames are normalized to obtain the final frame.The final frame is passed to the deep learning model to produce the encoded sentence.

    The performance of the proposed HLR-Net model is verified by the experiments and compared with those of the three state-of-the-art models, LipNet model, LCANet (AH-CTC) model,and A-CTC model.The experimental results show that the proposed HLR-Net model outperforms the other three models, achieving CER of 4.9%, WER of 9.7%, and Bleu score of 92% in case of unseen speakers, and CER of 1.4%, WER of 3.3%, and Bleu score of 99% in case of overlapped speakers.However, the CER value of the LCANet is 1.3%, and it is slightly better than that of the proposed model.

    As future work, the proposed HLR-Net model will be applied and tested using the Arabic language for the purpose of further verification.

    Acknowledgement:The authors would like to acknowledge the student group:Abdelrhman Khater,Ahmed Sheha, Mohamed Abu Gabal, Mohamed Abdelsalam, Mohamed Farghaly, Mohamed Gnana, and Hagar Metwalli from the Computer and Control Engineering Department at the Faculty of Engineering, Tanta University, who worked on this research as a part of their graduation project.We also thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.

    Funding Statement:The authors received no specific funding for this study.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    妹子高潮喷水视频| 多毛熟女@视频| 日本色播在线视频| 七月丁香在线播放| 亚洲内射少妇av| 在线观看美女被高潮喷水网站| 一级毛片我不卡| 亚洲一区中文字幕在线| 国产一级毛片在线| 国产精品久久久av美女十八| 亚洲,欧美精品.| 久久久久久久久久人人人人人人| av在线老鸭窝| 黄片小视频在线播放| 成人亚洲精品一区在线观看| 婷婷色av中文字幕| 亚洲三级黄色毛片| 99精国产麻豆久久婷婷| 亚洲美女黄色视频免费看| 亚洲国产av影院在线观看| 国产男女内射视频| 亚洲国产最新在线播放| 一边摸一边做爽爽视频免费| 香蕉精品网在线| 乱人伦中国视频| 在线观看一区二区三区激情| 国产在线视频一区二区| 免费观看av网站的网址| 免费在线观看黄色视频的| 一级片'在线观看视频| 亚洲欧洲精品一区二区精品久久久 | 国产精品 欧美亚洲| 色视频在线一区二区三区| 亚洲欧美清纯卡通| 亚洲精品国产av蜜桃| 国产色婷婷99| 黑丝袜美女国产一区| 男女啪啪激烈高潮av片| 一本色道久久久久久精品综合| 久久午夜综合久久蜜桃| 美国免费a级毛片| 麻豆乱淫一区二区| 亚洲国产精品国产精品| 亚洲精品一二三| 久久久久久久久久人人人人人人| 午夜av观看不卡| 国产精品人妻久久久影院| 国产精品一区二区在线不卡| av又黄又爽大尺度在线免费看| 国产精品女同一区二区软件| 亚洲一区中文字幕在线| 91成人精品电影| 亚洲欧美成人精品一区二区| 狠狠婷婷综合久久久久久88av| av在线观看视频网站免费| 搡女人真爽免费视频火全软件| 亚洲内射少妇av| 国产成人a∨麻豆精品| 国产成人aa在线观看| 欧美日韩精品网址| 丝袜喷水一区| 亚洲经典国产精华液单| 欧美日韩视频精品一区| 日韩三级伦理在线观看| 亚洲色图综合在线观看| 亚洲一区中文字幕在线| 久久久久久久亚洲中文字幕| 人人妻人人澡人人爽人人夜夜| 欧美精品人与动牲交sv欧美| 亚洲一级一片aⅴ在线观看| 午夜免费鲁丝| 色婷婷久久久亚洲欧美| 韩国精品一区二区三区| 免费大片黄手机在线观看| 中文天堂在线官网| 色吧在线观看| av国产精品久久久久影院| 成年人午夜在线观看视频| 免费看av在线观看网站| 亚洲三级黄色毛片| 久久人妻熟女aⅴ| av免费观看日本| 又粗又硬又长又爽又黄的视频| 国产成人a∨麻豆精品| 久久久久精品性色| 女人被躁到高潮嗷嗷叫费观| 国产淫语在线视频| 国产福利在线免费观看视频| 色吧在线观看| 国语对白做爰xxxⅹ性视频网站| 在线观看www视频免费| 日韩精品免费视频一区二区三区| 男女免费视频国产| 桃花免费在线播放| 国产成人av激情在线播放| 国产精品熟女久久久久浪| 久久久久久人人人人人| 欧美人与善性xxx| 电影成人av| 女人精品久久久久毛片| 天天躁日日躁夜夜躁夜夜| 婷婷色综合www| 欧美成人精品欧美一级黄| 亚洲国产精品国产精品| 午夜福利视频在线观看免费| 欧美日韩综合久久久久久| 国产一区二区三区av在线| 免费av中文字幕在线| 天天躁日日躁夜夜躁夜夜| 18禁动态无遮挡网站| 午夜免费观看性视频| 人体艺术视频欧美日本| 日本wwww免费看| 久久人人爽av亚洲精品天堂| 三级国产精品片| 亚洲欧美色中文字幕在线| 五月开心婷婷网| 一区二区三区精品91| 少妇的丰满在线观看| 久久国产亚洲av麻豆专区| 国产精品一二三区在线看| 男人添女人高潮全过程视频| 一本一本久久a久久精品综合妖精 国产伦在线观看视频一区 | 九草在线视频观看| 国产精品无大码| 免费av中文字幕在线| 在现免费观看毛片| 国产精品麻豆人妻色哟哟久久| 久久女婷五月综合色啪小说| 免费在线观看黄色视频的| 国产一区二区 视频在线| 一本大道久久a久久精品| 如何舔出高潮| 性高湖久久久久久久久免费观看| 国产精品 国内视频| 国产精品亚洲av一区麻豆 | 亚洲综合色网址| 国产免费视频播放在线视频| 伊人久久大香线蕉亚洲五| 久久av网站| av免费观看日本| 日韩,欧美,国产一区二区三区| 母亲3免费完整高清在线观看 | 欧美 日韩 精品 国产| 成人亚洲精品一区在线观看| 精品一区在线观看国产| 超碰97精品在线观看| 久久av网站| 日韩不卡一区二区三区视频在线| 中文字幕精品免费在线观看视频| 久久这里有精品视频免费| 免费人妻精品一区二区三区视频| 亚洲精品国产av成人精品| 午夜免费观看性视频| 如日韩欧美国产精品一区二区三区| 91精品国产国语对白视频| 91成人精品电影| 天天躁夜夜躁狠狠久久av| 黑人欧美特级aaaaaa片| 国产又爽黄色视频| 国产黄色免费在线视频| av国产久精品久网站免费入址| 日本欧美视频一区| 成人手机av| 少妇猛男粗大的猛烈进出视频| 午夜福利视频在线观看免费| 欧美人与性动交α欧美精品济南到 | 亚洲图色成人| 亚洲av中文av极速乱| 免费观看性生交大片5| 新久久久久国产一级毛片| 天天躁夜夜躁狠狠久久av| 久久青草综合色| 国产日韩欧美亚洲二区| 久久精品亚洲av国产电影网| 免费在线观看视频国产中文字幕亚洲 | 日韩制服骚丝袜av| 一区二区三区激情视频| 久久精品国产亚洲av高清一级| 男女国产视频网站| 新久久久久国产一级毛片| 成人亚洲欧美一区二区av| 久久精品国产自在天天线| 久久国内精品自在自线图片| 国产一区有黄有色的免费视频| 中文字幕人妻丝袜制服| 亚洲精品av麻豆狂野| 丝袜喷水一区| 久久精品久久久久久噜噜老黄| 免费在线观看视频国产中文字幕亚洲 | 日韩三级伦理在线观看| 久久久久久久国产电影| 亚洲人成电影观看| 2021少妇久久久久久久久久久| 国产又色又爽无遮挡免| 国产精品亚洲av一区麻豆 | 亚洲成av片中文字幕在线观看 | 黄色 视频免费看| 制服丝袜香蕉在线| 99久久综合免费| 午夜免费观看性视频| 又大又黄又爽视频免费| 亚洲欧美日韩另类电影网站| 在线天堂最新版资源| 两个人看的免费小视频| 美女午夜性视频免费| 香蕉国产在线看| 国语对白做爰xxxⅹ性视频网站| 宅男免费午夜| 啦啦啦在线观看免费高清www| 国产一级毛片在线| 亚洲欧美一区二区三区黑人 | 亚洲成人av在线免费| 欧美国产精品一级二级三级| 亚洲欧美色中文字幕在线| 午夜av观看不卡| www日本在线高清视频| 欧美人与善性xxx| av又黄又爽大尺度在线免费看| 999精品在线视频| 你懂的网址亚洲精品在线观看| 黄片播放在线免费| 少妇的逼水好多| 日本wwww免费看| 少妇熟女欧美另类| 大话2 男鬼变身卡| 9191精品国产免费久久| 看免费成人av毛片| 黄片播放在线免费| 80岁老熟妇乱子伦牲交| 亚洲av男天堂| 欧美日韩亚洲高清精品| 这个男人来自地球电影免费观看 | 亚洲国产毛片av蜜桃av| 少妇被粗大猛烈的视频| 一边摸一边做爽爽视频免费| av网站免费在线观看视频| 日日摸夜夜添夜夜爱| 国产淫语在线视频| 亚洲成国产人片在线观看| 亚洲精品在线美女| a级毛片在线看网站| 亚洲国产最新在线播放| 国产精品.久久久| 97在线视频观看| 亚洲av在线观看美女高潮| 亚洲精品视频女| 中文字幕精品免费在线观看视频| 亚洲av电影在线进入| 国产亚洲最大av| 亚洲少妇的诱惑av| 日韩伦理黄色片| 777久久人妻少妇嫩草av网站| 国产高清不卡午夜福利| 少妇猛男粗大的猛烈进出视频| 国产又色又爽无遮挡免| √禁漫天堂资源中文www| 97在线人人人人妻| 久久精品国产自在天天线| 在线免费观看不下载黄p国产| 国产av码专区亚洲av| 天堂8中文在线网| 成年动漫av网址| 亚洲av男天堂| 99久久人妻综合| 国产一区二区三区av在线| 黄色视频在线播放观看不卡| 久久人妻熟女aⅴ| 少妇猛男粗大的猛烈进出视频| 日韩在线高清观看一区二区三区| 欧美国产精品一级二级三级| av在线老鸭窝| 色吧在线观看| 精品人妻一区二区三区麻豆| 菩萨蛮人人尽说江南好唐韦庄| 天堂俺去俺来也www色官网| 亚洲一码二码三码区别大吗| 婷婷色综合www| 欧美激情极品国产一区二区三区| 国产综合精华液| 男人添女人高潮全过程视频| 精品午夜福利在线看| 久久精品国产亚洲av天美| freevideosex欧美| 在线免费观看不下载黄p国产| 亚洲欧美精品自产自拍| 王馨瑶露胸无遮挡在线观看| 黄色一级大片看看| 日韩成人av中文字幕在线观看| 一区福利在线观看| 欧美bdsm另类| 性色av一级| 午夜福利网站1000一区二区三区| 在现免费观看毛片| 国产成人精品久久久久久| 精品卡一卡二卡四卡免费| 亚洲,欧美,日韩| av一本久久久久| 欧美精品高潮呻吟av久久| 香蕉国产在线看| 一级a爱视频在线免费观看| 久久国产精品大桥未久av| 午夜激情久久久久久久| 叶爱在线成人免费视频播放| 日韩成人av中文字幕在线观看| 男女无遮挡免费网站观看| 久久精品久久久久久噜噜老黄| 激情视频va一区二区三区| 亚洲精品aⅴ在线观看| 午夜福利视频精品| 亚洲国产日韩一区二区| 9色porny在线观看| 久久这里有精品视频免费| 久久久久视频综合| 国产在线一区二区三区精| 街头女战士在线观看网站| 久久久久网色| 狠狠婷婷综合久久久久久88av| 在线观看免费高清a一片| 国产毛片在线视频| 亚洲欧洲精品一区二区精品久久久 | 啦啦啦视频在线资源免费观看| 亚洲欧美成人精品一区二区| 精品一区在线观看国产| 亚洲人成网站在线观看播放| 最新中文字幕久久久久| 在现免费观看毛片| 人人妻人人爽人人添夜夜欢视频| 亚洲av.av天堂| 成年人午夜在线观看视频| 亚洲视频免费观看视频| 精品久久久久久电影网| a级毛片黄视频| 久久亚洲国产成人精品v| 国产极品天堂在线| 精品国产国语对白av| 欧美精品高潮呻吟av久久| 伊人亚洲综合成人网| 久久综合国产亚洲精品| 91成人精品电影| 成人毛片a级毛片在线播放| 9色porny在线观看| 久久精品国产a三级三级三级| 亚洲一码二码三码区别大吗| 亚洲精品av麻豆狂野| 在线观看一区二区三区激情| 欧美人与性动交α欧美精品济南到 | 欧美国产精品va在线观看不卡| 亚洲精华国产精华液的使用体验| 精品一区二区免费观看| 黄网站色视频无遮挡免费观看| av有码第一页| √禁漫天堂资源中文www| 女的被弄到高潮叫床怎么办| 满18在线观看网站| 午夜福利视频精品| 国产不卡av网站在线观看| 少妇人妻 视频| 五月天丁香电影| 99re6热这里在线精品视频| 精品一区二区免费观看| 久久精品久久久久久久性| 人人妻人人澡人人爽人人夜夜| 男女高潮啪啪啪动态图| 久久久欧美国产精品| 色哟哟·www| 成年动漫av网址| 一边亲一边摸免费视频| 在线观看免费视频网站a站| 欧美日韩亚洲国产一区二区在线观看 | www.精华液| 日韩精品免费视频一区二区三区| 黑人猛操日本美女一级片| 人妻 亚洲 视频| 又粗又硬又长又爽又黄的视频| 日本欧美视频一区| 男女边吃奶边做爰视频| 考比视频在线观看| 两个人看的免费小视频| 黄片小视频在线播放| 久久热在线av| 中文字幕人妻丝袜一区二区 | 久热久热在线精品观看| 亚洲色图 男人天堂 中文字幕| 一区二区三区激情视频| 一区在线观看完整版| 如何舔出高潮| 免费播放大片免费观看视频在线观看| www.自偷自拍.com| 性少妇av在线| 欧美av亚洲av综合av国产av | 人妻人人澡人人爽人人| 免费观看av网站的网址| 久久久久网色| 午夜日本视频在线| 亚洲精品成人av观看孕妇| 精品福利永久在线观看| 秋霞在线观看毛片| 人成视频在线观看免费观看| a 毛片基地| 91午夜精品亚洲一区二区三区| 在线免费观看不下载黄p国产| 2022亚洲国产成人精品| 精品国产超薄肉色丝袜足j| 大香蕉久久成人网| 纯流量卡能插随身wifi吗| 少妇被粗大猛烈的视频| 精品人妻一区二区三区麻豆| 国产男人的电影天堂91| 一区福利在线观看| 丰满迷人的少妇在线观看| 亚洲久久久国产精品| 黄色怎么调成土黄色| 国产探花极品一区二区| 91精品国产国语对白视频| 欧美最新免费一区二区三区| 99久国产av精品国产电影| 国产免费福利视频在线观看| 97精品久久久久久久久久精品| av在线观看视频网站免费| 婷婷色av中文字幕| 亚洲国产av新网站| 国产日韩欧美在线精品| 男女边吃奶边做爰视频| 亚洲,欧美,日韩| 亚洲精品av麻豆狂野| 亚洲人成77777在线视频| 亚洲国产最新在线播放| 天天操日日干夜夜撸| 黄色视频在线播放观看不卡| 成人手机av| 中文字幕制服av| 日韩制服骚丝袜av| 精品一区二区三卡| 男女国产视频网站| 久久久久精品人妻al黑| 国产日韩一区二区三区精品不卡| 人妻 亚洲 视频| 国产精品亚洲av一区麻豆 | 波多野结衣av一区二区av| 成人黄色视频免费在线看| 人妻少妇偷人精品九色| 久久人妻熟女aⅴ| av卡一久久| 亚洲在久久综合| 激情五月婷婷亚洲| 亚洲三级黄色毛片| 精品久久蜜臀av无| 在线观看人妻少妇| 日本vs欧美在线观看视频| 国产无遮挡羞羞视频在线观看| 男女边摸边吃奶| 2021少妇久久久久久久久久久| 波野结衣二区三区在线| 啦啦啦在线免费观看视频4| 国产伦理片在线播放av一区| 久久久久人妻精品一区果冻| 午夜免费观看性视频| 亚洲av福利一区| 乱人伦中国视频| 国产欧美日韩综合在线一区二区| 欧美日韩亚洲高清精品| 男人操女人黄网站| 欧美亚洲日本最大视频资源| 亚洲精品,欧美精品| 一级a爱视频在线免费观看| freevideosex欧美| 九九爱精品视频在线观看| 国产人伦9x9x在线观看 | 99九九在线精品视频| 久久人妻熟女aⅴ| 少妇 在线观看| 欧美日韩精品成人综合77777| 国产亚洲精品第一综合不卡| 国产片内射在线| 九草在线视频观看| 国产精品嫩草影院av在线观看| 18在线观看网站| 男人舔女人的私密视频| 中文字幕色久视频| 国产伦理片在线播放av一区| 久久午夜福利片| 边亲边吃奶的免费视频| 亚洲国产欧美日韩在线播放| 国产97色在线日韩免费| 精品视频人人做人人爽| 黄色怎么调成土黄色| 精品国产露脸久久av麻豆| 日韩 亚洲 欧美在线| 美女主播在线视频| 亚洲国产毛片av蜜桃av| 两个人看的免费小视频| 国产精品av久久久久免费| 一二三四在线观看免费中文在| 日本午夜av视频| tube8黄色片| 中国国产av一级| 美女高潮到喷水免费观看| 老司机亚洲免费影院| 国产黄频视频在线观看| 在线亚洲精品国产二区图片欧美| 亚洲综合色网址| 1024香蕉在线观看| 99久久综合免费| 久久久久久久久久久久大奶| 亚洲国产欧美日韩在线播放| 777米奇影视久久| 欧美+日韩+精品| 亚洲综合精品二区| 亚洲精品日韩在线中文字幕| 18禁观看日本| 男女边吃奶边做爰视频| av有码第一页| 欧美精品亚洲一区二区| 免费播放大片免费观看视频在线观看| 久久婷婷青草| 啦啦啦中文免费视频观看日本| 日本-黄色视频高清免费观看| 少妇熟女欧美另类| 久久久久久伊人网av| 国产成人午夜福利电影在线观看| 美女国产视频在线观看| 国产午夜精品一二区理论片| 好男人视频免费观看在线| 天天躁夜夜躁狠狠久久av| 午夜精品国产一区二区电影| 熟女少妇亚洲综合色aaa.| 亚洲久久久国产精品| 国产精品久久久久成人av| 中文欧美无线码| 男女午夜视频在线观看| 精品第一国产精品| 狠狠婷婷综合久久久久久88av| 日韩中文字幕欧美一区二区 | 欧美日韩一级在线毛片| 欧美在线黄色| 亚洲精品av麻豆狂野| 人成视频在线观看免费观看| 亚洲精品国产av蜜桃| 中文精品一卡2卡3卡4更新| 久久久久精品人妻al黑| av女优亚洲男人天堂| 26uuu在线亚洲综合色| 亚洲精品久久成人aⅴ小说| 国产av国产精品国产| 欧美精品一区二区大全| 九九爱精品视频在线观看| 97在线视频观看| 亚洲av电影在线进入| 一级黄片播放器| 七月丁香在线播放| 亚洲,欧美,日韩| 99久国产av精品国产电影| 一区在线观看完整版| 青春草视频在线免费观看| 亚洲,一卡二卡三卡| av网站免费在线观看视频| 男人舔女人的私密视频| 欧美另类一区| 精品国产一区二区三区久久久樱花| 青春草国产在线视频| 国产97色在线日韩免费| 一本一本久久a久久精品综合妖精 国产伦在线观看视频一区 | 欧美黄色片欧美黄色片| 又黄又粗又硬又大视频| 国产熟女午夜一区二区三区| 在线观看免费视频网站a站| 热re99久久国产66热| 男女啪啪激烈高潮av片| 男女午夜视频在线观看| 观看美女的网站| 成人毛片60女人毛片免费| 成年av动漫网址| 欧美另类一区| 熟女电影av网| 亚洲国产欧美日韩在线播放| 多毛熟女@视频| 国产av精品麻豆| 亚洲成人一二三区av| 欧美日韩亚洲国产一区二区在线观看 | 高清欧美精品videossex| 男人操女人黄网站| 秋霞伦理黄片| 丰满少妇做爰视频| 国产 一区精品| 午夜免费观看性视频| 亚洲欧美一区二区三区久久| 欧美精品一区二区免费开放| 久久精品熟女亚洲av麻豆精品| 日日摸夜夜添夜夜爱| 亚洲av国产av综合av卡| 欧美精品高潮呻吟av久久| 日日爽夜夜爽网站| freevideosex欧美| 国产日韩欧美在线精品| 午夜福利视频精品| 在线观看免费日韩欧美大片| 男女无遮挡免费网站观看| 亚洲婷婷狠狠爱综合网| 欧美日韩成人在线一区二区| 国产色婷婷99| 王馨瑶露胸无遮挡在线观看| 青草久久国产| 久久婷婷青草| 性色avwww在线观看| 欧美日韩精品成人综合77777| 少妇的逼水好多| 国产精品久久久久成人av| 又大又黄又爽视频免费| 久久亚洲国产成人精品v| 老司机影院成人| 亚洲国产av影院在线观看| 91精品三级在线观看| 久久精品夜色国产| 精品久久蜜臀av无|