• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition

    2023-12-12 15:51:04SominParkMpabulungiMarkBogyungParkandHyunkiHong
    Computers Materials&Continua 2023年10期

    Somin Park,Mpabulungi Mark,Bogyung Park and Hyunki Hong,?

    1College of Software,Chung-Ang University,Seoul,06973,Korea

    2Department of AI,Chung-Ang University,Seoul,06973,Korea

    ABSTRACT Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,language,gender,and personality.These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition (SER).This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models.In the proposed approach,two wav2vec-based modules (a speaker-identification network and an emotion classification network) are trained with the Arcface loss.The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation containing emotion and speaker-specific information.Experimental results showed that the use of speaker-specific characteristics improves SER performance.Additionally,combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability,as demonstrated by the plots of t-distributed stochastic neighbor embeddings (t-SNE).The proposed approach outperforms previous methods using similar training strategies,with a weighted accuracy (WA) of 72.14% and unweighted accuracy(UA)of 72.97%on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset.This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.

    KEYWORDS Attention block;IEMOCAP dataset;speaker-specific representation;speech emotion recognition;wav2vec 2.0

    1 Introduction

    The recent rapid growth of computer technology has made human-computer interaction an integral part of the human experience.Advances in automatic speech recognition (ASR) [1] and text-to-speech (TTS) synthesis [2] have made smart devices capable of searching and responding to verbal requests.However,this only supports limited interactions and is not sufficient for interactive conversations.Most ASR methods generally focus on the content of speech (words)without regard for the intonation,nuance,and emotion conveyed through audio speech.Speech emotion recognition(SER)is one of the most active research areas in the computer science field because the friction in every human-computer interaction could be significantly reduced if machines could perceive and understand the emotions of their users and perform context-aware actions.

    Previous studies used low-level descriptors (LLDs) generated from frequency,amplitude,and spectral properties (spectrogram,Mel-spectrogram,etc.) to recognize emotions in audio speech.Although the potential of hand-crafted features has been demonstrated in previous works,features and their representations should be tailored and optimized for specific tasks.Deep learning-based representations generated from actual waveforms or LLDs have shown better performance in SER.

    Studies in psychology have shown that individuals have different vocal attributes depending on their culture,language,gender,and personality[3].This implies that two speakers saying the same thing with the same emotion are likely to express different acoustic properties in their voices.The merits of considering speaker-specific properties in audio speech-related tasks have been demonstrated in several studies[4,5].

    In this paper,a novel approach in which a speaker-specific emotion representation is leveraged to improve emotional speech recognition performance is introduced.The proposed model consists of a speaker-identification network and an emotion classifier.The wav2vec 2.0[6](base model)is used as a backbone for both of the proposed networks,where it is used to extract emotion-related and speakerspecific features from input audio waveforms.A novel tensor fusion approach is used to combine these representations into a speaker-specific emotion representation.In this tensor fusion operation,the representation vectors are element-wise multiplied by a trainable fusion matrix,and then the resultant vectors are summed up.The main contributions of this paper are summarized as follows:

    ? Two wav2vec 2.0-based modules(the speaker-identification network and emotion classification)that generate a speaker-specific emotion representation from an input audio segment are proposed.The two modules are trained and evaluated on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset[7].Training networks on the IEMOCAP dataset is prone to over-fitting because it has only ten speakers.The representations generated by the speakeridentification network pre-trained on the VoxCeleb1 dataset[8]facilitate better generalization to unseen speakers.

    ? A novel tensor fusion approach is used to combine generated emotion and speaker-specific representations into a single vector representation suitable for SER.The use of the Arcface[9] and cross-entropy loss terms in the speaker-identification network was also explored,and detailed evaluations have been provided.

    2 Related Work

    2.1 Hand-Crafted Audio Representations

    A vast array of representations and models have been explored to improve audio speech-based emotion recognition.LLDs such as pitch and energy contours have been employed in conjunction with hidden Markov models[10]to recognize a speaker’s emotion from audio speech.Reference[11]used the delta and delta-delta of a log Mel-spectrogram to reduce the impact of emotionally irrelevant factors on speech emotion recognition.In this approach,an attention layer automatically drove focus to emotionally relevant frames and generated discriminative utterance-level features.Global-Aware Multi-Scale(GLAM)[12]used Mel-frequency cepstral coefficient(MFCC)inputs and a global-aware fusion module to learn a multi-scale feature representation,which is rich in emotional information.

    Time-frequency representations such as the Mel-spectrogram and MFCCs merge frequency and time domains into a single representation using the Fast Fourier Transform (FFT).Reference [13]addressed the challenges associated with the tradeoff between accuracy in frequency and time domains by employing a wavelet transform-based representation.Here,Morlet wavelets generated from an input audio sample are decomposed into child wavelets by applying a continuous wavelet transform(CWT) to the input signal with varying scale and translation parameters.These CWT features are considered as a representation that can be employed in downstream tasks.

    2.2 Learning Audio Representation Using Supervised Learning

    In more recent approaches,models learn a representation directly from raw waveforms instead of hand-crafted representations like the human perception emulating Mel-filter banks used to generate the Mel-spectrogram.Time-Domain (TD) filter banks [14] use complex convolutional weights initialized with Gabor wavelets to learn filter banks from raw speech for end-to-end phone recognition.The proposed architecture has a convolutional layer followed by anl2feature pooling-based modulus operation and a low-pass filter.It can be used as a learnable replacement to Mel-filter banks in existing deep learning models.In order to approximate the Mel-filter banks,the square of the Hanning window was used,and the biases of the convolutional layers were set to zero.Due to the absence of positivity constraints,a 1 was added to the output before applying log compression.A key limitation of this approach is that the log-scale compression and normalization that were used reduce the scale of spectrograms,regardless of their contents.

    Wang et al.[15]also proposed a learned drop-in alternative to the Mel-filter banks but replaced static log compression with dynamic compression and addressed the channel distortion problems in the Mel-spectrogram log transformation using Per-Channel Energy Normalization(PCEN).This was calculated using a smoothed version of the filter bank energy function,which was computed from a first-order infinite impulse response(IIR)filter.A smoothing coefficient was used in combining the smoothed version of the filter bank energy function and the current spectrogram energy function.In order to address the compression function’s fixed non-linearity,PCEN was modified to learn channeldependent smoothing coefficients alongside the other hyper-parameters[16]in a version of the model referred to as sPer-Channel Energy Normalization(sPCEN).

    2.3 Learned Audio Representation Using Self-Supervised Learning

    In supervised learning,class labels are used to design convolution filters and generate taskspecific representations.Due to the vast amounts of unlabeled audio data available,self-supervised learning(SSL)methods have been proposed for obtaining generalized representations of input audio waveforms for downstream tasks.These audio SSL methods can be categorized into auto-encoding,siamese,clustering,and contrastive techniques[17].

    Audio2vec[18]was inspired by word2vec[19]and learned general-purpose audio representations using an auto-encoder-like architecture to reconstruct a Mel-spectrogram slice from past and future slices.Continuous Bags of Words (CBoW) and skip-gram variants were also implemented and evaluated.In the Mockingjay[20]network,bidirectional Transformer encoders trained to predict the current frame from past and future contexts were used to generate general-purpose audio representations.Bootstrap your own latent for audio (BYOL-A) [21] is a Siamese model-based architecture that assumes no relationships exist between time segments of audio samples.In this architecture,two neural networks were trained by maximizing the agreement in their outputs given the same input.Normalization and augmentation techniques were also used to differentiate between augmented versions of the same audio segment,thereby learning a general-purpose audio representation.Hidden unit bidirectional encoder representations from Transformers(HuBERT)[22]addressed the challenges associated with multiple sound units in utterance,the absence of a lexicon of input sounds,and the variable length of sound units by using an offline clustering step to provide aligned target labels for a prediction loss similar to that in BERT[23].This prediction loss was only applied over masked regions,forcing the model to learn a combined acoustic and language model over continuous inputs.The model was based on the wav2vec 2.0 architecture that consists of a convolutional waveform encoder,projection layer,and code embedding layer but has no quantization layer.The HuBERT and wav2vec 2.0 models have similar architectures but differ in the self-supervised training techniques that they employ.More specifically,the wav2vec 2.0 masks a speech sequence in the latent space and solves a contrastive task defined over a quantization of the latent representation.On the other hand,the HuBERT model learns combined acoustic and language properties over continuous input by using an offline clustering step to provide aligned target labels for a BERT-like prediction loss applied over only the masked regions.Pseudo labels for encoded vectors were generated by applying K-means clustering on the MFCCs of the input waveforms.

    Contrastive methods generate an output representation using a loss function that encourages the separation of positive from negative samples.For instance,Contrastive Learning of Auditory Representations(CLAR)[24]encoded both the waveform and spectrogram into audio representations.Here,the encoded representations of the positive and negative pairs are used contrastively.

    2.4 Using Speaker Attributes in SER

    The Individual Standardization Network (ISNet) [4] showed that considering speaker-specific attributes can improve emotion classification accuracy.Reference[4]used an aggregation of individuals’neutral speech to standardize emotional speech and improve the robustness of individual-agnostic emotion representations.A key limitation of this approach is that it only applies to cases where labeled neutral training data for each speaker is available.Self-Speaker Attentive Convolutional Recurrent Neural Net(SSA-CRNN)[5]uses two classifiers that interact through a self-attention mechanism to focus on emotional information and ignore speaker-specific information.This approach is limited by its inability to generalize to unseen speakers.

    2.5 Wav2vec 2.0

    Wav2vec 2.0 converted an input speech waveform into spectrogram-like features by predicting the masked quantization representation over an entire speech sequence [6].The first wav2vec [25]architecture attempted to predict future samples from a given signal context.It consists of an encoder network that embeds the audio signal into a latent space and a context network that combines multiple time steps of the encoder to obtain contextualized representations.VQ-wav2vec[26],a vector quantized(VQ)version of the wav2vec model,learned discrete representations of audio segments using a future time step prediction task in line with previous methods but replaced the original representation with a Gumbel-Softmax-based quantization module.Wav2vec 2.0 adopted both the contrastive and diversity loss in the VQ-wav2vec framework.In other words,wav2vec 2.0 compared positive and negative samples without predicting future samples.

    Wav2vec 2.0 comprises a feature encoder,contextual encoder,and quantization module.First,the feature encoder converts the normalized waveform into a two-dimensional(2-d)latent representation.The feature encoder was implemented using seven one-dimensional (1-d) convolution layers with different kernel sizes and strides.A Hanning window of the same size as the kernel and a shorttime Fourier transform (STFT) with a hop length equal to the stride were used.The encoding that the convolutional layers generate from an input waveform is normalized and passed as inputs to two separate branches (the contextual encoder and quantization module).The contextual encoder consists of a linear projection layer,a relative positional encoding 1-d convolution layer followed by a Gaussian error linear unit (GeLU),and a transformer model.More specifically,each input is projected to a higher dimensional feature space and then encoded based on its relative position in the speech sequence.Here,the projected and encoded input,along with its relative position,are summed and normalized.The resultant speech features are randomly masked and fed into the Transformer,aggregating the local features into a context representation(C).The quantization module discretizes the feature encoder’s output into a finite set of speech representations.This is achieved by choosingVquantized representations (codebook entries) from multiple codebooks using a Gumbel softmax operation,concatenating them,and applying a linear transformation to the final output.A diversity loss encourages the model to use code book entries equally often.

    The contextual representationctof the masked time step(t)is compared with the quantized latent representationqtat the same time step(t).The contrastive loss makesctsimilar toqtandctdissimilar toKsampled quantized representations in every masked time step (Q~qt).The contrastive task’s loss term is defined as

    whereκis the temperature of the contrastive loss.The diversity loss and the contrastive loss are balanced using a hyper-parameter.A more detailed description is available in the wav2vec 2.0 paper[6].

    Several variations of the wav2vec 2.0 model have been proposed in recent studies [27–29].The wav2vec 2.0-robust model[27]was trained on more general setups where the domain of the unlabeled data for pre-training data differs from that of the labeled data for fine-tuning.This study demonstrated that pre-training on various domains improves the performance of fine-tuned models on downstream tasks.In order to make speech technology accessible for other languages,several studies pre-trained the wav2vec 2.0 model on a wide range of tasks,domains,data regimes,and languages to achieve cross-lingual representations [28,29].More specifically,in the wav2vec 2.0-xlsr and wav2vec 2.0-xlsr variations of the wav2vec 2.0 model such as wav2vec 2.0-large-xlsr-53,wav2vec 2.0-large-xlsr-53-extended,wav2vec 2.0-xls-r-300m,and wav2vec 2.0-xls-r-1b,“xlsr”indicates that a single wav2vec 2.0 model was pre-trained to generate cross-lingual speech representations for multiple languages.Here,the“xlsr-53”model is large and was pre-trained on datasets containing 53 languages.Unlike the“xlsr”variations,the“xls-r”model variations are large-scale and were pre-trained on several large datasets with up to 128 languages.Here,the“300m”and“1b”refer to the number of model parameters used.The difference between the“300m”and“1b”variations is mainly in the number of Transformer model parameters.

    The wav2vec 2.0 representation has been employed in various SER studies because of its outstanding ability to create generalized representations that can be used to improve acoustic model training.SUPERB[30]evaluated how well pre-trained audio SSL approaches performed on ten speech tasks.The pre-trained SSL networks with high performance can be frozen and employed on downstream tasks.SUPERB’s wav2vec 2.0 models are variations of the wav2vec 2.0 with the original weights frozen and an extra fully connected layer added.For the SER task,the IEMOCAP dataset was used.Since the outputs of SSL networks effectively represent the frequency features in the speech sequence,the length of representations varies with the length of utterances.In order to obtain a fixed-size representation for utterances,average time pooling is performed before the fully connected layer.In[31],the feasibility of partly or entirely fine-tuning these weights was examined.Reference[32]proposed a transfer learning approach in which the outputs of several layers of the pre-trained wav2vec 2.0 model were combined using trainable weights that were learned jointly with a downstream model.In order to improve SER performance,reference[33] employed various fine-tuning strategies on the wav2vec 2.0 model,including task adaptive pre-training(TAPT)and pseudo-label task adaptive pre-training(P-TAPT).TAPT addressed the mismatch between the pre-training and target domain by continuing to pre-train on the target dataset.P-TAPT achieves better performance than the TAPT approach by altering its training objective of predicting the cluster assignment of emotion-specific features in masked frames.The emotion-specific features act as pseudo labels and are generated by applying k-means clustering on representations generated using the wav2vec model.

    2.6 Additive Angular Margin Loss

    Despite their popularity,earlier losses like the cross-entropy did not encourage intra-class compactness and inter-class separability [34] for classification tasks.In order to address this limitation,contrastive,triplet [35],center [36],and Sphereface [37] losses encouraged the separability between learned representations.Additive Angular Margin Loss (Arcface) [9] and Cosface [38] achieved better separability by encouraging stronger boundaries between representations.In Arcface,the representations were distributed around feature centers in a hypersphere with a fixed radius.An additive angular penalty was employed to simultaneously enhance the intra-class compactness and inter-class discrepancy.Here the angular difference between an input feature vector(x∈Rd)and the center representation vectors of classes(W∈RN×d)are calculated.A margin is added to the angular difference between features in the same class to make learned features separable with a larger angular distance.Reference [39] used the Arcface loss to train a bimodal audio text network for SER and reported improved performance.A similar loss term is used in the proposed method.

    Eq.(2) is the equivalent of calculating the softmax with a bias of 0.After applying a logit transformation,Eq.(2)can be rewritten as Eq.(3).

    where‖·‖is thel2normalization andθjis the angle betweenWjandxi.In Eq.(4),the additive margin penalty(m)is only added to the angle(θyi)between the target weight(Wyi)and the features(xi).The features are re-scaled using the scaling factor(s).The final loss is defined as:

    Reference[39]demonstrated the Arcface loss term’s ability to improve the performance of SER models.It is therefore employed in training the modules proposed in this study.

    3 Methodology

    In order to leverage speaker-specific speech characteristics to improve the performance of SER models,two wav2vec 2.0-based modules(the speaker-identification network and emotion classification network)trained with the Arcface loss are proposed.The speaker-identification network extends the wav2vec 2.0 model with a single attention block,and it encodes an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation that contains both emotion and speaker-specific information.

    3.1 Speaker-Identification and Emotion Classification Networks

    The speaker-identification network(Fig.1)encodes the vocal properties of a speaker into a fixeddimension vector(d).The wav2vec 2.0 model encodes input utterances into a latent 2-d representation of shapeR768×T,whereTis the number of frames generated from the input waveform.This latent representation is passed to a single attention block prior to performing a max-pooling operation that results in a 1-d vector of length 768.Only a single attention block was used in the speakeridentification network because it is assumed that the core properties of a speaker’s voice are unaffected by his or her emotional state.In other words,a speaker can be identified by his/her voice regardless of his/her emotional state.In order to achieve a more robust distinction between speakers,theRdshape speaker-identification representation(Hid)and theR#ID×dshape Arcface center representation vector(Wid) for speaker classes arel2normalized,and their cosine similarity is computed.Configurations of the speaker-identification network using the cross-entropy loss were also explored.In experiments using the cross-entropy loss,the Arcface center representation vectors for speaker classes were replaced with a fully connected (FC) layer.Then,the FC outputs were fed into a softmax function,and the probability of each speaker class was obtained.In Fig.1,“#ID”represents the index of each speaker class.For example,in the VoxCeleb1 dataset with 1,251 speakers,the final#ID is#1,251.

    Figure 1:Architecture of the speaker-identification network with the extended wav2vec 2.0 model(left)and l2 normalization,cosine similarity and cross-entropy loss computation(right),and a single output for each speaker class

    In the emotion classification network (Fig.2),the wav2vec 2.0 model encodes input utterances into aR768×Tshape representation.The encoding generated is passed to a ReLU activation layer before being fed into an FC layer and eventually passed to four attention blocks.The four attention blocks identify the parts of the generated emotion representation that are most relevant to SER.Experiments were also conducted for configurations with one,two,as well as three attention blocks.Max-pooling is applied across the time axis to the outputs of each attention block.The max pooled outputs of the attention blockshiare concatenated before the tensor fusion operation.During tensor fusion,an element-wise multiplication betweenHemo={h1,h2,···,hk} and a trainable fusion matrix (Wfusion∈Rk×d)is performed.As shown in Eq.(5),all thekvectors are summed to generate the final embedding.

    Figure 2:Architecture of the emotion classification network.Extended wav2vec 2.0 model(left)with four attention blocks and a tensor fusion operation. l2 normalization,cosine similarity,and crossentropy loss computation(right)for emotion classes with a single output for each emotion class

    whereei∈RdandWfusion,i∈Rd.The final embedding (E) isl2normalized prior to computing the cosine similarity between the Arcface center representation vectors (Wemo∈R#EMO×d).In Fig.2,“#EMO” represents the emotion class indices defined in the IEMOCAP dataset.Here,1_EMO,2_EMO,3_EMO,and 4_EMO represent angry,happy,sad,and neutral emotion classes,respectively.

    3.2 Speaker-Specific Emotion Representation Network

    Fig.3 shows the architecture of the proposed SER approach.The same waveform is passed to the speaker-identification network as well as the emotion classification network.The speaker representation generated by the pre-trained speaker-identification network is passed to the emotion classification network.More specifically,the output vector of the attention block from the speakeridentification network is concatenated to the outputs of the emotion classification network’s four attention blocks,resulting in a total of five attention block outputs(H∈R5×d).The fusion operation shown in Eq.(5)combines these representations into a single speaker-specific emotion representation(E).The angular distance between the normalized tensor fused output vector and the normalized center of the four emotion representation vectors is calculated using Eq.(4).The emotion class predicted for any input waveform is determined by how close its representation vector is to an emotion class’s center vector.

    Figure 3:Architecture of the speaker-specific emotion representation model with the speakeridentification network (up) that generates a speaker representation and the emotion classification (down) that generates a speaker-specific emotion representation from emotion and speaker-identification representations

    4 Experiment Details

    4.1 Dataset

    The IEMOCAP[7]is a multimodal,multi-speaker emotion database recorded across five sessions with five pairs of male and female speakers performing improvisations and scripted scenarios.It comprises approximately 12 h of audio-visual data,including facial images,speech,and text transcripts.The audio speech data provided is used to train and evaluate models for emotion recognition.Categorical(angry,happy,sad,and neutral)as well as dimensional(valence,activation,and dominance)labels are provided.Due to imbalances in the number of samples available for each label category,only neutral,happy (combined with exciting),sad,and angry classes have been used in line with previous studies [4,30–33,39,40].The 16 kHz audio sampling rate used in the original dataset is retained.The average length of audio files is 4.56 s,with a standard deviation of 3.06 s.The minimum and maximum lengths of audio files are 0.58 and 34.14 s,respectively.Audio files longer than 15 s are truncated to 15 s because almost all of the audio samples in the dataset were less than 15 s long.For audio files shorter than 3 s,a copy of the original waveform is recursively appended to the end of the audio file until the audio file is at least 3 s long.Fig.4 shows how often various emotions are expressed by male and female speakers over five sessions in the IEMOCAP dataset.As shown in Fig.4,the dataset is unevenly distributed across emotion classes,with significantly more neutral and happy samples in most sessions.

    Figure 4:Distribution of male and female speakers across emotion classes in the IEMOCAP dataset

    In order to generate an evenly distributed random set of samples at each epoch,emotion classes with more samples are under-sampled.This implies that samples of the training dataset are evenly distributed across all the emotional classes.Leave-one-session-out five-fold cross-validation is used.

    In this study,VoxCeleb1’s[8]large variation and diversity allow the speaker-identification module to be trained for better generalization to unseen speakers.VoxCeleb1 is an audio-visual dataset comprising 22,496 short interview clips extracted from YouTube videos.It features 1,251 speakers from diverse backgrounds and is commonly used for speaker identification and verification tasks.Its audio files have a sampling rate of 16 kHz with an average length of 8.2 s as well as minimum and maximum lengths of 4 and 145 s,respectively.Additionally,audio clips in VoxCeleb1 are also limited to a maximum length of 15 s for consistency in the experiments.

    4.2 Implementation Details

    In recent studies [31,32],pre-training the wav2vec model on the Librispeech dataset [41] (with no fine-tuning for ASR tasks) has been shown to deliver better performance for SER tasks.In this study,the wav2vec 2.0 base model was selected because the wav2vec 2.0 large model does not offer any significant improvement in performance despite an increase in computational cost [31,32].The key difference between the“wav2vec2-large”and its base model is that it consists of an additional 12 Transformer layers that are intended to improve its generalization capacity.Using other versions of the wav2vec 2.0 model or weights may improve performance depending on the target dataset and the pre-training strategy [27–29,33].This study proposes two networks based on the wav2vec 2.0 representation (Sub-section 2.5).In addition,reference [31] showed that either partially or entirely fine-tuning the wav2vec 2.0 segments results in the same boost in model performance on SER tasks despite the differences in computational costs.Therefore,the wav2vec 2.0 modules (the contextual encoder) used in this study were only partially fine-tuned.The model and weights are provided by Facebook research under the Fairseq sequence modeling toolkit[42].

    A two-step training process ensures that the proposed network learns the appropriate attributes.First,the speaker-identification network and emotion network are trained separately.Then,the pretrained networks are integrated and fine-tuned with the extended tensor fusion matrix to match the size of concatenated speaker-identification and emotion representations.In order to prevent over-fitting and exploding gradients,gradient values are clipped at 100 withn-step gradient accumulations.A 10-8weight decay is applied,and the Adam[43]optimizer with beta values set to(0.9,0.98)is used.The LambdaLR scheduler reduces the learning rate by multiplying it by 0.98 after every epoch.An early stopping criterion is added to prevent over-fitting.Each attention block consists of four attention heads with a dropout rate of 0.1.In the Arcface loss calculation,the feature re-scaling factor (s) is set to 30 and the additive margin penalty(m)to 0.3 for the experiments.Experiments were conducted using Pytorch in an Ubuntu 20.04 training environment running on a single GeForce RTX 3090 GPU.The specific hyper-parameters used in the experiments are shown in Table 1.

    Table 1:Hyper-parameters used during model evaluation

    4.3 Evaluation Metrics

    In this paper,weighted and unweighted accuracy metrics were used to evaluate the performance of the proposed model.Weighted accuracy(WA)is an evaluation index that intuitively represents model prediction performance as the ratio of correct predictions to the overall number of predictions.WA can be computed from a confusion matrix containing prediction scores aswhere the number of true positive,true negative,false positive,and false negative cases are TP,TN,FP,and FN,respectively.In order to mitigate the biases associated with the weighted accuracy in imbalanced datasets such as the IEMOCAP dataset,unweighted accuracy(UA),also called average recall,is widely employed and can be computed usingwhereCis the total number of emotion classes and is set to four for all the results presented in this study.

    5 Experimental Results

    5.1 Performance of Speaker-Identification Network and Emotion Classification Network

    Table 2 shows the performance of the speaker-identification network on the VoxCeleb1 identification test dataset.Training the speaker identification network using the Arcface loss resulted in significantly better speaker classification than training with the cross-entropy loss.This indicates that the angular margin in the Arcface loss improves the network’s discriminative abilities for speaker identification.Fig.5 shows a t-distributed stochastic neighbor embedding (t-SNE) plot of speakerspecific representations generated from the IEMOCAP dataset using two configurations of the speaker-identification network.As shown in Fig.5,training with the Arcface loss results in more distinct separations between speaker representations than training with the cross-entropy loss.As shown in Fig.6,the speaker identification network may be unable to generate accurate representations for audio samples that are too short.Representations of audio clips that are less than 3 s long are particularly likely to be misclassified.In order to ensure that input audio waveforms have the information necessary to generate a speaker-specific emotion representation,a 3-s requirement is imposed.In cases where the audio waveform is shorter than 3 s,a copy of the original waveform is recursively appended to the end of the waveform until it is at least 3 s long.

    Table 2:Overall performance of the proposed method when either the cross-entropy or Arcface loss was used in the speaker-identification network

    Figure 5:t-SNE plot of speaker-specific representations generated by the speaker-identification network when trained with different loss functions:(a)Cross-entropy(b)Arcface

    Figure 6:t-SNE plot of speaker-specific representations generated by the speaker-identification network when trained with audio segments of varying minimum lengths:(a)1 s(b)2 s(c)3 s(4)4 s

    Table 3 shows a comparison of the proposed methods’performance against that of previous studies.The first four methods employed the wav2vec 2.0 representation and used the cross-entropy loss [30–32].Tang et al.[39] employed hand-crafted features and used the Arcface loss.Here,the individual vocal properties provided by the speaker-identification network are not used.Table 3 shows that the method proposed by Tang et al.[39] has a higher WA than UA.This implies that emotion classes with more samples,particularly in the imbalanced IEMOCAP dataset,are better recognized.The wav2vec 2.0-based methods [30–32] used average time pooling to combine features across the time axis.Reference[32]also included a long short-term memory(LSTM)layer to better model the temporal features.In the proposed method,the Arcface loss is used instead of the cross-entropy loss,and an attention block is used to model temporal features.Table 3 shows that the proposed attentionbased method outperforms previous methods with similar training paradigms.It also demonstrates that using four attention blocks results in significantly better performance than using one,two,or three attention blocks.This is because four attention blocks can more effectively identify the segments of the combined emotion representation that are most relevant to SER.Reference[33]’s outstanding performance can be attributed to the use of a pseudo-task adaptive pretraining(P-TAPT)strategy that is described in Subsection 2.5.

    Table 3:Comparing the performance of the proposed emotion classification approach(with a varying number of attention blocks)against that of previous methods also trained with the Arcface loss

    5.2 Partially and Entirely Fine-Tuning Networks

    The proposed speaker-identification network was fine-tuned under three different configurations:fine-tuning with the entire pre-trained network frozen (All Frozen),fine-tuning with the wav2vec 2.0 segment frozen and the Arcface center representation vectors unfrozen (Arcface Fine-tuned),and fine-tuning with both the wav2vec 2.0 weights and the Arcface center representation vectors unfrozen(All Fine-tuned).The wav2vec 2.0 feature encoder(convolutional layers)is frozen in all cases[31].The IEMOCAP dataset only has 10 individuals.Therefore,the Arcface center representation vectors are reduced from 1,251 (in the VoxCeleb1 dataset) to 8 while jointly fine-tuning both the speaker-identification network and the emotion classification network.While fine-tuning with both the wav2vec 2.0 weights and the Arcface vectors unfrozen,the loss is computed as a combination of emotion and identification loss terms as shown in Eq.(6):

    αandβare used to control the extent to which emotion and identification losses,respectively,affect the emotion recognition results.Since training the emotion classification network with four attention blocks showed the best performance in prior experiments,fine-tuning performance was evaluated under this configuration.Fig.7 shows that freezing the speaker-identification network provides the best overall performance.Due to the small number of speakers in the IEMOCAP dataset,the model quickly converged on a representation that could distinguish speakers it was trained on but was unable to generalize to unseen speakers.More specifically,the frozen version of the speakeridentification module was trained on the VoxCeleb1 dataset and frozen because it has 1,251 speakers’utterances.These utterances provide significantly larger variation and diversity than the utterances of the 8 speakers (training dataset) in the IEMOCAP dataset.This implies that the frozen version can better generalize to unseen speakers than versions fine-tuned on the 8 speakers of the IEMOCAP dataset,as shown in Figs.7b and 7c.

    Figure 7:Performance of the proposed method with the speaker-identification network fine-tuned to various levels:(a)All Frozen(b)Arcface Fine-tuned(c)All Fine-tuned

    Fig.7 b shows that increasingβ,which controls the significance of the identification loss,improves emotion classification accuracy when the Arcface center representation vectors are frozen.Conversely,Fig.7c shows that increasingβ,causes the emotion classification accuracy to deteriorate when the entire model is fine-tuned.This implies that partly or entirely freezing the weights of the speakeridentification network preserves the representation learned from the 1,251 speakers of the VoxCeleb1 dataset,resulting in better emotion classification performance.On the other hand,fine-tuning the entire model on the IEMOCAP dataset’s eight speakers degrades the speaker-identification network’s generalization ability.More specifically,in the partly frozen version,only the attention-pooling and speaker classification layers are fine-tuned,leaving the pre-trained weights of the speaker-identification network intact.

    Figs.8 and 9 show t-SNE plots of emotion representations generated by the emotion classification network under various configurations.In Figs.8a and 8b,the left column contains representations generated from the training set,and the right column contains those generated from the test set.In the top row of Figs.8a and 8b,a representation’s color indicates its predicted emotion class,and in the bottom row,it indicates its predicted speaker class.The same descriptors apply to Figs.9a and 9b.More specifically,Fig.8 illustrates the effect of employing the speaker-specific representations generated by the frozen speaker-identification network in the emotion classification network.As shown in Fig.8,using the speaker-specific representations improves intra-class compactness and increases inter-class separability between emotional classes compared to training without the speakerspecific representation.The emotion representations generated when speaker-specific information was utilized how a clear distinction between the eight speakers of the IEMOCAP dataset and their corresponding emotion classes.

    Figure 8:t-SNE plot of emotion representations generated by the emotion classification network under two configurations:(a)without the speaker-specific representation(b)with the speaker-specific representation

    Figure 9:t-SNE plot of emotion representations generated by the emotion classification network under two configurations:(a)Only Arcface vector weights fine-tuned(b)All fine-tuned

    In contrast to Figs.9a and 9b,the result shows that fine-tuning both the speaker-identification network and the emotion classification network increases inter-class separability between the emotion representations of speakers while retaining speaker-specific information.This results in a slight improvement in the overall SER performance,which is in line with the findings shown in Figs.7b and 7c.

    5.3 Comparing the Proposed Method against Previous Methods

    In Table 3,the proposed method is compared against previous SER methods that are based on the wav2vec 2.0 model or employ the Arcface loss.In Table 4,the performance of the proposed method under various configurations is compared against that of existing approaches on the IEMOCAP dataset.In Table 4,“EF” and “PF” stand for “entirely fine-tuned” and “partially fine-tuned,”respectively.Experiments showed that the configuration using four attention blocks in the emotion network and fine-tuning with the speaker-identification network frozen (Fig.7a) provided the best performance.Therefore,this configuration was used when comparing the proposed method against previous methods.The proposed method significantly improves the performance of SER models,even allowing smaller models to achieve performance close to that of much larger models.As shown in Table 4,reference [33] achieved better performance than the proposed method because it uses a pseudo-task adaptive pretraining(P-TAPT)strategy,as described in Subsection 2.5.

    Table 4:Comparing the performance of the proposed method against previous SER methods

    Reference[44]was a HuBERT-large based model which employs label-adaptive mixup as a data augmentation approach.It achieved the best performance among the approaches listed in Table 4.This is because they created a label-adaptive mixup method in which linear interpolation is applied in the feature space.Reference [45] employed balanced augmentation sampling on triple channel log Melspectrograms before using a CNN and attention-based bidirectional LSTM.Although this method was trained for several tasks,such as gender,valence/arousal,and emotion classification,it did not perform as well as the proposed method.This is because the proposed method uses speaker-specific properties while generating emotional representations from speaker utterances.

    5.4 Ablation Study

    Since the audio segments in the IEMOCAP are unevenly distributed across emotion classes,emotion classes with more samples were under-sampled.In order to examine the effects of an imbalanced dataset,additional experiments were conducted with varying amounts of training data.More specifically,the model was trained on the entire dataset with and without under-sampling to examine the effects of an imbalanced dataset.The best-performing configuration of the proposed model (the speaker-specific emotion representation network with four attention blocks and the speaker-identification network frozen) was used in these experiments.Table 5 shows the results of experiments conducted under four configurations.In the experiment results,both pre-trained and finetuned model variations showed their best performance when trained using the undersampled version of the IEMOCAP dataset.This is because under-sampling addresses the dataset’s imbalance problem adequately.

    Table 5:Performance of the speaker-specific emotion representation network trained under four training configurations(with under-sampled and complete versions of the IEMOCAP dataset)

    In order to investigate the effects of using the speaker-specific representation,experiments were conducted at first using just the emotion classification network and then using the speaker-specific emotion representation network.More specifically,cross-entropy and Arcface losses,as well as configurations of the networks with 1,2,3,and 4 attention blocks,were used to investigate the effects of using the speaker-specific representation.As shown in Table 6,the inter-class compactness and inter-class separability facilitated by the Arcface loss results in better performance than when the cross-entropy loss is used for almost all cases.Using the speaker-specific emotion representation outperformed the bare emotion representation under almost all configurations.

    Table 6:Performance(accuracy)of the speaker-specific emotion representation network under 1,2,3,and 4 attention block configurations and trained with cross-entropy and Arcface losses

    The computation time of the proposed method under various configurations was examined.The length of input audio segments (3,5,10,and 15 s) and number of attention blocks (1,2,3,and 4) were varied.The proposed model (speaker-specific emotion representation network) consists of two networks (speaker-identification and emotion classification).Table 6 shows the two networks’separate and combined computation times under the abovementioned configurations.As shown in Table 7,computation time increases as the length of input audio segments and the number of attention blocks increases.Experiments show that the proposed model’s best-performing configuration is that in which the speaker-specific emotion representation network has four attention blocks.Under this configuration,the model can process an audio segment in 27 ms.

    Table 7:Computation time (ms) of the proposed networks (speaker-identification,emotion classification,and speaker-specific emotion representation networks) for input audio segments of varying lengths(3,5,10,and 15 s)

    6 Conclusion

    This study proposes two modules for generating a speaker-specific emotion representation for SER.The proposed emotion classification and speaker-identification networks are based on the wav2vec 2.0 model.The networks are trained to respectively generate emotion and speaker representations from an input audio waveform using the Arcface loss.A novel tensor fusion approach was used to combine these representations into a speaker-specific representation.Employing attention blocks and max-pooling layers improved the performance of the emotion classification network.This was associated with the attention blocks’ability to identify which segments of the generated representation were most relevant to SER.Training the speaker-identification network on the VoxCeleb1 dataset(1,251 speakers) and entirely freezing it while using four attention blocks in the emotion network provided the best overall performance.This is because of the proposed method’s robust generalization capabilities that extend to unseen speakers in the IEMOCAP dataset.The experiment results showed that the proposed approach outperforms previous methods with similar training strategies.In future works,various wav2vec 2.0 and HuBERT model variations are to be employed to improve the proposed method’s performance.Novel pre-training and fine-tuning strategies,such as TAPT and P-TAPT,are also to be explored.

    Acknowledgement:The authors extend their appreciation to the University of Oxford’s Visual Geometry Group as well as the University of Southern California’s Speech Analysis &Interpretation Laboratory for their excellent work on the VoxCeleb1 and IEMOCAP datasets.The authors are also grateful to the authors of the wav2vec 2.0 (Facebook AI) for making their source code and corresponding model weights available.

    Funding Statement:This research was supported by the Chung-Ang University Graduate Research Scholarship in 2021.

    Author Contributions:Study conception and model design:S.Park,M.Mpabulungi,B.Park;analysis and interpretation of results: S.Park,M.Mpabulungi,H.Hong;draft manuscript preparation: S.Park,M.Mpabulungi,H.Hong.All authors reviewed the results and approved the final version of the manuscript.

    Availability of Data and Materials:The implementation of the proposed method is available at https://github.com/ParkSomin23/2023-speaker_specific_emotion_SER.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    每晚都被弄得嗷嗷叫到高潮| 丝瓜视频免费看黄片| 欧美久久黑人一区二区| 午夜老司机福利片| 亚洲黑人精品在线| av有码第一页| 久久亚洲国产成人精品v| 天天影视国产精品| 啦啦啦啦在线视频资源| 亚洲成人手机| 久久久精品国产亚洲av高清涩受| 国产在线免费精品| 下体分泌物呈黄色| 手机成人av网站| 午夜激情av网站| 啦啦啦在线免费观看视频4| 一本—道久久a久久精品蜜桃钙片| 少妇人妻久久综合中文| 少妇的丰满在线观看| 国产亚洲欧美精品永久| 精品欧美一区二区三区在线| xxxhd国产人妻xxx| 中文字幕精品免费在线观看视频| 777久久人妻少妇嫩草av网站| 黄频高清免费视频| 亚洲精品久久久久久婷婷小说| 国产男人的电影天堂91| 在线看a的网站| 免费不卡黄色视频| 久久 成人 亚洲| 下体分泌物呈黄色| 黄色 视频免费看| 在线亚洲精品国产二区图片欧美| 水蜜桃什么品种好| 丁香六月欧美| svipshipincom国产片| 国产精品欧美亚洲77777| 久久午夜综合久久蜜桃| 青春草亚洲视频在线观看| 色视频在线一区二区三区| 脱女人内裤的视频| 免费在线观看视频国产中文字幕亚洲 | 人妻一区二区av| 一个人免费在线观看的高清视频 | 在线观看免费日韩欧美大片| 电影成人av| 纵有疾风起免费观看全集完整版| 欧美成狂野欧美在线观看| 国产av又大| 日韩有码中文字幕| 亚洲欧美成人综合另类久久久| 欧美精品人与动牲交sv欧美| 国内毛片毛片毛片毛片毛片| 免费观看av网站的网址| 亚洲九九香蕉| 精品免费久久久久久久清纯 | 国产欧美日韩一区二区精品| 精品福利永久在线观看| 永久免费av网站大全| 国产精品成人在线| 最新的欧美精品一区二区| 又大又爽又粗| 亚洲精品日韩在线中文字幕| 啪啪无遮挡十八禁网站| 天堂中文最新版在线下载| av网站免费在线观看视频| 国产一区二区三区在线臀色熟女 | 久久毛片免费看一区二区三区| 1024香蕉在线观看| 久久亚洲精品不卡| 国产日韩欧美视频二区| 亚洲国产精品一区三区| 一级片免费观看大全| 黄色视频不卡| 国产xxxxx性猛交| 一边摸一边做爽爽视频免费| 亚洲精品久久午夜乱码| a级毛片在线看网站| 国产欧美日韩一区二区三 | 久久99热这里只频精品6学生| 成人手机av| 国产免费av片在线观看野外av| 日韩 亚洲 欧美在线| 十八禁人妻一区二区| 男女无遮挡免费网站观看| 手机成人av网站| xxxhd国产人妻xxx| 亚洲自偷自拍图片 自拍| 亚洲va日本ⅴa欧美va伊人久久 | 日韩一区二区三区影片| 天天躁夜夜躁狠狠躁躁| 波多野结衣av一区二区av| 欧美变态另类bdsm刘玥| 午夜福利一区二区在线看| 考比视频在线观看| 欧美av亚洲av综合av国产av| av在线老鸭窝| 嫩草影视91久久| 亚洲 国产 在线| 新久久久久国产一级毛片| av天堂在线播放| 亚洲国产成人一精品久久久| 亚洲精品成人av观看孕妇| 成人av一区二区三区在线看 | 99国产综合亚洲精品| 国产福利在线免费观看视频| 色精品久久人妻99蜜桃| 国产欧美日韩综合在线一区二区| 国产精品免费大片| av电影中文网址| 老熟妇仑乱视频hdxx| 日韩欧美一区视频在线观看| 欧美日韩亚洲综合一区二区三区_| 青草久久国产| 一区二区三区四区激情视频| 久久精品国产a三级三级三级| 99国产综合亚洲精品| 人人澡人人妻人| 亚洲国产欧美网| 九色亚洲精品在线播放| 成年av动漫网址| 精品一品国产午夜福利视频| 日韩欧美一区视频在线观看| 人妻 亚洲 视频| 考比视频在线观看| av欧美777| 国产深夜福利视频在线观看| 老司机影院成人| 老司机午夜十八禁免费视频| 亚洲一卡2卡3卡4卡5卡精品中文| 亚洲第一av免费看| 日本av手机在线免费观看| 国产片内射在线| 国产成人av激情在线播放| 国产老妇伦熟女老妇高清| 一区二区三区激情视频| 日本五十路高清| 日本精品一区二区三区蜜桃| 久久精品国产亚洲av高清一级| 麻豆国产av国片精品| 欧美少妇被猛烈插入视频| 黑人操中国人逼视频| 精品卡一卡二卡四卡免费| 一边摸一边抽搐一进一出视频| 啦啦啦啦在线视频资源| 亚洲欧美色中文字幕在线| 亚洲精品国产精品久久久不卡| cao死你这个sao货| 夜夜夜夜夜久久久久| 黄片播放在线免费| 水蜜桃什么品种好| av线在线观看网站| 男女午夜视频在线观看| 妹子高潮喷水视频| 欧美人与性动交α欧美软件| 欧美精品啪啪一区二区三区 | 成人影院久久| 国产日韩一区二区三区精品不卡| 久久热在线av| av在线播放精品| 中文字幕高清在线视频| 日韩 亚洲 欧美在线| 91字幕亚洲| 国产野战对白在线观看| 欧美+亚洲+日韩+国产| av在线app专区| 亚洲精品一区蜜桃| 97在线人人人人妻| 日韩制服丝袜自拍偷拍| 国产一区有黄有色的免费视频| 久久精品aⅴ一区二区三区四区| 90打野战视频偷拍视频| 黄片大片在线免费观看| 日韩一卡2卡3卡4卡2021年| 淫妇啪啪啪对白视频 | 欧美+亚洲+日韩+国产| 国产一卡二卡三卡精品| 男女下面插进去视频免费观看| 老司机午夜福利在线观看视频 | 悠悠久久av| 美女脱内裤让男人舔精品视频| 一级片'在线观看视频| 国产亚洲一区二区精品| 香蕉丝袜av| 男女高潮啪啪啪动态图| 精品人妻1区二区| 欧美精品一区二区大全| xxxhd国产人妻xxx| 一区二区日韩欧美中文字幕| 久9热在线精品视频| 丰满少妇做爰视频| 十八禁高潮呻吟视频| 啦啦啦 在线观看视频| 免费观看av网站的网址| 18在线观看网站| av超薄肉色丝袜交足视频| 亚洲av日韩在线播放| 国产精品 国内视频| 欧美在线一区亚洲| av超薄肉色丝袜交足视频| 麻豆av在线久日| 电影成人av| 免费人妻精品一区二区三区视频| 久久香蕉激情| 中文字幕人妻熟女乱码| e午夜精品久久久久久久| 中文字幕色久视频| 国产日韩一区二区三区精品不卡| 国产成人一区二区三区免费视频网站| 国产熟女午夜一区二区三区| 大型av网站在线播放| 免费女性裸体啪啪无遮挡网站| 少妇裸体淫交视频免费看高清 | 成人亚洲精品一区在线观看| 国产熟女午夜一区二区三区| 久久久精品94久久精品| 成年女人毛片免费观看观看9 | 少妇精品久久久久久久| 人妻人人澡人人爽人人| 亚洲自偷自拍图片 自拍| 亚洲一区中文字幕在线| 一区二区三区激情视频| 999久久久精品免费观看国产| 成年av动漫网址| 国产又爽黄色视频| 在线观看免费视频网站a站| 久久久水蜜桃国产精品网| 最近中文字幕2019免费版| 国产成人免费观看mmmm| 亚洲精品国产av成人精品| 伊人久久大香线蕉亚洲五| 人人妻人人爽人人添夜夜欢视频| 亚洲精品久久久久久婷婷小说| 欧美大码av| 精品国产超薄肉色丝袜足j| 欧美激情 高清一区二区三区| 亚洲天堂av无毛| 亚洲中文日韩欧美视频| 91精品三级在线观看| 国产亚洲欧美精品永久| 在线观看人妻少妇| 女警被强在线播放| 无限看片的www在线观看| 涩涩av久久男人的天堂| av有码第一页| 久9热在线精品视频| 国产人伦9x9x在线观看| 日韩一区二区三区影片| 十八禁网站网址无遮挡| 超色免费av| 日本一区二区免费在线视频| 色综合欧美亚洲国产小说| 黑人欧美特级aaaaaa片| 男人爽女人下面视频在线观看| 成年人午夜在线观看视频| 老司机靠b影院| 黄色 视频免费看| 亚洲少妇的诱惑av| 日韩欧美一区二区三区在线观看 | 亚洲久久久国产精品| 免费观看人在逋| 国产精品秋霞免费鲁丝片| 男男h啪啪无遮挡| 亚洲精品在线美女| 黄片小视频在线播放| 亚洲人成电影观看| 色视频在线一区二区三区| 老熟妇仑乱视频hdxx| 国产男女超爽视频在线观看| 操美女的视频在线观看| 岛国在线观看网站| 国产日韩欧美在线精品| 一级片免费观看大全| 国产高清videossex| 亚洲精品第二区| 十八禁高潮呻吟视频| 深夜精品福利| 一级片免费观看大全| 一级片'在线观看视频| 免费日韩欧美在线观看| 日韩电影二区| 中文字幕另类日韩欧美亚洲嫩草| 亚洲综合色网址| 国产精品一区二区精品视频观看| 两个人免费观看高清视频| 美女大奶头黄色视频| 久久久久久亚洲精品国产蜜桃av| 中文字幕高清在线视频| 亚洲av成人不卡在线观看播放网 | 精品久久久久久电影网| 色综合欧美亚洲国产小说| 91精品伊人久久大香线蕉| 久久国产精品影院| 巨乳人妻的诱惑在线观看| 欧美xxⅹ黑人| 一区二区av电影网| 久久久久久久精品精品| 久久久国产欧美日韩av| 午夜福利在线免费观看网站| 自拍欧美九色日韩亚洲蝌蚪91| 亚洲精品成人av观看孕妇| 精品人妻熟女毛片av久久网站| 人人妻人人澡人人看| av在线播放精品| 国产老妇伦熟女老妇高清| 乱人伦中国视频| 啦啦啦 在线观看视频| 亚洲一区二区三区欧美精品| 国产男人的电影天堂91| 国产老妇伦熟女老妇高清| 亚洲精品一区蜜桃| 亚洲精品国产精品久久久不卡| 亚洲av日韩在线播放| 国产欧美日韩综合在线一区二区| 国产精品一区二区免费欧美 | 日韩三级视频一区二区三区| 黄片大片在线免费观看| 制服人妻中文乱码| 一级黄色大片毛片| 久久狼人影院| 久久精品人人爽人人爽视色| 99精品久久久久人妻精品| 久久ye,这里只有精品| 亚洲午夜精品一区,二区,三区| 18禁国产床啪视频网站| 国产国语露脸激情在线看| 久久天堂一区二区三区四区| 9191精品国产免费久久| 两性午夜刺激爽爽歪歪视频在线观看 | 性少妇av在线| 黄色毛片三级朝国网站| 啦啦啦免费观看视频1| 欧美成狂野欧美在线观看| 黑人猛操日本美女一级片| 色视频在线一区二区三区| 18禁观看日本| 色精品久久人妻99蜜桃| 又黄又粗又硬又大视频| 国产一区二区激情短视频 | a级毛片黄视频| av有码第一页| 国产男人的电影天堂91| 不卡一级毛片| 久久精品国产亚洲av高清一级| 亚洲国产欧美网| 少妇被粗大的猛进出69影院| 别揉我奶头~嗯~啊~动态视频 | 狠狠狠狠99中文字幕| 中国国产av一级| 乱人伦中国视频| bbb黄色大片| 淫妇啪啪啪对白视频 | 国产精品二区激情视频| 成人三级做爰电影| 激情视频va一区二区三区| 国产伦人伦偷精品视频| 日韩大片免费观看网站| 亚洲国产精品成人久久小说| 中文字幕另类日韩欧美亚洲嫩草| 天堂8中文在线网| 国产真人三级小视频在线观看| 精品国产乱码久久久久久小说| 亚洲国产精品成人久久小说| 国产一区二区激情短视频 | 日本五十路高清| a在线观看视频网站| 美国免费a级毛片| 国产在线视频一区二区| 国产精品免费大片| 久久久国产欧美日韩av| 99热国产这里只有精品6| 亚洲精品国产区一区二| 亚洲一区二区三区欧美精品| 99九九在线精品视频| 亚洲国产日韩一区二区| svipshipincom国产片| 女人高潮潮喷娇喘18禁视频| 中文字幕人妻熟女乱码| 啦啦啦啦在线视频资源| 18在线观看网站| 一级a爱视频在线免费观看| 精品国产乱码久久久久久男人| 啦啦啦视频在线资源免费观看| 亚洲av美国av| 午夜久久久在线观看| 欧美日韩黄片免| 大型av网站在线播放| 久久精品国产综合久久久| 巨乳人妻的诱惑在线观看| 久久久久久亚洲精品国产蜜桃av| 丝袜在线中文字幕| 18在线观看网站| 亚洲国产精品成人久久小说| 视频区欧美日本亚洲| 2018国产大陆天天弄谢| 亚洲精品国产区一区二| 久久精品国产亚洲av高清一级| 亚洲精品国产一区二区精华液| 高清av免费在线| 热re99久久精品国产66热6| 成年人免费黄色播放视频| 99精品久久久久人妻精品| 日韩免费高清中文字幕av| 一二三四社区在线视频社区8| 精品久久蜜臀av无| 男人舔女人的私密视频| 视频在线观看一区二区三区| 亚洲精品国产av成人精品| 捣出白浆h1v1| 高清在线国产一区| 亚洲国产毛片av蜜桃av| 99精品久久久久人妻精品| 欧美日韩精品网址| 18在线观看网站| 妹子高潮喷水视频| 国产成人一区二区三区免费视频网站| 极品少妇高潮喷水抽搐| 美女扒开内裤让男人捅视频| 国产欧美日韩一区二区三 | 国产精品久久久久久精品古装| 久久性视频一级片| www.自偷自拍.com| 久久久久视频综合| 国产欧美日韩一区二区三 | 久久毛片免费看一区二区三区| 国产日韩一区二区三区精品不卡| 亚洲精品粉嫩美女一区| 午夜日韩欧美国产| 99国产精品一区二区蜜桃av | 男人爽女人下面视频在线观看| 亚洲精品自拍成人| 每晚都被弄得嗷嗷叫到高潮| 99热网站在线观看| 久久亚洲精品不卡| 精品国产乱子伦一区二区三区 | 老司机在亚洲福利影院| 视频在线观看一区二区三区| 少妇粗大呻吟视频| 一本—道久久a久久精品蜜桃钙片| 国产精品香港三级国产av潘金莲| 精品第一国产精品| 久久久国产精品麻豆| 一本综合久久免费| av超薄肉色丝袜交足视频| 黄网站色视频无遮挡免费观看| 99精品久久久久人妻精品| 97在线人人人人妻| 多毛熟女@视频| 99精品欧美一区二区三区四区| 欧美久久黑人一区二区| 悠悠久久av| 高清av免费在线| 青草久久国产| 各种免费的搞黄视频| cao死你这个sao货| 国产精品一区二区在线不卡| 精品一区二区三区av网在线观看 | 久久青草综合色| 日韩 亚洲 欧美在线| 人妻一区二区av| 久久香蕉激情| 99久久综合免费| 伊人久久大香线蕉亚洲五| 国产精品.久久久| 免费不卡黄色视频| 亚洲精品国产av成人精品| 国产精品麻豆人妻色哟哟久久| 国产福利在线免费观看视频| 国产高清国产精品国产三级| 亚洲 国产 在线| 精品免费久久久久久久清纯 | tube8黄色片| 久久精品成人免费网站| 黄色怎么调成土黄色| 久久久国产欧美日韩av| 电影成人av| 一级毛片精品| 在线观看一区二区三区激情| 亚洲精品自拍成人| 欧美人与性动交α欧美软件| 宅男免费午夜| 精品熟女少妇八av免费久了| 热99国产精品久久久久久7| 少妇被粗大的猛进出69影院| 99精品久久久久人妻精品| av不卡在线播放| 亚洲av美国av| 国产精品亚洲av一区麻豆| 建设人人有责人人尽责人人享有的| 亚洲精品粉嫩美女一区| 高清av免费在线| 久久久久久久大尺度免费视频| 一级,二级,三级黄色视频| 91九色精品人成在线观看| 中文字幕人妻丝袜一区二区| 男女国产视频网站| 精品国产超薄肉色丝袜足j| 香蕉丝袜av| 黄色a级毛片大全视频| 欧美av亚洲av综合av国产av| 在线观看免费日韩欧美大片| 一本综合久久免费| 高清视频免费观看一区二区| 老司机福利观看| 每晚都被弄得嗷嗷叫到高潮| 69av精品久久久久久 | 午夜激情久久久久久久| 在线亚洲精品国产二区图片欧美| 久久人妻福利社区极品人妻图片| 精品第一国产精品| 人人妻,人人澡人人爽秒播| 1024视频免费在线观看| 在线观看舔阴道视频| 欧美黄色片欧美黄色片| 日韩中文字幕视频在线看片| 另类精品久久| 欧美激情高清一区二区三区| 国产高清国产精品国产三级| 三上悠亚av全集在线观看| 中文字幕另类日韩欧美亚洲嫩草| 久久久精品94久久精品| 999久久久国产精品视频| 亚洲九九香蕉| 搡老乐熟女国产| 新久久久久国产一级毛片| 1024视频免费在线观看| 精品卡一卡二卡四卡免费| 午夜免费观看性视频| 桃花免费在线播放| 飞空精品影院首页| 欧美精品一区二区大全| 欧美乱码精品一区二区三区| 19禁男女啪啪无遮挡网站| 男女高潮啪啪啪动态图| 韩国精品一区二区三区| 麻豆av在线久日| 两性午夜刺激爽爽歪歪视频在线观看 | 91精品国产国语对白视频| 免费少妇av软件| 在线观看免费午夜福利视频| 欧美人与性动交α欧美软件| 国产伦理片在线播放av一区| 成人国产av品久久久| 亚洲欧美一区二区三区黑人| 天堂8中文在线网| 男人舔女人的私密视频| 一级片免费观看大全| 国产成人精品久久二区二区免费| 亚洲国产看品久久| 老司机午夜十八禁免费视频| 另类精品久久| 亚洲精品日韩在线中文字幕| 欧美乱码精品一区二区三区| 亚洲久久久国产精品| 少妇的丰满在线观看| 免费在线观看影片大全网站| 满18在线观看网站| 午夜福利视频精品| 免费在线观看完整版高清| 免费不卡黄色视频| 悠悠久久av| 久久精品成人免费网站| 国产免费一区二区三区四区乱码| kizo精华| 老司机午夜福利在线观看视频 | 国产黄色免费在线视频| 女人爽到高潮嗷嗷叫在线视频| 免费在线观看完整版高清| 免费久久久久久久精品成人欧美视频| 99久久人妻综合| 亚洲美女黄色视频免费看| 亚洲一区中文字幕在线| 黄色视频,在线免费观看| 91老司机精品| 午夜影院在线不卡| 国产一区二区在线观看av| 久久久久国产精品人妻一区二区| 国产精品九九99| 欧美另类一区| 精品国产一区二区三区四区第35| bbb黄色大片| 婷婷成人精品国产| 亚洲精华国产精华精| 欧美激情久久久久久爽电影 | 色婷婷久久久亚洲欧美| 久久人妻福利社区极品人妻图片| 午夜福利视频精品| 日韩熟女老妇一区二区性免费视频| 国精品久久久久久国模美| 欧美日本中文国产一区发布| 黄片播放在线免费| 宅男免费午夜| 久久综合国产亚洲精品| 1024香蕉在线观看| 精品第一国产精品| 亚洲成av片中文字幕在线观看| 精品第一国产精品| 99精品欧美一区二区三区四区| 制服人妻中文乱码| 国产精品国产av在线观看| 法律面前人人平等表现在哪些方面 | 美女高潮到喷水免费观看| 黄色毛片三级朝国网站| 久久久水蜜桃国产精品网| 精品少妇久久久久久888优播| 一本一本久久a久久精品综合妖精| 国产片内射在线| 欧美激情 高清一区二区三区| 国产精品欧美亚洲77777| 日韩欧美一区二区三区在线观看 | 黑人巨大精品欧美一区二区mp4| 日韩大码丰满熟妇| 一区二区三区激情视频| 免费人妻精品一区二区三区视频| 一边摸一边做爽爽视频免费|