• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Joint On-Demand Pruning and Online Distillation in Automatic Speech Recognition Language Model Optimization

    2024-01-12 03:45:32SoonshinSeoandJiHwanKim
    Computers Materials&Continua 2023年12期

    Soonshin Seo and Ji-Hwan Kim

    1Clova Speech,Naver Corporation,Seongnam,13561,Korea

    2Department of Computer Science and Engineering,Sogang University,Seoul,04107,Korea

    ABSTRACT Automatic speech recognition (ASR) systems have emerged as indispensable tools across a wide spectrum of applications,ranging from transcription services to voice-activated assistants.To enhance the performance of these systems,it is important to deploy efficient models capable of adapting to diverse deployment conditions.In recent years,on-demand pruning methods have obtained significant attention within the ASR domain due to their adaptability in various deployment scenarios.However,these methods often confront substantial tradeoffs,particularly in terms of unstable accuracy when reducing the model size.To address challenges,this study introduces two crucial empirical findings.Firstly,it proposes the incorporation of an online distillation mechanism during on-demand pruning training,which holds the promise of maintaining more consistent accuracy levels.Secondly,it proposes the utilization of the Mogrifier long short-term memory (LSTM) language model (LM),an advanced iteration of the conventional LSTM LM,as an effective alternative for pruning targets within the ASR framework.Through rigorous experimentation on the ASR system,employing the Mogrifier LSTM LM and training it using the suggested joint on-demand pruning and online distillation method,this study provides compelling evidence.The results exhibit that the proposed methods significantly outperform a benchmark model trained solely with on-demand pruning methods.Impressively,the proposed strategic configuration successfully reduces the parameter count by approximately 39%,all the while minimizing trade-offs.

    KEYWORDS Automatic speech recognition;neural language model;Mogrifier;long short-term memory;pruning;distillation;efficient deployment;optimization;joint training

    1 Introduction

    1.1 Advancements in ASR

    Recent advancements in end-to-end automatic speech recognition (ASR),which transforms spoken language into written text,have led to the development of cutting-edge models that are suitable for industrial applications[1–5].

    Acoustic models (AMs) play a crucial role in ASR systems,translating acoustic signals into phonetic units.Their advancements have been foundational in constructing more robust end-to-end ASR systems.Main techniques in this field include connectionist temporal classification(CTC)[6],a training principle tailored for sequence-to-sequence tasks.It does not require a precise match between the input and output,which is extremely valuable in ASR,where it can often be challenging to establish a clear alignment between audio segments and their corresponding text representations.The recurrent neural network transducer[7]offers an improved approach to predicting output sequences of various lengths,capitalizing on the strengths of recurrent neural networks (RNNs).Another method is the attention-based encoder-decoder [8,9],where an encoder digests the input sequence,and a decoder produces the output.An attention mechanism pinpoints parts of the input sequence most pertinent to each output step.

    Furthermore,there have been advancements in decoding methods for end-to-end ASR models,which are responsible for generating the final transcription based on model predictions.These methods vary from simple greedy decoding,where the model selects the most likely word or phoneme at each step,to more complex approaches like frame-synchronous beam search [10–12].The latter simultaneously considers multiple potential transcriptions to determine the most probable one.One noteworthy development is the incorporation of shallow fusion with autoregressive neural language models (LMs).In this approach,the ASR model’s predictions are combined with a separate LM to refine the transcription.The use of beam search in this fusion process has proven to be crucial in enhancing ASR performance[1,4,5],as depicted in Fig.1.

    Figure 1:Overview of the end-to-end automatic speech recognition system in this study

    Fig.1 provides a detailed illustration of the end-to-end ASR system used in this study.This system incorporates the shallow fusion method,combining the AMs and LMs,and employs the beam search method for decoding.One of the central objectives of this system is its flexibility,making it compatible with a wide range of devices,from powerful computers to less resource-intensive gadgets.This adaptable approach is designed to address the challenges arising from varying computational capabilities and device specifications,making it an attractive solution for a multitude of real-world applications.

    In the domain of LMs,RNNs,such as long short-term memory(LSTM)[13–15],have played a crucial role in enhancing the learning of long-term dependencies and mitigating challenges related to vanishing or exploding gradients.Nevertheless,Transformers[16]have taken these strengths further by leveraging self-attention mechanisms to facilitate inter-word interactions,thereby enabling more intricate dependencies,and enhancing contextual understanding.However,conventional Transformers have grappled with the problem of context fragmentation.To tackle this issue,Transformer-XL[17]not only resolves the context fragmentation problem but also captures longer-range dependencies through a recurrent mechanism and a unique positional encoding scheme,resulting in substantial performance enhancements.

    1.2 Challenges in ASR Deployment

    Notwithstanding these remarkable strides,the development and effective deployment of ASR systems for resource-constrained environments,such as mobile devices,persist as formidable challenges[18–20].ASR systems often operate in scenarios where high-demand computational resources may not be readily accessible,compelling the utilization of low-complexity models.Additionally,there exists a compelling need for these systems to be adaptable across a spectrum of devices concurrently.This necessity underscores the requirement for a transferable ASR model capable of deployment across a spectrum ranging from high-end devices to entry-level ones,each characterized by distinct specifications and varying maximum available model sizes.While,in theory,it might be conceivable to train a distinct model for each device under the assumption of limitless resources.However,such an approach would ultimately be highly inefficient.

    Several prior investigations have delved into the realm of efficient deployments where a solitary model operates across diverse conditions.One widely adopted approach to address this challenge is the use of slimmable networks [21–23],along with on-demand pruning [24–27].These methods are part of the broader spectrum of pruning strategies [28–32].Specialized training methods,such as the “random”[24] or “sandwich rule”[22],enable on-demand pruning to be trained in a single pass.Subsequently,the trained model can be dynamically applied during inference,streamlining the deployment of models with varying sizes based on a single adaptable base model.Despite its adaptability and efficiency,inherent challenges persist,including a trade-off between the number of model parameters and corresponding performance[21–23].Consequently,when reducing the model parameters,ensuring consistent performance becomes a non-trivial task.

    To further enhance the adaptability of ASR systems in resource-limited environments,LSTMbased models have demonstrated promise due to their recurrent architecture [33],enabling efficient processing of variable-length sequences with fewer parameters and reduced computational demands compared to Transformers.Notably,the Mogrifier LSTM [34],an advanced iteration of the conventional LSTM,augments the model’s expressive capabilities without introducing substantial complexity.Its performance has exhibited great potential,frequently surpassing that of vanilla LSTMs and Transformer-XLs,all while preserving a comparable computational complexity[33,34].

    1.3 Contributions

    This study proposes the incorporation of an online distillation mechanism during the on-demand pruning training phase.This is devised to ensure the stability of model accuracy,primarily by minimizing the disparity between the pruned model’s predictions and the original model’s output probabilities,consequently yielding more robust results.Additionally,this study introduces the Mogrifier LSTM LM as the target for pruning within the ASR framework.Given the characteristics of the Mogrifier LSTM,it emerges as an ideal candidate for an on-demand pruning strategy tailored for deployment in resource-constrained environments.

    To substantiate the efficacy of the proposed methods,this study conducts a series of experiments on the ASR system employing the Mogrifier LSTM LM,which is trained using the suggested joint on-demand pruning and online distillation approach.The experimental outcomes,as observed on the LibriSpeech test-other subset,clearly demonstrate that our proposed methods surpass a baseline model trained solely through on-demand pruning techniques.Moreover,our research verifies that the proposed configuration can reduce the model parameters by approximately 39%.This reduction is accomplished with minimal trade-offs,underscoring the potential efficiency of our approach in terms of computational resources,all the while maintaining the model’s performance at a satisfactory level.

    2 Related Works

    In this section,this study delves into prior research studies that bear relevance to our proposed approach of joint on-demand pruning and online distillation.Section 2.1 offers an overview of the evolution of LMs in the context of ASR.Sections 2.2 and 2.3 expound upon the various pruning strategies and introduce the underlying distillation mechanisms,respectively.Finally,in Section 2.4,we scrutinize previous works that have sought to integrate these two methods.

    2.1 LMs in End-to-End ASR

    LMs have been employed in tandem with AMs within end-to-end ASR systems,with the overarching goal of enhancing transcription accuracy.LMs play a crucial role in aiding ASR systems in generating the most likely linguistic hypotheses for sentences.In practical application,LMs are constructed as probabilistic models,which may encompass n-gram models or neural network LMs.Various types of LMs can be seamlessly integrated into end-to-end ASR configurations.The most basic form of integration is commonly referred to as LM shallow fusion.

    Conventional n-gram LMs operate under the Markov assumption but face challenges,including difficulties in recognizing unfamiliar n-grams and limitations associated with small n values[35].To overcome these issues,deep neural networks(DNNs)are introduced,utilizing complex vector representations to represent words[36,37].However,as the length of sequences increased,DNNs exhibited constraints.This led to the emergence of RNNs,capable of handling longer sequences [38].The bidirectional variant of RNNs,which considers both preceding and succeeding contexts,demonstrated superior performance,albeit encountering issues such as gradient vanishing or explosion [39–41].The introduction of LSTM cells addressed these gradient problems[13–15,42].While LSTM models outperformed traditional RNNs,they did require lengthier training times[43].

    The introduction of attention mechanisms marked developments in the field,leading to the emergence of the Transformer model.This model incorporates positional encoding,further establishing its importance[16,44–46].Within the family of Transformer-based models,BERT stands out for its excellence in natural language processing,thanks to its bidirectional encoding.On the other hand,GPT-2,known for its multi-layer decoding,excels in language modeling tasks[47,48].Transformer-XL has its unique advantages,particularly in specific applications,due to its ability to retain past states[17].

    2.2 Pruning Strategies

    2.2.1 Conventional Pruning

    Pruning strategies play a crucial role in the realm of neural network optimization,offering a suite of techniques aimed at removing redundant connections through iterative fine-tuning processes[28–32].These refined models effectively reduce the number of parameters while preserving the accuracy of the original model,making them indispensable in resource-constrained scenarios.

    Han et al.[28] introduced a method to tackle the demanding computational and memory requirements frequently associated with neural networks.Their achievement lies in the identification and subsequent pruning of redundant network connections.This approach results in a significant reduction in parameters while upholding the model’s accuracy,effectively streamlining neural network architectures,and mitigating the resource-intensive nature of deep learning.

    Wu et al.[29]contributed to this field by introducing BlockDrop,an ingenious strategy characterized by dynamic layer selection during inference.This approach effectively reduces computational overhead while concurrently preserving high accuracy levels.The substantial acceleration of inference up to 36%attained through their approach underscores its importance in the context of neural network optimization.

    Liu et al.[30] challenged traditional notions surrounding network pruning.Their proposition shifts the focus from preserving specific important weights to emphasizing the significance of the pruned architecture itself in achieving an efficient final model.This fresh perspective redefines the approach to network pruning,illuminating alternative avenues for attaining computational efficiency.

    Collectively,these previous studies underscore the dynamic landscape of pruning strategies in neural network optimization,providing solutions to the challenges posed by burgeoning computational demands across diverse environments.Nonetheless,the adoption of these approaches in specific settings necessitates thoughtful consideration of associated expenses,particularly in the context of the expanding array of devices and applications.

    2.2.2 On-Demand Pruning

    Conversely,on-demand pruning can be dynamically applied during inference,obviating the necessity for further fine-tuning.Through the utilization of on-demand pruning,the creation of large yet lightweight models concurrently become feasible.Furthermore,it offers versatile deployment possibilities,adaptable to diverse environments.

    Several training methods have been proposed to effectively implement on-demand pruning.One such approach is commonly referred to as “random” [24–27],which targets specific layers or modules within a single neural network for pruning.These targets are selected randomly during the training phase.For instance,Huang et al.[24]introduced stochastic depth to train deep convolutional networks,involving the random omission of layers during training.This enables the training of shorter networks while using deeper networks during testing.Fan et al.[25]introduced LayerDrop,a form of structured dropout,to regulate over-parameterized transformer networks.It allows efficient pruning during inference by selecting sub-networks from a larger network without requiring additional fine-tuning.Lee et al.[26] presented a training and pruning method for ASR models that employ intermediate CTC and stochastic depth to reduce model depth at runtime without the need for extra fine-tuning.Vyas et al.[27]proposed stochastic compression for compute-efficient Wav2vec 2.0 models,which incorporates a variable squeeze factor and query and key-value pooling mechanisms for compression.The stochastically pre-trained model can be fine-tuned for specific configurations,resulting in substantial computational savings.

    While random sampling is relatively simple to implement,its intrinsic randomness can be challenging to control,and it does not consistently guarantee accuracy compared to the non-pruned model.Another method known as the “sandwich rule” [21–23] provides an alternative approach.Yu et al.[21] devised a technique for training a single neural network at various widths,which they termed Slimmable neural networks.This approach dynamically adjusts the network width during runtime,striking a balance between efficiency and accuracy.Building upon the concept of the Slimmable network,they introduce universally Slimmable networks[22]based on the sandwich rule.This allows these networks to operate at any chosen width.The sandwich rule is devised to train sub-networks with varying configurations within a single neural network,and the losses from these sub-networks are aggregated and back-propagated simultaneously.This method proves particularly effective in optimizing neural networks when working within defined lower and upper bounds.

    Moreover,Li et al.[23]introduced the dynamic Slimmable network,a dynamic network slimming technique designed to enhance hardware efficiency.This network shows considerable performance improvements over other model compression techniques,achieving reduced computation and realworld acceleration,while only minimally impacting accuracy in image classification.

    2.3 Distillation Mechanisms

    2.3.1 Teacher-Student Distillation

    Distillation mechanisms have found extensive application in enhancing the performance and efficiency of neural networks [49–52].In the conventional framework,a larger “teacher” network imparts its structured knowledge to a smaller “student” network,enabling the student network to leverage this guidance and achieve better performance compared to training from scratch.

    In a more specific context,Hinton et al.[49]introduced an approach to consolidate the knowledge from a variety of models into a single model,greatly streamlining the deployment process.They introduce a type of ensemble consisting of one or more comprehensive models alongside several specialist models designed to distinguish fine-grained classes,which are often challenging for the full models to differentiate.

    Sun et al.[50] presented a patient knowledge distillation approach that efficiently compresses a large-scale teacher model into a more compact student model without compromising effectiveness.The student model patiently learns from multiple layers of the teacher model,employing two strategies:learning from the last few layers and learning from every few layers.This approach leads to improved training efficiency and enhances performance across various natural language processing tasks by leveraging the rich information present in the hidden layers of the teacher model.

    In a related study,Jiao et al.[51]introduced a Transformer distillation approach explicitly tailored for the knowledge distillation of Transformer-based models.They introduced a two-stage learning framework for TinyBERT,where Transformer distillation is performed during both the pretraining and task-specific learning phases.Nevertheless,these approaches necessitate a fixed network of pretrained teachers,which can be costly and restrict the flow of knowledge from the teacher network to the student network in a one-way manner.

    2.3.2 Online Distillation

    Conversely,online distillation mechanisms streamline the training process by treating all networks as student networks,allowing them to exchange knowledge within a single stage[53–55].Lan et al.[53]introduced an online distillation technique known as an on-the-fly native ensemble.This method eliminates the need for a complex two-phase training procedure associated with offline distillation methods.Instead,it simultaneously trains a single multi-branch network while dynamically constructing a potent teacher model on the fly.

    Furthermore,in research by Zhang et al.[54],a deep mutual learning strategy was proposed.Instead of the traditional one-way knowledge transfer from a teacher to a student,this approach promotes a collaborative learning environment where an ensemble of students teaches each other throughout the training process.This eliminates the requirement for a powerful pre-trained teacher network.Notably,this collaborative learning approach enhances the performance of various network architectures and surpasses distillation methods reliant on a more capable yet static teacher.

    Expanding upon the principles of the preceding two approaches,Guo et al.[55] introduced an effective online knowledge distillation method that leverages collaborative learning.In contrast to traditional two-stage knowledge distillation methods that adhere to a unidirectional learning paradigm,this method fosters a single-stage training environment where knowledge is shared among students during collaborative training.Experimental results underscore the consistent improvement in the performance of all student models across diverse datasets,and it proves effective in knowledge transfer to multiple tasks,including object detection and semantic segmentation.

    2.4 Integration of Pruning and Distillation Mechanisms

    Pruning strategies and distillation mechanisms have traditionally been employed as separate techniques,but contemporary research has shifted towards their integration.The objective is to optimize the benefits of both methods,streamlining their combination to reduce computational overhead and enhance overall efficiency [56].Distillation has emerged as a valuable approach to address the inherent loss of precision associated with pruning.However,it is essential to note that applying distillation after the pruning process,rather than concurrently with pruning during training,might potentially lead to suboptimal model performance[57].

    The recent advancement in the field of ASR is the integration of both pruning and distillation methods.Initially,various methods have been explored for applying pruning to ASR based on selfsupervised learning(SSL)[58–60].Building on these foundational studies,Peng et al.[60]highlighted the limitations of using knowledge distillation in SSL.A significant drawback they pointed out is the necessity for manual design and maintaining a fixed architecture for the student model during training.Such an approach is dependent on prior knowledge and might not always achieve optimal performance.Addressing these challenges,they introduced task-agnostic compression methods for voice SSL,drawing inspiration from task-specific structural pruning.By merging distillation and pruning techniques,their methods showcased superiority over methods based solely on distillation across multiple tasks.

    The preceding studies indicate that an integrated approach involving pruning and distillation reveals the complementary nature of these two strategies when training neural networks for diverse tasks.This serves as a basis for extending the same concept to the online mechanism and LM task domains addressed in this study.

    3 Joint On-Demand Pruning and Online Distillation for ASR LM

    As discussed in Section 2,on-demand pruning offers efficient deployment and adaptability to various conditions,including diverse devices.However,it can encounter accuracy fluctuations when employed to compress models into smaller sizes.In such cases,distillation mechanisms can serve as a valuable complement.In Section 3,this study delves into the proposed joint on-demand pruning and online distillation approach,outlining its application in an online fashion for ASR LMs.Furthermore,this study analyzes and validates the efficacy of the Mogrifier LSTM LM as a potential target for pruning within the ASR task.

    In Section 3.1,this study introduces the Mogrifier LSTM as the selected LM for on-demand pruning,and Section 3.2 details the baseline,delving into the utilization of the random and sandwich rule as methods for on-demand pruning.Finally,in Section 3.3,this study describes the proposed method that merges the sandwich rule with online distillation mechanisms.

    3.1 Target LM for On-Demand Pruning

    3.1.1 LSTM

    LSTM is a type of RNN architecture.RNNs are neural networks specifically designed to process sequences,like time series data or natural language.The LSTM is introduced to address the limitations of the basic RNN,primarily the problem of long-term dependencies where the RNN struggles to remember information from earlier in the sequence.The LSTM introduces the concept of a cell state,a kind of memory that flows through the network and can be added to,removed from,or read using three distinct gates.

    Specifically,for a given input vector sequence x1,...,xT∈Rm,a vanilla LSTM[16]processes input vector xtsequentially to produce an output vector ht∈Rnand a cell state ct∈Rn.These two outputs are then used to produce the next output ht+1for the next input xt+1.The function of the LSTM layer is denoted as:

    The LSTM layer has three gates:forget(f),input(i),and output gates(o).The output of the gates is used to produce ctand ht.The forget gate determines which portions of the cell state should be discarded.The input gate integrates new information into the cell state.Subsequently,the output gate ascertains the segments of the cell state to be read and emitted.The gates employ sigmoid activation functions that yield values in the range of 0 to 1,facilitating their decision-making processes.The operation proceeds as follows:

    where σ is the sigmoid non-linearity function,⊙is the element-wise product operation,and W?and b?are the weight matrix and bias vector.

    3.1.2 Mogrifier LSTM

    With the advancement of deep learning,there have been endeavors to enhance the capabilities of the conventional LSTM.One notable development in this regard is the Mogrifier LSTM[32].This variant introduces additional gates that are designed to iteratively refine the input to the LSTM.These refinements aim to optimize the LSTM’s interaction with its inputs,ultimately improving its learning effectiveness.

    Then the old King said: Now I die comforted and in peace ; and then he went on: After my death you must show him the whole castle, all the rooms and apartments and vaults1, and all the treasures that lie in them; but you must not show him the last room in the long passage, where the picture of the Princess of the Golden Roof is hidden

    During each iteration within these gates,there is a selective update of either the input sequence or the preceding hidden state,depending on the value of the other.This mechanism aligns with the fundamental principles of LSTM’s gating mechanisms.The iterative rounds facilitate a nuanced interplay between the input and the hidden state,thereby enhancing the model’s proficiency in recognizing intricate patterns within the data.

    Subsequent studies on the Mogrifier LSTM [31,32,61] consistently demonstrate that various Mogrifier variants outperform both the traditional LSTM and Transformer models in language modeling tasks.

    Specifically,in the Mogrifier LSTM,the input of the vanilla LSTM,xt-1,and ht-1are replaced with the output of a special layer,referred to as the“MogrifierGate”.Accordingly,Eq.(1)is changed to:

    The structure of the MogrifierGate is based on a series of“rounds”,denoted asrwhereris greater than or equal to 1.For each round,a gate vector is obtained using either xtor ht-1.Then,the other component (xt,if ht-1is used to obtain the gate vector,and vice versa) is updated using its previous value and the gate vector.Formally,for thei-th round,ifiis odd:

    and ifiis even:

    where σ is a sigmoid non-linearity,⊙is the element-wise product,andrefers to the weight matrix for thei-th round.

    3.1.3 The Impact of Pruning Mogrifier Rounds

    Furthermore,to explain the impact of pruning on the components of the Mogrifier LSTM,this study analyzes their inference times and presents the results in Fig.2.Specifically,this study compares the inference time required for Eqs.(8) and (9),which differ in the number of rounds they involve.This analysis reveals that for a Mogrifier LSTM layer with 8 rounds,calculating these rounds (as per Eq.(8)) takes 2.8αtimes longer than the computation of the vanilla LSTM (as per Eq.(9)),whereαrepresents the inference time of the latter.This suggests that when conducting a feed-forward operation with a Mogrifier LSTM comprising eight rounds,it occupies 74% of the total inference time.However,the inference time needed for Eq.(8)can be significantly reduced to 0.76αwhen the number of rounds is decreased to 1.This implies that effective pruning of rounds with minimal or no performance degradation can reduce the necessary inference time by more than half(fromα+2.8αtoα+0.76α).

    3.2 Baseline:Sandwich Rule for On-Demand Pruning

    3.2.1 Random

    The random strategy involves generating a random values,where 1 ≤s≤r,for each iteration,and then using only thes-th MogrifierGate to feed-forward the input for that iteration.This strategy is straightforward to implement and has proven effective for on-demand pruning in prior research[22–25].Formally,the random strategy can be described as follows:

    Here,MogrifierGatesrepresents the sub-gate of MogrifierGate for thes-th round,andbrepresents the mini-batch.It should be noted thatsis shared among all timesteps(1 ≤t≤T)and mini batches.After training,this study evaluates the model using a fixed round.

    Figure 2:An analysis on the inference time of the two components of a Mogrifier long short-term memory(LSTM)layer using one CPU(Intel Xeon Gold 5120):MogrifierGate and LSTM layer

    3.2.2 Sandwich Rule

    Random selection has been reported in practice that the smallest sub-model(s=1)and the largest sub-model(s=r)tend to have reduced accuracy compared to the typicalsround Mogrifier LSTM.To address this limitation,the two sub-models are required to be trained more cautiously.Yu et al.[20]proposed a sandwich rule that is employed in universally Slimmable networks,and this study adopts it for the training of MogrifierGate pruning.The sandwich rule aims to mitigate the limitation on the accuracy of the smallest and largest sub-models by additionally weighting them during training.

    This study considers the use of multiple objective functions Ls,where Lsis the objective function of the model using thesround Mogrifier sub-gate,MogrifierGates,obtained from therround of MogrifierGate.This study considers multiple values of Lsconcurrently using the sandwich rule.Formally,the loss of sandwich rule can be addressed as:

    wherekis the hyperparameter of the sandwich rule,representing the number of losses per training iteration(here,k≥2)and L is used for training the model.

    One drawback of the sandwich rule is the increase in computational cost for each mini-batch,as it requires separate computations for each Lsi.Nevertheless,through experimental observations,this study notices that the sandwich rule significantly enhances the final accuracy of both the smallest and largest models,surpassing the performance of the random method.

    3.3 Proposed Method:Sandwich Rule with Online Distillation Mechanism

    In this section,this study introduces an approach that combines on-demand pruning and an online distillation mechanism customized for ASR LMs.Specifically,the proposed method builds upon the foundational concept of the sandwich rule,which is explained in Section 3.2.However,this study focuses on the fact that the utility of the sandwich rule may exhibit instability when the demand for more compressed models becomes paramount.To confront this challenge,this study proposes the integration of an online distillation mechanism in tandem with on-demand pruning during the training phase.This distillation mechanism takes inspiration from a previous method,namely collaborative learning-based online distillation,which is discussed in detail in Section 2.3.The primary advantage of this approach resides in its capacity to consistently accrue additional information through the ensemble of soft targets generated by all student networks throughout the training process[55].

    The proposed method can be conceptualized as an extension of the sandwich rule,as illustrated in Fig.3.Assuming the existence ofksub-models predicated on the sandwich rule,we denote the logits of thei-th sub-model as zi,while the ensemble logits generated from all sub-models are designated as zensemble.This can be formally expressed as follows:

    Figure 3:The process of the proposed sandwich rule with an online distillation mechanism.For different sub-models generated by the sandwich rule,respective knowledge distillation (KD) losses are generated.These KD losses are joined with the respective cross-entropy losses

    Subsequently,for each sub-model,a knowledge distillation (KD) loss is computed using the Kullback-Leibler divergence(KLD)between the soft targets(p)and the soft predictions(q).The soft targets are derived from zensemble,whereas the soft predictions are extracted from one of the sub-models.Formally,this process is articulated as follows:

    whereTrepresents the temperature parameter for KLD.The incorporation of a temperature parameter within the softmax function of the KD loss aids in the effective management of the probability distribution.Additionally,to circumvent potential underflow issues,it is advisable to work with soft predictions in the logarithmic space.

    Finally,the KD loss is added to the cross-entropy(CE)loss,which is a product of the sandwich rule,as delineated in Eq.(18).Formally,the KD loss,integrated with a trade-off weight denoted asλ,is calculated as follows:

    This integrated approach holds the promise of enhancing the efficacy of on-demand pruning while harnessing the power of distillation,presenting a compelling avenue for optimizing ASR LMs.

    4 Experiments

    In this section,this study assesses the effectiveness of the proposed joint pruning and online distillation method for the Mogrifier LSTM LM using the LibriSpeech dataset[62],a commonly used dataset in ASR tasks.Sections 4.1 and 4.2 outline the evaluation metrics and datasets employed in this study,respectively.Sections 4.3 and 4.4 provide comprehensive details about the proposed model specifications and the training process.Finally,Section 4.5 presents the experimental results.

    4.1 Evaluation Metrics

    This study employs two key evaluation metrics: perplexity (PPL) and word error rate (WER).Perplexity is a widely utilized metric in the field of natural language processing,particularly for assessing the performance of LMs.It quantifies how effectively the probability model predicts the sample and is typically defined for discrete probability distributions.The formula for perplexity is as follows:

    whereWis the test set composed ofNwords.A lower perplexity score indicates better generalization performance.

    WER is a standard metric for assessing the performance of a speech recognition system.It measures the minimum number of operations (substitutions (S),deletions (D),and insertions (I))required to transform a system output into the reference output.The formula for WER is:

    whereNis the number of words in the reference.A lower WER signifies better system performance.

    4.2 Dataset Setup

    This study utilizes the LibriSpeech ASR and LM datasets to train the English ASR and LM models,as outlined in Table 1.The LibriSpeech dataset is well-known in the ASR field and is publicly available.It comprises a wide range of audiobooks narrated by various native English speakers,earning acclaim for its high quality and diverse content.In particular,the text-only LM dataset consists of approximately 40 million normalized sentences sourced from 14,500 public domain books.

    Table 1:Statistics for the LibriSpeech automatic speech recognition(ASR)and language model(LM)datasets used in this study

    Table 2:Structure of vanilla long short-term memory(LSTM)used in this study

    Table 3:Structure of Mogrifier long short-term memory(LSTM)used in this study

    In this study,both the ASR and text-only LM datasets in this study are tokenized into 1,024 subwords using SentencePiece[63].To evaluate the PPL of the LMs,approximately 6K sentences are randomly selected from this dataset,while the rest are allocated for LM training.When assessing the WER of the ASR,the study utilizes four partitions of the LibriSpeech dataset.Each subset comprises approximately 3K utterances that exhibit diverse characteristics,including background noise and nonnative speakers.These subsets have been deliberately designed to evaluate ASR models in real-world,less controlled scenarios,effectively simulating the challenging conditions commonly encountered in casual conversational speech.

    4.3 Model Setup

    All LSTM-based LMs utilized in this study have 512 input and hidden dimensions,as specified in Tables 2 and 3.Initially,eight distinct Mogrifier LSTM LMs are trained from the ground up,each featuring a different number of rounds,spanning from one to eight.Furthermore,an eight-round Mogrifier LSTM is trained using on-demand pruning,as described in Section 3.2.Additionally,the proposed eight-round Mogrifier LSTMs are trained using a combined mechanism of the sandwich rule and online distillation,as explained in Section 3.3.

    4.4 Training Details

    For the training of these LMs,a batch size of 32 is employed,with each training example containing 512 tokens.All models are trained for a maximum of 300K iterations,and the checkpoint with the lowest loss value is selected.It is confirmed that all models converge within 300K steps.The Adam optimizer[64]is utilized,and the learning rate is gradually warmed up from 10-7to a peak of 10-3.Subsequently,it is decayed using a cosine scheduler with a weight decay of 10-2.The temperature for the distillation loss is set at 2,and the trade-off weight between the KD loss and the CE loss is fixed at 0.5.All models are trained using the Fairseq[65]framework with a single NVIDIA V100 GPU.

    For the training of AMs,this study employs a 16-layer Conformer-M encoder [3] trained with the CTC loss[6].In all experiments,beam search decoding is utilized in combination with LMs.The decoding algorithm follows a frame-synchronous approach[10],with the beam size set to 5,and the LM weight is fixed at 0.4,a value determined to be optimal through preliminary experiments.

    4.5 Experimental Results

    This study conducts three distinct sets of experiments to evaluate the performance of the proposed approach.In the first set,a comparison is made between the performance of the vanilla LSTM LM and the Mogrifier LSTM LM in the ASR task.The second set of experiments demonstrates the efficiency of the proposed joint on-demand pruning and online distillation method by comparing it to previous on-demand pruning strategies that serve as a baseline.As the third experiment,an ablation study of the proposed method is conducted.

    4.5.1 Comparison of LSTM LMs

    In the initial step,this study compares the performance of shallow fusion using the vanilla LSTM LM and the Mogrifier LSTM LM.To ensure a fair comparison,both LSTM LMs have the same model parameters.The vanilla LSTM consists of four layers,while the Mogrifier LSTM comprises two layers with eight rounds each.

    As elaborated in Section 3.1,the Mogrifier LSTM LM incorporates a distinct module called rounds,which facilitates mutual gating between the current inputs and previous hidden states.In this study,the rounds of the Mogrifier LSTM LM serve as targets for on-demand pruning.Table 4 illustrates the effectiveness of pruning Mogrifier rounds and provides a comparative analysis with the conventional method of pruning layers from a vanilla LSTM LM.Notably,even without the implementation of specific training techniques for pruning,the Mogrifier LSTM LM slightly outperforms its vanilla counterpart.When a layer is pruned from the vanilla LSTM LM,performance dramatically declines,occasionally yielding results even worse than ASR performance without any LMs.In contrast,while performance does decrease when rounds are pruned from the Mogrifier LSTM LM,the LM maintains its efficiency.

    Table 4:Comparison of shallow fusion with language models (LMs) which comprise vanilla and Mogrifier long short-term memory (LSTM) layers respectively in the test-other partition of LibriSpeech

    Furthermore,a comparative analysis has been conducted,comparing the Mogrifier LSTM LM against a variety of alternative architectures,as presented in Table 5.This comprehensive evaluation effectively highlights the capabilities of the Mogrifier LSTM LM in the context of the ASR task.The results not only showcase its ability to compete with established architectures but also position it as a viable and promising choice for enhancing ASR performance.

    Table 5:Comparison of shallow fusion with language models(LMs)which comprise several architectures and Mogrifier long short-term memory (LSTM) LM respectively in the test-other partition of LibriSpeech

    Table 6:Comparison of shallow fusion with Mogrifier long short-term memory language models(LMs)with on-demand pruning strategies and the proposed joint online distillation on the dev-clean partition of LibriSpeech under various run-time scenarios.The parameter k in the sandwich rule represents the number of losses computed per training iteration.Here,this study determines that setting k=2 or k=3 is adequate to achieve comparable accuracy to the baseline

    Table 7:Comparison of shallow fusion with Mogrifier long short-term memory language models(LMs)with on-demand pruning strategies and the proposed joint online distillation on the dev-other partition of LibriSpeech under various run-time scenarios.The parameter k in the sandwich rule represents the number of losses computed per training iteration.Here,this study determines that setting k=2 or k=3 is adequate to achieve comparable accuracy to the baseline

    Table 8:Comparison of shallow fusion with Mogrifier long short-term memory language models(LMs)with on-demand pruning strategies and the proposed joint online distillation on the test-clean partition of LibriSpeech under various run-time scenarios.The parameter k in the sandwich rule represents the number of losses computed per training iteration.Here,this study determines that setting k=2 or k=3 is adequate to achieve comparable accuracy to the baseline

    4.5.2 Comparison of On-Demand Pruning Strategies

    The following series of experiments demonstrate the effectiveness of the suggested joint ondemand pruning and online distillation method by comparing it to previous on-demand pruning strategies,as shown in Tables 6–9.As elaborated in Section 4.3,these experiments employ the Mogrifier LSTM LM with two layers and eight rounds.

    In the second row of these tables,individual Mogrifier LSTMs are presented,with each one being independently trained for rounds 1 through 8,without employing on-demand pruning.The number of parameters for these models varies from 5.8 million to 9.5 million,and during inference,it remains fixed to match the round used during training.This approach operates under the assumption that there are adequate resources available to train each model individually.While this approach may yield reliable performance,it is not practical due to its inefficiency.

    Starting from the third row,this study introduces baseline models trained using on-demand pruning strategies.For the random method,this study randomly selects one of the eight rounds for each iteration during training.The overall performance of the random method is less consistent compared to the individual models,especially at the lower and upper bounds(with rounds of 1 or 8).Conversely,when employing the sandwich rule,this study observes a more stable WER at lower rounds compared to the random method.This stability is particularly evident at the lower and upper bounds,where the performance gap relative to individual models is significantly reduced compared to the random method.This study usesk=3 andk=2 for the sandwich rule,wherek=3 includes cases that are randomly selected at each iteration during training,as explained in Section 3.2.It is important to note that finding the right balance between the value ofkand accuracy is crucial,as increasingkresults in higher computational costs.

    Nevertheless,there are trade-offs to consider,especially regarding the potential for unstable accuracy when using sandwich rule-based pruning strategies to achieve a smaller model size.To address this concern,this study incorporates joint online distillation alongside the sandwich rule,as illustrated in these tables and Fig.4(Fig.4 is included to provide an intuitive comparison between the sandwich rule and joint online distillation methods,using the data from Table 9).The experimental results of the proposed joint online distillation indicate an overall improvement in performance compared to using the sandwich rule alone.The proposed setup not only reduces the number of parameters by approximately 39%but also boosts performance,outperforming the models trained individually.Moreover,this approach showcases its robustness,sustaining its performance even when pruned down to models with fewer parameters.

    Table 9:Comparison of shallow fusion with Mogrifier long short-term memory language models(LMs)with on-demand pruning strategies and the proposed joint online distillation on the test-other partition of LibriSpeech under various run-time scenarios.The parameter k in the sandwich rule represents the number of losses computed per training iteration.Here,this study determines that setting k=2 or k=3 is adequate to achieve comparable accuracy to the baseline

    Figure 4:A comparison between the sandwich rule and the proposed joint online distillation methods on the test-other partition of the LibriSpeech under varying language model(LM)parameters

    Upon a detailed analysis of these experimental results,several insights emerge that confirm the effectiveness of integrating the online distillation mechanism.The data clearly shows that knowledge transfer,facilitated by online distillation,takes place seamlessly during the training phase.This manifests in the smaller model’s enhanced ability to absorb and reflect valuable information from its larger counterpart,leading to improved model accuracy and stability.

    Moreover,the inherent flexibility of the model,achieved through on-demand construction,is evident in the experiments.The model showcases its adaptability,demonstrating its capability to dynamically adjust based on situational needs.This flexibility ensures that the model remains versatile,and capable of catering to different devices or scenarios without requiring separate training for each context.

    Additionally,this study analyzes several samples of ASR recognition results from both the sandwich rule method(withk=3)and the proposed joint online distillation method,as presented in Table 10.This study focuses on cases where the online distillation mechanism has been jointly applied and observes that,in certain instances,this approach addresses issues such as deletion,insertion,and replacement,leading to outcomes that align with the correct answers.

    Table 10:Qualitative analysis of automatic speech recognition results for the sandwich rule and the proposed joint online distillation method

    Table 11:Comparison of ensemble and minimum when applying proposed joint online distillation methods on the dev-clean partition of LibriSpeech under various run-time scenarios.The parameter k in the sandwich rule represents the number of losses computed per training iteration

    Table 12:Comparison of ensemble and minimum when applying proposed joint online distillation methods on the dev-other partition of LibriSpeech under various run-time scenarios.The parameter k in the sandwich rule represents the number of losses computed per training iteration

    Table 13:Comparison of ensemble and minimum when applying proposed joint online distillation methods on the test-clean partition of LibriSpeech under various run-time scenarios.The parameter k in the sandwich rule represents the number of losses computed per training iteration

    4.5.3 Ablation Study of the Joint On-Demand Pruning and Online Distillation

    This study conducts an ablation study on the proposed method,which combines on-demand pruning and online distillation.As explained in Section 3.3,this study utilizes the ensemble logit of all sub-models to obtain soft targets in joint online distillation.To assess the effectiveness of this collaborative learning strategy,this study replaces the ensemble logit with the logit from the model that exhibits the smallest loss when obtaining the soft target.

    The experimental results demonstrate that employing ensembles for joint online distillation generally yields comparable or superior performance to using the model with the lowest loss,as indicated in Tables 11–14 and Fig.5 (Fig.5 offers an intuitive comparison between the ensemble logit and the minimum logit when applying joint online distillation methods,utilizing the data from Table 14).This implies that the logits generated by the sub-models constructed based on thekvalue of the sandwich rule closely influence one another.

    Table 14:Comparison of ensemble and minimum when applying proposed joint online distillation methods on the test-other partition of LibriSpeech under various run-time scenarios.The parameter k in the sandwich rule represents the number of losses computed per training iteration

    In summary,the fusion of on-demand pruning strategies with the online distillation mechanism empowers us to dynamically adjust the number of gates in the Mogrifier LSTM at runtime while maintaining high levels of accuracy.This adaptability streamlines the deployment of ASR model across a diverse array of devices,spanning from high-end to entry-level,all without the need for training multiple models of varying sizes tailored to each specific scenario.The effectiveness of our proposed method shines through in our evaluations on the LibriSpeech dataset.Nevertheless,it is imperative to undertake further validation across a spectrum of ASR and LM models.This crucial step will serve as the focal point of our forthcoming research endeavors,ensuring the robustness and versatility of our approach in diverse linguistic and acoustic environments.

    Figure 5:An ablation study for the proposed joint online distillation using an ensemble mechanism on the test-other partition of the LibriSpeech under varying language model(LM)parameters.Here,k=3 for the sandwich rule

    5 Conclusions

    The approach proposed in this study offers an efficient and robust solution to the challenge of deploying ASR models across various conditions,encompassing a wide spectrum of computational resources.By utilizing the Mogrifier LSTM LM in conjunction with the proposed joint on-demand pruning and online distillation,this study effectively adjusts the number of gates in the model while maintaining high accuracy.The experimental results demonstrate that the proposed approach surpasses the performance of the vanilla LSTM LM and maintains stable performance compared to the sandwich rule-based on-demand pruning strategies.Moreover,the optimal configuration proposed in this study reduces the number of parameters by approximately 39%,facilitating efficient deployment on both high-end and entry-level devices.

    Acknowledgement:The authors would like to express their gratitude to the editors and reviewers for their thorough review and valuable recommendations.

    Funding Statement:This work was supported by Institute of Information &communications Technology Planning &Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2022-0-00377,Development of Intelligent Analysis and Classification Based Contents Class Categorization Technique to Prevent Imprudent Harmful Media Distribution).

    Author Contributions:The authors confirm contribution to the paper as follows: study conception and design: S.Seo,J-H.Kim;data collection: S.Seo;analysis and interpretation of results: S.Seo,J-H.Kim;draft manuscript preparation: S.Seo.All authors reviewed the results and approved the final version of the manuscript.

    Availability of Data and Materials:All the data used in this study are publicly available,and readers can access them freely.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    好男人电影高清在线观看| 18在线观看网站| 成人国产av品久久久| 91精品三级在线观看| 国产免费福利视频在线观看| 777米奇影视久久| 91九色精品人成在线观看| 狠狠精品人妻久久久久久综合| 两个人免费观看高清视频| 成在线人永久免费视频| 女人精品久久久久毛片| 一本综合久久免费| av在线老鸭窝| 久久久国产精品麻豆| 日本色播在线视频| 午夜免费鲁丝| 老汉色∧v一级毛片| 免费在线观看完整版高清| 中国美女看黄片| 日本猛色少妇xxxxx猛交久久| 欧美 亚洲 国产 日韩一| 亚洲国产精品国产精品| 免费女性裸体啪啪无遮挡网站| 99热网站在线观看| 中文字幕精品免费在线观看视频| 欧美在线黄色| 老鸭窝网址在线观看| 国产男人的电影天堂91| 老司机在亚洲福利影院| 久久久欧美国产精品| 一级片免费观看大全| 一级毛片 在线播放| 亚洲av片天天在线观看| 欧美av亚洲av综合av国产av| 纵有疾风起免费观看全集完整版| 激情五月婷婷亚洲| 欧美日韩黄片免| 赤兔流量卡办理| 欧美黑人欧美精品刺激| 久久人妻福利社区极品人妻图片 | 天天影视国产精品| 国产精品 国内视频| 高清av免费在线| 免费日韩欧美在线观看| 免费少妇av软件| 亚洲七黄色美女视频| 各种免费的搞黄视频| 日韩免费高清中文字幕av| 男女午夜视频在线观看| 男女下面插进去视频免费观看| xxx大片免费视频| 久久精品亚洲熟妇少妇任你| 巨乳人妻的诱惑在线观看| 成年美女黄网站色视频大全免费| 中文字幕色久视频| 97精品久久久久久久久久精品| 成人午夜精彩视频在线观看| 欧美在线一区亚洲| 亚洲国产中文字幕在线视频| 免费看av在线观看网站| 人人妻人人澡人人爽人人夜夜| 国产成人av教育| 中文字幕av电影在线播放| 女人被躁到高潮嗷嗷叫费观| 成人亚洲精品一区在线观看| 爱豆传媒免费全集在线观看| av不卡在线播放| 午夜影院在线不卡| 日韩av在线免费看完整版不卡| 一级毛片电影观看| 国产一级毛片在线| 大陆偷拍与自拍| 一级毛片 在线播放| 日韩免费高清中文字幕av| 国产一区有黄有色的免费视频| 肉色欧美久久久久久久蜜桃| 一级毛片电影观看| 欧美 日韩 精品 国产| 岛国毛片在线播放| 亚洲av男天堂| 超碰成人久久| 国产不卡av网站在线观看| 久久99精品国语久久久| 亚洲 欧美一区二区三区| 三上悠亚av全集在线观看| 亚洲中文av在线| 一区二区av电影网| 一级毛片我不卡| 天天躁夜夜躁狠狠久久av| 丰满人妻熟妇乱又伦精品不卡| 久久国产精品影院| 九草在线视频观看| 精品久久蜜臀av无| 精品少妇一区二区三区视频日本电影| 老汉色av国产亚洲站长工具| 免费不卡黄色视频| 亚洲精品一卡2卡三卡4卡5卡 | 在线精品无人区一区二区三| 欧美人与性动交α欧美精品济南到| 久久 成人 亚洲| 丝瓜视频免费看黄片| 国产av精品麻豆| 在线av久久热| 麻豆乱淫一区二区| 国产三级黄色录像| 亚洲人成电影免费在线| 欧美国产精品va在线观看不卡| 美女中出高潮动态图| 电影成人av| 亚洲欧洲精品一区二区精品久久久| 韩国高清视频一区二区三区| 看免费成人av毛片| 亚洲av成人精品一二三区| 丰满少妇做爰视频| 日韩 欧美 亚洲 中文字幕| 一级,二级,三级黄色视频| 精品第一国产精品| 新久久久久国产一级毛片| 国产成人精品久久久久久| 男女边吃奶边做爰视频| 国产欧美日韩一区二区三区在线| 亚洲九九香蕉| svipshipincom国产片| 国产精品一二三区在线看| 精品人妻熟女毛片av久久网站| 久久精品久久久久久噜噜老黄| 丰满少妇做爰视频| 1024香蕉在线观看| 亚洲av成人精品一二三区| 婷婷色综合www| 国产主播在线观看一区二区 | 人人妻,人人澡人人爽秒播 | 免费日韩欧美在线观看| 亚洲久久久国产精品| 亚洲欧美成人综合另类久久久| 美国免费a级毛片| 日韩电影二区| 精品国产乱码久久久久久小说| 精品久久久精品久久久| 1024香蕉在线观看| bbb黄色大片| 不卡av一区二区三区| 午夜免费观看性视频| 大香蕉久久成人网| 欧美人与性动交α欧美软件| 一本一本久久a久久精品综合妖精| 免费观看a级毛片全部| 精品国产国语对白av| 美女午夜性视频免费| 国产成人影院久久av| 日日爽夜夜爽网站| 两个人免费观看高清视频| 国产精品一国产av| 国产精品熟女久久久久浪| 另类亚洲欧美激情| 老司机午夜十八禁免费视频| 亚洲精品成人av观看孕妇| 91精品三级在线观看| e午夜精品久久久久久久| 国产精品一区二区精品视频观看| 亚洲精品美女久久久久99蜜臀 | 国产淫语在线视频| 亚洲av成人精品一二三区| 国产老妇伦熟女老妇高清| 欧美黑人欧美精品刺激| 精品欧美一区二区三区在线| 女人被躁到高潮嗷嗷叫费观| 一级片免费观看大全| 啦啦啦中文免费视频观看日本| 亚洲成人免费av在线播放| 欧美精品一区二区大全| 男男h啪啪无遮挡| 巨乳人妻的诱惑在线观看| 国产色视频综合| 欧美乱码精品一区二区三区| 人妻一区二区av| 精品久久久久久电影网| 日韩中文字幕欧美一区二区 | 欧美精品亚洲一区二区| 波野结衣二区三区在线| 亚洲国产精品一区三区| 亚洲国产毛片av蜜桃av| 亚洲精品久久久久久婷婷小说| 精品少妇内射三级| 一区二区日韩欧美中文字幕| 国产成人免费观看mmmm| 国产亚洲一区二区精品| 高清黄色对白视频在线免费看| 黄片播放在线免费| 男女无遮挡免费网站观看| 日本色播在线视频| 国产片特级美女逼逼视频| 欧美人与善性xxx| 久久久久久亚洲精品国产蜜桃av| 国产一级毛片在线| 久久精品成人免费网站| 欧美人与性动交α欧美软件| 五月开心婷婷网| 一边亲一边摸免费视频| 新久久久久国产一级毛片| 十八禁人妻一区二区| 天天躁日日躁夜夜躁夜夜| av视频免费观看在线观看| 久9热在线精品视频| 欧美精品啪啪一区二区三区 | 香蕉丝袜av| av天堂久久9| 午夜免费成人在线视频| 日韩大码丰满熟妇| 免费日韩欧美在线观看| 日韩,欧美,国产一区二区三区| 日韩中文字幕视频在线看片| 成人国产一区最新在线观看 | 狠狠精品人妻久久久久久综合| 欧美在线一区亚洲| 人妻人人澡人人爽人人| 国产欧美日韩精品亚洲av| 少妇的丰满在线观看| 一区二区三区激情视频| 亚洲黑人精品在线| 一级毛片我不卡| 男女床上黄色一级片免费看| 欧美少妇被猛烈插入视频| 黑人巨大精品欧美一区二区蜜桃| 欧美精品一区二区免费开放| av电影中文网址| 在线精品无人区一区二区三| 国产高清videossex| 日本五十路高清| 十八禁人妻一区二区| 亚洲精品国产av成人精品| 亚洲,欧美,日韩| 久久久久国产一级毛片高清牌| 欧美亚洲日本最大视频资源| 1024视频免费在线观看| 一级毛片我不卡| 中文字幕人妻丝袜一区二区| 自拍欧美九色日韩亚洲蝌蚪91| 精品国产一区二区三区久久久樱花| 丰满人妻熟妇乱又伦精品不卡| 日本五十路高清| 如日韩欧美国产精品一区二区三区| 亚洲自偷自拍图片 自拍| 欧美日本中文国产一区发布| 久久影院123| 国产在线观看jvid| 在线观看免费午夜福利视频| 一级a爱视频在线免费观看| 亚洲人成77777在线视频| 丰满迷人的少妇在线观看| 久久国产精品男人的天堂亚洲| 最黄视频免费看| 亚洲精品一卡2卡三卡4卡5卡 | 国产黄色免费在线视频| 最近手机中文字幕大全| 欧美激情极品国产一区二区三区| 久久综合国产亚洲精品| av视频免费观看在线观看| 国产97色在线日韩免费| 久久人人97超碰香蕉20202| 丝袜人妻中文字幕| 亚洲欧美精品综合一区二区三区| 精品国产一区二区三区久久久樱花| 精品一区二区三区av网在线观看 | 久久久久国产精品人妻一区二区| 亚洲av日韩在线播放| 午夜福利视频精品| 午夜福利,免费看| www.精华液| 在线观看人妻少妇| 国产成人免费观看mmmm| 亚洲国产精品国产精品| 视频区图区小说| 欧美日韩精品网址| 亚洲一码二码三码区别大吗| 高清欧美精品videossex| 啦啦啦 在线观看视频| 国产精品三级大全| 欧美精品高潮呻吟av久久| av福利片在线| 欧美成人午夜精品| 又大又爽又粗| 日韩 亚洲 欧美在线| 国产有黄有色有爽视频| 丰满少妇做爰视频| 色网站视频免费| xxxhd国产人妻xxx| 欧美人与善性xxx| 一级毛片电影观看| 人妻 亚洲 视频| 18禁裸乳无遮挡动漫免费视频| 亚洲精品国产av蜜桃| 青草久久国产| 国产精品一区二区精品视频观看| 亚洲av欧美aⅴ国产| 亚洲国产最新在线播放| 欧美黑人欧美精品刺激| 9色porny在线观看| 国产成人一区二区在线| 亚洲av电影在线观看一区二区三区| 国产真人三级小视频在线观看| 亚洲国产av影院在线观看| 国产无遮挡羞羞视频在线观看| 久久久亚洲精品成人影院| 又黄又粗又硬又大视频| 香蕉丝袜av| 午夜日韩欧美国产| 亚洲人成77777在线视频| 一区二区三区四区激情视频| 国产在线视频一区二区| 中国国产av一级| 日韩精品免费视频一区二区三区| 色综合欧美亚洲国产小说| 精品亚洲成a人片在线观看| 又大又黄又爽视频免费| 国产欧美日韩一区二区三区在线| 制服诱惑二区| 黄色 视频免费看| 欧美亚洲 丝袜 人妻 在线| 亚洲午夜精品一区,二区,三区| 国产亚洲精品久久久久5区| 99久久99久久久精品蜜桃| 久久鲁丝午夜福利片| 制服人妻中文乱码| 国产一卡二卡三卡精品| 亚洲国产精品一区三区| 免费在线观看完整版高清| 久久久精品94久久精品| 久久亚洲精品不卡| 久久久精品94久久精品| 91麻豆精品激情在线观看国产 | 亚洲中文字幕日韩| 只有这里有精品99| 啦啦啦啦在线视频资源| 亚洲美女黄色视频免费看| 欧美成人午夜精品| 青草久久国产| 黄色 视频免费看| 欧美成狂野欧美在线观看| 侵犯人妻中文字幕一二三四区| 亚洲av日韩在线播放| 欧美人与性动交α欧美软件| 亚洲国产精品一区二区三区在线| 中文乱码字字幕精品一区二区三区| 免费女性裸体啪啪无遮挡网站| 精品久久久精品久久久| 欧美激情高清一区二区三区| 我要看黄色一级片免费的| 大陆偷拍与自拍| 亚洲国产欧美网| 午夜福利免费观看在线| 精品卡一卡二卡四卡免费| 色94色欧美一区二区| 午夜福利视频精品| 99精国产麻豆久久婷婷| 国产精品久久久久成人av| 欧美日韩综合久久久久久| 美女福利国产在线| 亚洲成色77777| 午夜福利视频精品| 99精国产麻豆久久婷婷| 超碰成人久久| av网站免费在线观看视频| 老司机亚洲免费影院| 久久99一区二区三区| 老司机亚洲免费影院| 亚洲国产精品一区二区三区在线| 老司机在亚洲福利影院| 久久久亚洲精品成人影院| 国产黄色视频一区二区在线观看| 1024视频免费在线观看| 激情视频va一区二区三区| 国产色视频综合| 国产精品久久久久久精品古装| 成年人黄色毛片网站| 欧美亚洲日本最大视频资源| 日本一区二区免费在线视频| 亚洲精品一区蜜桃| 中文字幕色久视频| 亚洲精品久久成人aⅴ小说| 国产激情久久老熟女| 精品一品国产午夜福利视频| av网站在线播放免费| 亚洲欧美色中文字幕在线| 中文精品一卡2卡3卡4更新| 国产在线观看jvid| 国产精品 欧美亚洲| 国产av一区二区精品久久| 一级毛片 在线播放| 欧美av亚洲av综合av国产av| 亚洲国产中文字幕在线视频| a级片在线免费高清观看视频| 久久午夜综合久久蜜桃| 久久精品国产亚洲av涩爱| 国产av国产精品国产| 美女高潮到喷水免费观看| 热re99久久国产66热| 色网站视频免费| 又黄又粗又硬又大视频| 电影成人av| 国产日韩欧美视频二区| 亚洲色图 男人天堂 中文字幕| 国产高清视频在线播放一区 | 丁香六月欧美| 看免费成人av毛片| 欧美乱码精品一区二区三区| 久久99一区二区三区| 水蜜桃什么品种好| 欧美精品一区二区免费开放| 成年av动漫网址| 国产成人91sexporn| 免费黄频网站在线观看国产| 国产成人啪精品午夜网站| 少妇被粗大的猛进出69影院| 久久综合国产亚洲精品| 国产免费现黄频在线看| 久久精品亚洲熟妇少妇任你| 中国美女看黄片| 亚洲精品在线美女| 久久这里只有精品19| 91成人精品电影| 成人三级做爰电影| 熟女av电影| 一级毛片女人18水好多 | 美女国产高潮福利片在线看| 亚洲欧洲日产国产| 亚洲国产精品999| 黄色怎么调成土黄色| 国产精品九九99| 午夜91福利影院| 国产有黄有色有爽视频| 国产精品一区二区精品视频观看| 久久九九热精品免费| 激情视频va一区二区三区| 国产精品久久久av美女十八| 欧美中文综合在线视频| 久久亚洲国产成人精品v| 午夜免费鲁丝| 精品国产一区二区三区久久久樱花| 免费日韩欧美在线观看| 欧美日韩成人在线一区二区| 天天躁夜夜躁狠狠躁躁| 少妇的丰满在线观看| 人妻 亚洲 视频| 精品一区二区三区av网在线观看 | 丝瓜视频免费看黄片| 少妇的丰满在线观看| 成人午夜精彩视频在线观看| 一级毛片女人18水好多 | www.av在线官网国产| 国产成人免费无遮挡视频| 午夜福利在线免费观看网站| 国产片特级美女逼逼视频| 国产成人影院久久av| 久久久久久久大尺度免费视频| 久久久精品国产亚洲av高清涩受| 亚洲第一av免费看| 国产日韩一区二区三区精品不卡| 午夜福利免费观看在线| 亚洲国产精品成人久久小说| 国产精品偷伦视频观看了| 亚洲 国产 在线| 久久久久国产一级毛片高清牌| 另类亚洲欧美激情| 少妇猛男粗大的猛烈进出视频| 久久中文字幕一级| 狂野欧美激情性bbbbbb| 后天国语完整版免费观看| 狠狠精品人妻久久久久久综合| e午夜精品久久久久久久| 欧美黑人精品巨大| 亚洲,一卡二卡三卡| 一级毛片黄色毛片免费观看视频| 亚洲欧美日韩另类电影网站| 日本av免费视频播放| www.自偷自拍.com| 日韩av免费高清视频| 免费黄频网站在线观看国产| 老司机午夜十八禁免费视频| 日韩视频在线欧美| 黄片播放在线免费| 捣出白浆h1v1| 我的亚洲天堂| 成年人免费黄色播放视频| 啦啦啦在线免费观看视频4| 狠狠婷婷综合久久久久久88av| 国产片特级美女逼逼视频| 国产精品国产三级专区第一集| 男人舔女人的私密视频| 777米奇影视久久| 午夜福利,免费看| av国产久精品久网站免费入址| 亚洲七黄色美女视频| 国产亚洲av高清不卡| 波多野结衣一区麻豆| 97在线人人人人妻| 国产片内射在线| 日韩,欧美,国产一区二区三区| 又黄又粗又硬又大视频| 男女免费视频国产| 妹子高潮喷水视频| av片东京热男人的天堂| 国产精品国产av在线观看| 亚洲,一卡二卡三卡| 大话2 男鬼变身卡| 欧美亚洲日本最大视频资源| 日韩中文字幕视频在线看片| 亚洲视频免费观看视频| 99九九在线精品视频| 久久久亚洲精品成人影院| 久久久久久久国产电影| 又粗又硬又长又爽又黄的视频| 日韩电影二区| 久久 成人 亚洲| 亚洲国产精品一区二区三区在线| 丝袜在线中文字幕| 操美女的视频在线观看| 国产精品久久久人人做人人爽| 高清不卡的av网站| 伊人亚洲综合成人网| 丝袜在线中文字幕| 可以免费在线观看a视频的电影网站| 国产一卡二卡三卡精品| 亚洲国产欧美网| 一级片免费观看大全| 蜜桃国产av成人99| 啦啦啦在线观看免费高清www| 人人妻人人爽人人添夜夜欢视频| 黄色毛片三级朝国网站| 香蕉丝袜av| 久久久久久免费高清国产稀缺| 成人国语在线视频| 深夜精品福利| 一本综合久久免费| 大香蕉久久成人网| 伊人久久大香线蕉亚洲五| 久久久久视频综合| 另类亚洲欧美激情| 99久久99久久久精品蜜桃| 这个男人来自地球电影免费观看| 国产日韩欧美亚洲二区| 成人18禁高潮啪啪吃奶动态图| 免费高清在线观看日韩| 亚洲欧美日韩另类电影网站| 国产成人免费观看mmmm| 高清视频免费观看一区二区| 又黄又粗又硬又大视频| 免费观看a级毛片全部| 久久狼人影院| 男男h啪啪无遮挡| 电影成人av| av天堂在线播放| 久久午夜综合久久蜜桃| 肉色欧美久久久久久久蜜桃| 亚洲精品在线美女| 爱豆传媒免费全集在线观看| 国产在线视频一区二区| 午夜激情av网站| 菩萨蛮人人尽说江南好唐韦庄| 国产午夜精品一二区理论片| 精品久久久精品久久久| 亚洲色图 男人天堂 中文字幕| 日韩视频在线欧美| 91精品伊人久久大香线蕉| 亚洲av国产av综合av卡| 免费黄频网站在线观看国产| 免费在线观看日本一区| 久久精品人人爽人人爽视色| 午夜两性在线视频| 婷婷丁香在线五月| 女人久久www免费人成看片| 99久久综合免费| 丝袜人妻中文字幕| 少妇精品久久久久久久| 国精品久久久久久国模美| 亚洲激情五月婷婷啪啪| 亚洲成人免费av在线播放| 亚洲精品久久午夜乱码| 男的添女的下面高潮视频| 91字幕亚洲| 久久ye,这里只有精品| 免费女性裸体啪啪无遮挡网站| 悠悠久久av| 欧美国产精品一级二级三级| 国产精品av久久久久免费| 精品国产超薄肉色丝袜足j| 91麻豆av在线| 午夜福利免费观看在线| 在线观看www视频免费| 一边亲一边摸免费视频| 久久国产精品大桥未久av| 亚洲精品在线美女| 国产男女内射视频| 成人亚洲欧美一区二区av| 激情视频va一区二区三区| 久久久国产一区二区| 夫妻性生交免费视频一级片| 丰满饥渴人妻一区二区三| 2018国产大陆天天弄谢| 国产高清不卡午夜福利| 蜜桃国产av成人99| 丰满人妻熟妇乱又伦精品不卡| 国产av精品麻豆| 女人被躁到高潮嗷嗷叫费观| 久久久久精品国产欧美久久久 | 十八禁人妻一区二区| 久久国产精品男人的天堂亚洲| 在线观看免费高清a一片| 在线观看免费日韩欧美大片| 亚洲国产看品久久| netflix在线观看网站| 欧美日韩视频精品一区| 亚洲人成网站在线观看播放| 亚洲精品国产av成人精品| 在线观看免费午夜福利视频|