• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    PAL-BERT:An Improved Question Answering Model

    2024-03-23 08:16:04WenfengZhengSiyuLuZhuohangCaiRuiyangWangLeiWangandLirongYin

    Wenfeng Zheng ,Siyu Lu ,Zhuohang Cai ,Ruiyang Wang ,Lei Wang and Lirong Yin,?

    1School of Automation,University of Electronic Science and Technology of China,Chengdu,610054,China

    2Department of Geography and Anthropology,Louisiana State University,Baton Rouge,LA,70803,USA

    ABSTRACT In the field of natural language processing (NLP),there have been various pre-training language models in recent years,with question answering systems gaining significant attention.However,as algorithms,data,and computing power advance,the issue of increasingly larger models and a growing number of parameters has surfaced.Consequently,model training has become more costly and less efficient.To enhance the efficiency and accuracy of the training process while reducing the model volume,this paper proposes a first-order pruning model PAL-BERT based on the ALBERT model according to the characteristics of question-answering(QA)system and language model.Firstly,a first-order network pruning method based on the ALBERT model is designed,and the PAL-BERT model is formed.Then,the parameter optimization strategy of the PAL-BERT model is formulated,and the Mish function was used as an activation function instead of ReLU to improve the performance.Finally,after comparison experiments with traditional deep learning models TextCNN and BiLSTM,it is confirmed that PALBERT is a pruning model compression method that can significantly reduce training time and optimize training efficiency.Compared with traditional models,PAL-BERT significantly improves the NLP task’s performance.

    KEYWORDS PAL-BERT;question answering model;pretraining language models;ALBERT;pruning model;network pruning;TextCNN;BiLSTM

    1 Introduction

    With the rapid growth of big data,the volume of complex network information is increasing exponentially.As a result,people are increasingly relying on search engines to retrieve relevant information efficiently.However,traditional search engine algorithms are no longer capable of adequately meeting user demands in the face of this overwhelming data volume.Therefore,there is a growing need for more efficient and effective information retrieval methods.The question-answering system(QA)[1,2]is more suitable for users’habits.Users only need to input natural language questions to get the relevant answer information returned by the system,which greatly reduces the time cost for users to obtain information.The successful application of QA will liberate human beings from a large amount of repetitive work,change human production mode,and have an inestimable impact on the progress of human society.

    For a long time,the development of deep learning in NLP has been far from its performance in image processing.While image recognition algorithms [3,4] have caught up with and even surpassed human performance,the progress in combining the NLP field with deep learning has been unsatisfactory.The emergence of BERT[5–7]broke this situation.It demonstrated remarkable results in the high-level evaluation of machine reading comprehension,specifically in SQuAD1.1 [8],it outperformed human performance across both measurement indicators and achieved outstanding results in 11 distinct NLP tests.These accomplishments included pushing the GLUE benchmark to 80.4% (a significant increase of 7.6%) and achieving an accuracy of 86.7% (a notable increase of 5.6%) in MultiNLI.The emergence of BERT is like a watershed,which makes people see the great development potential of neural network models in NLP.Despite the great success and social attention of the current emergence of ChatGPT,which is known for being able to answer questions quickly and accurately,the authenticity of the answers has also been widely questioned[9].In contrast,BERT has its advantages.BERT aims to understand the context of the words in a sentence,thus providing a more accurate answer to a question.BERT belongs to the upstream structure model,which depends on its own characteristics.In order to achieve better results in specific applications such as the QA system,it is also necessary to build the corresponding downstream structure model.In addition,the huge network structure and various parameter volumes of the BERT model are also great obstacles to completing the actual task.Therefore,it is meaningful to optimize the BERT model for better effect.

    Pruning[10–12]belongs to a method of model compression.In short,by deleting some unimportant neurons,the number of weights and computations are greatly reduced to improve the operation efficiency of the model.In practical application,a large neural network has some redundancy because of the limitations of application scenarios.Pruning technology is the way to reduce this redundancy.Its core idea is to minimize the connection of neurons on the premise of ensuring no great loss of accuracy.

    The pruning process is shown in Fig.1.First,complete the training of a neural network on a dataset and then use similar but different datasets for pruning.For the trained neural network,calculate the neurons that need pruning according to certain standards,remove the neurons with the least impact,and then conduct fine-tuning [13] on the new dataset.Continue this process until the neurons are reduced to the required goal,and the accuracy is within the loss range.

    Many experts and scholars have studied various pruning methods.EvoPruneDeepTL [14] is an evolutionary pruning model based on transfer learning,which replaces the last fully connected layer with a sparse layer optimized by genetic algorithms,reducing the number of active neurons in the last layer while improving accuracy.A method called progressive local filter pruning (PLFP) was proposed by Wang et al.[15],which retains the capacity of the original model when pruning and achieves excellent performance.Aiming at the problem that the existing residual network usually has a large model volume,a new hierarchical pruning-multi-scale feature fusion residual network(HSPMFFRN) for IFD was proposed [16],which uses multiple multi-scale feature extraction and fusion modules to extract,fuse and compress multi-scale features without changing the convolution filter size.Also,redundant channels are removed by pruning.The results show that pruning can effectively compress parameters and model volume,and achieve better diagnostic performance.

    The traditional zero-order network pruning (setting a threshold for the absolute value of the parameters in the model,which is higher than its retention and lower than its zero setting) is not suitable for the migration learning scenario,because in this scenario,the model parameters are mainly affected by the original model,but need to be fine-tuned and tested on the target task.Therefore,pruning directly according to the model parameters may lose the knowledge from either source or target tasks.

    The model used in this paper is based on ALBERT [17],short for “A Lite BERT,” is a more efficient and compact variant of the original BERT model.Developed by Google Research,ALBERT maintains high-performance levels in natural language processing tasks while addressing BERT’s computational and memory requirement issues.Key features of ALBERT include parameter sharing between layers,factorized embedding parameterization,a Sentence Order Prediction (SOP)pretraining task,and improved training strategies.These innovations reduce the model’s size and resource demands,making it a practical choice for various NLP applications.

    This study optimizes the ALBERT model through pruning technology to improve its performance in question and answer tasks.The innovation points are mainly reflected in the following aspects:

    ALBERT is introduced as the research basis.It is a more efficient and compact variant of the BERT model.It solves BERT’s problems in computing and memory requirements through features such as parameter sharing,decomposition of embedded parameters,and Sentence Order Prediction.For low-power environments,a special training optimization strategy is designed,including gradient accumulation and 16-bit quantization,to improve training efficiency.PAl-BERT was introduced as the first derivative pruning model to reduce computational requirements and improve model efficiency by retaining parameters far away from zero during the fine-tuning process.Through these innovations,the article attempts to fill some shortcomings in deep learning applications in the field of natural language processing,proposes a more efficient and compact model,and further improves performance through pruning technology.This series of innovations and improvements make the article uniquely valuable in optimizing question and answer systems.

    In our experimental settings,we follow ALBERT’s original settings and use the SQuAD 1.1 and SQuAD 2.0 datasets as pre-training experimental datasets.We select TextCNN and BiLSTM as comparative models.Through experimental comparisons and analysis in the dataset CMRC 2018,AP-BERT achieves an accuracy rate of 81.5% in question-answering tasks,demonstrating superior performance to baseline models.Experimental results verify the effectiveness of the proposed model.

    2 Dataset

    2.1 SQuAD 1.1 and SQuAD 2.0

    In this study,SQuAD 1.1 and SQuAD 2.0 [8] were used as English pre-training experimental dataset,for it has sufficient sample number and scientific selection of samples.The dataset can be obtained from: (https://rajpurkar.github.io/SQuAD-explorer/).Table 1 shows the amount and distribution of SQuAD 1.1 and SQuAD 2.0.

    Table 1:Number and distribution of positive and negative samples in the SQuAD dataset

    Various studies based on the old version of datasets have been well performed.Therefore,the new version of datasets introduces manually marked “unanswerable”questions,thus increasing the difficulty of the whole task.

    2.2 CMRC 2018

    For another experimental dataset,we will use the Chinese machine reading comprehension dataset CMRC 2018 [18] released by iFLYTEK Joint Laboratory of Harbin University of Technology.The dataset can be obtained from: (https://ymcui.com/cmrc2018/).The content of the dataset is from Chinese Wikipedia,the questions are written manually,and the data form provided is like SQuAD.The training set contains about 10000 pieces of data.Table 2 shows the specific information of the dataset.

    Table 2:CMRC 2018 sample quantity

    However,there are still some differences between Chinese and English datasets,so it is also used as a supplement to the experimental dataset to explore the differences in model and preprocessing in non-English cases.Each article will give several relevant questions,and each question has several manually marked reference answers.Note that each reference answer is considered correct during the evaluation.To confirm the diversity of problems,the dataset includes six common types of problems and other types except common problems.The statistical table of problem types is the same as that in Table 3.

    Table 3:CMRC2018 question type statistics

    3 Method

    3.1 Basic Framework of ALBERT Model

    ALBERT is a more efficient version of BERT,and it makes improvements in three main areas:

    1.Embedded Layer Restructuring:

    In BERT,the size of the word embedding layer matches that of the hidden layers.ALBERT,however,suggests a change.It argues that the embedded layer primarily holds context-free information while the hidden layers add context.So,the hidden layer should have a higher dimension.To avoid increasing the parameters in the embedded layer too much when increasing the hidden layer dimension,ALBERT decomposes the embedded layer and introduces an extra embedded layer.

    2.Cross-Layer Parameter Sharing:

    ALBERT uses a mechanism where all layers share parameters to enhance efficiency.While other methods exist for sharing parameters within specific parts of the model,ALBERT shares all parameters across all layers.This results in smoother transitions between layers,suggesting that parameter sharing improves the stability of the model.

    3.Sentence Order Prediction(SOP)Task:

    BERT introduced Next Sentence Prediction(NSP)to improve downstream tasks using sentence pairs.However,ALBERT introduces an alternative task called Sentence Order Prediction(SOP).In SOP,two consecutive paragraphs from a single document are used as a positive sample,and a negative sample is created by switching their order.ALBERT argues that NSP is less effective than SOP because it’s relatively easier.ALBERT combines topic prediction and coherence prediction into a single task,which helps maintain high scores even if NSP struggles with coherence prediction.

    3.2 Design of QA Model Based on ALBERT Improvement

    Based on ALBERT model,the model built in this section optimization and pruning are carried out to design a model suitable for QA tasks.

    3.2.1 Optimization Strategy of Model Training

    In this study,all experimental GPUs are equipped with NVIDIA RTX2060 graphics cards.While these cards offer decent computing power,they may fall short when handling the ALBERT model.Consequently,this section introduces two optimization strategies aimed at reducing the hardware demands for model training.To attain satisfactory training results on platforms with limited computing power,we employ the gradient accumulation method and 16-bit precision training.

    The first optimization strategy is gradient accumulation.Take Pytorch as an example.In traditional training of neural networks,Pytorch calculates the gradient after each ‘backward ()’,and the gradient will not be cleared automatically.If it is not cleared manually,the gradient will continue to accumulate until“CUDA out of memory”appears and an error is reported.Generally,the process is shown in Fig.2.

    From the Fig.2a,we can clearly see that there are the following five steps:

    1)Reset gradient to 0 after the previous batch calculation.

    2)Forward propagation,input data in the network to obtain the predicted value.

    3)Calculate the loss value according to the predicted value and label.

    4)Calculate the parameter gradient by backpropagation through loss.

    5)Update the network parameters by the gradient calculated in the previous step.

    The gradient accumulation method is to obtain one batch at a time,calculate the gradient once,and continuously accumulate without clearing.As shown in Fig.2b,the training steps of gradient accumulation are as follows:

    1)Forward propagation,input data in the network to obtain the predicted value.

    2)Calculate the loss value according to the predicted value and label.

    3)Standardize the loss function.

    4)Calculate the parameter gradient by back propagation through loss.

    5)Repeat steps 1 to 4 to accumulate the gradient instead of resetting.

    6)After the gradient accumulation reaches a fixed number of times,update the parameters,and reset the gradient to 0.

    Figure 2:Comparison of traditional training process of neural network(a)and gradient accumulation training process(b)

    Gradient accumulation expands the video memory in a disguised way.For example,if the gradient accumulation is set to 6,it can be approximately regarded as expanding the video memory by 6 times.Although the effect will be worse than the actual 6 times if the experimental conditions are not abundant,the gradient accumulation law has high-cost performance.

    The second optimization strategy is 16-bit accuracy training[19,20].

    Conventionally,models have been trained with 32-bit precision numbers.However,recent research has elucidated that employing 16-bit precision models can yield commendable training results.The foremost rationale for embracing 16-bit precision resides in its capacity to mitigate the memory demands imposed by model parameters and intermediate activations.The computational efficiency of 16-bit arithmetic operations,particularly on GPU hardware,is an instrumental factor.The salient characteristic of 16-bit arithmetic operations is their enhanced parallelizability,resulting in expedited computations.Training deep learning models with 16-bit precision bears the potential to significantly expedite the training process,predominantly by curtailing the temporal overhead incurred in memory transfers and arithmetic operations.

    The reduction in precision not only facilitates swifter computations but also leads to a reduction in memory footprint.Storing model parameters and interim activations in a 16-bit format consumes only half the memory compared to 32-bit precision.In numerous cases,models trained with 16-bit precision have demonstrated performance on par with models trained with 32-bit precision.Furthermore,the advantages of 16-bit precision extend to model deployment,particularly in contexts marked by resource constraints,such as edge devices or mobile platforms.

    Practically implementing 16-bit precision mandates the installation of the Apex library,an indispensable tool within the PyTorch framework,thoughtfully crafted by NVIDIA.Subsequently,a conscientious transition to 16-bit precision ensues,necessitating adjustments to the data types employed for model weights,activations,and gradients–a transition from ‘float32’to ‘float16’.To address potential numerical instability issues,Automatic Mixed Precision (AMP) is employed to dynamically manage the precision during forward and backward passes.

    3.2.2 First-Order Network Pruning Design Based on ALBERT

    Traditional zeroth-order network pruning,which involves setting a threshold for the absolute values of model parameters and retaining those above it while zeroing out those below it,is not suitable for transfer learning scenarios.In transfer learning,model parameters are primarily influenced by the original model but require fine-tuning and testing on the target task.Therefore,directly pruning based on the model parameters themselves may result in the loss of knowledge from either the source task or the target task.

    However,first derivative pruning relies on gradients calculated during training to identify and remove less important model parameters.This approach tends to preserve task-specific information better than zero-order pruning.By targeting specific parameters,first-order pruning can produce smaller,more efficient models,making them suitable for deployment in resource-constrained environments.

    For the reasons mentioned above,this paper proposes a first-order derivative pruning model for ALBERT,named AP-BERT(where‘A’stands for ALBERT and‘P’stands for Pruning).This model aims to retain parameters that deviate further from zero during the fine-tuning process to mitigate the loss of knowledge in transfer learning scenarios.

    Specifically:for model parametersW,give them importance scoresSof the same size,then prune mask can be shown as Eq.(1):

    In the forward propagation process,the neural network uses the parameters with mask to calculate the output componentsai,as shown in Eq.(2):

    In back propagation,using the idea of Straight-Through Estimator[21],omittopv()to approximate the gradient of the loss functionLto the importance score S,as shown in Eq.(3):

    For model parameters,there is shown as Eq.(4):

    Combining the above two equations,omitting the mask matrix of 0 and 1,we can get Eq.(5):

    According to the gradient decrease,when<0,the importanceSi,jincreases,the positive and negative ofandWi,jare different.It means that only when the positive parameter becomes larger or the negative parameter becomes smaller during back propagation,can we get a greater importance score and avoid being pruned.

    3.2.3 Parameter Optimization Strategy of PAL-BERT Model

    To make the established PAL-BERT achieve the best effect in question-and-answer tasks,it is also necessary to select and test its specific internal parameters to find out the optimal scheme.It is divided into two aspects.

    1.Selection of different layers.The official ALBERT_large model contains a total of 24 layers of coding structure,but this structure is like a black box.It is certain that each layer will contain different semantic information,but we have not decided what the specific semantic information contained in each layer is.Therefore,in order to find the most effective layer or combination,we need to analyze the different layers of PAL-BERT.

    2.Avoid overfitting during training.It is necessary to select an optimizer which has an appropriate learning rate to train PAL-BERT model.

    The lower layer of ALBERT always has more information,while the upper layer may contain less information.In order to adapt to this situation,it is necessary to select different training rates for different layers.

    The iterative method for the parameters of each layer can be shown in Eq.(6):

    whereεrepresents the decay coefficient.Whenε>1,the learning rate decays layer by layer,andε<1 indicates that the learning rate expands layer by layer.Whenε=1,the learning rate does not change,which is similar to the random gradient descent(SGD)under normal conditions.

    3.2.4 Further Optimization of PAL-BERT Model Performance

    ReLU is one of the most widely used activation functions.ReLU is computationally efficient and helps alleviate the vanishing gradient problem,making it suitable for training deep neural networks.It is defined as Eq.(8):

    However,the ReLU function still exhibits several drawbacks.Firstly,the output of ReLU is not zero-centered.Secondly,it suffers from the Dead ReLU Problem,where ReLU neurons become inactive in the negative input region,reducing their responsiveness during training.When x<0,the gradient remains permanently at 0,causing the affected neuron and subsequent neurons to remain unresponsive,and the corresponding parameters are never updated.

    In this section,we propose the adoption of the Mish activation function in place of ReLU within the fully connected layers to enhance the model’s performance.Mish offers several advantages over ReLU,making it a promising choice.It lacks an upper limit on positive values,avoiding issues related to saturation and maintaining a smoother gradient.These characteristics contribute to improved gradient flow,mitigate dead neuron problems,and ultimately result in enhanced accuracy and generalization within deep neural networks when compared to ReLU.

    The image of Mish function is shown in Fig.3,and the expression is shown in Eq.(9):

    Figure 3:Mish function

    Compared to ReLU,Mish allows positive values to reach higher levels without encountering an upper boundary,thus preventing saturation issues associated with limit constraints [22–25].Unlike ReLU’s hard zero boundary,Mish permits function values to be slightly less than zero,facilitating more favorable gradient flow.The most significant advantage of the Mish function over ReLU lies in its inherent smoothness.This smoothness enables better information propagation throughout the neural network when serving as the activation function,leading to notable improvements in accuracy and generalization capabilities.

    4 Experiment and Results

    4.1 Experimental Environment and Evaluation Criteria

    The specific experimental parameters are shown in Table 4.

    Table 4:Details of the experimental environment

    When rating the classification effect,the commonly used Accuracy,Recall and F1 value are used.

    Accuracy indicates the proportion of all correct predictions (positive and negative).Recall indicates the proportion of accurately predicted positives to the total number of actual positives.F1 value is the arithmetic mean divided by the geometric mean.

    The ALBERT model used in the experiment in this section is ALBERT_large_zh[17].Its initial model size is 64 M.The specific parameter settings are shown in Table 5.When training,batch is set to 32,the gradient cumulative number is 2,and the learning rate is 2.0e-5.The training rounds number is 5,the model is optimized by AdamW.

    Table 5:Parameters of ALBERT_large_zh

    4.2 Effects of Different Batch Sizes and Mixing Accuracy on Training Results

    Although layer normalization is used in the internal structure of ALBERT.But in fact,the size of the batch will still affect the model’s accuracy.Table 6 shows the performance of PAL-BERT model accuracy when using different batch sizes and whether half-precision training is used.

    Table 6:Effects of different batch and semi precision training on accuracy

    Among these,‘FP16’indicates the utilization of mixed-precision computing,while‘FP32’signifies not using mixed-precision computing.As shown in the table,different batch sizes of 8,16,and 32 lead to varying final accuracy results.The best performance is achieved with a batch size of 32.One possible explanation for this is the gradient update stage.With a batch size of 32,the average loss of 32 samples serves as the loss function,and the gradients of this average loss are used for parameter updates.This suggests that if the batch size is too small,it may get stuck in local optima,causing results to oscillate without converging to the global optimum.Therefore,for PAL-BERT training,larger batch size is preferable,albeit it demands higher hardware capabilities.

    In the table,when the batch size is set to 32 without using mixed-precision computing,data loss occurs due to GPU memory overflow,rendering computations impossible.However,mixed-precision computing resolves this issue.Experimental results indicate that adopting mixed-precision training significantly enhances GPU training capacity.It is worth noting that,despite NVIDIA’s official statement that its Apex mixed-precision training method does not affect model performance,during PAL-BERT training,there is a slight reduction in final accuracy when using mixed-precision training.Nevertheless,the increase in the trainable batch size resulting from this method more than compensates for its minor adverse effects.Therefore,the use of mixed-precision training remains essential during training.

    4.3 Effects of Gradient Accumulation Times on Results

    The actual effect of the gradient accumulation method can be seen as a disguised increase in the size of the batch.The calculation results in the table are the conclusions obtained when the gradient accumulation times are 2.When batch=32,the calculation results of accuracy for different gradient accumulation times are shown in Table 7.

    Table 7:Effect of gradient accumulation times on results

    As shown in Fig.4,when the gradient accumulation time is 1,which means,without accumulation,the accuracy is slightly higher than that of batch=16,and the gradient accumulation for two times is slightly higher,which shows that although the gradient accumulation can be regarded as the increase in the batch size,the performance is slightly worse than that of the actual batch size.The results show that the best effect occurs when the gradient is accumulated 2 or 3 times.However,considering the great increase in calculation time when accumulating 3 times,it is better to accumulate 2 times.

    Figure 4:Effect of gradient accumulation times on accuracy

    4.4 Effects of Different Training Rates on PAL-BERT Model

    Table 8 shows the effects of different learning rates.

    Table 8:Effects of different learning rates on results

    When the initial learning rate is high,the decline rate should be relatively low.Because the deep model can learn less,it needs a relatively low learning rate for fitting.Through comparison,it can be found that the trained model has the highest accuracy when the attenuation coefficient is 0.95 and the learning rate is 2.0e-5.

    4.5 Performance Comparison of ALBERT with Other Models

    To compare PAL-BERT’s performance with other models,this section introduces two additional commonly used models: TextCNN [21] and bidirectional long short-term memory networks (BiLSTM) [26,27].In this experiment,the Jieba word segmentation tool is employed,and open-source 300-dimensional Glove and Word2Vec word vectors are utilized for vectorization.

    Table 9 displays the performance impact of question-answering models based on ALBERT and traditional deep learning on the CMRC 2018 dataset.

    Table 9:Performance of each model on CMRC 2018 dataset

    TextCNN-G and BiLSTM-W denote the usage of GloVe as the word vector for vectorization,while TextCNN-W and BiLSTM-W indicate the utilization of Word2Vec for vectorization.

    For the four models using the combination of traditional neural network and word vector,BiLSTM-G has the best effect,but its result is still about 4% worse than PAL-BERT.It shows that PAL-BERT can perform QA tasks better than the traditional model.

    5 Discussion

    The NLP model based on pre-training technology represented by the BERT model has led to the emergence of a number of models,including 3 directions:1)Larger capacity,more training data and parameter models,such as RoBERTa[28],GPT-3[29],Megatron[30],TuringNLG[31],etc.;2)Models that combine more and richer external knowledge,such as ERNIE[32],K-BERT[33],etc.;3)Models Pursuing faster speed and less memory,such as DistillBERT[34],TinyBERT[35],ALBERT,etc.When it comes to the BERT,the parameter’s amount touch to more than 100 M at every turn consumes a lot of computing power and time.However,the ultimate objective of AI should be to develop simpler models capable of delivering superior results with minimal data.Human learning,for instance,does not require an abundance of samples;it relies on the ability to make inferences—a hallmark of genuine intelligence.Therefore,the pursuit of faster speeds and less memory will be the way of the future,allowing models to run not only on mobile platforms,but even on IoT devices with low-power chips.

    This study designs a first order pruning model PAL-BERT based on ALBERT and explores the impact of different parameter adjustment strategies on model performance through experiments.The original ALBERT_large_zh model had a size of 64 M,but after gradient descent,16-bit training,and first-order pruning,the model size was reduced to only 19 M.While sacrificing some accuracy,with the F1 score dropping from 0.878 to 0.796,this reduction in model size enabled it to be deployed and trained effectively on low-computing platforms.

    A comparative experiment with traditional deep learning models TextCNN and BiLSTM was designed.The results demonstrate the feasibility of the method proposed in this article and provide a solution for the knowledge question and answer model on a low computing power platform.Although PAL-BERT has shown superior performance in experiments,it may still have some potential shortcomings or limitations compared to traditional models.

    Pruning methods involve the choice of some hyperparameters.For PAL-BERT,these hyperparameters may require complex tuning depending on the task and the dataset.PAL-BERT’s pruning process aims to reduce the size of the model,but this is often accompanied by a certain degree of information loss.Compared with traditional models,PAL-BERT may require a trade-off between model efficiency and information retention.The performance improvement of PAL-BERT may be more dependent on the nature of the specific task and the quality of the fine-tuning data set.This may result in PAL-BERT’s generalization being not as good as traditional models.

    However,the methods proposed in this paper have certain limitations.In our training optimization approach,we have opted for gradient descent and 16-bit precision training to address the issue of insufficient computing power.Nevertheless,in the actual training process,the performance of gradient descent is highly sensitive to the choice of the learning rate.Setting it too high may lead to divergence,while setting it too low may result in slow convergence or getting trapped in local minima.Therefore,determining the optimal learning rate for a given dataset remains an unresolved challenge.Furthermore,for 16-bit precision training,the increase in computational efficiency comes at the cost of reduced precision.The impact of this precision reduction may become evident in specific training tasks,and its practicality in models with more complex word embedding still requires further investigation.

    6 Conclusion

    This paper studies the question answering technology based on the pre-training model,and proposes a pruning model compression method,which successfully shortens the training time and improves the training efficiency.We have improved and optimized the Albert structure in the article.When adapting to specific Q&A tasks,we propose a new model that can give full play to its performance in Q&A tasks.

    Firstly,this paper introduces an improved model of ALBERT based on BERT.Its improvement on the BERT model mainly includes the following three aspects:embedded layer decomposition,crosslayer parameter sharing,and SOP sentence order prediction task.

    Secondly,two optimization strategies for model training are proposed,namely gradient accumulation and 16-bit precision training.

    Thirdly,this paper proposes a first-order derivative pruning model,PAL-BERT for ALBERT.Through the comparative experiment with the traditional deep learning models TextCNN and BiLSTM,this paper explores the impact of different batch sizes and mixing accuracy on the model’s performance.The experimental results show that the effect is the best when the gradient is accumulated 2 or 3 times,and there is little difference between them.Given the significant increase in computation time when accumulating three times,it is advisable to limit the accumulation to two times.;In addition,by comparing the effects of different training rates on the PAL-BERT model,it can be found that the trained model has the highest accuracy when the attenuation coefficient is 0.95 and the learning rate is 2.0e-5.For the four models using the combination of traditional neural network and word vector,BiLSTM-G has the best effect,but its result is still about 4% worse than PAL-BERT.It shows that ALBERT significantly improves the performance of natural language processing tasks compared with the traditional deep learning model.

    Acknowledgement:Not applicable.

    Funding Statement:Supported by Sichuan Science and Technology Program (2021YFQ0003,2023YFSY0026,2023YFH0004).

    Author Contributions:The authors confirm contribution to the paper as follows:study conception and design:Wenfeng Zheng;data collection:Siyu Lu,Ruiyang Wang;software:Zhuohang Cai;analysis and interpretation of results: Wenfeng Zheng,Lei Wang;draft manuscript preparation: Wenfeng Zheng,Siyu Lu,Lirong Yin.All authors reviewed the results and approved the final version of the manuscript.

    Availability of Data and Materials:SQuAD can be obtained from:https://rajpurkar.github.io/SQuADexplorer/.CMRC can be obtained from:https://ymcui.com/cmrc2018/.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    亚洲无线在线观看| 在线免费观看的www视频| 久久中文字幕一级| 后天国语完整版免费观看| 丰满人妻熟妇乱又伦精品不卡| 一个人看视频在线观看www免费 | 少妇人妻一区二区三区视频| 国产亚洲欧美在线一区二区| 国产精华一区二区三区| 欧洲精品卡2卡3卡4卡5卡区| 老司机深夜福利视频在线观看| 国产三级中文精品| 99在线人妻在线中文字幕| www.www免费av| 亚洲avbb在线观看| 搞女人的毛片| 男女午夜视频在线观看| 露出奶头的视频| 国产精华一区二区三区| av黄色大香蕉| 国产精品99久久99久久久不卡| 女人高潮潮喷娇喘18禁视频| 露出奶头的视频| bbb黄色大片| 精品国内亚洲2022精品成人| 欧美日韩黄片免| 香蕉av资源在线| 老鸭窝网址在线观看| 亚洲精品456在线播放app | 岛国在线免费视频观看| 欧美中文日本在线观看视频| 久久精品国产综合久久久| 欧美一区二区精品小视频在线| 久久中文看片网| 欧美成人性av电影在线观看| 色视频www国产| 中文字幕高清在线视频| 久久久久久人人人人人| 99精品欧美一区二区三区四区| 国产成人影院久久av| 美女扒开内裤让男人捅视频| 欧美绝顶高潮抽搐喷水| 午夜福利18| 久久久精品欧美日韩精品| bbb黄色大片| 看片在线看免费视频| 免费在线观看影片大全网站| www.999成人在线观看| 嫁个100分男人电影在线观看| av黄色大香蕉| 中文亚洲av片在线观看爽| 色噜噜av男人的天堂激情| 日日干狠狠操夜夜爽| 亚洲第一欧美日韩一区二区三区| 一个人看视频在线观看www免费 | av视频在线观看入口| 亚洲欧美精品综合久久99| 不卡av一区二区三区| 久久人人精品亚洲av| 视频区欧美日本亚洲| 午夜激情欧美在线| 亚洲av片天天在线观看| 此物有八面人人有两片| 成人特级av手机在线观看| 香蕉国产在线看| 久久久久久久久免费视频了| 91九色精品人成在线观看| 午夜两性在线视频| 无遮挡黄片免费观看| 国产精品一及| 青草久久国产| 中文字幕久久专区| 悠悠久久av| 国产黄a三级三级三级人| 亚洲国产色片| 国产精品电影一区二区三区| 18禁观看日本| 久久久国产精品麻豆| 99热6这里只有精品| 看黄色毛片网站| 亚洲av成人不卡在线观看播放网| 亚洲无线在线观看| 日本撒尿小便嘘嘘汇集6| 亚洲av成人一区二区三| 亚洲自拍偷在线| 我要搜黄色片| 欧美一级毛片孕妇| 网址你懂的国产日韩在线| 久久99热这里只有精品18| 国模一区二区三区四区视频 | 欧美一区二区国产精品久久精品| 黄色片一级片一级黄色片| 亚洲av片天天在线观看| 久久国产乱子伦精品免费另类| 91av网站免费观看| 久久久久国产一级毛片高清牌| 亚洲午夜精品一区,二区,三区| 久久精品夜夜夜夜夜久久蜜豆| 久久天躁狠狠躁夜夜2o2o| av黄色大香蕉| 999精品在线视频| 国产精品九九99| 在线免费观看不下载黄p国产 | 国产精品 国内视频| www.熟女人妻精品国产| 国产午夜福利久久久久久| www.999成人在线观看| 国产成人aa在线观看| 亚洲午夜理论影院| 国产精品久久久av美女十八| 日韩av在线大香蕉| 亚洲欧洲精品一区二区精品久久久| 夜夜看夜夜爽夜夜摸| 长腿黑丝高跟| 国产麻豆成人av免费视频| 国产亚洲精品一区二区www| 成人三级黄色视频| cao死你这个sao货| 中文在线观看免费www的网站| 国产极品精品免费视频能看的| 老汉色av国产亚洲站长工具| 无人区码免费观看不卡| 国产精品99久久久久久久久| 国产精品香港三级国产av潘金莲| 俺也久久电影网| 啦啦啦观看免费观看视频高清| 999久久久国产精品视频| 国产一区二区三区在线臀色熟女| 搡老妇女老女人老熟妇| 99久久精品国产亚洲精品| 成人永久免费在线观看视频| 亚洲片人在线观看| 午夜福利视频1000在线观看| 国产1区2区3区精品| 亚洲真实伦在线观看| а√天堂www在线а√下载| 亚洲第一电影网av| av在线蜜桃| 天天添夜夜摸| 精品久久久久久成人av| 亚洲中文字幕日韩| 久久精品综合一区二区三区| 欧美3d第一页| 小说图片视频综合网站| 长腿黑丝高跟| 国产精品亚洲av一区麻豆| 手机成人av网站| 国产精品,欧美在线| 最近在线观看免费完整版| 99精品在免费线老司机午夜| 亚洲欧美日韩无卡精品| 中亚洲国语对白在线视频| 亚洲人成网站高清观看| 午夜免费成人在线视频| 亚洲黑人精品在线| 久久久久久久久免费视频了| 亚洲精品在线观看二区| 99国产精品一区二区蜜桃av| 夜夜躁狠狠躁天天躁| 日韩精品青青久久久久久| 午夜亚洲福利在线播放| 黄色丝袜av网址大全| 老熟妇仑乱视频hdxx| 香蕉丝袜av| 午夜成年电影在线免费观看| 日本黄大片高清| 午夜福利免费观看在线| 欧美另类亚洲清纯唯美| 国产一级毛片七仙女欲春2| 国产精品爽爽va在线观看网站| 午夜福利成人在线免费观看| 午夜福利在线观看免费完整高清在 | 国产午夜精品论理片| 草草在线视频免费看| 麻豆成人午夜福利视频| 韩国av一区二区三区四区| 男女下面进入的视频免费午夜| tocl精华| 亚洲专区字幕在线| 黄片小视频在线播放| 热99re8久久精品国产| www日本在线高清视频| 久久中文看片网| aaaaa片日本免费| 欧美日韩精品网址| 亚洲精品粉嫩美女一区| 99久久综合精品五月天人人| 不卡一级毛片| 性色av乱码一区二区三区2| 午夜久久久久精精品| 好男人电影高清在线观看| 亚洲av熟女| 夜夜躁狠狠躁天天躁| 男人舔奶头视频| 黄频高清免费视频| 真实男女啪啪啪动态图| 国产精品综合久久久久久久免费| 亚洲欧美一区二区三区黑人| 淫妇啪啪啪对白视频| 午夜福利18| 嫁个100分男人电影在线观看| 国产亚洲精品综合一区在线观看| 色综合亚洲欧美另类图片| 在线免费观看不下载黄p国产 | 一级作爱视频免费观看| 日韩国内少妇激情av| 97超视频在线观看视频| 女人被狂操c到高潮| 动漫黄色视频在线观看| 91av网站免费观看| 精品国产乱子伦一区二区三区| 亚洲成人精品中文字幕电影| 国产精华一区二区三区| 日本熟妇午夜| 免费高清视频大片| 亚洲avbb在线观看| 97碰自拍视频| 久久久久九九精品影院| 国产伦一二天堂av在线观看| 波多野结衣高清作品| 精品久久蜜臀av无| 12—13女人毛片做爰片一| 99国产精品一区二区蜜桃av| 一区二区三区国产精品乱码| 99久久精品热视频| 老司机午夜十八禁免费视频| 国产亚洲av高清不卡| 一个人免费在线观看的高清视频| 日韩欧美国产在线观看| 亚洲成人精品中文字幕电影| 国产淫片久久久久久久久 | 国产激情久久老熟女| ponron亚洲| 啪啪无遮挡十八禁网站| 亚洲国产欧美网| 91麻豆精品激情在线观看国产| av片东京热男人的天堂| 国产一级毛片七仙女欲春2| 亚洲七黄色美女视频| 成人欧美大片| 丰满的人妻完整版| 日本黄色片子视频| 日本黄大片高清| 99国产精品一区二区蜜桃av| 亚洲av成人一区二区三| 欧美黑人欧美精品刺激| 亚洲男人的天堂狠狠| 国产乱人视频| 又大又爽又粗| 熟妇人妻久久中文字幕3abv| 久久香蕉国产精品| 精品久久久久久久久久久久久| 日本与韩国留学比较| 久久九九热精品免费| 亚洲va日本ⅴa欧美va伊人久久| 亚洲精品美女久久av网站| 欧美另类亚洲清纯唯美| 午夜免费成人在线视频| 国产亚洲精品久久久com| 不卡av一区二区三区| 成年女人永久免费观看视频| 国产高清videossex| 亚洲自拍偷在线| 别揉我奶头~嗯~啊~动态视频| 免费无遮挡裸体视频| 中文字幕熟女人妻在线| 精品国产乱子伦一区二区三区| 一区福利在线观看| e午夜精品久久久久久久| 日本黄大片高清| 麻豆av在线久日| 韩国av一区二区三区四区| www.999成人在线观看| 午夜成年电影在线免费观看| 最好的美女福利视频网| 精品一区二区三区视频在线 | 国产欧美日韩一区二区三| 成人鲁丝片一二三区免费| 日韩免费av在线播放| 俺也久久电影网| 国产毛片a区久久久久| 亚洲性夜色夜夜综合| 夜夜躁狠狠躁天天躁| 国产伦精品一区二区三区视频9 | svipshipincom国产片| av天堂在线播放| 亚洲欧美日韩高清专用| 亚洲欧美精品综合一区二区三区| а√天堂www在线а√下载| 色综合婷婷激情| 88av欧美| 国产高清videossex| 国产一区在线观看成人免费| 老司机午夜十八禁免费视频| 身体一侧抽搐| 18禁裸乳无遮挡免费网站照片| 国产精品乱码一区二三区的特点| 精品久久久久久久毛片微露脸| 啦啦啦观看免费观看视频高清| 国产精品久久久久久精品电影| 婷婷丁香在线五月| 亚洲成人精品中文字幕电影| 亚洲国产色片| 不卡av一区二区三区| 日本免费一区二区三区高清不卡| 亚洲中文日韩欧美视频| 999精品在线视频| 国产亚洲欧美在线一区二区| 久久精品国产99精品国产亚洲性色| 国产精品一区二区三区四区免费观看 | 国产欧美日韩一区二区精品| 99国产精品99久久久久| 国产成人系列免费观看| a在线观看视频网站| 国产1区2区3区精品| 男女午夜视频在线观看| 免费搜索国产男女视频| 日本一本二区三区精品| 国产三级中文精品| av视频在线观看入口| 淫妇啪啪啪对白视频| 国产野战对白在线观看| 午夜福利在线观看吧| 波多野结衣巨乳人妻| 他把我摸到了高潮在线观看| svipshipincom国产片| 国产一区二区在线观看日韩 | svipshipincom国产片| 99re在线观看精品视频| 国产亚洲欧美98| 国产三级中文精品| 99久久国产精品久久久| 日本免费一区二区三区高清不卡| 国产真实乱freesex| 国产99白浆流出| 亚洲精品久久国产高清桃花| 一级黄色大片毛片| 午夜福利在线观看吧| 国产乱人伦免费视频| 丝袜美腿在线中文| 欧美激情久久久久久爽电影| 久久99热这里只频精品6学生 | 亚洲国产欧美人成| 成人欧美大片| 精品久久国产蜜桃| 可以在线观看毛片的网站| 亚洲av成人av| 又黄又爽又刺激的免费视频.| 亚洲伊人久久精品综合 | 国产一区二区三区av在线| 99热6这里只有精品| 久久久亚洲精品成人影院| 99国产精品一区二区蜜桃av| av专区在线播放| 国产成人精品婷婷| 尤物成人国产欧美一区二区三区| 久久欧美精品欧美久久欧美| av女优亚洲男人天堂| av.在线天堂| 久久久久九九精品影院| 免费观看在线日韩| 国产亚洲av嫩草精品影院| 高清午夜精品一区二区三区| 久久国内精品自在自线图片| 天堂中文最新版在线下载 | 欧美精品国产亚洲| 久久久精品大字幕| 床上黄色一级片| 美女被艹到高潮喷水动态| 国产成人福利小说| 亚洲人成网站在线观看播放| 听说在线观看完整版免费高清| 91精品一卡2卡3卡4卡| 最近视频中文字幕2019在线8| 在线播放无遮挡| 久久精品综合一区二区三区| 亚洲欧美成人精品一区二区| 舔av片在线| 国产亚洲精品av在线| 一级二级三级毛片免费看| 在线播放无遮挡| 18禁在线播放成人免费| 欧美+日韩+精品| 日本爱情动作片www.在线观看| 欧美人与善性xxx| 在线观看美女被高潮喷水网站| 欧美+日韩+精品| 国产黄色小视频在线观看| 高清视频免费观看一区二区 | 女的被弄到高潮叫床怎么办| 国产不卡一卡二| 免费黄网站久久成人精品| 草草在线视频免费看| 99久久成人亚洲精品观看| av免费在线看不卡| 中文字幕久久专区| 男女国产视频网站| 久久久久久久久久成人| 亚洲av熟女| 国产黄片美女视频| 国语对白做爰xxxⅹ性视频网站| 亚洲国产高清在线一区二区三| 久久国内精品自在自线图片| 午夜福利高清视频| 免费在线观看成人毛片| 国产精品无大码| 女的被弄到高潮叫床怎么办| 国产精品久久久久久精品电影| 久久人妻av系列| 亚洲av福利一区| 看片在线看免费视频| 中国国产av一级| 熟妇人妻久久中文字幕3abv| 男女视频在线观看网站免费| 国产av码专区亚洲av| 久久久色成人| 久久久亚洲精品成人影院| 人妻系列 视频| 亚洲自拍偷在线| 中文字幕制服av| 精品久久久久久电影网 | 久久久久久国产a免费观看| 午夜视频国产福利| 日韩成人av中文字幕在线观看| 欧美激情久久久久久爽电影| 18禁裸乳无遮挡免费网站照片| 身体一侧抽搐| 久久久久性生活片| 亚洲精品成人久久久久久| 99热这里只有精品一区| av免费在线看不卡| 日韩一区二区视频免费看| 亚洲av电影在线观看一区二区三区 | 热99re8久久精品国产| 自拍偷自拍亚洲精品老妇| 18禁动态无遮挡网站| 99热网站在线观看| 亚洲丝袜综合中文字幕| 在线观看美女被高潮喷水网站| 色综合色国产| 国内精品宾馆在线| 国产精品国产三级专区第一集| 国产精品久久久久久久久免| 有码 亚洲区| 看黄色毛片网站| 视频中文字幕在线观看| av天堂中文字幕网| 在线免费十八禁| 人人妻人人澡欧美一区二区| 国产91av在线免费观看| 亚洲自偷自拍三级| 亚洲国产日韩欧美精品在线观看| 亚洲av熟女| 高清午夜精品一区二区三区| 午夜激情欧美在线| 男人舔女人下体高潮全视频| 久久久久久久亚洲中文字幕| 舔av片在线| 少妇高潮的动态图| 国产精品熟女久久久久浪| 亚洲精品自拍成人| 中文字幕免费在线视频6| 草草在线视频免费看| 久久久精品大字幕| 偷拍熟女少妇极品色| 国语自产精品视频在线第100页| 99久久精品国产国产毛片| 亚洲精品乱码久久久久久按摩| 听说在线观看完整版免费高清| 亚洲丝袜综合中文字幕| 午夜亚洲福利在线播放| 精品国产露脸久久av麻豆 | 晚上一个人看的免费电影| www.色视频.com| 久久精品熟女亚洲av麻豆精品 | av国产久精品久网站免费入址| 我的老师免费观看完整版| 青春草国产在线视频| 热99在线观看视频| 国产探花在线观看一区二区| 午夜亚洲福利在线播放| 熟女电影av网| 国产精品国产三级国产专区5o | 久久精品熟女亚洲av麻豆精品 | 一本久久精品| 欧美又色又爽又黄视频| 欧美高清性xxxxhd video| av又黄又爽大尺度在线免费看 | 国产国拍精品亚洲av在线观看| 亚洲性久久影院| 欧美成人一区二区免费高清观看| 91午夜精品亚洲一区二区三区| 女的被弄到高潮叫床怎么办| 国产男人的电影天堂91| 搞女人的毛片| 2021少妇久久久久久久久久久| 日本五十路高清| 别揉我奶头 嗯啊视频| 国产成人福利小说| 九草在线视频观看| 亚洲欧美日韩高清专用| 97人妻精品一区二区三区麻豆| 老女人水多毛片| 噜噜噜噜噜久久久久久91| 国产白丝娇喘喷水9色精品| 国产淫片久久久久久久久| 久久国产乱子免费精品| 久久久久久久亚洲中文字幕| 亚洲av一区综合| 亚洲中文字幕日韩| 色视频www国产| 国产精品一及| 久久久久免费精品人妻一区二区| videossex国产| 亚洲激情五月婷婷啪啪| 内地一区二区视频在线| 男女边吃奶边做爰视频| 久久久久久久久久久免费av| 能在线免费观看的黄片| 国内揄拍国产精品人妻在线| 国产伦精品一区二区三区四那| 久久6这里有精品| 三级国产精品欧美在线观看| 国产一级毛片在线| 亚洲国产欧美在线一区| 亚洲最大成人中文| 男人和女人高潮做爰伦理| 亚洲乱码一区二区免费版| 国产大屁股一区二区在线视频| 我要看日韩黄色一级片| 熟女电影av网| 成人鲁丝片一二三区免费| 国产精品国产三级国产av玫瑰| 尾随美女入室| 亚洲欧美清纯卡通| 国产一区二区三区av在线| 男女那种视频在线观看| 校园人妻丝袜中文字幕| 成人毛片60女人毛片免费| 老司机福利观看| 床上黄色一级片| 白带黄色成豆腐渣| 51国产日韩欧美| 精品久久久久久电影网 | 亚洲婷婷狠狠爱综合网| 午夜福利在线在线| 级片在线观看| 青青草视频在线视频观看| 插逼视频在线观看| 成人鲁丝片一二三区免费| 亚洲精品色激情综合| 两性午夜刺激爽爽歪歪视频在线观看| 老司机影院成人| 哪个播放器可以免费观看大片| 午夜精品一区二区三区免费看| 午夜亚洲福利在线播放| 午夜福利网站1000一区二区三区| 中文字幕制服av| 麻豆国产97在线/欧美| 白带黄色成豆腐渣| 午夜激情福利司机影院| 色吧在线观看| 高清视频免费观看一区二区 | 国产精品电影一区二区三区| 少妇猛男粗大的猛烈进出视频 | 免费观看在线日韩| 人妻夜夜爽99麻豆av| 国产亚洲一区二区精品| 亚洲18禁久久av| 少妇熟女aⅴ在线视频| 熟女人妻精品中文字幕| 国产成人91sexporn| 欧美日韩国产亚洲二区| 好男人在线观看高清免费视频| 精华霜和精华液先用哪个| 美女被艹到高潮喷水动态| 好男人视频免费观看在线| av在线观看视频网站免费| 又黄又爽又刺激的免费视频.| 最近中文字幕高清免费大全6| 欧美一区二区精品小视频在线| 内地一区二区视频在线| 我要看日韩黄色一级片| 亚洲中文字幕一区二区三区有码在线看| 欧美成人精品欧美一级黄| 亚洲欧洲日产国产| 亚洲成人av在线免费| 国产视频内射| 国产精品人妻久久久影院| 欧美不卡视频在线免费观看| 一级黄片播放器| 色吧在线观看| 国内精品一区二区在线观看| 午夜激情欧美在线| 97超视频在线观看视频| 亚洲在久久综合| 麻豆成人午夜福利视频| 2021少妇久久久久久久久久久| 人妻制服诱惑在线中文字幕| 老司机影院毛片| 久久亚洲精品不卡| 偷拍熟女少妇极品色| 亚洲人成网站高清观看| 高清午夜精品一区二区三区| 免费观看性生交大片5| 亚洲av电影不卡..在线观看| 三级国产精品欧美在线观看| 国产精品精品国产色婷婷| 99久久人妻综合| 国产高潮美女av| 久久久久久久午夜电影| 夜夜爽夜夜爽视频| 色噜噜av男人的天堂激情| 色视频www国产| 亚洲激情五月婷婷啪啪|