• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Progress in Neural NLP: Modeling, Learning, and Reasoning

    2020-09-14 03:42:00MingZhouNanDuanShujieLiuHeungYeungShum
    Engineering 2020年3期
    關(guān)鍵詞:鼠標(biāo)老鼠

    Ming Zhou, Nan Duan, Shujie Liu, Heung-Yeung Shum*

    Microsoft Research Asia, Beijing 100080, China

    Keywords:Natural language processing Deep learning Modeling, learning, and reasoning

    A B S T R A C T Natural language processing(NLP)is a subfield of artificial intelligence that focuses on enabling computers to understand and process human languages.In the last five years,we have witnessed the rapid development of NLP in tasks such as machine translation, question-answering, and machine reading comprehension based on deep learning and an enormous volume of annotated and unannotated data.In this paper, we will review the latest progress in the neural network-based NLP framework (neural NLP) from three perspectives: modeling, learning, and reasoning. In the modeling section, we will describe several fundamental neural network-based modeling paradigms,such as word embedding,sentence embedding,and sequence-to-sequence modeling,which are widely used in modern NLP engines.In the learning section, we will introduce widely used learning methods for NLP models, including supervised, semi-supervised, and unsupervised learning; multitask learning; transfer learning; and active learning. We view reasoning as a new and exciting direction for neural NLP, but it has yet to be well addressed. In the reasoning section, we will review reasoning mechanisms, including the knowledge,existing non-neural inference methods, and new neural inference methods. We emphasize the importance of reasoning in this paper because it is important for building interpretable and knowledgedriven neural NLP models to handle complex tasks. At the end of this paper, we will briefly outline our thoughts on the future directions of neural NLP.

    1. Introduction

    As an important branch of artificial intelligence (AI), natural language processing (NLP) studies the interactions between humans and computers via natural language.It studies fundamental technologies for the meaning expressions of words, phrases,sentences,and documents,and for syntactic and semantic processing such as word breaking,syntactic parsers,and semantic parsing and develops applications such as machine translation (MT),question-answering (QA), information retrieval, dialog, text generation, and recommendation systems. NLP is vital to search engines, customer support systems, business intelligence, and spoken assistants.

    The history of NLP dates back to the 1950s. In the beginning of NLP research,rule-based methods were used to build NLP systems,including word/sentence analysis, QA, and MT. Such rules edited by experts were utilized in algorithms for various NLP tasks starting from MT.Normally,designing rules required significant human efforts. Furthermore, it is difficult to organize and manage rules when the number of rules is large. In the 1990s, along with the rapid development of the internet, large amounts of data became available, which enabled statistical learning methods to work on NLP tasks. With human-designed features, statistical learning models were learned by using labeled/mined data. The statistical learning method brought significant improvements to many NLP tasks,typically in MT and search engine technology.In 2012,deep learning approaches were introduced to NLP following deep learning’s success in object recognition with ImageNet[1]and in speech recognition with Switchboard [2]. Deep learning approaches quickly outperformed statistical learning methods with surprisingly better results. As of the present, the neural network-based NLP (referred to as ‘‘neural NLP” hereafter) framework has achieved new levels of quality and has become the dominating approach for NLP tasks, such as MT, machine reading comprehension (MRC), chatbot, and so forth. For example, the Bible system from Microsoft achieved a human parity result on the Chineseto-English news translation task of workshop on MT in 2017.R-NET and NLNet from Microsoft Research Asia (MSRA) achieved human-quality results on the Stanford Question Answering Dataset(SQuAD) evaluation task on both the exact match (EM) score and the fuzzy-match score (F1score). Recently, pre-trained models such as generative pre-training (GPT) [3], bidirectional encoder representations from transformers (BERT) [4], and XLNet [5] have demonstrated strong capabilities in multiple NLP tasks.The neural NLP framework works well for supervised tasks in which there is abundant labeled data for learning neural models, but still performs poorly for low-resource tasks where there is limited or no labeled data.

    This paper reviews the notable progress of the neural NLP framework in three categories of efforts: ①neural NLP modeling aimed at designing appropriate network structures for different tasks; ②neural NLP learning aimed at optimizing the model parameters; and ③reasoning aimed at generating answers to unseen questions by manipulating existing knowledge with inference techniques. Based on deep analysis of current technologies and the challenges of each of these aspects,we seek to identify and sort out future directions that are critical to advancing NLP technology.

    2. Modeling

    An NLP system consumes natural language sentences and generates a class type(for classification tasks),a sequence of labels(for sequence-labeling tasks), or another sentence (for QA, dialog,natural language generation, and MT). To apply neural NLP approaches, it is necessary to solve the following two key issues:

    (1)Encode the natural language sentence(a sequence of words)in the neural network.

    (2) Generate a sequence of labels or another natural language sentence.

    From these two aspects, in this section, we will introduce several popularly used neural NLP models,including word embedding,sentence embedding, and sequence-to-sequence modeling. Word embedding maps words in the input sentences into continuous space vectors. Based on the word embedding, complex networks such as recurrent neural networks (RNNs), convolutional neural networks(CNNs),and self-attention networks can be used for feature extraction, considering the context information of the whole sentence to build context-aware word embedding, or integrating all the information of the sentence to construct the sentence embedding. Context-aware word embedding can be used for sequential labeling tasks such as part-of-speech (POS) tagging and named-entity recognition (NER), and sentence embedding can be used for sentence-level tasks, such as sentiment analysis and paraphrase classification. Sentence embedding can also be used as input to another RNN or self-attention network to generate another sequence, which forms the encoder-decoder framework for the sequence-to-sequence modeling. Given an input sentence,the sequence-to-sequence modeling can be used to generate an answer for a question (i.e., a QA task), or to perform a translation in another language (i.e., an MT task).

    2.1. Word embedding and sentence embedding

    Word/sentence embedding attempts to map words and sentences from a discrete space into a semantic space, in which the semantically similar words/sentences have similar embedding vectors.

    2.1.1. Context-independent word embedding

    To map a word into a continuous semantic vector, Ref. [6]proposed the continuous bag-of-words (CBOW) and skip-gram models,based on which the implementation tool word2vec is used to learn word-embedding vectors with a large monolingual corpus.As shown in Fig. 1 [6], the CBOW model predicts the central word using its surrounding words in a window, while the skip-gram model predicts the surrounding words of the given word. These two models are designed based on the principle of ‘‘knowing a word by the company it keeps” [7]. In addition, to utilize the benefits of global co-occurrence statistics and meaningful linear substructures, Ref. [8] proposed a global log-bilinear regression model (GloVe) to learn word embedding.

    Word2vec and GloVe learn a constant embedding vector for a word;the embedding is the same for a word in different sentences.For example, we can learn an embedding vector for the word‘‘bank,” and the embedding remains the same, regardless of whether the word ‘‘bank” is used in the sentence ‘‘a(chǎn)n ant went to the river bank” or in the sentence ‘‘that is a good way to build up a bank account.” Ostensibly, the embedding of the word ‘‘bank”in the first sentence should be different from that of ‘‘bank” in the second sentence. To deal with this issue, context information of the sentence is used to predict a dynamic word embedding.

    2.1.2. RNN-based context-aware word embedding

    ELMo [9] leverages the bidirectional recurrent neural network(the long short-term memory(LSTM)network is particularly used)to model the context information,in which the word embedding is the concatenation of the hidden states of a forward RNN and a backward one, modeling the context at the left side and the right side, respectively. For example, given the input sentence ‘‘a(chǎn)n ant went to the river bank.” as shown in Fig. 2, the forward RNN first takes the first word ‘‘a(chǎn)n” as the input and generates the first hidden state, which contains the information of the first word. When the second word‘‘a(chǎn)nt”is inputted,the RNN combines the information of the first hidden state and the second word to generate the second hidden state, which should contain the information of the first two words. When the word ‘‘bank” is inputted, the previous hidden state should contain all the previous information of ‘‘a(chǎn)n ant went to the river.” Taking it as context information, the new hidden state of ‘‘bank” contains dynamic information from the given sentence.

    2.1.3. Self-attention-based context-aware word embedding

    Ref. [3] proposed GPT, which leverages the self-attention network to train a multi-layer left-to-right language model.Compared with the RNN used in ELMo, which is also a left-to-right language model, the self-attention network used in GPT allows direct interaction between the current word and each previous word, which leads to better context representations. Ref. [4] proposed BERT,which leverages the self-attention network to jointly consider both the left and right context information in the sentence.Whereas the RNN processes the input sentence in a sequential order,from left to right or from right to left,as shown in Fig.3,the self-attention network takes all the remaining words of the word‘‘bank”as the context to build the context-aware word embedding, which is a weighted sum of all the representations of the words in the sentence, whose weight is calculated by normalizing and computing the similarity between the current word ‘‘bank” and all the words in the sentence. To consider the ordering information, a position index is also used to enrich the input by summing the word embedding and position embedding. To alleviate BERT’s pretrain-fine-tune discrepancy issue,which means that artificial symbols such as [MASK] used by BERT during pre-training are absent from real data at fine-tuning time, Ref. [5] proposed XLNet, a pre-training method that enables the learning of bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order.

    2.1.4. CNN-based context-aware word embedding

    Fig. 1. Context-independent word-embedding methods [6]. CBOW: using the context words in a window to predict the central word. Skip-gram: using the central word to predict the context words in a window. Wt is the tth word in the sentence.

    Fig. 2. RNN-based context-aware word embedding.

    Both ELMo and BERT can consider all the context information in the input sentence to generate the dynamic embedding for a given word. In contrast to using all the words as context, a CNN can be used to generate the dynamic embedding with only the surrounding words as context[10].As shown in Fig.4,the CNN uses a window to slide along the sequence of the input sentence. Using the embeddings of the words in the window,linear mappings(i.e., filters) are used to generate a representation vector to integrate the information of the input words. For example, to generate the dynamic embedding for the word ‘‘bank,” a window with size 3 can be used to cover the span of‘‘river bank,”and the word‘‘river”can be used to generate a disambiguating dynamic embedding for the word ‘‘bank.”

    2.1.5. Sentence embedding

    Based on the representation of each word in the input sentence,the sentence representation can be obtained, via RNN, selfattention network, and CNN. For the RNN, the last hidden state(the blue one) should contain all the information in the sentence by consuming the input words one by one. For the self-attention network, a sentence-ending symbol, , can be added, and its hidden state (the blue one) can be used as the representation of the whole sentence. For CNN, a max pooling layer can be used to select the maximum value for each dimension and generate one semantic vector (with the same size as the convolution layer output) to summarize the whole sentence, which is processed by a feed-forward network (FFN) to generate the final sentence representation.The generated sentence embedding can be used in other tasks,such as predicting the sentiment class(i.e.,positive or negative) or predicting another sequence (MT).

    2.2. Sequence-to-sequence modeling

    2.2.1. Task of sequence-to-sequence modeling

    Sequence-to-sequence modeling attempts to generate one sequence with another sequence as input. Many NLP tasks can be formulated as a sequence-to-sequence task,such as MT(i.e.,given the source language word sequence, generate the target language word sequence), QA (i.e., given the word sequence of a question,generate the word sequence of an answer), and dialog (i.e., given the word sequence of user input, generate the word sequence of response).

    2.2.2. Encoder-decoder framework

    Fig. 3. Self-attention-based context-aware word embedding. : sentence-ending symbol.

    Fig. 4. CNN-based context-aware word embedding.

    Fig. 5. Encoder-decoder framework for MT from English to Chinese.

    Ref. [11] proposed an encoder-decoder framework for sequence-to-sequence modeling. As shown in Fig. 5, the encoder-decoder framework contains two parts: an encoder and a decoder.The encoder is an RNN to encode the input sentence into a semantic representation, by consuming the words from left to right, one by one. The final hidden states should contain the information of all the words in the sentence, and are used as the context vector (i.e., representation) of the input sentence. Based on the context vector, another decoder RNN is used to generate the target sequence one word after another until the sentenceending symbol()is generated.The decoder RNN takes the previous word, previous hidden state, and source sentence context vector as input to generate the current hidden state in order to predict the next target word.

    The original encoder-decoder framework has several drawbacks: ①Only the last hidden state is used to model the source sentence. With such a size-fixed vector, it is difficult to model any sentence in the source language. ②The information of previous words consumed by the encoder is difficult for the RNN cell to maintain to influence the target word. ③It is difficult to use only one context vector to predict all the words in the target sentence.

    2.2.3. Attention-based encoder-decoder framework

    In order to deal with these problems, the attention-based encoder-decoder framework [12] makes it possible for the neural network to pay attention to different parts of the input and to directly align the input sequence and the output result. As shown in Fig. 6, the attention mechanism leverages all the hidden states of the encoder and the previous decoder hidden state to compute a context vector, which is a weighted sum of the hidden state of the encoder, and the weights are computed and normalized by the similarity between the hidden states of the decoder and encoder.Together with the previous decoder hidden state and the previous target word, this context vector is used as the input to the decoder RNN to generate the next decoder hidden state. In this way, not only are all the encoder hidden states leveraged, but the decoder hidden state also directly relates to the corresponding encoder hidden states.

    2.2.4. All-attention-based encoder-decoder framework

    To use the strong modeling capacity of self-attention, Transformer [13] (as shown in Fig. 7) uses multi-head self-attention to replace the original attention mechanism and RNN cells in the encoder and decoder. Multi-head attention is a combination of attention networks. By projecting each query, key, and value into N vectors with linear layers, N attention networks are used to generate N context vectors, which are concatenated into one context vector. For the decoder self-attention, only the previous hidden states are used to compute the decoder context vector,because future words cannot be used during the inference.

    Fig. 6. Attention-based encoder-decoder framework for MT from English to Chinese.

    Fig. 7. All-attention-based encoder-decoder framework for MT from English to Chinese.

    2.3. Summary

    In this section, we introduced the network structures to learn word embedding, sentence embedding, and sequence generation.In order to improve modeling for various NLP tasks, several directions still require further exploration:

    · Prior knowledge modeling. Even though word-embedding methods that are trained with a huge amount of data can model certain kinds of commonsense knowledge [6], the problem of how to integrate the linguistic prior knowledge information, such as WordNet and HowNet, with specific NLP tasks should receive more attention [14].

    · Document/multi-turn modeling. Leveraging sentence context by means of word embedding effectively improves the performance of various tasks, but the problem of how to model long-distance context, such as other sentences in the same document (or even in another document), is still an ongoing research work [15]. For example, given the English sentence ‘‘the mouse is on the table,” it is impossible to tell whether the word ‘‘mouse” should be translated to the Chinese‘‘鼠標(biāo)”(a computer device)or the Chinese‘‘老鼠”(an animal) based on the information in the given sentence. To do semantic disambiguation,the context of the document should be leveraged. Similarly, the problem of how to better model the context information in multi-turn tasks such as chatbots and dialog systems [16] is still challenging.

    · Non-autoregressive generation. The current sequence-tosequence model generates the output sentence one word at a time in an autoregressive way,meaning that the word generated in the previous time-step is used as the input to generate the next word. Such an autoregressive generation process leads to the exposure bias problem,in which the mistake made in previous steps will be amplified in subsequent steps. To deal with this problem, non-autoregressive structures have been proposed, which show a performance drop compared with the autoregressive structure[17].More attention should be paid to developing better non-autoregressive structures in the future.

    3. Learning

    New and efficient training algorithms have been proposed to optimize the large number of parameters in deep learning models.To train the neural network,stochastic gradient descent(SGD)[18]is often used,which is usually based on back-propagation methods[19]. Momentum-based SGD has been proposed in order to introduce momentum to speed up the training process. The AdaGrad[20], AdaDelta [21], Adam [22], and RMSProp methods attempt to use different learning ratios for different parameters,which further improves the efficiency and stabilizes the training process.When the model is very complex, parallel training methods are used to leverage many computing devices—even hundreds or thousands (central processing units, graphics processing units, or field programmable gate arrays).Depending on whether the parameters are updated synchronously or not, distributed training methods can be grouped into synchronous SGD and asynchronous SGD.

    In addition to the progress that has been achieved in general optimization methods, better training methods have been proposed for specific NLP tasks. When large amounts of training data are available for rich resource tasks, using supervised learning methods, deep learning models achieve very good performance.For some specific tasks,such as MT for language pairs with a large volume of parallel data such as English-Chinese,neural models do a good job, sometimes achieving human parity results in the shared tasks.For many NLP tasks,however,it is difficult to acquire large amounts of labeled data. Such tasks are often referred to as low-resource tasks, including MT of sentiment analysis for rare languages.By using unlabeled data to enhance the models trained with a small amount of labeled data, semi-supervised learning methods can be used. Without any labeled data, unsupervised learning methods can be leveraged to learn NLP models. Another way to leverage unlabeled data is to pre-train models, which will be transferred to specific tasks with transfer learning. Instead of leveraging in-task labeled data, labeled data from other tasks can also be used with the help of multitask learning.If there is no data that can be used, human resources could be introduced to create the training data using active learning in order to maximize the model performance with a given budget.

    3.1. Supervised learning

    where p`e(yi|yi-1,...,y1,X)is the softmax layer output of the decoder in the sequence-to-sequence model. Based on the likelihood, the log-likelihood loss function is defined as follows:

    Based on this loss function, training algorithms (such as Adam or AdaDelta) can be used to optimize the parameters. Instead of only maximizing the generation probability of the golden target,in order to consider the task-specific error in the training process,Ref. [23] proposed minimum-risk (maximum-bilingual evaluation understudy (BLEU) [24]) training for the sequence-to-sequence generation model. The model is first optimized using the crossentropy loss with the bilingual training corpus, and then finetuned by maximizing the expected BLEU of the generated translation candidates (where BLEU is the evaluation metrics for MT,which measures the n-gram accuracy of the generated candidates against a human-made reference). To deal with the exposure bias problem caused by the autoregressive decoding of the sequenceto-sequence model from left to right,Ref.[25]introduced two Kullback-Leibler (KL) divergences into the training objective to maximize the agreement between the candidates generated by leftto-right and right-to-left models. A deliberation network [26] is a method to refine the translation candidate based on two-pass decoding that simulates the human translation process; the first pass generates the initial translation and the second pass refines it. In order to deal with the label bias problem, as mentioned in Section 2.3, Ref. [27] proposed sampling the context words not only from the ground truth sequence, but also from the predicted sequence in order to bridge the gap between the training and inference of MT training.

    3.2. Semi-supervised and unsupervised learning

    Semi-supervised and unsupervised learning use unlabeled data to improve the model performance. For semi-supervised learning,first, the model is usually trained with labeled data; next, it is fine-tuned with the help of unlabeled data. There are many semisupervised learning methods, such as self-learning, generative methods,and graph-based methods[28].In these methods,pseudo data generated by models are usually used to fine-tune the model itself.In neural machine translation(NMT),to control the errors or noise generated in semi-supervised learning, weights/rewards are usually leveraged to filter the bad translation candidates;examples of weights/rewards include the expected BLEU method [29] and the dual-learning approach [30]. To utilize unlabeled data to improve the performance of the sequence-to-sequence model,back-translation [31] uses a reverse translation model to translate the target monolingual data in order to build a pseudo bilingual corpus, which is used to fine-tune the source-to-target (S2T)model. The joint training method (as shown in Fig. 8) extends the back-translation method to iteratively boost the S2T and target-to-source (T2S) translation models in a unified generalized(T2S) expectation-maximization framework by leveraging both the source and target monolingual corpus [32]. The bilingual corpus is used to train the NMT models first, including S2T and T2S models. With the source monolingual data, the S2T model is used to generate pseudo data to fine-tune the T2S model,and the target monolingual data is used to fine-tune the S2T model.This training process is iterated until the performance on hold-out data is no longer improved.

    Fig. 8. Joint training of S2T and T2S NMT models. m means the mth epoch; Y′ is the translation of the source monolingual data X; X′ is the translation of the target monolingual data Y.

    Leveraging the deep learning technique, deep generative models have been proposed for unsupervised learning, such as the variational auto-encoder(VAE)[33]and generative adversarial networks (GANs) [34]. The VAE net follows the auto-encoder framework, in which there is an encoder to map the input to a semantic vector,and a decoder to reconstruct the input.Unlike the original auto-encoder methods, a VAE assumes that the distribution of the generated semantic vectors should be as close as possible to a standard normal distribution. Similar to the VAE nets,GANs also have two parts:a generator,which uses a given semantic vector to generate the output, and a discriminator, which tries to distinguish between the generated samples and the real samples.With an adversarial training loss function,the generator tries to output samples that are similar to real ones,in order to fool the discriminator,while the discriminator tries to distinguish between the real and fake samples. A great deal of research work has attempted to apply VAEs and GANs to natural languagegeneration tasks [35,36].

    Without using a bilingual corpus, but only using a small dictionary and a large monolingual corpus, unsupervised learning methods can be used for MT methods [37]. This method uses the joint training method to boost the S2T and T2S translation models jointly with the source and target monolingual corpus by generating a pseudo bilingual corpus.As there is no real bilingual data,the generated pseudo data may contain errors and noise,which will be reinforced in the subsequent iterative training process, when such examples are used for the model training. To deal with this problem, Ref. [38] introduced statistical machine translation (SMT)models as posterior regularization (PR) to filter this noise. The SMT and NMT models are optimized jointly and boost each other incrementally in a unified expectation-maximization framework.The whole procedure of this method consists of two parts (as shown in Fig. 9 [38]): model initialization and unsupervised NMT with SMT as PR.Given a language pair X-Y,two initial SMT models are first built with language models pre-trained using monolingual data, and word translation tables are inferred from cross-lingual embeddings. Then the initial SMT models will generate pseudo data to warm up two NMT models. The NMT models are trained using not only the pseudo data generated by SMT models,but also those generated by reverse NMT models with the joint training method. After that, the NMT-generated pseudo data is fed to SMT models. As PR, SMT models filter out noise and infrequent errors by constructing strong phrase tables with good and frequent translation patterns, and then generate denoised pseudo data to guide the subsequent NMT training. Benefiting from this process, NMT then produces better pseudo data for SMT to extract phrases of higher quality, while compensating for the deficiency in smoothness inherent in SMT via back-translation.The NMT and SMT models boost each other until final convergence. Compared with the work presented in Ref.[37],this method can significantly improve the translation results on four translation tasks with gaps of 1.4 BLEU points from French to English, 3.5 BLEU points from English to French, 3.1 BLEU points from German to English, and 2.2 BLEU points from English to German. Ref. [39] further introduced GPT methods to the unsupervised NMT tasks and proposed a cross-lingual language model pre-training to achieve the new state-of-the-art performance.

    3.3. Multitask learning

    Multitask learning attempts to leverage the information from other related tasks to improve the performance on the desired task.When a large amount of training data for the desired task is not available,a training corpus of related tasks can be introduced using a multitask learning approach. Ref. [10] proposed a unified neural network architecture with various NLP tasks including POS,chunking, named entity recognition, and semantic role labeling (SRL).This method learns internal representations based on vast amounts of mostly unlabeled data. This work was a milestone in learning features automatically with a neural network for NLP, which inspired the subsequent deep learning trend that has been applied in the field of NLP.

    Ref. [40] proposed treating ten NLP tasks, including QA, MT,summarization, natural language inference, and so forth, as QA tasks, and built a multitask question-answering network (MQAN)to model them in a unified framework, as shown in Fig. 10 [40].The inputs of the MQAN are a question and a context document.This input format is natural for the original QA task. For the MT task, the question is roughly ‘‘What is the translation from language X to language Y?” and the context is the source language sentence.For the summarization,the question is‘‘What is the summary?” and the context is the document to be summarized. The question and the context are encoded with BiLSTM, followed by dual co-attention to build conditional representations for both sequences. These conditional representations are processed with another two BiLSTMs, followed by two self-attention networks and two BiLSTMs to obtain the final encoding representations of the question and context. To generate the output, an attention mechanism is leveraged to focus on the necessary encoding hidden states, and a multi-pointer generator is used to decide whether to copy from the question and context or generate a new word. This model achieved a state-of-the-art result on WikiSQL with a 72.4%EM and an 80.4% execution accuracy. With multitask learning,the MQAN can lead to better generalization for zero-shot learning on the zero-shot relation extraction task on dataset QA-ZRE,with a gain of 11 F1points over the highest single-task models. Ref. [41]proposed MT-deep neural network (DNN), a multitask learningbased DNN based on BERT. By adding more specific tasks into the pre-training,MT-DNN obtains very good results on ten natural language understanding(NLU)tasks,including SNLI,SciTail,and eight out of nine GLUE [42] tasks. The effectiveness of MT-DNN also shows that different but related tasks can boost each other via multitask learning.

    Fig. 9. Illustration of the unsupervised NMT training [38].

    Fig. 10. Network structure of the MQAN [40].α is the attention weights;γ and λ are the scalars to switch the output distributions.

    3.4. Pre-trained models and transfer learning

    Task-unspecific models are pre-trained first, and can then be transferred to specific tasks with a fine-tuning process. With pretrained word embedding or sentence embedding,transfer learning can be used on top of them to fine-tune the task-specific models[43]. In recent years, many pre-trained models have been proposed, such as word2vec [6,8], ELMo [9], GPT [3], BERT [4], and XLNet [5], as introduced in Section 2.1, which are commonly used in NLP tasks such as MRC and QA.

    Zero-shot and one-shot transfer learning have been explored with no or only a few labeled samples. Ref. [44] proposed a zeroshot transfer learning method for text classification, in which the model is trained to learn relationships, sentences, and categories with a large dataset,and is then transferred to new categories with no training data. Ref. [45] used semantic parsing to map natural language explanations on classifying concepts to formal constraints relating features grounded in the observed attributes of unlabeled data.Such constraints are combined using PR to yield a classifier for the zero-shot classification task. For some tasks in which there is a large amount of training data in a rich language(e.g., English), but little or no data in other languages (e.g., Romanian), cross-lingual transfer learning can be used to transfer the model trained in the rich language to a model for the rare language.Multitask learning over different MT language pairs can be transferred to enable zero-shot translation for language pairs with no bilingual corpus [46].

    Another direction for transferring a pre-trained model to a wide variety of new and unseen situations is ‘‘learning to learn”—also known as meta learning.First proposed by Ref.[47],meta learning has recently become a hot topic,and is used for pre-trained model transfer, hyper-parameter tuning, and neural network optimization.By directly optimizing an initial model that can be effectively transferred with the help of a few examples, model agnostic meta learning (MAML) [48] has been proposed to learn a task agnostic model. The learned model can be quickly adapted (with a small number of update steps) to several related tasks. Ref. [49]leveraged MAML to optimize NMT models using 18 European languages. By treating each example as a unique pseudo-task, the original structured query-generation task is reduced to a fewshot problem for which meta learning is used, bringing significant performance improvement.

    Ref. [50] introduced an effective multitask learning framework to train a transferable sentence representation by combining a set of objectives including multi-lingual NMT, natural language inference,constituency parsing,and skip-thought vectors(i.e.,to predict the previous or following sentence),and the learned representation can be transferred to a low-complexity classifier on a new task.By switching between different tasks, the gated recurrent unit (GRU)based RNN network can learn different aspects of the sentence.With the NMT and parsing tasks,syntactic properties can be better encoded. Sentence characteristics such as length and word order have been found to be better encoded by parsing.The learned representations are visualized using dimensionality reduction and nearest-neighbor exploration on three different datasets (movie reviews,question type classification,and Wikipedia classification),and sentences are found to be clustered reasonably well according to the labels. The learned representations are used as features(without parameters updating), and transferred to sentimentclassification tasks (movie reviews, product reviews, subjectivity/objectivity classification) with gains of 1.1%-2.0%, question type classification with gains of 6%, and paraphrase identification(Microsoft research paraphrase corpus)with gains of 2.3%.

    3.5. Active learning

    Active learning can interactively query the user in order to selectively label the data, with the aim of maximizing the performance gain and minimizing the labeling costs. To deal with the low-resource problem, one straightforward method is to label more data,which presents the challenge of identifying which data should be labeled in order to improve the model performance best.To deal with this problem, active learning methods [51] can be used to automatically and iteratively select useful instances to be labeled to maximize the model performance. With the labeled data, the learning model can be trained and applied to the unlabeled data. Based on the labeled result, the active selection model can predict a best selection for the editors to label based on signals such as uncertainty. The new labeled data can be used to obtain a better learning model, and this process can be iterated until an acceptable performance is achieved.

    3.6. Summary

    In this section, we reviewed several typical training methods,including supervised,semi-supervised,and unsupervised learning;multitask learning; transfer learning; and active learning. When enough labeled data is available, the supervised method can achieve very good performances on many NLP tasks, such as MT and MRC. For other tasks, where there is insufficient labeled data,we have introduced several learning methods to enhance the model performance, including semi-supervised and unsupervised learning with unlabeled data, transfer learning with pre-trained models, multitask learning with labeled data of other tasks, and active learning to annotate the most valuable samples.To improve the performance of NLP models, we think that more research should be conducted in the following topics:

    · The topological training process. Although multitask learning and transfer learning can leverage the data from related tasks to enhance models for a desired task, the relationship between various NLP tasks should be further studied. For instance, a topological training process should be developed in the future, based on which the pre-trained models on fundamental tasks (e.g., language models, POS, and parsing) can be better transferred to higher-level tasks (e.g., MT and QA).

    · Reinforcement learning (RL). RL has also been explored in many NLP model training cases, but it seems that it has not achieved satisfactory results.For example,in the dialog system for custom service,the error or loss is not available in each turn of the dialog session.The only information is whether the tickets are booked successfully or not,or how many turns are used.For such scenarios,where only a long-term reward is available,RL can be used to learn a policy network to maximize the expected reward. The RL process still suffers from the exponential search space with respect to the length of the natural language sentence[52].

    · GANs. Even though many research efforts have tried to apply GANs to NLP tasks,such as MT[53]and natural language generation [54], there are several challenges.It is difficult for the discriminator to pass the gradient signal to the generator directly, as it does in image and speech processing, due to the discrete output of the generator.And the GAN is very sensitive to the random initialization and small deviations from the best hyper-parameter choice [36]. Such challenges amplify the difficulty of GAN training.

    4. Reasoning

    Neural approaches have achieved good progress on many NLP tasks,such as MT and MRC.However,they still have some unsolved problems.For example,most neural network models behave like a black box, which never tells how and why a system has solved a problem in the way it did.Besides,for tasks such as QA and dialog systems, only knowing the literal meanings of input utterances is often not enough.To generate the right responses,external and/or context knowledge may also be needed.To build such interpretable and knowledge-driven systems,reasoning is necessary.

    In this paper, we define reasoning as a mechanism that can generate answers to unseen questions by manipulating existing knowledge with inference techniques. Based on this definition, a reasoning system (Fig. 11) should have two components:

    ·Knowledge,such as a knowledge graph,common sense,rules,assertions extracted from raw texts, etc.;

    · An inference engine, to generate answers to questions by

    manipulating existing knowledge.

    Next, we use two examples to illustrate why reasoning is important to NLP tasks.

    The first example is a knowledge-based QA task. Given the question ‘‘When was Bill Gates’ wife born?,” the QA model must parse it into the logical form for answer generation:λxλy.DateOfBirth(y,x)∧Spouse(Bill Gates,y), where a knowledge graph-based reasoning is needed. Starting with this question,new questions can be further appended to this context, such as:‘‘What’s his/her job?” To answer such context-aware questions,co-reference resolution determines which person is meant by‘‘his/her.”This is also a reasoning procedure,which needs the commonsense knowledge that ‘‘his” can only refer to men and ‘‘her”can only refer to women.

    The second example is a dialog task.For example,if a user says‘‘I am very hungry now,” it would be more suitable to reply: ‘‘Let me recommend some good restaurants to you” instead of ‘‘Let me recommend some good movies to you.” This also requires reasoning, as the dialog system should know that hungry will lead to actions such as looking for restaurants instead of watching films.

    In the remainder of this section, we will first introduce two types of knowledge: the knowledge graph and common sense.Next, we will describe typical inference approaches, which have been or are being studied in the NLP area.

    Fig. 11. Overview of a reasoning system. ILP: integer linear programming; MLN: Markov logic network.

    4.1. Knowledge

    Knowledge plays an important role in reasoning-driven NLP tasks. It may refer to any information (e.g., dictionaries, rules,knowledge graphs, annotations of specific tasks) that can guide NLP systems to complete specific tasks. Here, we focus on two types of knowledge for reasoning: the knowledge graph and common sense.

    4.1.1. The knowledge graph

    A knowledge graph is a directed graph{V,E},which consists of nodes V and edges E.Each node v ∈V denotes an entity.Each edge e ∈E denotes a predicate between the two entities it connects with. Each triple <v1,e,v2>denotes a fact, where v1,v2∈V and e ∈E.

    For example,<Microsoft, Founder, Bill Gates >is a knowledge triple from a knowledge graph, where Microsoft is the subject entity, Bill Gates is the object entity, and Founder is the predicate indicating that Bill Gates is the founder of Microsoft. A knowledge graph is essential to those NLP tasks where parsing a natural language into machine-executable structured queries is indispensable, such as semantic searches, knowledge-based QA, and taskoriented dialog.

    The approaches of knowledge graph construction can be grouped into three categories:

    · Handcrafted methods. In these methods, knowledge graphs are manually constructed by human experts. Knowledge graphs built in this way, such as WordNet [55], are usually of high quality, but cover limited facts. In addition, the cost of maintenance is usually very high.

    · Crowdsourcing methods. In these methods, community members construct knowledge graphs in a collaborative way. Compared with handcrafted methods, knowledge graphs built in this way, such as DBpedia [56], Freebase[57], and WikiData [58], are large scale and high quality.

    · Information-extraction methods. These methods extract structured knowledge from web documents. KnowItAll [59],YAGO [60],and NELL[61]are representatives of this method.Compared with the first two approaches, these methods can extract more knowledge from free texts; however, they also bring a lot of noise into the resulting knowledge graphs.

    4.1.2. Common sense

    Common sense refers to common knowledge of things in the world, such as their properties, relationships, and interactions,which all humans are expected to know.Such knowledge is mostly location-,language-,and culture-independent,and is rarely explicitly expressed in texts.

    For example, ‘‘a(chǎn) father is a male” is property common sense,‘‘the sun is in the sky” is spatial common sense, and ‘‘plants grow from seeds” is procedural common sense.

    Building a commonsense knowledge base (CKB) is difficult. At present, there are three major methods (which are similar to the knowledge graph), but all suffer from significant problems.

    · Handcrafted methods. CYC [62] is a CKB built by human experts in this way, which focuses on things that are rarely written down or said, such as ‘‘every human has exactly one father and exactly one mother.” The goal of CYC is to enable AI systems to perform human-like reasoning and to be less brittle when confronting unseen situations. The sizes of CKBs such as CYC are limited, as human labeling is expensive.

    · Crowdsourcing methods. ConceptNet [63] is a CKB built by absorbing knowledge from knowledge graphs such as Word-Net,Wiktionary,Wikipedia,DBpedia,and Freebase.Compared with CYC,ConceptNet covers many more assertions.However,a large portion of the ConceptNet assertions is actually not common sense, as it comes from existing knowledge bases and are about ‘‘named entities,” such as Bill Gates and White House, instead of ‘‘common nouns,” such as human and building.

    · Information extraction methods. WebChild [64] is a CKB containing commonsense assertions that connect nouns with adjectives via fine-grained relations such as hasShape and hasTaste. Such assertions are extracted from web documents based on seed assertions from WordNet. Due to extraction errors,WebChild contains a great deal of noise.It now covers more than four million commonsense assertions.

    4.2. Inference engine

    Although this paper focuses on neural NLP methods, we will still start from two typical non-neural inference methods: integer linear programming (ILP) and Markov logic networks (MLNs), as they have been studied a great deal and have already been used in some NLP tasks.After that,we will introduce memory networks as a typical neural inference method. We will also describe their applications in two reasoning tasks: semantic parsing and response generation, which are based on the knowledge graph and common sense, respectively.

    4.2.1. Non-neural inference methods: ILP and MLN

    ILP is an optimization framework that maximizes a linear objective function over a finite set of variables x,subject to a set of linear inequality constraints:maxmize wTx

    where w is parameters to be optimized; A and b are pre-defined parameters in constraints; n is the number of variables.

    The constraints used in ILP can be considered as prior knowledge,and the optimization procedure is to reason out a global prediction by incorporating learned models based on such prior knowledge.This method can be used in the information extraction task[65],whose objective is to recognize named entities and their relationships from texts. Fig. 12 gives an example.

    Usually, an NER and a relation extractor (RE) will be trained separately and used in this task in a cascaded manner. First, the NER detects entity mentions from input and assigns possible types to each mention. Then, the RE assigns possible relations to each entity pair. Five prediction tables are generated for these three named entities and two relations, which are shown in Fig. 13.

    If local predictions with the highest probabilities are always chosen, errors will occur. For example, the type of Brooklyn will be recognized as Person and the relation between Adam and Anne will be classified as PlaceOfBirth. However, if it is known that the object type of PlaceOfBirth should be Location instead of Person,such errors can be avoided. Inference with such prior knowledge is a reasoning procedure, which can be done by ILP.

    In order to obtain correct predictions based on the above five local prediction tables, we formulate it as an ILP problem and manually define the following four constraints: ①Each entity mention can be assigned an entity type only once. ②Each entity pair can be assigned a relation only once. ③The type assignment to each entity mention should be consistent with the assignments to its neighboring relations. For example, when Anne is tagged as Person, if type of Brooklyn is also recognized as Person, then the relation assignment between Anne and Brooklyn cannot be PlaceOfBirth, as its object entity type should be Location, which is inconsistent with Person. ④The relation assignment to each entity pair should be consistent with the assignments to these two entities. Based on local NER and RE outputs and these four constraints, ILP can obtain the global optimized output (Fig. 14).

    This time,the type of Brooklyn is recognized as Location,as it is consistent to the relation (i.e., PlaceOfBirth) assigned to Anne and Brooklyn. Similarly, the relation between Adam and Anne can be correctly recognized as SpouseOf.

    ILP can be used in many NLP tasks, such as QA [66-68] and semantic role labeling[69,70].It is especially useful when the size of the training data is small, as prior knowledge (used as constraints in ILP) can play more important roles in such scenarios.However,this framework is still independent from existing neural network methods.

    An MLN [71] L=(Fl,wl) is defined as a set of pairs, where each Flis a first-order logic formula and wlis the weight of Fl.It combines probability and first-order logic in a unified model and can generate a Markov network ML,Cby grounding all variables in formulas to constants C ={c1,...,c|c|}.The basic idea of an MLN is to soften hard constraints in first-order logic: When a world violates one formula in the knowledge graph, it is less probable, but not impossible.

    Fig. 12. An example of an information-extraction task.

    Fig.13. Sub-optimal results without using ILP.E:entities occurred in the sentence;R: relations between entities.

    Fig. 14. Optimal results when using ILP.

    Ideally,an MLN can be applied to reasoning-required NLP tasks by the following steps. Here, we use the famous ‘‘friend smoking”case as an example:

    (1) Given a world described in natural language, parse it into a set of first-order logic formulas F ={F1,...,F(xiàn)|F|}.For example,given two sentences:

    Smoking causes cancer

    Friends have similar smoking habits

    The sentences are parsed into two first-order logic formulas:?xSmokes (x )?Cancer(x)

    In practice,each formula has a weight,which should be learned from existing relational databases by algorithms such as the pseudo-likelihood algorithm [72].

    (2) Given L and a set of constants C, a Markov network ML,Cis defined as follows: ①ML,Ccontains one binary node for each possible grounding of each predicate in L.The value of the node is 1 if the ground atom is true,and 0 otherwise.②ML,Ccontains one feature for each possible grounding of each formula Fiin L. The value of this feature is 1 if the ground formula is true, and 0 otherwise.The weight of the feature is the wiassociated with Fiin L. For example,given two constants Anna(A)and Bob(B),the ground ML,Cis described in Fig. 15.

    (3) Given a set of evidences, such as Friends(A ,B)=1 and Cancer(A )=1, reason the most likely state of the world, such as Smokes(A )=?, Smokes(B )=?, and Cancer(B )=?. This inference procedure can answer questions such as ‘‘What is the probability that Anna smokes, given that Anna has cancer?” In practice, algorithms such as MC-SAT [66] can be used to solve such inference tasks.

    As a statistical relational learning approach, the MLN has been applied to some NLP tasks already, such as semantic parsing [73],information extraction[74], and entity disambiguation[75]. However, it is still difficult to apply it to real tasks, as the size of the ground Markov network is usually very large, which will raise the computation issue. For example, based on the analysis of Ref.[76], the size of the corresponding ground Markov networks is exponential in |C|, which is the number of constants. If the constants come from a practical knowledge base (such as Freebase,which contains 39 million entities), the inference complexity in the resulting ground Markov networks will be enormous. In addition, similar to ILP, the MLN is independent of neural models. The necessities and solutions of designing neural versions of the MLN are still worthy of discussion and study.

    4.2.2. Neural inference methods: MemNN and its variants

    A memory network (MemNN) [77-79] is a neural network architecture in which reasoning can be supported by processing knowledge stored in a long-term memory component multiple times before outputting a result.

    We take the key-value memory networks (KV-MemNN) proposed by Ref. [79] as an example to show how this framework can support reasoning. Fig. 16 [79] gives an overview of the KVMemNN.

    Fig. 15. An example of a Markov network.

    where q=AΦX(x ) is the question vector, and R1is a d×d matrix.This procedure will repeat H times using different Rjand the final vector qH+1will be used to predict outputs.

    MemNN is widely used in many reasoning-required NLP tasks.For example, Ref. [77] first proposed the MemNN framework and applied it to bAbI, which is a reasoning-required QA task. Ref.[79] proposed KV-MemNN and evaluated it on two QA datasets:WikiMovies and WikiQA [80]. Ref. [81] described an end-to-end goal-oriented dialog method based on MemNN.

    In a broader sense, MemNN is a special case of memoryaugmented neural networks (MANNs), which extend the capabilities of neural networks by coupling them to external memory resources. In such methods, neural models interact with memory by attentional processes. As MANNs have many variants, we use two examples to illustrate the applications of a MANN in two reasoning-required tasks (i.e., semantic parsing and response generation), which are based on the knowledge graph and on commonsense knowledge, respectively.

    Ref.[82] proposed a unified semantic parsing approach to handle a knowledge-graph-based conversational QA task, in which a dialog memory motivated by MemNN is introduced to cope with co-reference and ellipsis phenomena in multi-turn interactions.Two examples are given below.

    Fig.16. Overview of KV-MemNN[79].a means the answer of the question;B means a d ×D matrix,which can be constrained to be identical to A. Rj means a d ×d matrix,which is used to update the representation of the input question in jth hop.

    In order to answer the first question:

    (1) Where was Donald Trump born?

    The semantic parser first generates its logical form:¨ex.PlaceOfBirth(Donald Trump,x)and then executes it against a knowledge base to get the answer New York. This is called context-independent semantic parsing,as only the current question is needed.

    In order to answer the second question:

    (2) Where did he graduate from?

    Both questions (1) and (2) are needed, as ‘‘he” in question (2)refers to Donald Trump in question (1). This is called contextdependent semantic parsing.

    Given a context-independent question, (e.g., question (1)),semantic parsing is done as an action sequence prediction task.Starting from a root symbol ‘‘Start,” a grammar-guided decoder recursively rewrites the leftmost nonterminal (i.e., the semantic category)in the logical form by applying a legitimate action.It terminates when no nonterminal is left. Fig. 17 illustrates this procedure.

    A list of actions is defined in Table 1,where each action consists of three parts: a semantic category; a function symbol, which might be omitted; and a list of arguments, each of which can be a semantic category,a constant,or an action subsequence.The first 15 actions (A1-A15) are designed to cover typical operations in semantic parsing.We take A5 as an example:"num"is the semantic category, which denotes the returned type of count(set);‘‘count” is the function symbol, which returns the number of elements in set; and ‘‘set” is the only argument of count. A16-A18 are designed to instantiate entity e, predicate r, and number num, respectively. A19-A21 are designed to reuse the previously predicated action subsequences.

    Given a context-dependent question(e.g.,question(2)),a dialog memory is used to maintain all entities,predicates,and action subsequences that come from either the generated logical forms of previous questions or the predicted answers. Such contents are considered to be the conversation history and will be used to restore the missing information (co-reference and entity/relation ellipsis) of the current question. For the example in Fig. 18, as‘‘he” in current question actually refers to Donald Trump in the previous question, the semantic parser can copy the action subsequence(i.e.,A15 eDT)from the dialog memory to complete the logical form generation.

    Experiments are conducted on the CSQA dataset [83], which consists of 2×105dialogs with 1.6×106turns over 1.28×107entities from WikiData, and this method achieves state-of-the-art results on both context-independent and context-dependent questions.

    Ref. [84] proposed a commonsense-aware encoder-decoder framework for the response-generation task in open-domain dialog systems. ConceptNet is used in this work to understand the background information of a given user utterance and then facilitate response generation.

    Table 1 List of actions, each consisting of a semantic category, function symbol, and list of arguments.

    In the decoder part, a knowledge-aware generator is designed to generate a response by making full use of the retrieved graphs.This part plays two roles: ①It attentively reads the retrieved graphs to obtain a graph-aware context vector,and uses the vector to update the decoder’s state.This procedure is similar to MemNN.②It adaptively chooses a generic word or an entity from the retrieved graphs for response generation.

    Experiments are conducted on the Commonsense Conversation dataset, and this method achieves state-of-the-art results compared with the traditional sequence-to-sequence approach with copy.

    With deep learning, different types of knowledge can be represented and used in neural inference methods for various reasoning tasks.We also see that such approaches still lack sufficient reasoning flexibilities.For example,the reasoning hop in MemNN is fixed without considering different input contents,which can be further improved in the future.

    4.3. Reasoning-aware shared tasks

    Based on different types of knowledge used,we classify recently proposed reasoning tasks into four categories:

    Fig. 17. An example of context-independent semantic parsing. DT: Donald Trump.

    Fig. 18. An example of content-dependent semantic parsing.

    · Tasks based on knowledge graphs. These tasks include WikiSQL [85], LC-QuAD [86], CSQA [83], and Complex-WebQuestions[87],in which various types of questions,such as multi-hop, multi-constraint, superlative, and multi-turn questions, must be answered based on knowledge graphs.In these tasks, reasoning is required and is done by performing semantic parsing explicitly.

    · Tasks based on commonsense knowledge. These tasks include the Winograd Schema Challenge[88],ARC[89],CommonsenseQA [90], and ATOMIC [91], in which questions can be answered based on different types of commonsense knowledge, such as temporal, spatial, and causal common sense. In these tasks, reasoning can either be performed explicitly by designing specific inference models, or performed implicitly by end-to-end training.

    · Tasks based on texts. These tasks include HotpotQA [92],NarrativeQA [93], MultiRC [94], and CoQA [95], in which answers can be obtained by reasoning across paragraphs,documents, or conversation turns. Currently, state-of-the-art results on these tasks are achieved by end-to-end neural models. One reason for the current poor performance is that existing knowledge bases still suffer from low coverage of open-domain natural language texts.

    · Tasks based on both texts and visual content. These tasks include GQA[96]and VCR[97],in which the goal is to answer a natural language question based on a given image. As the questions in these two datasets are either multi-hop questions or require commonsense knowledge, a model needs a strong reasoning ability to achieve a good performance.

    As data annotation is expensive, these datasets are often small in size, or big but generated automatically using templates. The models learned from such datasets lack strong generalization abilities. Recently, pre-trained models such as ELMo, GPT, BERT,and XLNet have shown good generalization performances across different NLP tasks. For the next step, it is practical to fine-tune existing pre-trained models with reasoning-aware datasets with the aim of building more generalized reasoning systems.

    4.4. Summary

    This section briefly reviewed the progress of reasoning in NLP.Typical (non-neural and neural) inference methods were introduced, including ILP, MLN, and MemNN, all of which have been successfully used in various NLP tasks,such as QA,dialog systems,and information extraction.

    In order to build more powerful reasoning systems, some challenges remain:

    · Knowledge extraction. Due to the limited coverage of existing knowledge sources,only a small portion of queries issued to current human-computer interaction systems (such as search, QA, and dialog engines) can be fully understood and addressed. Therefore, knowledge extraction is still a longterm research task in order to acquire high-quality knowledge to support reasoning.

    · Reasoning with explicit knowledge and pre-trained models. Typical reasoning approaches are built based on explicit knowledge bases. Recently, pre-trained models such as GPT,BERT, and XLNet have shown strong abilities on reasoningrequired NLP tasks, such as the Winograd Schema Challenge[88] and SWAG [98]. Therefore, the question of how to combine explicit knowledge and pre-trained models into a unified reasoning framework is worth exploring.

    · Datasets and metrics. Although the latest approaches can perform well on many NLP tasks, it is still difficult to tell how much reasoning capability an NLP model has. In order to measure such abilities, large-scale datasets and specific metrics should be built for the community.

    In general,reasoning is critical to advance NLP,as reasoning can bring existing knowledge into all NLP tasks for better natural language understanding and generation.

    5. Conclusion

    This paper reviewed the latest progress of neural NLP from three perspectives: modeling, learning, and reasoning. In general,NLP is entering a new era, in which neural network-based models dominate in research and application systems. For rich-source tasks with enough training data(e.g.,NMT and SQuAD),supervised learning can do pretty well.For low-resource tasks with little training data, semi-supervised and unsupervised learning, multitask learning, transfer learning, and active learning can be used; these methods can either generate more pseudo training data for model training, or leverage the knowledge learned from existing models and tasks. Different reasoning mechanisms have also been explored and used for tasks in which reasoning is required, such as commonsense QA, chitchat, and dialog systems.

    Looking ahead, we can see many exciting directions. For example,pre-trained models such as GPT,BERT,and XLNet have already shown strong performances on various NLP tasks, and there is no doubt that they will continue to evolve for both natural language understanding and generation tasks. It is also worth exploring how to integrate such pre-trained models into various NLP tasks with transfer learning, multitask learning, meta learning, and so on.Research on memory-argument neural networks and their variants will further advance reasoning approaches, with the help of new knowledge-extraction techniques. Furthermore, by combing NLP with multi-modal tasks such as speech recognition, image/video captioning, and QA, new research and application scenarios will emerge. It is an incredibly exciting time to work on NLP from research to applications.

    Compliance with ethics guidelines

    Ming Zhou, Nan Duan, Shujie Liu, and Heung-Yeung Shum declare that they have no conflict of interest or financial conflicts to disclose.

    猜你喜歡
    鼠標(biāo)老鼠
    老鼠怎么了
    抓老鼠
    小動作幫您擺脫“鼠標(biāo)手”
    老鼠開會
    大灰狼(2018年1期)2018-01-24 15:53:20
    笨貓種老鼠
    老鼠分油
    45歲的鼠標(biāo)
    超能力鼠標(biāo)
    IM家庭電子(2008年11期)2008-12-05 09:49:20
    鼠標(biāo)也可以是這樣的
    可愛的鼠標(biāo)防汗止滑貼等
    欧美三级亚洲精品| 日日干狠狠操夜夜爽| 日韩av不卡免费在线播放| 人妻丰满熟妇av一区二区三区| 在线观看午夜福利视频| 狂野欧美激情性xxxx在线观看| 色在线成人网| 日韩,欧美,国产一区二区三区 | 小蜜桃在线观看免费完整版高清| 51国产日韩欧美| 国产成人福利小说| 婷婷精品国产亚洲av在线| 国产视频一区二区在线看| 黄色欧美视频在线观看| 欧美潮喷喷水| 久久精品国产亚洲网站| 91久久精品国产一区二区三区| 九九久久精品国产亚洲av麻豆| 一本精品99久久精品77| 欧美日韩乱码在线| 午夜a级毛片| 免费观看在线日韩| 日韩在线高清观看一区二区三区| 乱系列少妇在线播放| 观看美女的网站| 精品久久久久久久久av| 国产男靠女视频免费网站| 国产精品不卡视频一区二区| 在线看三级毛片| 丰满乱子伦码专区| 久久国产乱子免费精品| av福利片在线观看| 小蜜桃在线观看免费完整版高清| 麻豆乱淫一区二区| 国产一区二区激情短视频| 一进一出抽搐gif免费好疼| 中文在线观看免费www的网站| 春色校园在线视频观看| 国产一区二区亚洲精品在线观看| 亚洲av中文av极速乱| 国产成人一区二区在线| 老司机福利观看| 亚洲四区av| 日韩欧美免费精品| 亚洲成人精品中文字幕电影| 人妻制服诱惑在线中文字幕| 在线播放国产精品三级| 日本欧美国产在线视频| 欧美性感艳星| 久久久久久国产a免费观看| 久久久成人免费电影| 免费看a级黄色片| 欧美区成人在线视频| 国产精品亚洲美女久久久| 亚洲在线观看片| 亚洲人成网站在线播| 在线免费观看不下载黄p国产| 女人十人毛片免费观看3o分钟| 亚洲无线在线观看| 日韩欧美三级三区| 色噜噜av男人的天堂激情| 国产精品久久久久久亚洲av鲁大| 变态另类成人亚洲欧美熟女| 国产亚洲精品综合一区在线观看| 日韩大尺度精品在线看网址| 欧美中文日本在线观看视频| 国产片特级美女逼逼视频| 丝袜美腿在线中文| 亚洲最大成人av| 日韩欧美一区二区三区在线观看| 亚洲精品在线观看二区| 国产女主播在线喷水免费视频网站 | 看黄色毛片网站| 免费在线观看成人毛片| 嫩草影院精品99| av国产免费在线观看| 插逼视频在线观看| 成人精品一区二区免费| 日本黄色片子视频| 嫩草影院入口| 女生性感内裤真人,穿戴方法视频| 亚洲国产精品sss在线观看| 欧美日韩一区二区视频在线观看视频在线 | 午夜激情欧美在线| 亚洲av熟女| 国产精品爽爽va在线观看网站| 精品一区二区免费观看| 91av网一区二区| 免费av不卡在线播放| 亚洲无线观看免费| 成人三级黄色视频| 久久欧美精品欧美久久欧美| 精品一区二区三区人妻视频| 国产高清视频在线播放一区| 亚洲最大成人中文| 久久久欧美国产精品| 又黄又爽又刺激的免费视频.| 晚上一个人看的免费电影| 最近2019中文字幕mv第一页| 99在线视频只有这里精品首页| 日本一二三区视频观看| 男女视频在线观看网站免费| 欧美日韩乱码在线| 综合色av麻豆| 久久久久久久午夜电影| 97超视频在线观看视频| 久久久久久伊人网av| 国内揄拍国产精品人妻在线| 国产一区二区在线观看日韩| 亚洲国产精品sss在线观看| av天堂在线播放| 日本黄大片高清| 亚洲欧美精品自产自拍| 小说图片视频综合网站| 黄色日韩在线| 网址你懂的国产日韩在线| 亚洲国产精品合色在线| 欧美最新免费一区二区三区| 变态另类成人亚洲欧美熟女| 亚洲最大成人中文| 91狼人影院| 小蜜桃在线观看免费完整版高清| 真人做人爱边吃奶动态| 成人一区二区视频在线观看| 日韩大尺度精品在线看网址| 亚洲av成人精品一区久久| 天堂影院成人在线观看| 国产单亲对白刺激| 国产精品伦人一区二区| 精品无人区乱码1区二区| 可以在线观看的亚洲视频| 99riav亚洲国产免费| 亚洲精品久久国产高清桃花| 狂野欧美激情性xxxx在线观看| 国产一区二区亚洲精品在线观看| 99在线人妻在线中文字幕| 国产一区亚洲一区在线观看| 丰满人妻一区二区三区视频av| 3wmmmm亚洲av在线观看| 少妇人妻一区二区三区视频| 亚洲乱码一区二区免费版| 免费在线观看影片大全网站| 精品一区二区三区人妻视频| 一级黄色大片毛片| 国产精品乱码一区二三区的特点| 国产免费男女视频| 国产色爽女视频免费观看| 麻豆一二三区av精品| 国产成人福利小说| 国产真实伦视频高清在线观看| 99热精品在线国产| 午夜精品一区二区三区免费看| 免费人成在线观看视频色| 99久久精品一区二区三区| 免费一级毛片在线播放高清视频| 欧美高清成人免费视频www| 日韩国内少妇激情av| 久久久久免费精品人妻一区二区| 天天一区二区日本电影三级| 白带黄色成豆腐渣| 亚洲国产高清在线一区二区三| 国产一级毛片七仙女欲春2| 午夜福利在线观看免费完整高清在 | 欧美色欧美亚洲另类二区| 桃色一区二区三区在线观看| 极品教师在线视频| 亚洲图色成人| 亚洲图色成人| 免费观看精品视频网站| 亚洲中文日韩欧美视频| 欧美激情久久久久久爽电影| 九九在线视频观看精品| 国产高潮美女av| 亚洲欧美精品自产自拍| 亚洲激情五月婷婷啪啪| 国产三级在线视频| av中文乱码字幕在线| 国产高清激情床上av| 免费av毛片视频| 国产欧美日韩精品亚洲av| 欧美一区二区国产精品久久精品| 女生性感内裤真人,穿戴方法视频| 欧美性猛交黑人性爽| 久久久精品94久久精品| 日韩欧美三级三区| 蜜臀久久99精品久久宅男| 亚洲乱码一区二区免费版| 欧美一区二区国产精品久久精品| 18禁在线无遮挡免费观看视频 | 亚洲精华国产精华液的使用体验 | 日韩欧美国产在线观看| 国产精品一区二区免费欧美| 国产色婷婷99| 国产精品不卡视频一区二区| 久久精品国产亚洲av涩爱 | 亚洲三级黄色毛片| 成人一区二区视频在线观看| 国产精品99久久久久久久久| 亚洲色图av天堂| 欧美区成人在线视频| 久久久久久九九精品二区国产| 精品人妻视频免费看| 男女下面进入的视频免费午夜| 亚洲18禁久久av| 激情 狠狠 欧美| 性插视频无遮挡在线免费观看| 午夜免费激情av| 国产人妻一区二区三区在| 欧美丝袜亚洲另类| 欧美中文日本在线观看视频| 直男gayav资源| 中文字幕久久专区| av在线亚洲专区| 亚洲经典国产精华液单| 国产成人精品久久久久久| 日产精品乱码卡一卡2卡三| 看非洲黑人一级黄片| 亚洲美女视频黄频| 黑人高潮一二区| 淫秽高清视频在线观看| av卡一久久| 久久九九热精品免费| 麻豆av噜噜一区二区三区| 如何舔出高潮| 午夜精品一区二区三区免费看| 久久久久免费精品人妻一区二区| 国产毛片a区久久久久| 精品一区二区三区视频在线观看免费| 91午夜精品亚洲一区二区三区| 国产伦精品一区二区三区视频9| 国产欧美日韩精品亚洲av| 欧美一区二区亚洲| 国产欧美日韩一区二区精品| 91精品国产九色| 成人欧美大片| 人人妻人人澡人人爽人人夜夜 | 老师上课跳d突然被开到最大视频| 联通29元200g的流量卡| 18+在线观看网站| h日本视频在线播放| 中文字幕av成人在线电影| 日本 av在线| 俄罗斯特黄特色一大片| 高清毛片免费观看视频网站| 久久久久性生活片| 老女人水多毛片| 中国美白少妇内射xxxbb| .国产精品久久| 大型黄色视频在线免费观看| 欧美bdsm另类| 黑人高潮一二区| 俄罗斯特黄特色一大片| 国产精品久久视频播放| 精品人妻熟女av久视频| 成熟少妇高潮喷水视频| 啦啦啦观看免费观看视频高清| 免费一级毛片在线播放高清视频| 国产精品99久久久久久久久| 夜夜看夜夜爽夜夜摸| 免费看a级黄色片| 欧美精品国产亚洲| 狠狠狠狠99中文字幕| 欧美xxxx性猛交bbbb| 中出人妻视频一区二区| 精品久久久久久成人av| 亚洲人成网站高清观看| 国产精品亚洲美女久久久| 波野结衣二区三区在线| 天天躁夜夜躁狠狠久久av| 国产伦精品一区二区三区视频9| 色哟哟·www| 一边摸一边抽搐一进一小说| 国产精品美女特级片免费视频播放器| 久久精品国产亚洲av天美| 欧美日韩国产亚洲二区| 永久网站在线| 日韩精品有码人妻一区| 长腿黑丝高跟| 日日摸夜夜添夜夜爱| 国产人妻一区二区三区在| 99久久成人亚洲精品观看| 内射极品少妇av片p| .国产精品久久| 日本黄色视频三级网站网址| 午夜影院日韩av| 亚洲18禁久久av| 精品免费久久久久久久清纯| 国产精品美女特级片免费视频播放器| 嫩草影院入口| 国产大屁股一区二区在线视频| 人妻夜夜爽99麻豆av| 淫秽高清视频在线观看| 国产精品久久久久久久久免| 小说图片视频综合网站| 变态另类丝袜制服| 一进一出抽搐动态| 午夜免费男女啪啪视频观看 | 一级黄片播放器| 国产亚洲精品久久久com| 51国产日韩欧美| 99久久成人亚洲精品观看| 国产精品99久久久久久久久| 成人av一区二区三区在线看| 久久久久久久久大av| 麻豆国产av国片精品| a级一级毛片免费在线观看| 人人妻人人澡欧美一区二区| 亚洲国产精品成人综合色| 真人做人爱边吃奶动态| 国产精品日韩av在线免费观看| 真实男女啪啪啪动态图| 国产精品国产三级国产av玫瑰| 高清毛片免费看| 亚洲成人中文字幕在线播放| 如何舔出高潮| www日本黄色视频网| 国产69精品久久久久777片| 男人舔奶头视频| 成人av在线播放网站| 在线天堂最新版资源| 亚洲高清免费不卡视频| 成人亚洲欧美一区二区av| 91久久精品国产一区二区成人| 欧美绝顶高潮抽搐喷水| 日韩欧美免费精品| 久久精品国产亚洲av涩爱 | 非洲黑人性xxxx精品又粗又长| 悠悠久久av| 亚洲四区av| 国产亚洲精品久久久久久毛片| 国产精品久久久久久av不卡| 免费av不卡在线播放| 色播亚洲综合网| 看免费成人av毛片| 91久久精品国产一区二区成人| 国产伦在线观看视频一区| 波多野结衣高清作品| 国产单亲对白刺激| 欧美+日韩+精品| 国产精品一区二区三区四区久久| 国产精品久久电影中文字幕| 婷婷六月久久综合丁香| 又爽又黄a免费视频| 国产精品av视频在线免费观看| 久久草成人影院| 日韩欧美精品v在线| 99热这里只有是精品在线观看| 嫩草影视91久久| 熟女人妻精品中文字幕| 神马国产精品三级电影在线观看| 一级黄片播放器| 波多野结衣高清无吗| 熟女人妻精品中文字幕| 久久欧美精品欧美久久欧美| 日本黄色片子视频| 亚洲成人中文字幕在线播放| 99热这里只有精品一区| www.色视频.com| 久久久久精品国产欧美久久久| 国产伦精品一区二区三区视频9| 九九爱精品视频在线观看| avwww免费| 99热这里只有精品一区| 99热网站在线观看| 日本一本二区三区精品| 亚洲内射少妇av| 欧美潮喷喷水| 噜噜噜噜噜久久久久久91| 国产精品人妻久久久久久| 露出奶头的视频| 午夜视频国产福利| 亚洲性夜色夜夜综合| 中文资源天堂在线| 亚洲av第一区精品v没综合| 国产成人freesex在线 | 99久久久亚洲精品蜜臀av| 精品一区二区三区视频在线| 嫩草影视91久久| 午夜激情福利司机影院| 国产精品无大码| 久久草成人影院| 3wmmmm亚洲av在线观看| 久久精品国产鲁丝片午夜精品| 色综合亚洲欧美另类图片| 午夜久久久久精精品| 久久久久久伊人网av| 国产综合懂色| 97超视频在线观看视频| 可以在线观看毛片的网站| 久久精品综合一区二区三区| 国产伦在线观看视频一区| 日韩成人av中文字幕在线观看 | 一级毛片我不卡| 夜夜看夜夜爽夜夜摸| 99久国产av精品国产电影| 国产极品精品免费视频能看的| 亚洲欧美日韩高清专用| 此物有八面人人有两片| 免费看日本二区| 欧美bdsm另类| 国产aⅴ精品一区二区三区波| 亚州av有码| 免费观看精品视频网站| 嫩草影院精品99| 久久久久久久久久成人| 天天躁日日操中文字幕| 亚洲人成网站在线播放欧美日韩| 麻豆成人午夜福利视频| 人人妻,人人澡人人爽秒播| 欧美丝袜亚洲另类| 18禁黄网站禁片免费观看直播| 成熟少妇高潮喷水视频| 国产成人91sexporn| 日本一本二区三区精品| 欧美一区二区国产精品久久精品| 熟妇人妻久久中文字幕3abv| 欧美激情在线99| 一级黄色大片毛片| av女优亚洲男人天堂| 亚洲精品日韩在线中文字幕 | 国产午夜精品论理片| а√天堂www在线а√下载| 免费无遮挡裸体视频| 性色avwww在线观看| 色哟哟哟哟哟哟| 国产麻豆成人av免费视频| 欧美日本亚洲视频在线播放| 九九爱精品视频在线观看| 高清日韩中文字幕在线| 欧美潮喷喷水| 免费观看人在逋| 黄色一级大片看看| 99久久精品国产国产毛片| 一区二区三区高清视频在线| 日本黄色片子视频| 久久这里只有精品中国| 成人综合一区亚洲| 18禁在线播放成人免费| 国语自产精品视频在线第100页| 高清毛片免费观看视频网站| 在线观看av片永久免费下载| 国产精品伦人一区二区| 久久精品国产清高在天天线| 亚洲丝袜综合中文字幕| 国产亚洲精品综合一区在线观看| 久久亚洲精品不卡| 国产精品一区二区免费欧美| a级毛片a级免费在线| 欧美日本视频| 欧美zozozo另类| 三级男女做爰猛烈吃奶摸视频| 国产精品嫩草影院av在线观看| 97碰自拍视频| 国产精品久久久久久久久免| 免费电影在线观看免费观看| 赤兔流量卡办理| 日本免费a在线| 内地一区二区视频在线| 97超视频在线观看视频| 午夜日韩欧美国产| 少妇猛男粗大的猛烈进出视频 | 亚洲18禁久久av| 中文字幕av在线有码专区| 精品免费久久久久久久清纯| 亚洲中文字幕日韩| 一级黄片播放器| 看非洲黑人一级黄片| 乱码一卡2卡4卡精品| 少妇的逼好多水| 伦精品一区二区三区| 18禁在线无遮挡免费观看视频 | 蜜桃亚洲精品一区二区三区| 色吧在线观看| 午夜视频国产福利| 女人十人毛片免费观看3o分钟| 国产精品久久久久久久电影| 秋霞在线观看毛片| 热99在线观看视频| 免费观看精品视频网站| 天美传媒精品一区二区| 日本黄色视频三级网站网址| 美女内射精品一级片tv| 久久久久久九九精品二区国产| 精品无人区乱码1区二区| 看非洲黑人一级黄片| 精品久久久久久久久久免费视频| 成人漫画全彩无遮挡| 一个人看的www免费观看视频| 久久午夜亚洲精品久久| 亚洲乱码一区二区免费版| 久久精品国产鲁丝片午夜精品| 变态另类丝袜制服| 成人高潮视频无遮挡免费网站| 日本一二三区视频观看| 自拍偷自拍亚洲精品老妇| 日本在线视频免费播放| 一级毛片电影观看 | 欧洲精品卡2卡3卡4卡5卡区| 国产黄a三级三级三级人| 伊人久久精品亚洲午夜| 亚洲av免费高清在线观看| 久久久午夜欧美精品| 亚洲久久久久久中文字幕| АⅤ资源中文在线天堂| 国产成人freesex在线 | 国产一区二区三区av在线 | 身体一侧抽搐| 久久草成人影院| 亚洲av免费在线观看| 亚洲欧美成人精品一区二区| 97热精品久久久久久| 97碰自拍视频| 亚洲第一电影网av| or卡值多少钱| 日本成人三级电影网站| 韩国av在线不卡| 久久久成人免费电影| 少妇的逼水好多| 天天躁日日操中文字幕| 又粗又爽又猛毛片免费看| 内地一区二区视频在线| 欧美一区二区国产精品久久精品| 网址你懂的国产日韩在线| 久99久视频精品免费| 3wmmmm亚洲av在线观看| 一级毛片aaaaaa免费看小| 久久6这里有精品| 成人av一区二区三区在线看| 成人亚洲精品av一区二区| 国产国拍精品亚洲av在线观看| 久久久久久久久久成人| 午夜精品在线福利| 婷婷色综合大香蕉| 国产色婷婷99| 女人十人毛片免费观看3o分钟| 两个人的视频大全免费| 中文字幕免费在线视频6| 精品久久国产蜜桃| 亚洲av二区三区四区| 热99re8久久精品国产| 99久久精品热视频| 亚洲欧美成人精品一区二区| 久久久久九九精品影院| 国产真实伦视频高清在线观看| 国产成人a区在线观看| 校园人妻丝袜中文字幕| 国产女主播在线喷水免费视频网站 | 国产精品久久久久久久电影| 免费看av在线观看网站| 波多野结衣高清作品| 又黄又爽又免费观看的视频| 国产精品一区www在线观看| 男女下面进入的视频免费午夜| 免费一级毛片在线播放高清视频| 嫩草影院入口| 国产av在哪里看| 十八禁国产超污无遮挡网站| 少妇裸体淫交视频免费看高清| 不卡一级毛片| 中文字幕免费在线视频6| 久久久久久久午夜电影| 亚洲成人av在线免费| 少妇熟女欧美另类| 免费不卡的大黄色大毛片视频在线观看 | 国产男人的电影天堂91| 国产伦在线观看视频一区| 在现免费观看毛片| 午夜a级毛片| 日韩欧美在线乱码| 美女cb高潮喷水在线观看| 国产精品无大码| 国产探花在线观看一区二区| 日韩av在线大香蕉| 日韩一本色道免费dvd| 精品久久久久久久久久久久久| 美女xxoo啪啪120秒动态图| 国产成人精品久久久久久| 校园春色视频在线观看| 九九热线精品视视频播放| 国产熟女欧美一区二区| 免费看av在线观看网站| 亚洲内射少妇av| 成人亚洲欧美一区二区av| 免费看av在线观看网站| 国产黄a三级三级三级人| 亚洲专区国产一区二区| 国产精品一二三区在线看| 亚洲国产欧洲综合997久久,| 看十八女毛片水多多多| 99久久成人亚洲精品观看| 国产精品爽爽va在线观看网站| 成人特级黄色片久久久久久久| 精品午夜福利视频在线观看一区| 久久久久久国产a免费观看| 99视频精品全部免费 在线| 成年女人毛片免费观看观看9| 女生性感内裤真人,穿戴方法视频| 国产老妇女一区| 99久国产av精品| 淫秽高清视频在线观看| 午夜福利在线观看免费完整高清在 | 又爽又黄a免费视频| 天美传媒精品一区二区| 国产乱人视频| 国产高清三级在线| 综合色丁香网| 午夜福利高清视频| 精品久久久久久久久久久久久| 国产黄色小视频在线观看| 91久久精品电影网| 国产在视频线在精品| av在线播放精品| 神马国产精品三级电影在线观看| 亚洲在线自拍视频| 亚洲av一区综合|