The latest Chinese NLP open source toolbox is here! Supports 6 major tasks and is aimed at industrial applications | Resources
Tong Ling from Aofei Temple
Produced by Quantum Bit | Public Account QbitAI
For developers, there is a new NLP toolbox available, codenamed PaddleNLP, which is now open source.
This toolbox, developed based on Baidu's deep learning platform PaddlePaddle, contains a large number of industrial-grade Chinese NLP open source tools and pre-trained models.
The tools in this toolbox are comprehensive and support 6 major NLP tasks, including text classification, text matching, sequence labeling, text generation, semantic representation and language model and other complex tasks.
In addition to being comprehensive, PaddleNLP also has good results . For example, based on Baidu's massive search data, the semantic matching model trained by PaddleNLP has an AUC improvement of more than 5% compared to the literal similarity method in real FAQ question-answering scenarios.
Whether you are working on chatbots, intelligent customer service, news recommendations, information retrieval, reading comprehension or machine translation, PaddleNLP can meet your needs at one time.
PaddlePaddle said that with PaddleNLP, multiple models in the NLP field can be implemented with a set of shared skeleton codes, reducing the duplication of work for developers during the development process. It is convenient for developers to flexibly plug and unplug and try multiple network structures, and make the application achieve industrial-grade results as quickly as possible.
Let’s take a look at the usage in detail.
1. Text Classification
Text Sentiment Analysis
Emotion is a kind of advanced intelligent behavior of human beings. In order to identify the emotional tendency of text, in-depth semantic modeling is required. In addition, different fields (such as catering and sports) have different emotional expressions, so large-scale data covering various fields is needed for model training. To this end, the above two problems can be solved through semantic models based on deep learning and large-scale data mining.
The Chinese-style sentiment analysis model (Sentiment Classification, or Senta) developed by Baidu can automatically determine the sentiment polarity category of Chinese text with subjective descriptions and give the corresponding confidence level.
Emotional types are divided into positive and negative. Emotional tendency analysis can help companies understand user consumption habits, analyze hot topics and monitor crisis public opinion, providing companies with favorable decision-making support.
The evaluation results based on the open source sentiment classification dataset ChnSentiCorp are shown in the following table. In addition, PaddleNLP has also open-sourced Baidu's model trained based on massive data. After fine-tuning the model on the ChnSentiCorp dataset (see Github for details on the method of Finetune based on the open source model), better results can be achieved.
-
The BOW (Bag Of Words) model is a non-sequential model that uses a basic fully connected structure.
-
CNN (Convolutional Neural Networks) is a basic sequence model that can process variable-length sequence inputs and extract features within local areas.
-
GRU (Gated Recurrent Unit), a sequence model, can better solve the problem of long-distance dependencies in sequence text.
-
LSTM (Long Short Term Memory), a sequence model, can better solve the problem of long-distance dependency in sequence text.
-
BI-LSTM (Bidirectional Long Short Term Memory), a sequence model, uses a bidirectional LSTM structure to better capture the semantic features in sentences.
-
ERNIE (Enhanced Representation through kNowledge IntEgration) is a general text semantic representation model developed by Baidu based on massive data and prior knowledge training, and is fine-tuned on a sentiment classification dataset.
-
ERNIE+BI-LSTM, based on the ERNIE semantic representation, connects to the upper-layer BI-LSTM model and performs fine-tuning on the sentiment tendency classification dataset.
Project address:
https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/sentiment_classification
Conversational emotion recognition
Conversation emotion recognition is applicable to multiple scenarios such as chat and customer service. It can help companies better grasp the quality of conversations and improve the user interaction experience of products. It can also analyze customer service quality and reduce manual quality inspection costs.
Emotion Detection (EmoTect) focuses on identifying users' emotions in intelligent dialogue scenarios. It automatically determines the emotion category of user text in intelligent dialogue scenarios and gives the corresponding confidence level. Emotion types are divided into positive, negative, and neutral.
The evaluation results based on Baidu's self-built test set (including small talk and customer service) and the nlpcc2014 Weibo sentiment dataset are shown in the following table. In addition, PaddleNLP has also open-sourced Baidu's model trained based on massive data. After fine-tuning the model on chat dialogue corpus, better results can be achieved.
-
BOW: Bag Of Words, is a non-sequence model that uses a basic fully connected structure.
-
CNN: A shallow CNN model that can process variable-length sequence inputs and extract features within a local area.
-
TextCNN: A multi-convolutional kernel CNN model that can better capture the local correlation of sentences.
-
LSTM: A single-layer LSTM model that can better solve the problem of long-distance dependencies in sequential text.
-
BI-LSTM: A bidirectional single-layer LSTM model that uses a bidirectional LSTM structure to better capture the semantic features in sentences.
-
ERNIE: Baidu developed a general text semantic representation model based on massive data and prior knowledge training, and fine-tuned it on the conversation emotion classification dataset.
Project address:
https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/emotion_detection
2. Text matching
Short text semantic matching
Baidu's self-developed short text semantic matching framework (SimilarityNet, SimNet) is a framework for calculating the similarity of short texts. It can calculate the similarity score based on two texts entered by the user.
SimNet follows the implicit continuous vector representation method in semantic representation, but performs end-to-end modeling of the semantic matching problem under the deep learning framework, unifying both point-wise and pair-wise supervised learning methods in an overall framework.
In actual application scenarios, the massive amount of user click behavior data is converted into large-scale weakly labeled data, which shows great power when used for the first time in web search tasks and brings about an increase in relevance.
The SimNet framework is widely used in Baidu's products, mainly including core network structures such as BOW, CNN, RNN, MMDNN, etc. It provides a semantic similarity calculation training and prediction framework, which is suitable for multiple application scenarios such as information retrieval, news recommendation, and intelligent customer service, helping companies solve semantic matching problems.
Based on Baidu's massive search data, PaddleNLP trained a SimNet-BOW-Pairwise semantic matching model. In some real FAQ question-answering scenarios, the model's AUC is more than 5% higher than that of the literal similarity method.
The evaluation was conducted based on Baidu's self-built test set (including chat, customer service, and other data sets) and the semantic matching data set (LCQMC). The results are shown in the following table.
The LCQMC dataset uses Accuracy as the evaluation indicator, and the output of the pairwise model is similarity. Therefore, 0.958 is used as the classification threshold. Compared with the CBOW model with the same complex network structure in the baseline model (accuracy is 0.737), the accuracy of BOW_Pairwise is improved to 0.7532.
Project address:
https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/similarity_net
3. Sequence Labeling
lexical analysis
Baidu independently developed the Chinese-specific model lexical analysis task (Lexical Analysis of Chinese), the input is a string, and the output is the word boundaries, parts of speech, and entity categories in the sentence.
Sequence tagging is a classic modeling method for lexical analysis. LAC uses a GRU-based network structure to learn features and connects the learned features to the CRF decoding layer to complete sequence tagging.
The CRF decoding layer essentially replaces the linear model in the traditional CRF with a nonlinear neural network, which is based on the likelihood probability at the sentence level and can better solve the tag bias problem. LAC can complete the Chinese word segmentation, part-of-speech tagging, and proper name recognition tasks in an integrated manner.
Based on the self-built dataset, the overall evaluation results of word segmentation, part-of-speech tagging, and proper name recognition are shown in the following table. In addition, fine-tuning is performed on PaddlePaddle's open semantic representation model ERNIE, and the results of the baseline model, BERT finetuned, and ERNIE finetuned are compared, showing a significant improvement.
Project address:
https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis
4. Text Generation
machine translation
Machine translation (MT) is the process of using a computer to convert one natural language (source language) into another natural language (target language). The input is a sentence in the source language and the output is a sentence in the corresponding target language.
Transformer is a new network structure proposed in the paper "Attention Is All You Need" to complete sequence to sequence (Seq2Seq) learning tasks such as machine translation (MT).
It also uses the typical encoder-decoder framework structure in the Seq2Seq task, but compared with the previously widely used recurrent neural network (RNN), it completely uses the attention mechanism to achieve sequence-to-sequence modeling.
After training the Transformer models of Base and Big configurations based on the public WMT'16 EN-DE dataset, they were evaluated on the corresponding test sets. The results are shown in the following table.
Project address:
https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/neural_machine_translation/transformer
5. Semantic Representation and Language Model
Language Representation Toolbox
BERT is a general semantic representation model with strong transferability. It uses Transformer as the basic network component and bidirectional Masked Language Model and Next Sentence Prediction as training objectives. It obtains general semantic representation through pre-training, and then combines it with a simple output layer and applies it to downstream NLP tasks, achieving SOTA results in multiple tasks.
ELMo (Embeddings from Language Models) is one of the important general semantic representation models. It uses bidirectional LSTM as the basic network component and Language Model as the training target. It obtains general semantic representation through pre-training and migrates the general semantic representation as a feature to downstream NLP tasks, which will significantly improve the model performance of downstream tasks.
PaddleNLP has released a pre-trained model based on encyclopedia data.
Baidu's self-developed semantic representation model ERNIE learns real-world semantic knowledge by modeling words, entities, and entity relationships in massive data. Compared with BERT, which learns raw language signals, ERNIE directly models prior semantic knowledge units, enhancing the model's semantic representation capabilities.
Here is an example:
Learnt by BERT: Harbin is the capital of Heilongjiang Province and a famous city of international ice and ice culture.
Learnt by ERNIE: Beijing is the capital of Heilongjiang Province and an international cultural city.
In the BERT model, we can judge the word "尔" through the local co-occurrence of "哈" and "滨", and the model has not learned any knowledge related to "哈尔滨". However, ERNIE can model the relationship between "哈尔滨" and "黑龙江" by learning the expression of words and entities, and learn that "哈尔滨" is the capital of "黑龙江" and that "哈尔滨" is a snowy city.
In terms of training data, in addition to Chinese encyclopedia and information corpora, ERNIE also introduced forum dialogue data, used DLM (Dialogue Language Model) to model the Query-Response dialogue structure, took dialogue pairs as input, introduced Dialogue Embedding to identify the roles in the dialogue, and used Dialogue Response Loss to learn the implicit relationship of the dialogue, further improving the semantic representation ability of the model.
ERNIE has leading results in multiple NLP Chinese tasks such as natural language inference, semantic similarity, named entity recognition, sentiment analysis, and question-answer matching.
project address:
https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE
https://github.com/PaddlePaddle/LARK/tree/develop/BERT
https://github.com/PaddlePaddle/LARK/tree/develop/ELMo
Language Model
The task of the LSTM-based language model is to calculate the PPL (language model perplexity, the fluency of the user in expressing the sentence) given an input word sequence (Chinese word segmentation, English tokenize). For an introduction to the recurrent neural network language model, please refer to the paper "Recurrent Neural Network Regularization".
Compared with traditional methods, the method based on recurrent neural network can better solve the problem of sparse words. This task uses the RNN network commonly used in sequence tasks to implement a two-layer LSTM network, and then the LSTM result is used to predict the probability of the next word appearing.
The following table shows the ppl comparison of small, meidum, and large configurations.
Project address:
https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/language_model
6. Complex tasks
Dialogue Model Toolbox
Auto Dialogue Evaluation
The automatic dialogue evaluation module is mainly used to evaluate the response quality of open-domain dialogue systems. It can help enterprises or individuals quickly evaluate the response quality of dialogue systems and reduce manual evaluation costs.
1) In the absence of labeled data, the matching model trained by negative sampling is used as an evaluation tool to rank the response quality of multiple dialogue systems;
2) Using a small amount of annotated data (manual scoring of a specific dialogue system or scenario) and fine-tuning the matching model can significantly improve the evaluation effect of the dialogue system or scenario.
Taking four different dialogue systems (seq2seq_naive/seq2seq_att/keywords/human) as examples, we use the dialogue automatic evaluation tool to perform automatic evaluation.
1) In the absence of labeled data, the pre-trained evaluation tool is used directly for evaluation. The Spearman correlation coefficients of automatic evaluation scores and manual evaluation scores on the four dialogue systems are shown in the following table.
2) Sort the average scores of the four systems:
3) After fine-tuning using a small amount of labeled data, the Spearman correlation coefficient between the automatic evaluation score and the manual score is shown in the following table.
Project address:
https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/auto_dialogue_evaluation
Deep Attention Matching Network
The deep attention mechanism model is an open-domain multi-round dialogue matching model. It sorts the most appropriate responses based on the multi-round dialogue history and candidate response content.
The input of the multi-turn dialogue matching task is the multi-turn dialogue history and candidate responses, and the output is the response matching score, which is sorted according to the matching score. For more information, please refer to the paper Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network.
The evaluation results on two public datasets are shown in the following table.
Project address:
https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/deep_attention_matching
Dialogue General Understanding Model DGU
In dialogue-related tasks, the Dialogue System often needs to solve a variety of tasks according to changes in scenarios. The diversity of tasks (intent recognition, slot parsing, DA recognition, DST, etc.) and the scarcity of domain training data have brought great difficulties and challenges to the research and application of the Dialogue System. In order to make the dialogue system better developed, it is necessary to develop a general dialogue understanding model. The BERT-based Dialogue General Understanding module (DGU: Dialogue General Understanding) has been experimentally shown to achieve comparable or even superior results to the best models in various fields in almost all dialogue understanding tasks by using the base-model (BERT) combined with common learning paradigms, demonstrating the great potential of learning a general dialogue understanding model.
DGU has developed relevant model training processes for data sets, supporting tasks such as classification, multi-label classification, and sequence annotation. Users can customize relevant models for their own data sets.
The evaluation was conducted based on a public industry dataset related to dialogue, and the results are shown in the following table.
Project address:
https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/dialogue_general_understanding
Knowledge Drives Dialogue
Human-computer dialogue is one of the most important topics in artificial intelligence (AI) and has received widespread attention from academia and industry in recent years. Currently, dialogue systems are still in their infancy, usually conversing passively and more often taking their remarks as responses rather than their own initiatives, which is different from human-to-human conversations.
Therefore, Baidu set up this competition on a new conversation task called knowledge-driven conversation, in which machines talk to people based on a constructed knowledge graph. It aims to test the ability of machines to have human-like conversations.
This provides a retrieval-based and generation-based baseline system. Both systems are implemented by PaddlePaddle (Baidu's deep learning framework) and Pytorch (Facebook's deep learning framework). The performance of the two systems is shown in the following table.
Project address:
https://github.com/baidu/knowledge-driven-dialogue/tree/master
Reading comprehension
In the machine reading comprehension (MRC) task, a question (Q) and one or more paragraphs (P)/documents (D) are given, and then the machine is used to find the correct answer (A) in the given paragraph, that is, Q + P or D => A. Machine reading comprehension (MRC) is one of the key tasks in natural language processing (NLP), which requires the machine to have a deep understanding of the language to find the correct answer.
The reading comprehension based on PaddlePaddle upgraded the classic reading comprehension BiDAF model, removed the char-level embedding, used the pointer network in the prediction layer, and referred to some network structures in R-NET, which improved the effect (see the table below for the performance on the DuReader2.0 validation set and test set).
DuReader is a large-scale, real-world, human-generated Chinese reading comprehension dataset. DuReader focuses on real-world question-answering tasks in a domain-independent manner. Compared with other reading comprehension datasets, DuReader has the following advantages:
The problem comes from the real search log
The content of the article comes from a real web page
Answers are generated by humans
For real application scenarios
More detailed annotations
Project address:
https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/reading_comprehension
ps: Finally, I would like to recommend a GPU benefit to everyone - Tesla V100 free computing power! With PaddleHub, the model can take off on the spot ~ Scan the QR code below to apply ~
The author is a contracted author of NetEase News and NetEase "Each has its own attitude"