The magical application of human-like learning in conversational robots

Latest update time：2016-09-26

Reads：

There are many chatbots, such as Siri, Xiaobing, DuerOS, and Allo, which can chat with you when you are free. However, as manufacturers and users realize that it is unrealistic to create a highly general chatbot out of thin air, the attitude of chatbots has also changed slightly - manufacturers try to start from certain vertical fields and develop from pure chat functions to the chatbots that can complete specific tasks for users. Suddenly, a new scenario has been found for the "talk but not action" chatbots.

However, this newly upgraded conversational robot shifts its focus to making decisions on behalf of people and helping users complete tasks.

How does it understand what the user wants to do? How does it achieve human-like learning that cannot be achieved during chatting?

Then how does it know what tasks the third-party app can complete and where to click so that it can connect with what the user wants to do?

When its focus becomes completing tasks for users, what are its core technical requirements?

…

For answers to these questions that only people with years of practical experience can come up with, we invited Dai Shuaixiang, the founder of Muran Cognition. His Xiaomu robot is designed to obtain and "guess" intentions during conversations, ultimately enabling human decision-making and helping with task execution. Dai Shuaixiang, a former chief architect at Baidu, has long served as the head of Baidu's Query Understanding Direction. He is a technical expert in natural language understanding and has won the first and so far only Baidu's highest award centered on NLP technology.

The "Query Rewriting Model" proposed in 2010 brought a leap forward in Baidu's search engine technology, significantly increasing search relevance and advertising revenue. This model was proposed nearly a year earlier than the similar model "Query Rewriting Using Monolingual Statistical Machine Translation" in academia. This model is still widely used in all Baidu search product lines. More than 20 patent technologies in the fields of natural language processing, semantic search, and automatic problem solving have been applied for.

Could you first introduce what Human-like learning is emphasized in your products?

One-shot Learning and RL (reinforcement learning) in conversation models.

The purpose of one-shot learning is to train from a small number of samples to solve the "cold start" problem of the dialogue system;

RL is an unsupervised learning method that learns through trial and error. For the dialogue model after cold start, RL can help the system continuously enhance the advantageous strategies in the dialogue model and weaken the influence of negative strategies in the actual interaction with users. The actual performance is that users will feel that the system is becoming more and more humane, or personalized.

The above two learning methods are closer to biological organisms, or the way people learn, so I prefer to call them "Human-like Learning". In a conversation, one of them is at the front end of the process, and the other is at the back end of the process; one allows the model to be cold-started, and the other allows the model to be optimized in real time, and the two complement each other. Of course, in natural language understanding, One-shot Learning can also be used in more places, such as semantic analysis, which requires "representation learning" tasks; task decision-making, a typical "multi-task learning" scenario; and the handling of the portability problem of conversation scenarios, which is a bit similar to the "transfer learning" that everyone often hears.

When users ask questions in the voice engine, they are given a direct correct answer instead of a web link with 10 search results. What is the key technology?

In fact, this involves a lot of complex technologies, divided into different levels, and also includes different disciplines, such as linguistics, cognitive science, logic, natural language processing, machine learning, etc. It is a process of integration and intersection of multiple types of key technologies. We can call it natural language understanding for the time being.

If the human brain is regarded as a machine, then natural language is a semantic representation suitable for machine recognition and calculation. Natural language is composed of words through characters, words form sentences, and sentences further form complex chapters. It embodies a process of combining the simple into the complex, and the finite into the infinite.

Compositionality provides the ability to generate new things, but it does not guarantee that all generated things are good, while causality provides this guarantee.

Therefore, the core of natural language understanding is to model compositionality and causality at a certain level of language. In more popular terms, the former corresponds to the representation of semantics, while the latter refers to knowledge reasoning.

Voice robots make decisions on behalf of people and achieve automation. There are two difficulties in this matter: First, they understand what the user wants to do. When he says "Kobe's highest scoring game", he knows that he is talking about basketball, video, and 81 points, and then he can use invisible hands to operate for people.

How do you solve this problem using one shot learning technology?

This is actually the semantic understanding I mentioned above. I cannot disclose the specific details for the time being, because this part is one of our important innovations and is definitely not something that is open to the industry and academia, so I cannot say it for the time being, but I can talk about the basic ideas from another perspective.

One-shot Learning is just a concept, an abstract idea, and not even a general learning framework. For semantic understanding, it is definitely not a typical pattern recognition problem like image recognition or speech recognition, or an end-to-end problem. Semantic understanding is a reasoning-related problem, and in layman's terms, it is closer to problems like playing chess. This type of problem obviously cannot be trained and learned directly through an end-to-end framework, but first needs to model the problem itself, and then seek a suitable learning method based on this.

Let's take an example that is easy to understand. When we learn to write, we only need to learn to write a small number of characters, and then when we see a new character, we can basically write it smoothly. The reason for this can be considered that we have made some kind of approximate abstraction and modeling for the writing process: that is, we regard writing as a process of limited specific strokes + specific spatial arrangement. When we see a character we have never seen before, we try to use this abstract method to "construct" such a character, and then adjust while comparing, and finally write a character that is most similar to this character. This also reflects the combination and causality I mentioned above, but causality may be more of a statistical relationship.

Therefore, the most critical point for this type of problem is to model the problem itself, abstract the problem, and get close to the essence of the problem. It is impossible to have a ready-made general framework to solve it.

Continuing with the above question: The second is to know what tasks can be completed by third-party apps, where to click, and then connect with what the user wants to do (semantic understanding).

What in-app search technology did you use to solve this problem?

[This question seems to completely misunderstand our work. We don’t care how an app works, and we don’t actually need to connect to a specific app. ]

Let's go back to the essence of the problem. The operation of current APP is designed based on the input methods of mouse, keyboard and touch screen. No matter how friendly or simple the APP is, it is limited by these mechanical input methods. Simply put, the current APP is just a combination of these input methods. Why should semantic understanding correspond to such a low-level operation method? There is no need to do so! Dialogue is a new way of interaction. Only the interactive method of dialogue is the closest to the interaction between people, and of course it is also the most natural way for people to interact with machines.

Apart from the technical details, if we want to complete a task or make a decision, the process itself has nothing to do with the input method. It is a task flow. There may be some key nodes that different people need to follow, but there is actually no rule. Everyone has their own personalized process. For example, the process of "buying a plane ticket": some people will buy it online, some will buy it by phone, and some will buy it at the counter; some people are very stubborn and just want a ticket that meets all their given conditions; some people are hesitant and keep comparing, asking and considering; more people have a basic optimization goal, such as the price should be as low as possible, or the time should be as fast as possible, and then choose the one they think is the best based on the current flight situation.

What we need to do is to assist people in making decisions in the most natural way at the abstract level of completing a task, so as to promote the execution of the task as soon as possible. The most appropriate way is obviously the dialogue between people. It targets the scenario of people completing a specific task, uses dialogue to promote the rapid progress of the entire task, and calls possible third-party interfaces at the appropriate time , such as displaying specific information, placing orders, etc., so that the entire task is optimized towards a certain goal, such as obtaining the most personalized order for the current user. This is a typical AI idea, and the technology involved is also a fusion of the various complex technologies mentioned above.

When the focus of a voice robot becomes helping users make decisions and mobilizing third-party applications to respond quickly, it will become a matter that relies heavily on the integration of technology and resources.

Even when it comes to operational cooperation, how do you view this issue?

[Does this question mean that we need to connect to many services, so the operation of service connection will be heavier? 】

We do need to connect to many services so that we can flexibly and appropriately call a certain service to assist decision-making in specific task scenarios.

But it is completely opposite to the understanding in the question. We can automatically build semantic analysis and service docking programs for different service interfaces on the Internet (leaving aside the specific business negotiations, we only consider it from a technical perspective. After all, the more widely used services on the Internet are, the freer they are). This is another advantage of ours. In addition to our semantic analysis method that can quickly migrate from one scenario to another, we can automatically build corresponding docking programs for different services. To put it more bluntly, for the interface of a specific service, we will let our system automatically "write" a program to handle the docking process between people and services under this service, that is, the dialogue process for this service interface. From the perspective of program writing, we have designed a program that can generate specific programs to replace the work that may have required programmers to write manually.

Mobilizing third-party apps to respond to tasks has a wide range and requires deep verticalization to have an advantage. How to balance it?

Based on the versatility of the model we designed, in theory, as long as we continue to go deeper, we can achieve full-scenario dialogue and even multi-language dialogue. However, judging from the current user acceptance, the maturity of related products, and the business model, we prefer to make breakthroughs in car and home environments. In other words, the method of pure voice dialogue is still only a non-mainstream method. This method will only become useful and effective when a person's hands are occupied. However, with the rapid development of language understanding and dialogue technology, this situation is becoming more and more popular, and is expected to become a mainstream human-computer interaction method in a few years.

● ● ●

Reader Questions

Can you introduce the current status of NLP applications in human-computer interaction? To what extent can it be achieved in AR/VR scenarios? For example, can NPCs in games conduct complex random exchanges like Microsoft XiaoIce?

First of all, I personally think XiaoIce doesn't have any complicated technology at all, although it may use deep learning and sentence generation methods. But its core is to train a new question with a large amount of chat corpus (question-answer pairs) to find the most "relevant" existing question in the current context. Technically, it is more similar to the retrieval method, but different models may have different description capabilities for the context, and the results may also be different.

But in general, this kind of dialogue has nothing to do with "language understanding", which means it has nothing to do with reasoning. It is just a purely statistical similarity calculation. So the question should be: Can the NPC in the game have a natural dialogue with the game player in front of the computer like a real game player?

In fact, a specific game is a very segmented scenario. I personally think that in this limited scenario, human-computer dialogue can be made closer to conversations between people. However, it is necessary to consider that there are many characters in the game, and different characters have different settings. A truly universal and quickly personalized dialogue model is needed, and the cold start of this model also needs to be different for different characters.

What are the current uses of deep learning in NLP and how it differs from and integrates with traditional methods? How should practitioners choose?

It is still related to the problem itself. If the problem is regarded as a pattern recognition problem, the input and output are very clear, and a large number of training samples can be obtained (including a method that can be annotated manually), such problems are generally suitable for end-to-end solutions, and deep learning is a good choice. Sequence labeling in NLP and statistical machine translation can all use this method. If the problem itself is a goal that cannot be clearly described with a specific label, but requires a series of clear actions based on the environment it is in, such as planning problems, task decision problems, etc., such problems are not suitable for deep learning yet, and require a combination of deep learning and logic.

Regarding the term "traditional NLP methods", I personally have a different opinion. For NLP technology, it studies methods that approach the essence of language phenomena, or methods for language modeling. At first, people studied it from a logical perspective, and then added statistics and machine learning. The language problem is a big enough problem, and these different schools only reveal different aspects of the language problem. If we look at it from an application perspective, of course the appropriate method is to consider all aspects of the language problem, and to integrate the power of logic, knowledge and statistics to achieve effective progress. Many so-called products that use NLP technology mostly focus on single-point tasks such as text classification, vocabulary tagging, and even word segmentation. In fact, they are still far from understanding the role of NLP technology.

How likely is it that "interactive search" using natural language will replace traditional search forms? What is the current status of the industry's response to this trend and what are the main difficulties it currently faces?

With the diversification of terminals and the gradual miniaturization of devices, this is an inevitable trend. I cannot give a specific time prediction, but I believe that this day will not be too far away. The advent of the mobile era has already begun to dismember search. The current search engine will soon degenerate into a less important service behind "interactive search", just like the classified directory search has been replaced by the current search engine.

The difficulty also lies in natural language understanding, or more specifically, how machines can "understand" human language. Machines must at least understand human intentions and support reasoning, which itself must be a form of calculation. Only then can we talk about dialogue and interaction. Otherwise, it will be like water without a source and trees without roots.

What are the noteworthy development trends of natural language processing technology in the field of education?

Education is a huge industry. I am not an expert in this area, so I can only talk about it casually.

From the perspective of our interaction and decision-making engine, machine-assisted education may be a very interesting point, allowing robots to help complete some aspects of the education process that machines are better at, such as grading papers, patiently explaining basic calculation methods, answering science questions, etc.

Technically, I am personally more concerned about automatic problem solving and automatic proof, because this is also closely related to our technology. However, we have not yet considered in detail how to combine these AI methods with educational products, such as assisting in answering questions, participating in exams, etc., but it should be a very interesting direction.

When conducting sentiment analysis, which word segmentation result will generally be more accurate if using a professional dictionary or the Chinese Academy of Sciences word segmentation system?

Sentiment analysis is still a classification problem. We should consider more advanced features, such as syntax and even semantics, instead of staying at low-level features such as vocabulary level. In other words, basic features related to granularity such as word segmentation should not have a direct impact on high-level NLP applications, otherwise it will be fatal in terms of system scalability. I think if we want to unleash the power of NLP at a higher level, we should not get hung up on links such as word segmentation. The effect of word segmentation should have as little impact on the final effect of the model as possible, or even not rely on word segmentation.

Click on a keyword to view related historical articles

● ● ●

figure

Hua Beige Wang | Fiil CTO Wu Ning | OnePlus Liu Zuohu

Cohen Wu Shi | Huawei Ren Zhengfei | DJI Wang Tao

Daniela Rus from MIT | Yang Qiang from HKUST

Michael Wooldridge | Cai Wei | Mark Zuckerberg

Baidu Wang Jin | Songhe Cheng Hao | Jilu Wang Chuyun

company

Apple AI | Baidu AI | IBM AI

Magic Leap | iRobot | Tesla | Tencent