Exclusive Interview | Dr. Sun Jian from Alibaba iDST: How can we see hope in the wailing of others when engaging in intelligent human-computer dialogue?

Latest update time：2017-04-17

Reads：

9 top experts from Alibaba iDST Data Science and Technology Laboratory presented a wonderful "NLP Quick Start Practical Training Course". Don't you believe it? Go to www.mooc.ai and you will know.

Leifeng.com: In the field of AI, some problems are called AI-hard problems. AI-hard means that once these problems are solved, AI or strong AI will be realized. Natural language understanding and human-computer dialogue are among these AI-hard problems. In Alibaba's most mysterious iDST department, a group of people have been exploring and developing in this direction since the beginning of 2014. At that time, it had been three years since the launch of Siri, and a large number of voice assistant products that chased Siri were entering a low ebb and extinction period. Why did this group of people start to enter this field when others were desperate? What did they see differently? What do they know about AI-hard? Three years have passed. What are they doing? What have they achieved? What else do they want to do? With these questions, Leifeng.com conducted an exclusive interview with Dr. Sun Jian, a senior expert in natural language understanding and human-computer dialogue at Alibaba iDST, to let everyone know about Alibaba iDST's exploration, thinking and progress in the direction of intelligent human-computer dialogue.

Why enter this field when others are desperate?

Leifeng.com AI Technology Review: When did you start working on human-computer dialogue? What was your reason at that time?

Sun Jian: We started to try and explore the direction of human-computer dialogue in early 2014. We started to explore the direction of human-computer dialogue because we perceived two trends and changes at that time.

The first trend is the rapid development of smart devices. At that time, smartphones were already quite popular, and other smart devices were also developing very rapidly, such as smart glasses, smart watches , smart TVs , smart speakers, Internet cars, robots and other products. These smart devices have different hardware forms and various usage scenarios. Traditional human-computer interaction methods, such as keyboards and mice or touch controls, are increasingly unable to meet user requirements. For example, when walking, it is painful to text while walking; when driving, both hands are occupied, and using touch controls to navigate or listen to music is both inconvenient and dangerous; when interacting with robots, especially some anthropomorphic robots, if you drag a keyboard and mouse or need to touch a screen, it will give users a very unnatural feeling. Such problems have given rise to an urgent need for natural dialogue interaction between humans and machines. However, at the beginning of 2014, human-computer dialogue interaction was still in a very early stage, and the experience was very poor.
The second trend is the increasing richness and penetration of Internet services. From traditional information and communication services to shopping, food delivery, navigation, taxi-hailing and other Internet services related to all aspects of people's lives. It has become an important trend to allow users to break through traditional interaction methods on various smart devices to obtain a variety of services.

Based on the above two points, we believe that there is an urgent need to create a means for people and machines to interact more naturally and conveniently, so that users can easily obtain the information and services they want at any time and any place.

When we started this project, we received a lot of internal challenges: one of the main issues was that the volume of voice searches on Taobao was very small at the time, so why did we launch voice human-computer dialogue when the volume of voice searches was very small? Another example cited by the skeptics was that the voice assistant products developed by some Internet companies in the industry in early 2014 were basically dead, so why did we launch voice human-computer dialogue? But we insisted on our judgment because we sensed different trends and changes and believed that it would become the future.

Their troubles and choices

AI Technology Review: What major stages have you gone through in your practice over the past three years, and what pitfalls have you encountered?

Sun Jian: In the past three years, we have actually gone through two major stages. In the first stage, iDST and the YUNOS department worked closely together to create an intelligent assistant product with voice interaction capabilities on the YUNOS mobile operating system. But in the process, we found that there was a problem with the intelligent assistant product model. What is the problem?

Smart assistant products are positioned to become the user's main entrance, allowing it to take on any user needs, and it is expected to be able to meet all user needs within the product, and try not to let users jump to other applications. This means that all functions must be implemented within this product. For example, if you want to buy train tickets, shop, take a taxi, order takeout, etc., it means implementing all these functions such as 12306, Taobao shopping, Didi taxi, Meituan takeout, etc. within a smart assistant product. The workload is very huge, and what is more terrible is that limited resources simply cannot achieve a user experience comparable to these apps. Moreover, even if you implement all the functions, it will be in a competitive relationship with other apps for traffic entrances, which is not feasible.

Therefore, after more than a year of exploration and thinking, we have come to the conclusion that a product like an intelligent assistant cannot be carried by an app alone. It should be a combination of device + operating system + app ecosystem. Based on this conclusion, we made a strategic choice at the end of 2015: our positioning is not to create an intelligent assistant product, but to create an intelligent human-computer interaction platform, empowering each device and each app, so that each device and app has the ability of intelligent human-computer dialogue. In this way, the ability of voice interaction is provided to each device or app, so that you are not in a competitive relationship with the app, but a collaborative relationship.

AI Technology Review: Can you introduce your main work?

Sun Jian: We are the natural language human-computer dialogue team, and our work includes natural language understanding, dialogue management, intelligent question and answer, intelligent chat and other technical directions.

Language understanding is to make machines understand human language. In simple terms, it can be divided into two subtasks. The first is to determine the user's intention when speaking, such as ordering a meal, booking a taxi, or buying a train ticket? Second, based on understanding the user's intention, it is also necessary to extract the key information in the user's speech. For example, if you want to buy a train ticket, you need the departure place, destination, and time.

The dialogue interaction between people and machines cannot be completed with just one sentence, so a dialogue management module is needed to manage the dialogue process. Let's continue with the above example. If the user only says the departure place and destination but not the time, the machine will ask what time to leave. In this way, through multiple rounds of dialogue, the machine collects all the information needed to meet the user's needs, and then requests a specific data service (for example, train ticket service).

For most tasks, getting the service results is only half the job. Users also need to screen and filter the results according to their preferences, such as turning pages, viewing details, changing query conditions, etc. This entire interactive process requires dialogue management.

In terms of intelligent question-answering, the human-machine dialogue is open. It is not like an APP, where the functions are pre-designed and the user can only click a few buttons. During the human-machine dialogue, the user can ask questions freely, which is a great challenge for the machine. The machine really needs to know everything in the world to understand and answer the various questions of users .

Therefore, in the area of intelligent question answering, we have invested some resources in focusing on Internet FAQs, Internet encyclopedia knowledge, and in-depth mining of the endless stream of information that emerges every day, so as to better answer users' questions. This is also our focus in the future.

We have also done some work on the chat engine, but considering that it is not that valuable to users, we have not focused on investing in this direction.

The biggest challenge they see

AI Technology Review: What do you think is the biggest challenge in the field of intelligent human-computer dialogue? How to deal with such challenges?

Sun Jian: In the field of intelligent human-machine dialogue, I think the biggest challenge is scalability. There are two dimensions to scalability: first, scalability in the field, and second, scalability in the device.

First, let's talk about domain scalability. For example, when we develop dialogue interaction in the music field, we need to define the ontology of the music field, process the semantic knowledge of the field (song name, singer name, music style, album, etc.), define the language understanding pattern or train the language understanding model, develop the dialogue interaction process, request services in the music field, and process data. To meet the requirements of product release, there is a lot of work and details to be polished. But when we want to develop another new field, such as the map field, these steps and work will not be reduced, and the time spent is almost linear. Therefore, the time cost of domain expansion is very high.
Second, it is the scalability of the device. For example, after we develop a dialogue interaction in the music field suitable for smart TVs, can we use it directly on speakers? The answer is no. Why not? Because these two different types of devices are different, which may lead to different processes of human-computer dialogue interaction. For example, on smart TVs, because the TV screen is large, the product definition is: when the user wants to listen to Andy Lau's songs, the system should display Andy Lau's song list, and then the user chooses one from it. This is a way of interaction. But this kind of dialogue interaction does not work on speakers because it has no screen. The product definition is: when the user wants to listen to Andy Lau's songs, the system just recommends a song of Andy Lau that the user likes the most, without letting the user make a choice. Therefore, whether the device has a screen, the size of the screen, and other factors determine that the human-computer dialogue interaction process in the same field is different.

Based on the above two points, we realize that human-computer dialogue interaction is strongly related to business, field, device type, etc. The owner of each business is the most suitable team to develop its business field, but at the same time, the development of human-computer dialogue interaction is a high-threshold thing for the business side (compared to app development). Therefore, our idea is to divide dialogue interaction into two layers, one is the engine layer and the other is the business layer.

iDST provides the natural interaction platform, and we build up the engine capabilities, such as speech understanding and conversation capabilities, and then let the business team develop conversations suitable for their own business scenarios based on this platform.

AI Technology Review: We all know that it is difficult to make machines understand human language. Can you tell us more specifically what are the key challenges of language understanding in human-computer dialogue?

Sun Jian: I think there are the following points:

The first point is the diversity and ambiguity of spoken language. The diversity and ambiguity of spoken language add richness to language, but it adds a lot of difficulty for machine understanding. First of all, diversity means that the same meaning can be expressed in multiple ways. For example, to turn up the volume, you can say "turn up the volume", "louder", "louder", "enlarge the volume", "turn up the sound", "the sound is too low", or even "can't hear clearly", etc.; secondly, ambiguity means that the same statement can express multiple meanings. For example, "I want to go to Lhasa", does it mean buying a plane ticket? Buy a train ticket? Check attractions? Check guide? Or do you want to listen to the song "I want to go to Lhasa".
The second point is that language understanding depends on context. In a conversation, speaking only one sentence is a special case, just like when people talk to each other, they talk back and forth. For a conversation between a person and a machine, it is very difficult to establish, manage and use this kind of context, for example:

Q: Then marry me

A: My mother said I am too young to get married.

Q: I asked your mother and she agreed to you marrying me

A: The following results were found for you... (the system was unable to answer and turned to the search results)

The third point is commonly called fault tolerance, and professionally called robustness. In intelligent human-computer dialogue, the unsatisfactory speech recognition results due to various reasons will increase the difficulty of language understanding, such as "九月奇奇" may be recognized as "九月奇奇", and Han Hong has a song called "九儿" may be recognized as "九二". Some people generally find it difficult to accurately remember and express longer entity words, and many cases are translated from meanings, with extra words, missing words, and wrong words. To give a few examples, for example, "The king asked me to patrol the mountain" may be said as "The king asked me to patrol the mountain", and "Dora the Explorer" may be said as "Dora the Adventurer", etc.
The fourth point is the mastery of common sense and the ability to reason. People can have smooth conversations with each other because we share a lot of common sense and can reason. But common sense and reasoning are very difficult for today's machines. For example, "I'm hungry" means finding a restaurant, and "I have a stomachache" means finding a hospital or buying medicine.

Looking back on the past year

AI Technology Review: What achievements has your team made in the field of language understanding and dialogue in the past year?

Sun Jian: Our work and achievements over the past year mainly include the following aspects:

第一点，我们的语言理解引擎从传统的机器学习方法，全面升级为深度学习方法并在效果指标上取得显著改进，对用户口语各种丰富表达的理解更具鲁棒性。在意图判定方面，我们在对比了多种深度学习模型之后，现在选择了 CNN 模型并做了很多改进；在 slot-filling 方面，随着数据量的增加和各种知识的融入，Bi-LSTM-CRF 模型的优势越来越大。在上下文的理解上，我们建立了有效的模型来做处理。在鲁棒性方面，大量的 data augmentation 对效果又直接的提升，此外，我们在实验让模型自身能够学会处理实体词多字、少字、错字的问题。
Secondly, we proposed and designed a dialogue description language to describe the task flow. This dialogue description language can not only characterize the dialogue process of slot filling, but also fully describe the steps of the entire task and the conditions required for each step. For example, taking train ticket booking as an example, this dialogue description language not only describes the dialogue in the information collection stage, but also can fully describe the subsequent task processes such as selecting train numbers, seats, and payment.
Third, we developed a dialogue engine for the dialogue description language. The dialogue engine has two features: it can support the cross-domain attribute carry over mechanism; it supports the dialogue interruption and return mechanism. I can expand on these two features. First, the dialogue can jump freely across domains and carry over the attribute information during the jump process. For example, when buying a train ticket, the user sometimes wants to check the weather at the destination. If the weather is bad, he may have to modify his itinerary. Therefore, the human-computer dialogue system we designed and developed can support the user to jump to a new task relatively naturally and smoothly in the process of completing a task without the user having to repeat some of the information he said before. Second, the human-computer dialogue we designed and developed can implement the dialogue interruption and return mechanism. In human-computer dialogue, often due to various reasons, the machine may not understand the user's words, resulting in the interruption of the ongoing dialogue. If there is no such mechanism, after the dialogue is interrupted, the user will have to continue from the beginning to the end. With this mechanism, we can allow the user's next round of dialogue to continue the previous round of dialogue before the dialogue is interrupted, and you don't need to repeat it. This is what I find more interesting.
Fourthly, we proposed an Open Dialogue solution that enables business parties to develop and customize dialogues in business-specific areas, and based on this, we built a complete human-computer dialogue natural interaction platform (NUI).

AI Technology Review: In what scenarios and products has the intelligent voice human-computer dialogue developed by Alibaba been applied?

Sun Jian: The intelligent voice human-computer dialogue we developed is a deep collaboration with Alibaba's YunOS operating system. Therefore, after the YunOS system is installed on the device, it will naturally be equipped with the capability of intelligent voice human-computer dialogue interaction.

It is now available on Tmall Magic Box, YunOS mobile phones and some smart speakers. In addition, Alibaba is working with SAIC to build Internet cars, and we are working on dialogue interaction around Internet car scenarios.

Internet cars are a necessary scenario for human-computer interaction, and they are very interesting because the YUNOS operating system can receive many signals from the car's hardware system (e.g., how much gas is left in the tank, whether the sunroof is open, etc.). If the car is running low on gas, this information can be sensed by the operating system, and the system can proactively communicate with the owner, telling you that there is a gas station two kilometers ahead and you can go there to refuel. This is a great help to users.

They are preparing to make human-machine dialogue more intelligent

AI Technology Review: What are the important research topics that need to be explored in the field of intelligent human-computer dialogue?

Sun Jian: I think there are three main aspects.

First, current language understanding is still targeted at specific domains, which are pre-defined, and there are big problems with scalability, and the coverage of user needs is not enough. Therefore, open-domain natural language understanding is an important direction for the future, but also a big challenge.

Second, current human-computer dialogue interactions are basically single-round dialogues. Even multi-round dialogues only consider limited contexts. Modeling of language understanding based on the complete dialogue context is also a topic worthy of research and exploration.

Third, the current human-computer dialogue is mostly defined by humans, and the disadvantage is that its ability does not improve as the dialogue continues. Establishing a data-driven human-computer dialogue mechanism allows the dialogue ability to continuously self-learn and continuously learn and improve during the dialogue process. This is also a very important direction.

AI Technology Review: We heard that Alibaba has few vacancies for recruiting people. Is IDST still recruiting people for intelligent voice interaction? What are your expectations and requirements for talents?

Sun Jian: Many BUs in Alibaba do not have a headcount for recruiting people, but intelligent voice interaction is a direction that the group attaches great importance to and continues to invest in, so headcount is not a problem. We have the following expectations for talent:

The first is curiosity. You need to be curious about new things and have the willingness and initiative to actively explore them.

The second is the ability to learn, because science and technology are developing very, very fast, and new scientific advances and technological breakthroughs may appear every day, so we need to learn every day;

The third is thinking ability, which means being able to ask yourself more questions about the problems and phenomena you encounter, and being able to have your own thoughts and judgments. If you are interested in intelligent human-computer dialogue, we look forward to having more discussions, exchanges or cooperation with you.

The article is exclusively reported by Leifeng.com [AI Technology Review]

Click on a keyword to view related historical articles

● ● ●