Dialogue with Weng Jiaqi: Talking about Zhujian Intelligent Emotional Computing Technology and Commercialization | CCF-GAIR 2018
▲Click above Leifeng.com Follow
Text | Li Shi
Report from Leiphone.com (leiphone-sz)
The 2018 Global Artificial Intelligence and Robotics Summit (CCF-GAIR) was held in Shenzhen. The summit was hosted by the China Computer Federation (CCF), co-organized by Leiphone.com and the Chinese University of Hong Kong, Shenzhen, and received strong guidance from the Bao'an District Government. It is a top exchange event in the three major fields of domestic artificial intelligence and robotics academia, industry and investment, aiming to create the most powerful cross-border exchange and cooperation platform in the field of domestic artificial intelligence.
CCF-GAIR 2018 continues the top lineup of the previous two sessions, providing a rich platform with 1 main venue and 11 special sessions (bionic robots, robot industry applications, computer vision, intelligent security, financial technology, intelligent driving, NLP, AI+, AI chips, IoT, investors), intending to present more forward-looking and practical conference content and on-site experience to participants from three fields from multiple dimensions of industry, academia and research.
At the Natural Language Processing Special Session, Zhujian Intelligence CTO Weng Jiaqi was invited as a guest at the roundtable discussion on "Difficulties in the Implementation of Natural Language Processing and Future Applications" and shared his views. After the meeting, Leifeng.com conducted a one-on-one interview with Weng Jiaqi.
Zhujian Intelligence was founded in 2016, mainly engaged in text analysis, natural semantic understanding, and sentiment computing. Zhujian Intelligence has two main product lines, one is the brain-like conversational robot, including customer service robots, shopping guide robots, financial robots, marketing robots, personal assistants, brand IP robots, etc.; the other is the multimodal emotion recognition system, including emotion recognition and analysis system, facial expression recognition system, impression analysis system, advertising effect analysis system, call center quality inspection system, classroom emotion analysis system, etc.
Currently, most companies engaged in conversational AI focus on text, while Zhujian Intelligent's multimodal emotion recognition includes multiple modules such as text, voice, and facial expressions. Leifeng.com's interview with Weng Jiaqi also focused on the two aspects of emotional computing technology and commercialization.
In Weng Jiaqi's view, human-computer interaction is divided into three levels. The bottom level is natural language processing, the second level is intent understanding, and the third level is the meaning behind the understanding. Currently, the industry is still at the first two stages. To achieve the third stage, emotional computing is inevitable.
The difficulty of emotional computing is that it not only needs to accurately understand the emotions of a single modality, but also needs to accurately determine which emotion is real when multiple modal emotions conflict. For example, when a person's voice is happy, but his facial expression is angry, is he happy or angry? What is even more difficult is that after AI learns about people's emotions, how should it respond and how should it comfort people who are depressed?
Bamboo Intelligence is based on the intelligent robot Simontha in the movie "Her". It believes that robots should be able to understand human facial expressions and human conversations, and has been committed to multimodal emotional computing from the beginning. And the founder Jian Renxian has not only focused on smart voice assistants in mobile phones and speakers from the beginning, but also on scenes such as stores and retail. In these scenes, text and voice interaction alone is obviously not enough, and vision is indispensable.
However, apart from retail scenarios, the application scenarios of emotional computing still need to be explored. After all, in many private occasions, such as at home, it is difficult to accept robots equipped with cameras.
Weng Jiaqi believes that language and text technology can now help people solve problems in specific areas. Specific areas refer to booking a hotel or a restaurant, and being able to have a natural conversation with people without having to let people speak according to the logic of the robot. In the future, everyone will have their own intelligent assistant that can understand your emotions, your intentions, and help you take care of your daily life. And every company will also have a customer service robot. In the future, it is likely that the user's intelligent assistant will deal with the company's customer service robot. In these scenarios, both large companies and startups have opportunities, and no company can solve all technologies and scenarios.
The following is the original interview text, which has been edited and organized by Leifeng.com without changing the original meaning.
Leifeng.com: What are you currently responsible for at Zhujian Intelligence? What was your work experience like before joining Zhujian?
Weng Jiaqi : I started working in computers in 1982 and came into contact with artificial intelligence 27 years ago. Of course, it was impossible to do artificial intelligence at that time because artificial intelligence had already declined. At that time, most people who worked on AI turned to search engines because search engines have a certain relationship with text analysis. I worked in the search engine field for about 11 years, and now I am back to artificial intelligence. This time, AI should not be a bubble again, and it can really enter human life.
I joined Emu Intelligence about two and a half years ago and am currently the company's CTO, responsible for the technical part, including designing Emu's current entire dialogue architecture, how modules reflect and interact with each other, and the implementation of the entire external project.
Leifeng.com: Can you tell us specifically what modules there are?
Weng Jiaqi : Chatbots can be roughly divided into three categories. The first type is functional robots, such as SIRI and WeChat voice assistant, which can check the weather, check stocks, and set some reminders; the second type is knowledge-based, you can ask him where Lu Qi went to work? (This was the biggest news yesterday, he joined Pinduoduo), you can ask how much the stock price fell after Lu Qi left Baidu (it fell about 18 points in three days); the third type is small talk, which can have emotional dialogues with humans and have situational chats. If you tell the robot that you are heartbroken, then we have to think about how the robot should respond and how to control the context of the entire conversation.
I am responsible for the entire conversation process control. For example, when a user says something, the robot needs to determine whether it should execute a task, provide knowledge, or start a small talk, because each module can be touched. This is the same as a search engine. Baidu's search engine has more than 300 modules behind it, and Google has more than 500 modules behind it. If you ask a question today, it actually has more than 300 modules that give answers. After the answers come out, how should I integrate the answers and sort them? Which ones should I see on the first page and which ones on the second page?
The same is true for chatbots. Chatbots are more demanding because I can't answer 100 sentences, I can only answer one sentence. At this time, which sentence should I choose to answer so that it is more vivid and not too dull, but the answer is not too outrageous. This is what the entire dialogue control is doing.
Leifeng.com: There are now conversational robots like Microsoft Xiaobing that develop both IQ and EQ . Some startups are more vertical, focusing on task-based or knowledge graph fields. In which direction is Zhujian Intelligence developing?
Weng Jiaqi : Microsoft XiaoIce's concept of combining IQ and EQ is correct. Jian Renxian, the founder of Emotibot, is also one of the founders of Microsoft XiaoIce. He left Microsoft in 2015 to found Emotibot, and the company's name means "emotional robot". Emotibot actually started working on emotional robots a year and a half earlier than Microsoft XiaoIce.
Emotional intelligence and emotion are not the same thing. Emotional intelligence means that I really understand you, I will not offend you, and I will not give a cold answer. This of course includes emotional computing.
In terms of text emotions, some companies have made three types of classifications: positive, negative, and neutral. Microsoft Xiaoice may have made six types, and Zhujian Intelligence has made 22 types of classifications for text emotions, which can more accurately detect emotions such as boredom and jealousy.
Text emotions alone are not enough, we also work on voice and facial expressions. For example, when someone says that he scored 500 points in the college entrance examination, you don’t know whether to congratulate or comfort him, so you need to know the tone. Generally speaking, voice emotions are more important than text emotions, and the signals are stronger. The emotions of facial expressions are more complicated, because when I am halfway through a speech, my face may be distorted and my mouth just opens. If I capture it at this time, it does not mean that I am surprised.
Then, the combination of text, voice and emoticons becomes even more complicated. For example, I often cite an example where I smile and then my colleague says "you're dead". The emotion in my facial expression conflicts with the emotion in the text. What should I do?
This is a concept of multimodal emotion, which means that your current emotion is a combination of text, voice, and facial expressions, all mixed together, and each has its own weight. Usually the weight of text is slightly lower, voice is the highest, and facial expressions are in the middle.
In the example just now, I said to him with a smile, "You're dead." Actually, it depends on the context. If we were joking with the previous two sentences and I suddenly said to him with a smile, then it was still a joke. If we were arguing with each other with the previous two sentences and I suddenly said to him, "You're dead," then I was definitely threatening you. So the emotion is not just about the words.
Leifeng.com: The concept of emotional computing has been talked about a lot recently. Can you tell us about your understanding of this concept?
Weng Jiaqi : Affective computing was proposed by MIT professor Rosalind Picard, who is the originator of affective computing. At present , I generally divide human-computer interaction into three levels. The lowest level is called natural language processing . For example, the syntactic analysis of the two sentences "I am hungry" and "I want to eat something later" is different. This is the lowest level.
The second level is the understanding of intent . Although these two sentences are different, their intent is the same. Its intent may mean that I want to order takeout, or I want to find a nearby restaurant.
Then, the third level is the meaning behind it. No one has been able to do this yet . For example, if I suddenly tell you that I am hungry when we meet for the first time, I believe you will not feel very good. You will always think that I am begging for food. If I tell a female colleague that I am hungry, she may think that I am asking her out for dinner. Do I have bad intentions? In different scenes, with different people, and in different situations, when I say the same sentence, the meaning behind it is actually different.
At present, everyone is still working on the first and second layers, how can I classify sentences correctly, how can I make the syntactic structure correct. In this regard, there are many experts at Harbin Institute of Technology in China. Many people are also working on the second layer of intention understanding. At present, it can be roughly usable. I say to the TV or speakers, "Let's play a song by so-and-so", and it knows that I want to listen to music. Then I say to the speakers, "So-and-so's song is terrible", which does not mean that I want to listen to his songs, I just say don't play his songs for me in the future. Now I can correctly understand the intentions of these sentences.
The third layer is the meaning behind it. When I say I am hungry, what is the real meaning behind this sentence? To get to this point, emotional computing is inevitable, and you cannot avoid the entire scene and context.
Leifeng.com: Has Zhujian Intelligence integrated text, voice, and face to implement emotional computing scenarios?
Weng Jiaqi : Let me tell you an example of how we helped Sharp TV do new retail. Sharp has a newly opened flagship store in a shopping mall. There are a total of five TV stores in that mall. In the first three days of opening, Sharp's turnover was 900,000, while the other four stores' combined turnover was only over 400,000. Sharp alone exceeded the total of the other four stores. How did it do this?
We put a large TV screen at the entrance of the store, which can capture the face of every person passing by, and can identify male/female, long hair/short hair, age, expression, appearance, etc. Many people will stop to watch. Then we recommend different products and discounts based on user portraits. This makes the customer flow into the store more than five times that of other stores.
After entering the store, we have unmanned smart shelves with tablets and cameras installed on them. When the camera sees a girl with long hair walking over, the smart shelf will take the initiative to greet her and say, "Madam, your hair is very beautiful. I have shampoo, hair care, and conditioner products here. Are you interested?" If the camera recognizes that the girl has dark spots on her face, it will automatically recommend products such as concealer.
If the other party responds, the topic will continue; if the camera finds that the consumer's face is getting uglier, it will stop the topic. So we can see that the interaction in this case includes face, voice and text.
Leifeng.com: Today’s conversational AI focuses on voice. Why did Zhujian Intelligence focus on vision from the beginning?
Weng Jiaqi : The idea of our conversational AI mainly comes from the movie "Her", in which the intelligent assistant Simonsha can fully perceive the user's status, see the user's expression and hear the user's words. Voice is important for any communication, but in many cases, a single expression is enough without saying a word.
For example, if you walk into a store and see a certain product with a disgusted expression, it actually expresses that you don’t like this product at all.
So when we were doing it, at the very beginning, we had image processing, voice processing, and text processing. At that time, the boss (Jian Renxian) had already thought of the scenarios he wanted in the future. Not only human-computer interaction on mobile phones and speakers, but also in stores. So vision is an indispensable part. So the boss's ambition was quite big at the beginning.
Leifeng.com: What are the difficulties in multimodal emotion computing of text, voice and face?
Weng Jiaqi : The biggest difficulty is, of course, what should we do when several emotions conflict with each other? If you say that the text is happy, the voice is also happy, and the facial expression is also happy, then there is no problem, and primary school students will also know that this is happiness.
The voice is angry, and the text is happy. For example, if I say "I am very happy today" in anger, what do you think it means? To solve this problem, the first thing is to accurately recognize the emotions of text, face and voice. The second thing is that when several emotions conflict, which one should I take as the main one?
Generally speaking, voice emotion accounts for a larger proportion, but if the voice emotion is anger, but the confidence level is only 3 or 4 points, and the text says you are happy, the confidence level is 99 points. What should you do at this time?
Another important point is the entire situation. Although there are three multimodal judgments, if it is only a small segment, it is not accurate enough. I also need to look at the continuous situation, because people's emotions do not change too quickly. Of course, sometimes you will be surprised and furious in an instant, but it does not mean that you will suddenly become happy the next second when you are angry. Therefore, the entire continuous emotion must be considered, which is a relatively big difficulty.
Finally, the most difficult part is, when the smart assistant finds that you are angry or sad, how should it soothe and comfort you? After judging your emotions, how should it respond?
Leifeng.com: What proportion does the multimodal emotion computing solution account for in your current business?
Weng Jiaqi : At present, most smart customer service systems have no vision, and smart TVs, refrigerators, and speakers also have no cameras. After all, installing a bunch of cameras in your home makes people feel very uneasy, and this will definitely violate your privacy.
In public places, such as shopping malls and banks, there are cameras and surveillance, and everyone accepts this.
For example, when I go to an interview, there is a camera facing me, and while I am speaking, it is doing facial analysis for me. This feels a bit weird, but it may be acceptable.
The degree to which people accept cameras depends on the situation and perhaps also on the generation. Each generation accepts different things. For example, do people in their 70s really know how to use mobile phones? They know how to use computers, but their acceptance of this may not be high. Do people know how to use apps? They are still used to talking on the phone and communicating on the phone, rather than using apps or the Internet.
People in their 50s and 60s may not be used to using search engines because there were no search engines when they grew up. And the proportion of people in their 40s who use apps is definitely not as high as that of people in their 20s.
So this still depends on some changes in the future. Some scenarios may accept it, while others may not.
Leifeng.com: Some speakers now already have screens. Is it possible to add visual effects?
Weng Jiaqi : For now, it may not sell well if I add a camera. Usually, if I add a camera, I have to add a cover, and I can put the cover up to cover the camera.
You must tell the user that there is a camera here, and you can also give him a cover so that he can cover the camera when necessary. This is acceptable to the user. Otherwise, if you suddenly add a camera, your cost will increase, but it will not sell well, and people will wonder what is the point of this speaker?
And at least the speaker says OK, I turn it off and the camera is gone, instead of having cameras all over the ceiling of your house, where there is really no privacy at all.
Leifeng.com: What if it is a robot? It has eyes similar to human eyes.
Weng Jiaqi : When we watch science fiction movies, everyone accepts that robots can walk around in your home. But if you really put such a robot with a camera in your home, you probably wouldn’t feel very comfortable.
Leifeng.com: Zhujian Intelligence has a robot factory (Bot Factory) to help companies customize robots. Will you give each customized robot its own personality?
Weng Jiaqi : At present, we have only done the simplest things. The robot has the attributes of a robot. What is the robot's name? Is it a boy or a girl? How old is it this year? Does it sleep at night? What does it look like? Who is its father? Who is its mother? Where is it from? We will set it up based on these questions that people ask most frequently.
In addition, some robots are more serious, while others are more playful. We have made some switches for this, such as some robots can tell jokes, while others can only check the weather.
We are trying the next step, which is to see if we can have a robot of our own. I can use the data from your daily chats with your friends to train it and learn the way you speak. Then you will have your own robot that can chat in your style.
This step is technically feasible, and it is just a matter of data volume. I need to obtain enough data so that the robot can slowly approach your behavior. But this involves whether you are willing to make your privacy public. The conversations you have with your friends are all your privacy.
Leifeng.com: Have you experimented with this?
Weng Jiaqi : We experimented with this two years ago, but we found that users don’t have the patience, because it might take a lot of time. How many years does it take to teach a child? Ten or twenty years, right? Do you have that much patience to teach a robot? You’ll probably lose your patience after two days, so this is a question of patience.
Leifeng.com : Now, Microsoft Xiaobing is developing towards artificial intelligence creation. She can write poems, sing songs, and write news. What do you think about this?
Weng Jiaqi : Actually, writing poems and couplets is relatively easy, because it is actually solving problems in a very limited direction. From the perspective of fun, these are very good and everyone will find them very fresh.
But from a practical point of view, how can this thing help solve problems and make money for me? It is still difficult to use it. Of course, XiaoIce is positioned as a companion, which is to make you less bored. This bunch of fancy things are actually a good thing and helpful.
Leifeng.com: Did Zhujian Intelligence focus on commercialization from the beginning?
Weng Jiaqi : Yes, because we have to move towards commercialization. I made a very interesting robot, but I can’t get paid. Microsoft doesn’t care. Microsoft has profitable businesses such as Windows and Office, so they can make XiaoIce fun and interesting.
Leifeng.com: At this stage, what level do you expect conversational artificial intelligence to reach?
Weng Jiaqi : I think the current level of technology can help people solve problems in specific areas. Specific areas are when I want to book a hotel or a restaurant, and it can understand my conversation. A robot that orders a restaurant needs to understand specific words, for example, "seven or eight people with two children" and "seven or eight people plus two children" have different meanings.
When people talk, they usually don’t express themselves directly, but use a variety of ways. For example, “My parents and I are going to celebrate my girlfriend’s birthday.” How many seats do we need? He is not telling you four, he is telling you a strange text. So, in this field, we need to let robots understand these human words, rather than asking humans to adapt to robots and answer in a way that robots can understand.
I think in the future, everyone will have their own robot. That robot knows your preferences. You can ask it to order takeout for you. It knows what you like and don’t like. It also knows what you ate yesterday and the day before yesterday. Don’t order the same thing today. I can ask it to call my mom for me. It knows who I mean.
Then he will know your mother's phone number and what time is appropriate to call her. He may remind you that it is too late and your mother has gone to bed.
In the future, every company may have its own robot. For example, McDonald's may have a food ordering robot to help you order food. If you have your own robot and McDonald's has a robot, robots may communicate with each other in the future.
I just need to tell my wristband to order a Big Mac for me. Then it knows that Big Mac is from McDonald's, so it goes to find the McDonald's robot. The two robots may not communicate in human language, but they will exchange information in their own way, and then McDonald's will process it and complete the order.
Leifeng.com: In this future scenario, large companies have the terminal advantage. Where are the opportunities for startups?
Weng Jiaqi : WeChat is a very natural entry point because everyone is used to opening WeChat. For example, I said on WeChat, "How much did I spend on credit card transactions this month?" The WeChat robot knows your three cards: those of China Merchants Bank, Bank of Communications, and Pudong Development Bank. It will then go to these three robots to help you with identity authentication. You don't need to open the apps of three banks. The WeChat robot can directly tell you the information of the three banks.
Of course, it is difficult for small companies to seize the entrance, but WeChat and Tencent alone cannot handle semantic understanding. My semantic understanding includes that every China Merchants Bank, Industrial and Commercial Bank of China, and Shanghai Pudong Development Bank must have a robot behind it to receive these instructions or accept a natural language. This is something that every company actually needs help with.
Follow Leiphone.com (leiphone-sz) and reply 2 to add the reader group and make a friend