Interview with Yu Dong: Multimodality is an important direction towards general artificial intelligence
▲Click above Leifeng.com Follow
When it comes to artificial intelligence, we may only be scratching the surface.
Text | Cong Mo
Leifeng.com AI Technology Review: As artificial intelligence technologies such as speech recognition, natural language processing, and computer vision become increasingly mature and gradually applied to actual scenarios, how to achieve large-scale applications or how to move towards general artificial intelligence is increasingly becoming a topic of exploration and thinking for researchers in these fields.
Under this kind of exploration and thinking, "multimodality" has become a research direction that leading experts and scholars in the field of artificial intelligence have focused on. For example, Professor Liu Qun, an expert in natural language processing, mentioned in a previous dialogue with AI Technology Review that one of the current key research directions of the Noah's Ark Speech and Semantics Laboratory is multimodality; Zhang Jianwei, an academician of the Hamburg Academy of Sciences in Germany, believes that the future of human-computer interaction is a multimodal sharing model; Professor Jia Jiaya, an expert in computer vision, has put forward the view that "multimodality is the future of artificial intelligence" in many speeches.
As one of the representatives in the industry paying attention to this research direction, Tencent has been focusing on multimodal research since February 2018, and announced in November 2018 that it would explore the next generation of human-computer interaction: multimodal intelligence.
On September 2, at the world's first "Nature Conference - AI and Robotics Conference" jointly organized by Tencent AI Lab, Nature Research and its two journals "Nature Machine Intelligence" and "Nature Biomedical Engineering", Dr. Yu Dong, one of the leaders in the field of speech recognition, deputy director of Tencent AI Lab, and head of the multimodal virtual human project, gave a speech report on "Multimodal Synthesis Technology in Virtual Humans" based on his research results in multimodal technology. Using the virtual human project as a carrier, he introduced the technical advantages of multimodality to everyone and shared Tencent AI Lab's research and application exploration in this direction.
After the meeting, AI Technology Review also conducted an exclusive interview with Dr. Yu Dong to further explore the application and exploration of multimodality. Dr. Yu Dong regarded the research direction of multimodality as a breakthrough towards general artificial intelligence, and pointed out with a more calm attitude that multimodality will be a very important direction for artificial intelligence in the future, but it is not all. Because artificial intelligence is a very broad concept, we may only have a superficial understanding of it at present. What exactly is the road to general artificial intelligence? Everyone is still in a state of exploration.
At the same time, AI Technology Review also took this opportunity to talk to Dr. Yu Dong about his historical origins in pioneering the application of deep learning technology to the field of speech recognition, his career transition from Microsoft Research to Tencent AI Lab, and his views on the future development of the field of speech recognition.
Let’s first take a look at what Dr. Yu Dong shared at this conference.
Human-computer interaction has gone through several stages, such as keyboard interaction and touch interaction. Currently, many devices use voice interaction. The driving force behind each change in the interaction mode is the higher requirements for the convenience, naturalness and accuracy of the interaction between humans and machines.
In order to better meet the needs of human-computer interaction, Dr. Yu Dong pointed out a very important research direction or development trend, which is multimodal human-computer interaction. At the same time, Dr. Yu Dong also explained why multimodality is the development trend of human-computer interaction. There are four main reasons:
-
First, multimodal interaction allows humans to choose different modal combinations for interaction in different scenarios, thereby improving the naturalness of human-computer interaction as a whole;
-
Second, under multimodal technology, one modality can complement the weaknesses of another modality, so that more accurate user, emotion, scene, and speaker location estimation can be obtained by fusing information from multiple modalities;
-
Third, multimodal interaction has the advantage of "mutual supervision", that is, when the machine cannot obtain obvious information from a certain modality, other modalities can provide it with weak supervision information, allowing the machine to continue to make system adaptive adjustments;
-
Fourth, multimodality enables people to have multi-dimensional feelings when interacting with machines, so that they can experience the emotions and semantics of the machine from multiple aspects such as vision, hearing, and touch.
In addition to these advantages, Dr. Yu Dong believes that multimodal interaction can also bring more imagination space to the industry. For example, human-computer interaction technology can be used to create virtual commentary, virtual front desk, virtual companionship, etc.
It is precisely because of these advantages of multimodal interaction and the imagination space it brings that he led the team to start the research project on virtual humans. Below, Dr. Yu Dong also used the research results of virtual humans as a carrier to give a detailed introduction to multimodal interaction technology.
Dr. Yu Dong first introduced the system framework of multimodal interaction, which mainly includes three parts: multimodal input, intermediate cognitive and decision-making control links, and final output.
Furthermore, Dr. Yu Dong demonstrated the interim results of multimodal technology - the technical process of synthesizing virtual humans: the system first extracts a variety of information from the text, including actions, expressions, emotions, stress positions, and excitement levels; then inputs this information into the action and expression model to generate actions and expressions, and at the same time inputs it into the multimodal synthesis system DurIAN to synchronously generate voice, lip shape and expression parameters, and then synthesize a real person or cartoon image.
Among them, the DurIAN model for simultaneous synthesis of speech and images, as the core achievement of multimodal synthesis technology, is the focus of Dr. Yu Dong's introduction this time.
According to Dr. Yu Dong, compared with traditional speech synthesis methods and the latest end-to-end speech synthesis methods, the application of multimodal synthesis technology DurIAN model has achieved better results in terms of naturalness, robustness, controllability, generalization ability and real-time performance.
Traditional speech synthesis method VS end-to-end speech synthesis method
Before formally introducing the DurIAN model, Dr. Yu Dong first introduced the traditional speech synthesis method, the end-to-end speech synthesis method, and the respective advantages and disadvantages of these two methods.
Traditional speech synthesis methods are mainly based on the BLSTM+WORLD model, which has the advantages of strong stability and controllability, but also has the disadvantage of being too mechanical in synthesized speech. However, due to the strong stability and controllability of this method, this framework is still mainly used in practical systems in the industry.
The advantage of the end-to-end speech synthesis method is that it is very natural, but its disadvantage is that it is relatively unstable and controllable. The most common problems are missing words and repetitions. Taking the results extracted from the literature as an example, the probability of missing words or repetition errors in the system is 1%-5%. Therefore, this method has not been widely used in practical systems. However, recently, this method has made great progress, such as the Tacotron model combined with WaveNet proposed by Google in 2018.
Compared with traditional speech synthesis methods, the advantages of the end-to-end speech synthesis model Tacotron mainly include four improvements:
First, it uses a neural network-based encoder model to replace manually designed linguistic features;
Second, it directly predicts the frequency spectrum containing rich information rather than the source-filter acoustic features;
Third, it introduces an autoregressive model to solve the over-smoothing problem in the synthesis process;
Fourth, it adopts an end-to-end training method based on the attention mechanism.
However, this end-to-end attention mechanism also brings the problem of poor stability. Dr. Yu Dong's team found through analysis that the attention mechanism is the main reason for the model's problems such as missing words and repetitions. There are two synthetic examples on the right side of the figure below, in which the words marked in blue are missed.
Speech synthesis system using multimodal technology: DurIAN model
Based on the discovery of the reasons for the problems of missing words and repetition in the end-to-end speech synthesis model Tacotron, Dr. Yu Dong's team proposed a solution in the DurIAN model, which is to retain the parts of the Tacotron model that are beneficial to the naturalness of speech synthesis, that is, the first three improvements mentioned above, and then use the duration prediction model to replace the end-to-end attention mechanism . The basic approach is to train a phoneme duration prediction model, and then train the model end-to-end given the duration.
In this way, the DurIAN model can maintain the advantage of high naturalness of the end-to-end speech synthesis model while solving the problems of system stability and controllability while ensuring that there are no missing words or repetitions.
In terms of controllability, the DurIAN model can further achieve fine control . The basic idea is to use supervised learning methods, but there is no need to finely annotate the training corpus. For example, it only needs to annotate whether the voice is excited or whether the speaking speed is fast or slow. During training, each control variable learns a directional vector; during the synthesis process, it only needs to scale the corresponding vector by continuous values to achieve fine style control.
In addition to stability and controllability, the DurIAN model has made significant improvements in robustness, generalization, and real-time performance.
In response to the problems of robustness and weak generalization of previous end-to-end speech synthesis systems, the DurIAN model introduces linguistic information, especially punctuation and prosodic boundaries, that is, by making full use of the prosodic structure in Chinese speech to improve the generalization performance of the model. The specific approach is to use Skip Encoder in the DurIAN model to replace the encoder structure in the Tacotron model, thereby effectively introducing the prosodic structure in Chinese sentences. The basic idea of Skip Encoder is to explicitly express this linguistic information with additional frames at the time of input. However, since punctuation and prosodic boundaries themselves are a time point rather than a time period, the extra frames are skipped at the encoder output, so that the encoder output of each frame still corresponds to the frame of the frequency spectrum.
On the issue of real-time, Google has previously proposed the waveRNN model. Although it is much faster than the wavenet commonly used in the beginning of the neural vocoder in terms of computing speed, and can also achieve real-time after careful engineering optimization, the real-time rate is not good and the cost of speech synthesis is high. In response to this, Dr. Yu Dong’s team proposed a multi-band synchronous waveRNN technology. Its basic approach is to divide the voice information into frequency bands, and use the same vocoder model to predict the values of multiple frequency bands at the same time in each step. If it is divided into 4 frequency bands, 4 values can be calculated in each step, and the number of calculations is one-fourth of the original. At the same time, during the synthesis process, after the vocoder predicts the values of multiple frequency bands, upsampling and special filter design can ensure the restoration of the original signal without distortion.
In addition to speech synthesis, Dr. Yu Dong also demonstrated the advantages of the DurIAN model in synchronously synthesizing multimodal information. That is, the duration prediction model in the model enables the system to synchronously synthesize speech, mouth shape, and facial expression parameters, and ultimately generate a virtual person with a cartoon or real-life image.
Although the DurIAN model has done very well in terms of naturalness and robustness, style controllability, real-time, and synchronous synthesis of speech, mouth shape, and facial expressions, Dr. Yu Dong also pointed out that there is still a lot of room for exploration of this technology, and his team still has a lot of work to do in the future, mainly in four directions:
-
First, in terms of model optimization, we need to explore end-to-end training methods based on the DurIAN structure to better support end-to-end optimization.
-
Second, in terms of control capabilities, the model needs to have a full range of control capabilities, that is, it needs to be able to synthesize corresponding speech under different scenarios, emotions, timbre, and tone information;
-
Third, in terms of training corpus, the system needs to be able to learn rhythm from low-quality corpus and learn sound quality from high-quality corpus;
-
Fourth, further exploration of model customization is needed so that new timbres can be trained with a small amount of speech data (<15 minutes).
AI Technology Review interview with Dr. Yu Dong:
Q: The topic of your report this time is "Multimodal Synthesis in Virtual Humans". In your speech, you focused on the latest achievements in virtual humans and the application of multimodal technology in virtual humans. What was the opportunity that led you to start researching this project?
Yu Dong: First, we are increasingly aware that a single technology can do very little, so we need to combine many technologies to produce more influential results.
Second, Tencent AI Lab set up all the research directions needed for virtual humans at the beginning, including speech, natural language processing, machine learning, computer vision, etc. Therefore, the conditions we currently have for the virtual human project are already relatively mature.
Third, multimodal interaction is an inevitable trend of historical development, and we estimate that this technology will become more and more important in the next few years.
Q: What is the current progress of the virtual human project?
Yu Dong: We started planning this project in the second half of last year, and we really started to organize this project at the beginning of this year. After eight months of research, the project has made some progress. (Related progress can be found in the above report)
This project is roughly divided into three core parts: the first is the output of the virtual person; the second is the input of the virtual person, including perception such as seeing, hearing, and touching; the third is the module of cognition and dialogue, which is the least mature module, but it is also a very important module. The industry has been studying the cognitive module for a long time, but it is still unknown what the correct approach is. We are not very clear about the extent to which we can achieve this part, but we still need to organize our forces to move in this direction.
Q: Now in the field of artificial intelligence, researchers including Professor Jia Jiaya, head of Tencent Youtu Lab, are studying multimodal technology. He also put forward the view that "multimodality is the future of artificial intelligence development" in a recent speech. What do you think of this view?
Yu Dong: I think multimodality is an important direction for the future. Artificial intelligence is a very broad concept. In fact, we have only learned a little bit about it so far, including basic questions such as what cognitive reasoning and causal reasoning are, and why the generalization ability of machines is so weak. We have not figured it out yet.
We are still in a state of exploration as to what the road to general artificial intelligence will look like. Therefore, reinforcement learning, multimodal interaction, etc. are all important attempts towards general artificial intelligence, but not all.
In a few years, perhaps we will find that some other technology is the technology that can truly realize general artificial intelligence.
Q: Just in terms of academics, your resume is very rich. You were one of the first researchers to apply deep learning technology to the field of speech recognition. You have worked closely with Geoffrey Hinton, Deng Li and others. Whether it is papers, monographs or research results, your performance is very outstanding. So what opportunity did you have to choose speech recognition as a research direction in the first place?
Yu Dong: When I was in primary school, I read an extracurricular book called "Strange Robot Dog". Many of the things discussed in the book have been realized now, including machines that can understand what people say, interact with children, help them solve learning problems, and take children out to play, etc. So actually, I became interested in these intelligent robots when I was a child.
I really came into contact with speech recognition during my undergraduate studies. I majored in automatic control at Zhejiang University. The class I attended was a special class set up by Zhejiang University, called the "mixed class". The students who entered this class were the best 100 freshmen that year. The teachers in this class trained us students as future researchers, so we started to pay attention to the concept of the "national team of science and technology" as soon as we entered school.
When we were in our third year of university, we joined the research group to do research. It was during a small peak period in the development of artificial intelligence (1989-1991). There were two main popular directions. One was expert systems. My senior at the time, Wu Zhaohui (now the president of Zhejiang University), did a lot of research in this direction. The other direction was neural networks, which had just become popular at the time. One of my directions at the time was neural networks.
After graduating from college, I planned to go to the Chinese Academy of Sciences, because at that time, everyone knew that the Chinese Academy of Sciences was the national team of science and technology. Since my undergraduate major was automatic control, I went to the Institute of Automation to find a mentor. In the process, I found Professor Huang Taiyi, whose research direction was consistent with my interests. He was studying speech recognition. Coincidentally, my senior in the "mixed class" Xu Bo (now the director of the Institute of Automation) was also studying for a master's degree with Professor Huang Taiyi at the time. So I finally went to Professor Huang Taiyi for graduate studies and began to enter the research field of speech recognition.
Q: In fact, in the early stages of deep learning, this method was not favored. In what context did you start studying deep learning?
Yu Dong: As I mentioned earlier, when I first came into contact with neural networks, neural networks were one of the hot research directions in artificial intelligence at that time.
Later, when I was studying for a master's degree with Professor Huang Taiyi, Professor Huang Taiyi and other teachers in his laboratory also used neural network methods to do speech recognition, so my master's thesis in the Institute of Automation used neural network methods to do speech recognition. This laid the foundation for my subsequent work on introducing deep learning into speech recognition tasks.
Q: This year, Hinton and other three deep learning giants won the 2018 Turing Award. The revolutionary impact of deep learning on the field of artificial intelligence has already occurred a few years ago. Do you think this is a belated honor for deep learning? In addition, how do you evaluate the work of these three researchers?
Yu Dong: I think it is basically timely. Because when many scientific advances first come out, it is difficult for people in the field to see how big their impact is. Generally, there is a delay in recognition, which may take only a few years, or even until the inventor dies before the achievement is recognized. So I think it is quite timely for them to receive this honor.
First, they started studying deep learning very early, and they had done a lot of preparatory work before I first came into contact with this work during college; second, they persisted in this direction for a long time, even in low periods, they still persisted, which are qualities that are very worthy of learning for us researchers.
Q: Is applying deep learning technology to speech recognition your most representative work? What are your main research directions in the field of speech recognition?
Yu Dong: I think this is a relatively representative work. Of course, we have done a series of work in this research direction, and therefore it has played a relatively large role in promoting this field. If it were just a single work, the driving force would not be so great.
One of the research directions we are currently focusing on is multimodality, which is a technology that covers information such as vision, sound, symbolic language, smell and touch. Technologies related to speech, such as speech recognition, speech synthesis, speech enhancement, semantic separation, voiceprint recognition, etc., are all used in multimodality.
Q: In addition to academia, you also have extensive experience in the industry. In May 2017, you left Microsoft Research and joined Tencent AI Lab. What changes have you experienced in terms of work content and the role you assumed?
Yu Dong: When I was working at Microsoft Research, I was relatively more focused on my own research direction and technical aspects. After joining Tencent AI Lab, my role is no longer purely technical research-oriented. In addition to technical research, I also need to play the role of a manager.
Relatively speaking, there were two difficulties in adapting at the beginning: first, I had to spend a lot of time on management, and relatively less time on technology, which required me to find a better balance; second, because the team I was in charge of was in Seattle, due to the time difference with the headquarters, I had to have meetings with people in China a lot of the night, and the free time at night was much less than when I was in MSR. In order to reduce communication problems, I increased the time I spent in the Chinese laboratory.
Q: At present, domestic technology giants have actually established artificial intelligence-related laboratories one after another. What do you think of the position of Tencent AI Lab among them?
Yu Dong: Now these companies have established artificial intelligence laboratories and recruited many strong scientists. I think this is a good trend and will have a great driving effect on the development of AI as a whole.
In comparison, Tencent AI Lab is slightly different in that our research may not be as closely tied to products as other labs. Other companies’ labs are more like an engineering academy, preferring to replicate technologies in papers and then implement them in products. We, on the other hand, focus more on whether we can develop cutting-edge technologies, which is different from the focus of other companies’ labs.
Q: How much attention does your team pay to the academic community's progress in the field of speech recognition? In addition to speech, what other research directions does your team focus on?
Yu Dong: We pay close attention to cutting-edge technologies. I personally attend at least one speech-related conference and one natural language processing conference every year, and other members of my team also attend related conferences. Therefore, our colleagues basically attend major academic conferences.
In addition to speech, we are also focusing on natural language processing, computer vision, graphics and imaging, and the basic theories of machine learning and artificial intelligence technologies.
Q: In terms of industrial implementation, compared with other fields of artificial intelligence, speech recognition is ahead, but there are also many problems exposed. Which problems do you think are more serious?
Yu Dong: In fact, the problem is still a robustness problem. Now the deep learning-based method has made the system much more robust than before, but it still cannot achieve the effect we expect.
Our main approach now is to increase the training corpus, but the training corpus is currently difficult to collect. Even if a lot of corpus is collected, once the machine is in a completely new mismatch environment that it has never seen before, it will not be able to achieve very good results.
A typical example is that many speech recognition machines now have an error rate of 60-70 percent, and they can do well even in relatively noisy environments. However, if two people are talking at the same time, the error rate may reach 50-60 percent. In addition, if the speaker has a heavy accent, the speech recognition machine will not work well.
We have tried many solutions before, including improving the generalization ability of the model and making the model adaptable. At present, these solutions still have a lot of room for improvement.
Q: In your opinion, what stages has the development of the field of speech recognition gone through, what stage is it in now, and what should the ideal state be?
Yu Dong: In terms of difficulty, speech recognition and other fields of artificial intelligence have gone through very similar stages: at the beginning, some very simple tasks were done, such as phoneme recognition and single word recognition; then it was the stage of continuous speech recognition. After the hidden Markov model came out, continuous speech recognition became feasible, and later it came to large-vocabulary continuous speech recognition; after that, it was the stage of real-time speech recognition, which required the machine to understand people chatting freely.
Now we are at the stage of speech recognition in completely real scenarios. For example, many researchers are trying to study speech recognition in cocktail party scenarios. This is also the direction we will break through in the next stage. Speech recognition in real scenarios also includes speech recognition in very noisy environments or scenarios where the speaker has a heavy accent.
I think that the ideal state of machines should be that they can recognize things at a higher rate than humans. One day in the future, computers should be able to recognize things at a higher rate than humans in all scenarios.
Q: In the next three to five years, what directions or technologies can we seek breakthroughs in the field of speech recognition?
Yu Dong: I think in the next three to five years, there are three main directions for breakthroughs in the field of speech recognition: the first is multimodality; the second is models with stronger and faster adaptive capabilities; third, speech recognition in scenarios similar to cocktail parties will also be a direction that can be explored.
Finally, the original paper download link of the DurIAN model is attached:
https://www.yanxishe.com/resourceDetail/999
At this conference, Tencent AI Lab also officially released "42 Big Questions about AI and Robots", which you can view and download for free at https://www.yanxishe.com/resourceDetail/988.
▎ Huawei Kirin 990 is here! The first 5G SoC integrates 10.3 billion transistors
New arrival! "AI Investment Research" has now launched the complete video of the CCF GAIR 2019 summit and white papers on major theme sessions, including the Robotics Frontier Session, Intelligent Transportation Session, Smart City Session, AI Chip Session, AI Finance Session, AI Healthcare Session, Smart Education Session, etc. "AI Investment Research" members can watch the annual summit videos and research reports for free, scan the QR code to enter the member page to learn more, or send a private message to teaching assistant Xiao Mu (WeChat: moocmm) for consultation.