Sogou and Tsinghua University break through AI music dance, what will the future of human-computer interaction be like?-EEWORLD

Collect

What kind of chemical reaction will occur when sound and AI behavior are combined? This topic is becoming a new hot topic in AI research at home and abroad.

For example, Carnegie Mellon University and CMU Robotics Institute are studying the interaction between sound and robot movements; China starts with digital humans. Sogou's avatar technology team and Professor Jia Jia's team from Tsinghua University's Tiangong Intelligent Computing Institute have taken the lead in conducting research on audio-driven body movements.

Recently, the digital human technology paper jointly published by the two, "ChoreoNet: A Music-Dance Synthesis Framework Based on Dance Movement Units", was accepted as a long paper by the 2020 top international event ACM Multimedia.

As the judging body for the Turing Award, the Nobel Prize in computer science, ACM (Association for Computing Machinery) has a clear industry status. Its ACM Multimedia is also considered an Olympic-level top event in the field of multimedia technology, with a very low paper acceptance rate.

So, what is so groundbreaking about this new technology that it has been recognized by a top conference?

Dance to the music, how is the “music and dance synthesis” achieved?

There are already many mature applications of digital humans making facial expressions and body movements according to the semantics of text, such as AI synthetic anchors. If they can also make synchronous and natural body reactions following the audio, it will undoubtedly produce wonderful chemical reactions in a variety of scenarios.

However, the difficulty of moving with the sound lies in the fact that there are many technical problems that need to be solved behind it, such as:

The traditional method of synthesizing music and dance is the baseline method, which maps the key points of the human skeleton. However, many key points are difficult to capture and predict, which will result in high redundancy and noise, causing the synthesis results to be unstable and the movements to be inconsistent with real people.

Later, scholars such as Yalta also proposed to solve the above problems through weakly supervised learning of AI. However, due to the lack of knowledge of human dance experience, problems such as unnatural synthesis and unfluent emotional expression still arise.

In addition, since the music clips are relatively long and are accompanied by thousands of action scenes, it is also a major challenge for the intelligent agent to remember and map such ultra-long sequences.

The breakthrough made by Sogou and the research team of Tsinghua University Tiangong Institute is to integrate human professional knowledge into the algorithm and propose a program ChoreoNet that imitates human dance choreography to generate dynamic, beautiful, coherent, non-linear and highly realistic dance based on music.

Simply put, ChoreoNet captures and digitizes the various movement units and music melodies of professional dancers, and then allows AI to find patterns in them, knowing what kind of dance moves should be made in what kind of music beat and melody style, and thus forming a coherent movement trajectory.

Among them, the researchers broke through two links:

1. Dance knowledge. Use motion capture to collect how professional human dancers choreograph movements according to the rhythm and melody of music. Researchers collected dance data of four different types (cha-cha, waltz, rumba and tango), cut out a segment corresponding to a choreography action unit (CAUs) for several music beats, formed an action control unit (CA), and formed a mapping sequence between music and action.

2. The dance movements collected previously were only the key point data of the human skeleton. How to make the continuous transition between them more natural? With the help of NLP semantic understanding, the researchers allowed AI to react in real time based on the accumulated knowledge. A motion generation model was designed using GAN, which allowed AI to draw some dance movements and fill in the missing data, thereby achieving a smooth transition of the dance and producing a natural effect.

Experimental results show that ChoreoNet performs better than the baseline method and can generate structured controls with longer duration to generate actions that match the music and make them naturally connected and emotionally smooth.

In this breakthrough, Sogou's keen perception of the topic of audio-driven body movements and the addition of AI avatar technology in body movement and posture generation are undoubtedly an excellent combination of leading technological capabilities and innovative consciousness.

Sogou continues to lead the pack, its inextricable bond with avatar technology

It can be seen that the emergence of ChoreoNet has broughtThe improvement of human-computer interaction capabilities also givesMachine learning incorporates knowledge elements. This can be seen as an advancement of Sogou's "avatar technology", and it also indirectly confirms that Sogou's AI technology landscape with "natural interaction + knowledge computing" as the core is continuing to run wild, and has accumulated the potential to continuously lead the direction of technology.

Since the first avatar technology was invented in 2018, Sogou has never stopped its research and development, and has continued to focus on how to better drive the facial expressions and lip movements of digital humans with text and audio. In the field of 2D/3D digital humans, it has successively built the ability to generate and drive audio and video synchronization and realistic facial expressions and lip movements.

How to make digital humans more natural and expressive is also a key research direction of Sogou Avatar, in which the expression of body movements and postures is crucial. After reaching a high standard for the facial drive of digital humans, Sogou will shift the focus of research from facial drive to facial + movement drive, focusing on how to make body movements more natural and expressive. For example, the 3D AI synthetic anchor launched in May this year not only has a facial expression that can withstand the test of high-definition lenses, but also realizes free walking driven by text semantics.

Today, ChoreoNet has gone a step further and achieved real-time driving of AI digital humans with audio. Sogou's pioneering attempt in the industry and its breakthrough R&D results have changed the current situation where AI avatars' faces and movements can only be driven by text and semantics, bringing more innovation possibilities to the industry, and Sogou's ideals and strengths in avatar technology are also clearly demonstrated.

What exactly does Sogou want to do by continuously creating visual and naturally interactive AI digital humans?

The future of human-computer interaction and Sogou's technological vision

Returning to the corporate strategy level, Sogou's AI concept is to empower people with AI. Through human-machine collaboration, people can be liberated from repetitive work and social productivity can be better liberated. For example, AI anchors can free hosts from reading set content and allow them to engage in more creative work. Of course, all of this starts with more natural human-machine interaction, completing communication and touch again and again.

This time, ChoreoNet allows digital humans to dance to music. This creative breakthrough is not only technically cool, but also has huge application potential.

Not surprisingly, Sogou is likely to combine this technology with 3D digital humans, because compared with 2D digital humans, 3D digital humans have stronger body flexibility and plasticity, and thus have a wider range of applications. The addition of audio-driven technology can not only enrich the scenes of Sogou 3D digital humans in news broadcasting and location interviews, but also directly help to break through the field of integrated media and enter the fields of entertainment, film and television. It can be seen that vision-based human-computer interaction will become more and more mainstream. For example, the current popular intelligent customer service, virtual idols, etc. often require a lot of text and semantic input for reasoning and interaction. The movements of virtual idols also need to be captured and produced manually frame by frame. Switching to audio-driven technology can realize voice communication more directly, saving production/calculation steps and costs.

In addition, the combination of human knowledge system and machine learning has greatly improved AI capabilities. Through training and learning with knowledge data in vertical fields, more accurate and reliable services can be provided, greatly improving the acceptance of AI customer service.

Of course, audio drivers can also generate more humane personal secretaries to help people reduce their workload and improve efficiency. At the same time, they can respond in real time through audio recognition and judgment, with richer expressiveness, allowing smart homes, service robots, etc. to better integrate into the living environment, and play a more active role in scenarios such as elderly care, personal assistants, and child companionship.

There is a consensus in the industry that generally only research projects with great potential for impact on daily life and technological breakthroughs will be approved and accepted by ACM Multimedia. From this perspective, the work done by Sogou and Tsinghua Tiangong Institute is far more than just an academic breakthrough. While global technology giants are exploring how to use multimodal interaction to create new gameplay and new functions, Sogou has taken a step forward that will make people's eyes shine.

Making digital humans more human-like will enable them to achieve close cooperation and collaboration with humans earlier, which is equally important for humans and AI. It is precisely because of this that the world's top events will recognize and encourage it. What capabilities will Sogou gather for digital humans next time? Let's wait and see.

[1] [2]

Reference address：Sogou and Tsinghua University break through AI music dance, what will the future of human-computer interaction be like?

Previous article：Baidu has experienced 30 years of AI in China. Wang Haifeng talks about the opportunities and challenges of AI
Next article：Luxshare Precision's rise to prominence

Popular Resources
Popular amplifiers