A magical "brain supplement"! Just one sentence can tell your speaking gestures | UC Berkeley
乾明 发自 凹非寺
量子位 报道 | 公众号 QbitAI
Hard to guard against! Now, AI can understand your speaking gestures just by listening to your voice.
This new research on "brain power" Max comes from institutions such as UC Berkeley and was included in this year's top academic conference CVPR 2019.
In their research, they only needed to input a piece of speech to predict the speaker's gestures, and there was basically no sense of incongruity.
Don't believe it? Just look at Oliver, the host of the famous American talk show Last Week Night. His gestures have been thoroughly studied by AI. When he speaks, the angle of his shoulders and the way his fingers move can be predicted clearly.
And it’s not just sit-down talk show hosts, their research covers a variety of other scenarios, too:
The talk show host standing up uses bold gestures when speaking:
For example, when a teacher is teaching, he uses gestures like this:
After seeing this research, some netizens commented that they wondered if it could predict Trump’s magical gestures?
Some people also said that it is a good thing that this is just a research. What if it can be applied to reality?
Next time when you are on the phone with someone, if you say you love them but you are doing something naughty, you will be exposed.
How is it achieved?
Gestures are spontaneous behaviors that people make while speaking. They are used to supplement voice information and help better convey the speaker's ideas.
Usually, when speaking, gestures are related to speech. But to obtain gesture information from speech, we need to learn the mapping relationship between audio and gestures. In practice, there are still many problems:
-
First, gestures and speech are asynchronous, and gestures can appear before, after, or during the corresponding speech.
-
Second, this is a multimodal task, and speakers may use inconsistent gestures when saying the same thing on different occasions.
-
Moreover, each person's gestures when speaking are also very special, and different speakers tend to adopt different speaking gestures.
To address these problems, the researchers proposed a method for temporal cross-modal translation, which converts speech into gestures in an end-to-end manner and uses a very large temporal context for prediction to overcome the asynchrony problem.
They built a large dataset of 144 hours of personal videos from 10 speakers. To show the range of applicability of the model, the speakers had different backgrounds: a TV host, a university lecturer, and a TV evangelist.
The topics they discuss also cover a wide range of topics, from the philosophy of death and chemistry to the history of rock music, current affairs commentary, and reading the Bible, the Quran, and more.
Now, this dataset has been made available to the public.
How do we predict gestures from speech? See the following figure:
Given a speech, the translation model (G) predicts the speaker’s gestures (hand and arm movements) that match the speech.
A regression function (L1) is then used to extract a training signal from the data, and a degree-resistant discriminator is used to ensure that the predictions are temporally consistent with the utterance and consistent with the speaker's style.
An existing video synthesis method is then used to generate what it would sound like if the speaker said the words.
The entire convolutional network consists of an audio encoder and a 1D UNet translation architecture. The audio encoder takes a 2D log-Mel spectrogram as input and downsamples it through a series of convolutions to produce a 1D signal with the same sampling rate as the video (15 Hz).
The UNet translation architecture then learns to map this signal to a temporal stack of gesture vectors via an L1 regression loss.
The UNet architecture is used for translation because its bottleneck provides the network with past and future temporal context, allowing high-frequency temporal information to flow through, enabling the prediction of rapid gesture movements.
While L1 regression is the only way to extract a training signal from the data, it suffers from the known problem of regression to the mean which produces overly smoothed motion. To address this, an adversarial discriminator conditioned on the differences in the predicted pose sequence is added.
research team
Most of the authors of this study are from UC Berkeley.
Shiry Ginosar is a PhD student in the Department of Computer Science at UC Berkeley. She was previously a researcher in the field of human-computer interaction and a visiting scholar at the Department of Computer Science at CMU.
Amir Bar is a machine learning engineer living in Berkeley. Currently, he works at Zebra Medical Vision to improve efficiency in the healthcare field.
They say in their paper that this research is a step toward computational analysis of conversational gestures, which could then be used to drive behavior in virtual tasks.
Finally, send the portal:
Paper address:
http://people.eecs.berkeley.edu/~shiry/speech2gesture/
Source code will be available soon:
https://github.com/amirbar/speech2gesture
-over-
Quantum Bit QbitAI · Toutiao signed author
Tracking new trends in AI technology and products
If you like it, click "Watching"!