The winner of the international AI competition is actually an educator
Yang Jingyuyang from Aofei Temple
Quantum Bit Report | Public Account QbitAI
Recently, I heard many friends in the voice circle talking about a competition.
What is this? Very few samples, what is this? Intrusion, what is this? Education...
Hey, what the hell is " education "?
After careful inquiries, I found out that it was the international TTS (speech synthesis) competition M2VoC that just ended .
M2VoC , also known as the Multi-Speaker Multi-Style Voice Cloning Competition, gives you very few voice samples (at least 5) and asks you to synthesize voices of the same style.
Isn't this just a regular competition for technical players?
My friend said that the surprise was that there was an "outlier" among the sub-track champions this time -
Yuanfudao, which provides live online courses .
Can online education produce an AI champion?
It is said that the winning team had other urgent tasks at the time, so they randomly sent two people and squeezed in five days to participate in the competition.
It's suspected to be Versailles, and there is evidence!
But Qiaodou Madai and Yuanfudao are definitely online education companies? !
M2VoC wins its first battle
Let’s first take a look at the game itself.
Yuanfudao Random entry The competition I participated in was the task during the International Conference on Acoustics, Speech and Signal Processing (ICASSP) Signal Processing Challenge - Multi-speaker Multi-style Voice Cloning Competition (M2VoC) .
ICASSP, as an annual conference organized by the IEEE Signal Processing Society, is also one of the most authoritative conferences in signal processing and applications.
It is said that this is also the world's first small-resource sound cloning challenge.
The competition is divided into two tracks, one is a small sample track, and the other is a track with even fewer samples than the “small sample” track.
In the Very Few Samples track, contestants are required to calibrate and test their voices against different speaking styles and 5 available voice samples.
Each track is divided into open set and closed set. Open set means using any public data; closed set means only using official data.
In the end, more than 150 teams participated in the competition, and Yuanfudao won first place in the very small sample open set track.
In addition, in the few-sample open set and very few-sample closed set tracks, it ranked 4th and 5th respectively.
In fact, what Yuanfudao presented in this competition was not laboratory technology.
Instead, it is a technology that has already been used in products such as Xiaoyuan Oral Arithmetic and Yuanfudao Online Classes, and is used in scenarios such as English pronunciation and reading questions.
For example, when reading a math problem, some young children cannot recognize all the words, so the problem needs to be read out loud for them to understand. In addition, teachers can also set a topic and synthesize an audio based on the text of the topic.
Especially in the application of English listening, the pronunciation requirements are more stringent.
But frontline teachers have reported that this is more useful than public services.
In the past, teachers would set questions and then find regular British and American teachers to record the voices. It would usually take a week for the outsourcing company to return the voice package.
If there are any modifications, it will take at least 2 weeks, and it will be even more uncontrollable during holidays.
Now, through speech synthesis, a 10-second sentence can be converted into speech in less than 1 second, which greatly improves efficiency.
In this way, it is not so exaggerated for two people to prepare for the competition in five days.
What they didn’t expect was that they won first place in the sub-track in their first international competition.
They expressed some surprise at the result.
We were a little surprised to get the first place in the (very few samples open set) sub-track. There are many experts in the technology field, and we will continue to work hard!
The basic idea is consistent with the usual training process, which consists of pre-training of large-scale samples and fine-tuning training of small-scale samples.
As for the reason for winning the award, the team's internal analysis showed that in addition to better selection of training data, the pause and rhythm models they used in the front end of speech synthesis made the synthesized speech more natural.
Generally speaking, most common speech synthesis technologies focus on the accuracy of synthesized words, while other issues such as accurate pronunciation, rhythmic emotions, and appropriate pauses are usually not taken into account.
It's just an emotionless reading machine.
But in the field of education, these pain points that are usually easily overlooked have become the focus of the technical team.
It is necessary to ensure that the correct pronunciation is given when encountering situations such as polyphones, and at the same time, in teaching scenarios for young children, the spelling should be more natural, rhythmic and not awkward.
I can’t let my children’s studies be delayed just because I’m indifferent !
It is precisely because of this that Yuanfudao received recognition from the organizers and the judging committee.
What is it like to work in technology at Yuanfudao?
So, as an online education company, why did Yuanfudao appear in the international arena of speech synthesis technology?
In fact, it was due to a chance opportunity.
At that time, Yang Mingqi, a member of the contest, forwarded the contest information he had just seen to the voice group. During the daily chat among the R&D students, they suddenly thought that Yuanfudao had accumulated technology in this area, so why not take advantage of this competition to exchange ideas with other teams and see what different ideas others had under the same task that they could learn from.
This practice of keeping an eye on cutting-edge technologies and actively learning is not a sudden idea, but a normal daily routine for the entire technical department.
This can also be seen from a habit they have persisted for a long time -
Paper reading is an activity that Yuanfudao AI Lab has been carrying out since its establishment in 2014.
For seven years, the team has arranged for a technical student to share a cutting-edge technical paper at the group meeting every week and have in-depth exchanges and discussions with other colleagues.
At first, the entire lab had to attend such a reading session together. Later, as the scale continued to expand, it was changed to be held separately in the five laboratories. Students in my own laboratory must of course participate, and people from other laboratories can also participate according to their interests.
The collision between different technologies has also become Yuanfudao’s unique technical methodology.
Yang Mingqi from the Speech Synthesis Group of the Speech Laboratory shared an experience.
When the noise reduction team shared the latest technology for improving the signal-to-noise ratio , the speech synthesis team thought about whether such technology could be applied to TTS. Because the training samples collected on a daily basis are recorded in different environments, the sound quality cannot be guaranteed. By introducing relevant technologies, the quality of the final synthesized speech can be improved from the data level.
In addition, as an online education company, Yuanfudao has richer and more specific implementation scenarios, so it pays more attention to technology implementation than general technology companies .
Low-latency live streaming is such an example.
The common live broadcast technology on the market may have a delay of 1-3 seconds or even longer, which has little impact on the scenario of live broadcasting with goods. However, in education and teaching, students and teachers have a strong need for interaction, and a delay of a few seconds will affect the teaching experience.
For example, when the teacher asks a question in class and the students are thinking about it, the teacher has already given the answer and moves on to the next question.
Therefore, it is necessary to reduce the delay technology from 3 seconds to 0.3 seconds to maintain the classroom effect.
Voice group Ma Nan also said,
Many times, when front-line teachers put forward demands, R&D personnel have to find ways to combine the most advanced technologies to meet them.
Therefore, when I see some new papers and technical solutions, I will also think about what specific scenarios they can be applied to and whether they can cooperate with front-line teachers in teaching.
Because they value the essence of implementation , they always stay ready.
Whenever there are new demands, we find ways to use technology to meet them, and as a result, team members have developed the ability to quickly adapt to business transformations.
Ma Nan said that most of the people in their speech synthesis group were not originally professionals in this field. For example, some members used to work in search engines. It can be said that they all learned gradually through exploration.
Now it only takes them half a month to one month to go from an idea to a demo.
Therefore, unlike the sense of accomplishment of other technical teams, the source of their sense of accomplishment comes more from the feedback from front-line teachers.
“Better than public services” is the best reward they have received.
The technological power behind education
In fact, although the outside world's perception of Yuanfudao is more focused on "education", Yuanfudao has been a company that regards "technology" as its core competitiveness since its inception.
In 2014, Yuanfudao established its AI Lab, becoming the first company in the online education industry to establish an AI research institute.
From a business perspective, Yuanfudao has: Xiaoyuan Search, which can give you a solution to a problem in seconds by just taking a picture; Xiaoyuan Oral Arithmetic, which uses AI to help teachers and parents correct homework; and Zebra AI Class, which uses AI to create an intelligent learning model...
The technical support required behind it is not limited to voice, but also includes vision, natural language understanding, audio and video and other fields.
Let’s use the Zebra AI class as an example.
As the largest online course learning platform for preschool children in China, the most attractive thing about Zebra AI Class is that it can "teach students in accordance with their aptitude."
In other words, it is not just about simply dividing the learning stages according to the age of the children, but about allowing children to learn adaptively .
Wait, adaptation, isn’t this a way to educate AI models?
In fact, this is what is going on . Based on Yuanfu 's 10 billion-level children's language behavior big data , by analyzing the child's language behavior, we can understand the child's current learning situation in more detail, and then intelligently adjust the difficulty of the course and the path of "killing monsters". After the online class, you can also generate a personalized learning report based on AI big data analysis, and give timely feedback on learning results.
In addition, as mentioned before, general models often perform poorly when applied to more vertical scenarios. For example, the pronunciation of children is more difficult than conventional adult speech recognition, and it is necessary to collect data and optimize the model in a targeted manner.
Based on this background, Yuanfudao has now developed five major laboratories: speech laboratory, vision laboratory, natural language understanding laboratory, audio and video laboratory, and basic support laboratory.
Ma Nan, a technician, also emphasized:
For our products, the support of cutting-edge technology is not optional, but the core reason why the products can be established.
Taking question searching as an example, if the OCR accuracy is not high enough and the search is not accurate enough, users will not choose to use it at all.
At this point in time, from the perspective of user choice, Yuanfudao has become the online education company with the largest number of online course users in China.
From the perspective of technical verification, Yuanfudao has won the championship in two world-class NLP competitions, the MS MARCO machine reading comprehension level test and the Stanford question-answering dataset. Now, it has also made its mark in world competitions in more fields such as speech...
Capital's recognition of its business and technology has also been reflected in specific numbers: with a valuation of US$15.5 billion, it is the world's highest valued online education unicorn company.
Amid such rapid development, Yuanfudao, which is growing in size, is not only paying attention to implementation, but also increasingly attaching importance to "long-termism."
Ma Nan revealed that within the technical team, a considerable amount of manpower is now invested in the exploration and tackling of cutting-edge technologies. This type of R&D work will not be used immediately in the short term, but the company believes that from the perspective of long-term development, these accumulations will become a key component of Yuanfudao's technological moat.
Finally, let’s summarize, what kind of technology company is Yuanfudao?
In the name of online education, starting from each specific scenario, the power of AI technology is reflected in various products.
If a technology is developed very well but not used, it is unlikely to work for Yuanfudao.
Therefore, it can be said that Yuanfudao’s AI is more pragmatic.
Once upon a time, China's education sector was considered to have been set. However, the third generation of educational technology companies represented by Yuanfudao have emerged as dark horses and opened up a new path for advancement.
The core keyword behind it is technology and AI .
The potential of education driven by new-generation technologies such as AI may have just begun.
-over-
This article is the original content of [Quantum位], a signed account of NetEase News•NetEase's special content incentive plan. Any unauthorized reproduction is prohibited without the account's authorization.
Prize Questionnaire | Which smart car is the best?
Quantum Bit QbitAI · Toutiao signed author
Tracking new trends in AI technology and products
One-click triple click "Share", "Like" and "Watching"
Advances in science and technology are being seen every day~