Microsoft is now a dark horse in the voice evaluation field
Author | Li Jingying
In recent years, with the continuous maturity of artificial intelligence technology and the acceleration of the digital transformation of enterprises, AI has gradually penetrated into various scenarios, making human production and life more intelligent. In terms of hearing, intelligent voice technology has become an important area for major technology companies to tackle.
Microsoft, a long-established technology giant, has been deeply engaged in speech synthesis technology and speech recognition technology for many years. It has opened up and implemented its technical capabilities to global partners and provided a variety of intelligent voice solutions.
In mid-May, at the 2020 Microsoft Build Developer Conference, Microsoft launched a voice evaluation feature based on the Azure Speech Service Speech-to-text. Users can upload follow-up text and audio to evaluate the accuracy, fluency, and completeness of the speaker's voice. In the field of education, especially oral learning, it can make teaching and learning oral learning more efficient and convenient due to its high recognition accuracy and high consistency of quasi-expert scoring.
Recently, the intelligent voice team of Microsoft Asia Pacific R&D Group accepted an online interview with Leifeng.com and other media, and gave a detailed introduction to the advantages and application scenarios of Microsoft's intelligent voice evaluation technology.
Ding Binggong, Product Director of the Cloud Computing and Artificial Intelligence Division of Microsoft Asia Pacific R&D Group, and Ma Lisa, Senior Product Manager of the Cloud Computing and Artificial Intelligence Division of Microsoft Asia Pacific R&D Group, participated in the interview.
1
Four Dimensions of Voice Evaluation
Ma Lisa, senior product manager of the Cloud Computing and Artificial Intelligence Division of Microsoft Asia Pacific Research and Development Group, said that the current market demand for voice evaluation mainly has four dimensions:
-
professional.
-
real-time.
-
stability.
-
Customizability.
In terms of professionalism, each language tested by Microsoft Speech Assessment learns pronunciation with a pure local accent from more than 100,000 hours of native language big data, and gives precise scores in multiple dimensions for each age group. It evaluates each level from article to sentence to word to phoneme, and the scores given by the proposed expert panel are highly consistent.
Marisa mentioned that the professional consideration of voice evaluation is to compare the consistency with the evaluation of native speakers. The industry generally uses the Pearson correlation coefficient to reflect the degree of linear correlation between two sequences. The range is between -1 and 1, 1 means perfect correlation, -1 means complete opposite, 0 means random order and no correlation, and the larger the value, the higher the correlation. Microsoft's voice evaluation has reached 0.75 in consistency, which is close to the level of native speakers.
In terms of real-time performance, Microsoft Speech Assessment supports streaming processing of audio uploads, which means processing while reading aloud, and providing immediate feedback of evaluation results after reading.
In terms of stability, Microsoft Speech Evaluation builds fuzzy matching text based on the NLP model, and has good tolerance for specific scenarios in vertical fields. Missed reading, wrong reading, and repeated reading do not affect the effectiveness and accuracy of the scoring.
In terms of customizability, the ASR basic model and Microsoft's comprehensive technical capabilities in the field of voice can realize personalized scoring standards that can adapt to accents, noise environments, and age groups, etc.
It is understood that in addition to supporting English assessment, Microsoft Speech Assessment can also be expanded to support language assessment in more than 40 countries and regions around the world. It is widely suitable for partners of education solutions, APP developers, and language schools, training centers, educational institutions, and examination centers for the development of various language learning, oral practice, and examination scenarios.
2
The biggest technical difficulty lies in multi-point balance
In the field of education, voice assessment capabilities are mainly used by teachers and students, and are widely used in teacher evaluation, homework practice, and language learning scenarios. So, what are the pain points of educational users for oral learning? What is the biggest technical difficulty of voice assessment?
In this regard, Marisa said that for students, their pain point is learning non-native spoken language. In the process of learning a new language, how to give timely and accurate feedback to students' pronunciation and allow students to contact more conveniently anytime and anywhere is crucial to improving the effectiveness of oral learning.
For schools and educational institutions, their pain point is that the resources of teachers are limited. How to expand the existing high-quality teacher resources into a stable teaching system. Therefore, teachers need such a capability, which can not only simulate the scoring of native-speaking experts, but also learn the evaluation methods of teachers, so that teachers can use the evaluation to provide online and offline guidance and help to students in an efficient one-to-many manner.
These pain points of education users have put forward higher requirements for speech evaluation technology. Ma Lisha believes that the biggest technical difficulty of speech evaluation lies in the speech recognition technology itself. First of all, the understanding and recognition of multiple languages, different scoring scenarios, including noisy environments, and the pronunciation of students of different ages, need to be further optimized.
The second is to achieve a balance between inclusiveness and robustness (note: robust is a technical term, which can be understood as robustness or resistance to change). The model must be well-made and well-recognized, and it must also be able to be built dynamically and in real time, and large models must be called in high real-time. These requirements combined make it extremely difficult.
"We have a very solid foundation in voice evaluation, which allows us to do a good job in all aspects and present a comprehensive performance to users. So we are not struggling on just one point, but on a balance of multiple points."
It is reported that TAL, which is actively exploring digital transformation and new education models, is also one of the customers of Microsoft's voice evaluation function. Hu Xiangyu, an AI scientist at TAL, said:
"How to quickly and effectively conduct oral assessment for different students is a major challenge we encounter in online and offline English teaching. Microsoft Intelligent Speech Service provides powerful real-time speech assessment capabilities for TAL and our partners. According to our tests, the pronunciation assessment function of Microsoft Speech Service better adapts to our listening environment and has higher consistency, which is closer to the evaluation results of experts."
3
A capability built on the Azure cloud:
Algorithms, data, and computing power
Ding Binggong, product director of the cloud computing and artificial intelligence division of Microsoft Asia Pacific R&D Group, said that generally speaking, we look at three aspects of AI technology: algorithms, data, and computing power. In terms of these three aspects, Microsoft's voice evaluation function has its own unique advantages.
-
At the algorithm level, Microsoft has been deeply engaged in the field of intelligent voice for many years, and its speech recognition has reached the human level, with an error rate of about 5.1%. It is precisely because of this algorithmic foundation that Microsoft can be more confident in the extended application of speech recognition - speech evaluation.
-
At the data level, relying on Microsoft's years of accumulation in voice, the system was trained with nearly 100,000 hours of native speaker data, and finally learned a relatively authentic local accent.
-
In terms of computing power, all voice evaluation technologies are built on Microsoft Azure cloud, which has the most data centers and covers the most areas in the world, and can support users' large-scale computing needs. At the same time, Azure complies with the EU GDPR (General Data Protection Regulation) standards to protect users' data security.
In fact, Microsoft Speech Evaluation is not a specific product, but a capability built on the Azure cloud. That is, using Microsoft Azure as a platform, Microsoft's 30 years of research results in the field of artificial intelligence are opened to APIs for partners, independent software developers, and system integrators, providing them with capabilities beyond cognitive services, so that they can further develop solutions suitable for their respective fields.
"To give an analogy, if Microsoft Cloud is a platform, cognitive services are the part of this platform that provides intelligence to users. Just like a person has eyes, ears, and a brain, cognitive services empower users who want to obtain these capabilities on the Microsoft Azure Cloud and provide them with expansion capabilities," said Ding Binggong.
"It is better to teach someone how to fish than to give him fish. After we provide such capabilities or tools, it will be easier for partners to customize or develop corresponding solutions and products based on a variety of scenarios in vertical fields. They can directly access such capabilities without having to do any AI research from scratch."
It is understood that at present, Microsoft's speech evaluation API has rich interfaces and parameters, and supports high real-time and multi-concurrency calls. If a third party wants to call the API, it is free in the early evaluation stage, and when it enters the integrated development stage, it will be charged according to the standard Speech-To-Text service price and the length of the evaluation audio.
In addition, Ding Binggong also mentioned that Microsoft Azure has an independent Microsoft Education team that provides different solutions specifically for the education field. In addition to voice assessment, Azure cloud has many applications in the education field.
-
For example, the “suspended classes but not learning” policy during the epidemic has allowed Microsoft’s remote collaboration platform Microsoft Teams to be used by students as a tool for remote learning and has been widely used in the field of education.
-
Secondly, in terms of personalized education, voice assessment, as a service on Azure, provides personalized scoring capabilities, allowing users who use this function on the platform to enjoy their own unique service.
-
Third, Azure's speech technology can help synthesize AI teachers, help educational institutions generate courseware, and solve the problem of scarce educational resources.
-
In addition, voice technology can also help some visually impaired and hearing-impaired students learn better.
Ma Lisa believes that the global education industry is currently undergoing digital transformation, and the outbreak has accelerated the digitization and onlineization of the education industry. By further empowering the education industry with AI and cloud computing, it is possible to provide students with diversified and personalized services and create a more intelligent ecological environment.
Microsoft has been deeply involved in intelligent voice technology for many years and has launched many well-known voice products, such as Microsoft Cortana, Microsoft XiaoIce, Skype, etc. The launch of the voice evaluation function is also a "go with the flow" move. In the voice evaluation market, technology giants such as BAT and iFlytek are competing for layout. The entry of Microsoft, a "dark horse", is bound to make the battle in the voice evaluation battlefield more intense.
Previous recommendations