Abstract: We are in the wave of the fourth industrial revolution led by intelligent manufacturing. With the rise of emerging technologies such as artificial intelligence, information technology, and biotechnology, the manufacturing industry has entered a period of comprehensive intelligent transformation. How to make machines and equipment more intelligent and provide more comfortable and convenient services for humans has become a topic that people are constantly exploring. Language communication is the basis of communication between people. Naturally, intelligent voice has become an important carrier for interaction between people and machines. In the field of smart TVs, with the continuous upgrading of technology and the expansion of application scenarios, intelligent voice has become one of the core capabilities of smart TVs and has occupied an increasingly important position in human-computer interaction. In order to improve the overall performance and business capabilities of voice, major manufacturers are no longer satisfied with the whole set of services provided by third-party voice technology solutions, and have begun to increase their R&D investment in the entire link of intelligent voice, so that there is more room for optimization and selection of voice functions. In the entire link of intelligent voice, cloud control and decision-making capabilities are a crucial link, so building a private central control platform has become the preferred solution for major manufacturers.
1 Introduction
With the iterative upgrade of smart TV technology and the continuous expansion of application scenarios, intelligent voice has become one of the core capabilities of smart TV. Intelligent voice can more conveniently complete the interaction between people and TV, so the level of intelligent voice ability has become an important criterion for judging the level of TV intelligence. With the advancement of voice technology and the development of the market, simple control instructions completed by voice can no longer meet user needs. People hope to achieve more functions through voice, but it is relatively difficult to increase or change services by upgrading terminal device software. Against this background, major manufacturers have established private voice control platforms, hoping to continuously optimize voice skills and flexibly configure voice services through their own control platforms. The following will take you to understand the full-link processing process of intelligent voice, briefly introduce how to build the voice control platform and the basic functions of each module, and finally talk about the basic architecture of the voice control software.
2 Full-link Voice Analysis
The full link of intelligent voice includes two parts: the end-side capability and the cloud-side capability. The end-side refers to the smart TV terminal, which is mainly responsible for sound collection and processing of sound signals. It sends audio signals and text information to the cloud through cloud protocols for processing, and executes the instructions returned by the cloud or broadcasts the generated results. The cloud-side capabilities include several modules, such as speech recognition, semantic understanding, dialogue management, resource call, reply generation, and speech synthesis. They are responsible for converting a sentence into text, understanding the intention of the sentence, completing the corresponding instructions, and returning the corresponding results. Whether the speech analysis is intelligent and whether it can accurately understand the user's intention lies in the level of cloud capabilities. The full link structure of voice is shown in Figure 1. After the voice signal is collected by the sound collection module, it is limited and denoised by the signal processing module, and then sent to the voice wake-up module for wake-up word matching. After the match is successful, the voice is sent to the voice recognition module, which converts the sound signal into text information. The semantic understanding module then parses the keywords. The dialogue management module understands the user's intention based on the context input, and then calls external resources through the application programming interface (API) to generate reply content. When it returns to the terminal to execute relevant instructions, it broadcasts the voice reply synthesized by the speech synthesis module. At this point, a complete voice processing chain is completed, and the above process is repeated when there is new voice input.
2.1 Speech Recognition
Automatic speech recognition (ASR) is a service that converts voice signals into text information. According to the actual application scenario, speech recognition selects the appropriate acoustic and language model, extracts the features of the received voice signal, performs multi-channel decoding, calculates the model, and compares the weights to obtain a high-confidence text output. By analyzing the sound signal, the user's voiceprint, emotional state, age group and other information can also be obtained; based on this data, user groups can be segmented and refined, providing users with personalized services while also improving operational quality.
2.2 Semantic Understanding
When performing voice interaction, it is far from enough to simply convert the sound into text. It is necessary to understand what the user is saying. Therefore, semantic understanding service is a very important part of the voice interaction link. For the user's application scenario, we must first define the semantic space of the scenario, identify the user's intentions, and then collect data for intent recognition and parameter extraction, process the input text into a model, and output the key information in the text. This step is to convert the human language form into a machine-understandable, structured, and complete semantic representation.
2.3 Dialogue Management
Dialogue management controls the process of human-computer dialogue interaction. It determines the system's response to user input based on dialogue history information and current user input, which is also the basis for multiple rounds of dialogue. In the process of completing complex tasks, when the user's input is not specific or clear enough, the system uses dialogue management to inquire, clarify or confirm the user's needs to clarify the user's true intention and complete the user's request. Dialogue management includes dialogue state tracking, response decision-making, semantic slot filling, context management, reference disambiguation and other functions.
2.4 Reply Generation
According to the context and the actual use scenario of the user, the system provides feedback text or actions for the results of the user input. The response generation includes local commands, control definitions, dialogue responses, default broadcasts, error broadcasts, dialogue control and other functions.
2.5 Speech Synthesis
Speech synthesis is the process of converting text information into standard speech output, which is equivalent to equipping the device with a "mouth". Whether the speech is smooth and the tone is beautiful and pleasant is determined by this module. Through certain data input and model training, the voice of a specific person can be synthesized, making the interaction between people and devices more harmonious.
3. Construction of voice control platform
3.1 Construction of enterprise central control platform
Before talking about the voice control platform, we should first look at the basic architecture of the enterprise control platform, because the voice control belongs to the enterprise control and is part of the cloud platform. Generally, the enterprise cloud control platform will be compatible with multiple business needs. In addition to the voice business needs, most of them also need to meet other intelligent business needs such as image recognition and AIoT (AI IOT, artificial intelligence Internet of Things). The enterprise cloud control platform can be flexibly customized according to business needs. As shown in Figure 2, it shows the basic architecture of a cloud control platform and the relationship between external modules. The enterprise control platform includes control modules such as authentication gateway, control engine, decision engine, and unit modules that only serve specific businesses. For example, the automatic speech recognition and semantic processing platform, image recognition platform, and AIoT platform in Figure 2 are voice business, image recognition business, and AIoT business services, respectively. Through the overall control of the own enterprise cloud control platform, it is not only easy to realize the flexible configuration of each business unit, but also to promote the integration and reuse of various technologies, and promote the improvement of terminal product performance and user experience.
Figure 2 Relationship between internal and external modules of the enterprise central control platform
3.2 Voice Central Control Platform Architecture
In the whole process of voice processing, the cloud's capabilities are very important, which determines whether the voice processing results are intelligent, so the cloud is also called the brain of intelligent voice. What we call the construction of an enterprise's own voice control platform mainly refers to the establishment of a cloud voice processing and control platform. According to the function implementation, the voice control platform can be divided into four major modules, namely voice recognition, semantic understanding, intention decision and skill distribution/decision module. The relationship between each module is shown in Figure 3. After the far-field pickup module picks up the voice, it is given to the signal processing module to process the voice signal, and then the voice signal is sent to the voice recognition module to convert the sound signal into text information. Here, the far-field and near-field processing methods are slightly different. After the near-field voice is picked up, it is directly output to the voice recognition module. The converted text information is parsed by the semantic understanding module, processed by the intention decision and skill distribution module, and the processing results are returned to the terminal device to present specific information or perform related actions.
Previous article:How to analyze and optimize some background noise in mobile phone audio systems
Next article:Sensors in wearable devices: getting smaller and smaller, but bigger and bigger!
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Shenzhen becomes the "world's first 5G city": challenges and business opportunities for the PCB industry
- 【Ended】 R&S Live 【PCI Express Gen 3 Conformance Test】
- Live broadcast at 10 am today [Renesas Electronics Secure IoT Suite provides you with secure cloud connection solutions
- Xun developed Qt for Android for i.MX6ULL Terminator QT application
- Applications of RF Transformers
- RF FilterRF knowledge classics to understand
- 【NXP Rapid IoT Review】+Hello world!
- 5G miniaturized terminal and base station antenna technology
- DIY retro game console based on Raspberry Pi Zero
- Transistor static operating point