A brief analysis of the smart TV voice control solution-EEWORLD

Collect

Abstract: We are in the wave of the fourth industrial revolution led by intelligent manufacturing. With the rise of emerging technologies such as artificial intelligence, information technology, and biotechnology, the manufacturing industry has entered a period of comprehensive intelligent transformation. How to make machines and equipment more intelligent and provide more comfortable and convenient services for humans has become a topic that people are constantly exploring. Language communication is the basis of communication between people. Naturally, intelligent voice has become an important carrier for interaction between people and machines. In the field of smart TVs, with the continuous upgrading of technology and the expansion of application scenarios, intelligent voice has become one of the core capabilities of smart TVs and has occupied an increasingly important position in human-computer interaction. In order to improve the overall performance and business capabilities of voice, major manufacturers are no longer satisfied with the whole set of services provided by third-party voice technology solutions, and have begun to increase their R&D investment in the entire link of intelligent voice, so that there is more room for optimization and selection of voice functions. In the entire link of intelligent voice, cloud control and decision-making capabilities are a crucial link, so building a private central control platform has become the preferred solution for major manufacturers.

1 Introduction

With the iterative upgrade of smart TV technology and the continuous expansion of application scenarios, intelligent voice has become one of the core capabilities of smart TV. Intelligent voice can more conveniently complete the interaction between people and TV, so the level of intelligent voice ability has become an important criterion for judging the level of TV intelligence. With the advancement of voice technology and the development of the market, simple control instructions completed by voice can no longer meet user needs. People hope to achieve more functions through voice, but it is relatively difficult to increase or change services by upgrading terminal device software. Against this background, major manufacturers have established private voice control platforms, hoping to continuously optimize voice skills and flexibly configure voice services through their own control platforms. The following will take you to understand the full-link processing process of intelligent voice, briefly introduce how to build the voice control platform and the basic functions of each module, and finally talk about the basic architecture of the voice control software.

2 Full-link Voice Analysis

The full link of intelligent voice includes two parts: the end-side capability and the cloud-side capability. The end-side refers to the smart TV terminal, which is mainly responsible for sound collection and processing of sound signals. It sends audio signals and text information to the cloud through cloud protocols for processing, and executes the instructions returned by the cloud or broadcasts the generated results. The cloud-side capabilities include several modules, such as speech recognition, semantic understanding, dialogue management, resource call, reply generation, and speech synthesis. They are responsible for converting a sentence into text, understanding the intention of the sentence, completing the corresponding instructions, and returning the corresponding results. Whether the speech analysis is intelligent and whether it can accurately understand the user's intention lies in the level of cloud capabilities. The full link structure of voice is shown in Figure 1. After the voice signal is collected by the sound collection module, it is limited and denoised by the signal processing module, and then sent to the voice wake-up module for wake-up word matching. After the match is successful, the voice is sent to the voice recognition module, which converts the sound signal into text information. The semantic understanding module then parses the keywords. The dialogue management module understands the user's intention based on the context input, and then calls external resources through the application programming interface (API) to generate reply content. When it returns to the terminal to execute relevant instructions, it broadcasts the voice reply synthesized by the speech synthesis module. At this point, a complete voice processing chain is completed, and the above process is repeated when there is new voice input.

2.1 Speech Recognition

Automatic speech recognition (ASR) is a service that converts voice signals into text information. According to the actual application scenario, speech recognition selects the appropriate acoustic and language model, extracts the features of the received voice signal, performs multi-channel decoding, calculates the model, and compares the weights to obtain a high-confidence text output. By analyzing the sound signal, the user's voiceprint, emotional state, age group and other information can also be obtained; based on this data, user groups can be segmented and refined, providing users with personalized services while also improving operational quality.

2.2 Semantic Understanding

When performing voice interaction, it is far from enough to simply convert the sound into text. It is necessary to understand what the user is saying. Therefore, semantic understanding service is a very important part of the voice interaction link. For the user's application scenario, we must first define the semantic space of the scenario, identify the user's intentions, and then collect data for intent recognition and parameter extraction, process the input text into a model, and output the key information in the text. This step is to convert the human language form into a machine-understandable, structured, and complete semantic representation.

2.3 Dialogue Management

Dialogue management controls the process of human-computer dialogue interaction. It determines the system's response to user input based on dialogue history information and current user input, which is also the basis for multiple rounds of dialogue. In the process of completing complex tasks, when the user's input is not specific or clear enough, the system uses dialogue management to inquire, clarify or confirm the user's needs to clarify the user's true intention and complete the user's request. Dialogue management includes dialogue state tracking, response decision-making, semantic slot filling, context management, reference disambiguation and other functions.

2.4 Reply Generation

According to the context and the actual use scenario of the user, the system provides feedback text or actions for the results of the user input. The response generation includes local commands, control definitions, dialogue responses, default broadcasts, error broadcasts, dialogue control and other functions.

2.5 Speech Synthesis

Speech synthesis is the process of converting text information into standard speech output, which is equivalent to equipping the device with a "mouth". Whether the speech is smooth and the tone is beautiful and pleasant is determined by this module. Through certain data input and model training, the voice of a specific person can be synthesized, making the interaction between people and devices more harmonious.

3. Construction of voice control platform

3.1 Construction of enterprise central control platform

Before talking about the voice control platform, we should first look at the basic architecture of the enterprise control platform, because the voice control belongs to the enterprise control and is part of the cloud platform. Generally, the enterprise cloud control platform will be compatible with multiple business needs. In addition to the voice business needs, most of them also need to meet other intelligent business needs such as image recognition and AIoT (AI IOT, artificial intelligence Internet of Things). The enterprise cloud control platform can be flexibly customized according to business needs. As shown in Figure 2, it shows the basic architecture of a cloud control platform and the relationship between external modules. The enterprise control platform includes control modules such as authentication gateway, control engine, decision engine, and unit modules that only serve specific businesses. For example, the automatic speech recognition and semantic processing platform, image recognition platform, and AIoT platform in Figure 2 are voice business, image recognition business, and AIoT business services, respectively. Through the overall control of the own enterprise cloud control platform, it is not only easy to realize the flexible configuration of each business unit, but also to promote the integration and reuse of various technologies, and promote the improvement of terminal product performance and user experience.

Figure 2 Relationship between internal and external modules of the enterprise central control platform

3.2 Voice Central Control Platform Architecture

In the whole process of voice processing, the cloud's capabilities are very important, which determines whether the voice processing results are intelligent, so the cloud is also called the brain of intelligent voice. What we call the construction of an enterprise's own voice control platform mainly refers to the establishment of a cloud voice processing and control platform. According to the function implementation, the voice control platform can be divided into four major modules, namely voice recognition, semantic understanding, intention decision and skill distribution/decision module. The relationship between each module is shown in Figure 3. After the far-field pickup module picks up the voice, it is given to the signal processing module to process the voice signal, and then the voice signal is sent to the voice recognition module to convert the sound signal into text information. Here, the far-field and near-field processing methods are slightly different. After the near-field voice is picked up, it is directly output to the voice recognition module. The converted text information is parsed by the semantic understanding module, processed by the intention decision and skill distribution module, and the processing results are returned to the terminal device to present specific information or perform related actions.

[1] [2]

Keywords：Smart TV Reference address：A brief analysis of the smart TV voice control solution

Previous article：How to analyze and optimize some background noise in mobile phone audio systems
Next article：Sensors in wearable devices: getting smaller and smaller, but bigger and bigger!

Popular Resources
Popular amplifiers