Speech recognition systems can be divided into several categories-EEWORLD

Collect

The voice recognition system in the voice chip is used in more robots that need to replace manual services or recognize commands, realizing more human-computer interactions and bringing more convenience to life. The classification and structure of the voice recognition system are also different from those of the OTP voice chip system.

Classification and structure of voice chip recognition system

1. Classification of speech recognition systems

There are many ways to classify speech systems, but the most common one is based on the recognition object. Its recognition tasks are roughly divided into three categories: isolated word recognition, keyword recognition, and continuous speech recognition.

2. Structure of speech recognition system

1. The structure of the speech recognition system includes the sampling and preprocessing part of the speech signal, the feature parameter extraction part, the speech recognition core part and the speech recognition post-processing part.

2. The so-called speech recognition process is actually pattern recognition and matching. First, we need to establish a speech model based on the characteristics of human speech, analyze the input speech signal, and extract the required features. On this basis, we can establish the pattern required for speech recognition.

3. During the recognition process, the characteristics of the input speech signal should be compared with the existing speech patterns based on the overall model of speech recognition, and a series of optimal patterns that match the input speech should be found based on certain search and matching strategies.

Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert the vocabulary content in human speech into computer-readable input. Speech recognition technology is a high-tech technology that allows machines to convert speech signals into corresponding text or commands through the process of recognition and understanding. Speech recognition technology mainly includes three aspects: feature extraction technology, pattern matching criteria and model training technology. Speech recognition technology has also been fully cited in the Internet of Vehicles. For example, in the Internet of Vehicles of Yika, you only need to press the one-touch customer service staff to set the destination and directly navigate, which is safe and convenient.

Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert the vocabulary content in human speech into computer-readable input, such as keystrokes, binary codes or character sequences. It is different from speaker recognition and speaker confirmation, which try to identify or confirm the speaker of the speech rather than the vocabulary content contained therein.

Main categories

According to the different objects to be recognized, speech recognition tasks can be roughly divided into three categories, namely isolated word recognition, keyword recognition (or keyword spotting) and continuous speech recognition. Among them, the task of isolated word recognition is to recognize isolated words known in advance, such as "turn on" and "turn off"; the task of continuous speech recognition is to recognize any continuous speech, such as a sentence or a paragraph; keyword detection in continuous speech stream is aimed at continuous speech, but it does not recognize all the text, but only detects where several known keywords appear, such as detecting the two words "computer" and "world" in a paragraph.

According to the speaker, speech recognition technology can be divided into specific person speech recognition and non-specific person speech recognition. The former can only recognize the speech of one or a few people, while the latter can be used by anyone. Obviously, the non-specific person speech recognition system is more in line with practical needs, but it is much more difficult than the recognition of specific people.

In addition, according to the voice device and channel, it can be divided into desktop (PC) voice recognition, telephone voice recognition and embedded device (mobile phone, PDA, etc.) voice recognition. Different acquisition channels will deform the acoustic characteristics of human pronunciation, so it is necessary to construct their own recognition systems.

Identification method

The main method of speech recognition is pattern matching. In the training stage, the user speaks each word in the vocabulary one by one, and stores its feature vector as a template in the template library. In the recognition stage, the feature vector of the input speech is compared with each template in the template library in turn for similarity, and the one with the highest similarity is output as the recognition result.

Problems

1. Accent and noise

One of the most obvious flaws in speech recognition is the handling of accents and background noise.

2. Semantic Error

Usually the actual goal of speech recognition systems is not the word error rate. We are more concerned with the semantic error rate, that is, the part of the speech that is misunderstood.

3. Single channel and multi-person conversation

A good conversational speech recognizer must be able to segment the audio based on who is speaking and should also be able to sort out overlapping conversations (source separation).

4. Changes in other areas

For example: reverberation from changes in the acoustic environment, artifacts caused by hardware, audio codecs and compression artifacts, changes in sampling rates, and different ages of the speakers.

5. Context-related judgment and recognition

It is easy for humans to make judgments based on context in conversation, but it is currently difficult for machines to do so.

Differences from natural language recognition

Speech recognition is a direction of natural language recognition.

In a broad sense, "natural language processing" includes "speech", or "speech" is also a type of "natural language". In a narrow sense, "natural language processing" refers to processing and understanding text. In simple terms, the result of speech recognition becomes one of the raw materials for natural language processing, and the result of natural language processing becomes the raw material for speech generation.

It is named to distinguish it from command speech, but the basic principles are the same. The highlight of natural speech recognition is the natural language understanding function, that is, users can speak the voice task to be recognized according to their personal language habits, using their own accustomed tone and accustomed words. The main difference between natural speech recognition and command speech recognition is the size of the vocabulary and the processing method. All processing of command speech is done locally, and natural speech recognition currently basically uses cloud processing, so its voice library and processing capabilities are incomparable to command speech.

A fundamental problem of speech recognition is the reasonable selection of features. The purpose of feature parameter extraction is to analyze and process speech signals, remove redundant information irrelevant to speech recognition, obtain important information that affects speech recognition, and compress speech signals at the same time. In practical applications, the compression rate of speech signals is between 10-100. Speech signals contain a large amount of various information. What information to extract and how to extract it require comprehensive consideration of various factors, such as cost, performance, response time, and computational complexity. Non-specific person speech recognition systems generally focus on extracting feature parameters that reflect semantics and try to remove the speaker's personal information; while specific person speech recognition systems hope to extract feature parameters that reflect semantics while also including the speaker's personal information as much as possible.

Linear prediction (LP) analysis technology is currently a widely used feature parameter extraction technology, and many successful application systems use cepstrum parameters extracted based on LP technology. However, the linear prediction model is a pure mathematical model and does not take into account the processing characteristics of the human auditory system for speech.

Mel parameters and perceptual linear prediction cepstrum extracted based on perceptual linear prediction (PLP) analysis simulate the processing characteristics of human ears on speech to a certain extent, and apply some research results in human auditory perception. Experiments have shown that the performance of speech recognition systems has been improved by using this technology. From the current usage, Mel-scale cepstrum parameters have gradually replaced the cepstrum parameters derived from the commonly used linear prediction coding, because it takes into account the characteristics of human voice production and reception of sound and has better robustness.

Some researchers have also tried to apply wavelet analysis technology to feature extraction, but the current performance is difficult to compare with the above technologies and needs further research.

Reference address：Speech recognition systems can be divided into several categories

Previous article：What are the types of DVI interfaces?
Next article：How Speech Recognition Technology Works

Popular Resources
Popular amplifiers

Latest Embedded Articles

Red Hat announces definitive agreement to acquire Neural Magic
● This transaction reflects Red Hat's commitment to helping customers achieve flexible deployment in hybrid cloud environments, supporting the delivery of applications and workloads from local data centers to public clouds and to edge computing, meeting ...
5G network speed is faster than 4G, but the perception is poor! Wu Hequan: 6G standard formulation should focus on user needs
Wu Hequan, an academician of the Chinese Academy of Engineering, recently publicly stated that in the process of formulating 6G standards, we should pay attention to reasonable indicators that meet the basic needs of the public, rather than just focusing on those specific high-demand indicators. ...
SEMI report: Global silicon wafer shipments increased by 6% in the third quarter of 2024
According to the latest silicon wafer quarterly analysis report released by SEMI's Silicon Manufacturers Group (SMG), global silicon wafer shipments will increase by 5.9% month-on-month to reach 2.1% in the third quarter of 2024. ...
OpenAI calls for a "North American Artificial Intelligence Alliance" to compete with China
OpenAI, an American artificial intelligence startup, recently called on the United States and its allies to work together to establish a "North American Artificial Intelligence Alliance" to jointly support the development of the infrastructure needed for artificial intelligence systems to respond to China's ...
OpenAI is rumored to be launching a new intelligent body that can automatically perform tasks for users
According to two people familiar with the matter, the US artificial intelligence startup OpenAI is preparing to launch a new type of intelligent agent code-named "Operator". The intelligent agent can be written by a computer instead of by the user. ...
Arm: Focusing on efficient computing platforms, we work together to build a sustainable future
AMD to cut 4% of its workforce to gain a stronger position in artificial intelligence chips
NEC receives new supercomputer orders: Intel CPU + AMD accelerator + Nvidia switch
RW61X: Wi-Fi 6 tri-band device in a secure i.MX RT MCU

He Limin Column Microcontroller and Embedded Systems Bible

Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.

MoreSelected Circuit Diagrams

Change More Related Popular Components

MorePopular Articles

MoreDaily News

Guess you like