Speech recognition systems can be divided into several categories

Publisher:和谐相伴Latest update time:2024-06-25 Source: elecfans Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

The voice recognition system in the voice chip is used in more robots that need to replace manual services or recognize commands, realizing more human-computer interactions and bringing more convenience to life. The classification and structure of the voice recognition system are also different from those of the OTP voice chip system.

Classification and structure of voice chip recognition system

1. Classification of speech recognition systems

There are many ways to classify speech systems, but the most common one is based on the recognition object. Its recognition tasks are roughly divided into three categories: isolated word recognition, keyword recognition, and continuous speech recognition.

2. Structure of speech recognition system

1. The structure of the speech recognition system includes the sampling and preprocessing part of the speech signal, the feature parameter extraction part, the speech recognition core part and the speech recognition post-processing part.

2. The so-called speech recognition process is actually pattern recognition and matching. First, we need to establish a speech model based on the characteristics of human speech, analyze the input speech signal, and extract the required features. On this basis, we can establish the pattern required for speech recognition.

3. During the recognition process, the characteristics of the input speech signal should be compared with the existing speech patterns based on the overall model of speech recognition, and a series of optimal patterns that match the input speech should be found based on certain search and matching strategies.

Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert the vocabulary content in human speech into computer-readable input. Speech recognition technology is a high-tech technology that allows machines to convert speech signals into corresponding text or commands through the process of recognition and understanding. Speech recognition technology mainly includes three aspects: feature extraction technology, pattern matching criteria and model training technology. Speech recognition technology has also been fully cited in the Internet of Vehicles. For example, in the Internet of Vehicles of Yika, you only need to press the one-touch customer service staff to set the destination and directly navigate, which is safe and convenient.

Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert the vocabulary content in human speech into computer-readable input, such as keystrokes, binary codes or character sequences. It is different from speaker recognition and speaker confirmation, which try to identify or confirm the speaker of the speech rather than the vocabulary content contained therein.

Main categories

According to the different objects to be recognized, speech recognition tasks can be roughly divided into three categories, namely isolated word recognition, keyword recognition (or keyword spotting) and continuous speech recognition. Among them, the task of isolated word recognition is to recognize isolated words known in advance, such as "turn on" and "turn off"; the task of continuous speech recognition is to recognize any continuous speech, such as a sentence or a paragraph; keyword detection in continuous speech stream is aimed at continuous speech, but it does not recognize all the text, but only detects where several known keywords appear, such as detecting the two words "computer" and "world" in a paragraph.

According to the speaker, speech recognition technology can be divided into specific person speech recognition and non-specific person speech recognition. The former can only recognize the speech of one or a few people, while the latter can be used by anyone. Obviously, the non-specific person speech recognition system is more in line with practical needs, but it is much more difficult than the recognition of specific people.

In addition, according to the voice device and channel, it can be divided into desktop (PC) voice recognition, telephone voice recognition and embedded device (mobile phone, PDA, etc.) voice recognition. Different acquisition channels will deform the acoustic characteristics of human pronunciation, so it is necessary to construct their own recognition systems.

Identification method

The main method of speech recognition is pattern matching. In the training stage, the user speaks each word in the vocabulary one by one, and stores its feature vector as a template in the template library. In the recognition stage, the feature vector of the input speech is compared with each template in the template library in turn for similarity, and the one with the highest similarity is output as the recognition result.

Problems

1. Accent and noise

One of the most obvious flaws in speech recognition is the handling of accents and background noise.

2. Semantic Error

Usually the actual goal of speech recognition systems is not the word error rate. We are more concerned with the semantic error rate, that is, the part of the speech that is misunderstood.

3. Single channel and multi-person conversation

A good conversational speech recognizer must be able to segment the audio based on who is speaking and should also be able to sort out overlapping conversations (source separation).

4. Changes in other areas

For example: reverberation from changes in the acoustic environment, artifacts caused by hardware, audio codecs and compression artifacts, changes in sampling rates, and different ages of the speakers.

5. Context-related judgment and recognition

It is easy for humans to make judgments based on context in conversation, but it is currently difficult for machines to do so.

Differences from natural language recognition

Speech recognition is a direction of natural language recognition.

In a broad sense, "natural language processing" includes "speech", or "speech" is also a type of "natural language". In a narrow sense, "natural language processing" refers to processing and understanding text. In simple terms, the result of speech recognition becomes one of the raw materials for natural language processing, and the result of natural language processing becomes the raw material for speech generation.

It is named to distinguish it from command speech, but the basic principles are the same. The highlight of natural speech recognition is the natural language understanding function, that is, users can speak the voice task to be recognized according to their personal language habits, using their own accustomed tone and accustomed words. The main difference between natural speech recognition and command speech recognition is the size of the vocabulary and the processing method. All processing of command speech is done locally, and natural speech recognition currently basically uses cloud processing, so its voice library and processing capabilities are incomparable to command speech.

A fundamental problem of speech recognition is the reasonable selection of features. The purpose of feature parameter extraction is to analyze and process speech signals, remove redundant information irrelevant to speech recognition, obtain important information that affects speech recognition, and compress speech signals at the same time. In practical applications, the compression rate of speech signals is between 10-100. Speech signals contain a large amount of various information. What information to extract and how to extract it require comprehensive consideration of various factors, such as cost, performance, response time, and computational complexity. Non-specific person speech recognition systems generally focus on extracting feature parameters that reflect semantics and try to remove the speaker's personal information; while specific person speech recognition systems hope to extract feature parameters that reflect semantics while also including the speaker's personal information as much as possible.

Linear prediction (LP) analysis technology is currently a widely used feature parameter extraction technology, and many successful application systems use cepstrum parameters extracted based on LP technology. However, the linear prediction model is a pure mathematical model and does not take into account the processing characteristics of the human auditory system for speech.

Mel parameters and perceptual linear prediction cepstrum extracted based on perceptual linear prediction (PLP) analysis simulate the processing characteristics of human ears on speech to a certain extent, and apply some research results in human auditory perception. Experiments have shown that the performance of speech recognition systems has been improved by using this technology. From the current usage, Mel-scale cepstrum parameters have gradually replaced the cepstrum parameters derived from the commonly used linear prediction coding, because it takes into account the characteristics of human voice production and reception of sound and has better robustness.

Some researchers have also tried to apply wavelet analysis technology to feature extraction, but the current performance is difficult to compare with the above technologies and needs further research.


Reference address:Speech recognition systems can be divided into several categories

Previous article:What are the types of DVI interfaces?
Next article:How Speech Recognition Technology Works

Latest Embedded Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号