The voice recognition system in the voice chip is used in more robots that need to replace manual services or recognize commands, realizing more human-computer interactions and bringing more convenience to life. The classification and structure of the voice recognition system are also different from those of the OTP voice chip system.
Classification and structure of voice chip recognition system
1. Classification of speech recognition systems
There are many ways to classify speech systems, but the most common one is based on the recognition object. Its recognition tasks are roughly divided into three categories: isolated word recognition, keyword recognition, and continuous speech recognition.
2. Structure of speech recognition system
1. The structure of the speech recognition system includes the sampling and preprocessing part of the speech signal, the feature parameter extraction part, the speech recognition core part and the speech recognition post-processing part.
2. The so-called speech recognition process is actually pattern recognition and matching. First, we need to establish a speech model based on the characteristics of human speech, analyze the input speech signal, and extract the required features. On this basis, we can establish the pattern required for speech recognition.
3. During the recognition process, the characteristics of the input speech signal should be compared with the existing speech patterns based on the overall model of speech recognition, and a series of optimal patterns that match the input speech should be found based on certain search and matching strategies.
Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert the vocabulary content in human speech into computer-readable input. Speech recognition technology is a high-tech technology that allows machines to convert speech signals into corresponding text or commands through the process of recognition and understanding. Speech recognition technology mainly includes three aspects: feature extraction technology, pattern matching criteria and model training technology. Speech recognition technology has also been fully cited in the Internet of Vehicles. For example, in the Internet of Vehicles of Yika, you only need to press the one-touch customer service staff to set the destination and directly navigate, which is safe and convenient.
Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert the vocabulary content in human speech into computer-readable input, such as keystrokes, binary codes or character sequences. It is different from speaker recognition and speaker confirmation, which try to identify or confirm the speaker of the speech rather than the vocabulary content contained therein.
Main categories
According to the different objects to be recognized, speech recognition tasks can be roughly divided into three categories, namely isolated word recognition, keyword recognition (or keyword spotting) and continuous speech recognition. Among them, the task of isolated word recognition is to recognize isolated words known in advance, such as "turn on" and "turn off"; the task of continuous speech recognition is to recognize any continuous speech, such as a sentence or a paragraph; keyword detection in continuous speech stream is aimed at continuous speech, but it does not recognize all the text, but only detects where several known keywords appear, such as detecting the two words "computer" and "world" in a paragraph.
According to the speaker, speech recognition technology can be divided into specific person speech recognition and non-specific person speech recognition. The former can only recognize the speech of one or a few people, while the latter can be used by anyone. Obviously, the non-specific person speech recognition system is more in line with practical needs, but it is much more difficult than the recognition of specific people.
In addition, according to the voice device and channel, it can be divided into desktop (PC) voice recognition, telephone voice recognition and embedded device (mobile phone, PDA, etc.) voice recognition. Different acquisition channels will deform the acoustic characteristics of human pronunciation, so it is necessary to construct their own recognition systems.
Identification method
The main method of speech recognition is pattern matching. In the training stage, the user speaks each word in the vocabulary one by one, and stores its feature vector as a template in the template library. In the recognition stage, the feature vector of the input speech is compared with each template in the template library in turn for similarity, and the one with the highest similarity is output as the recognition result.
Problems
1. Accent and noise
One of the most obvious flaws in speech recognition is the handling of accents and background noise.
2. Semantic Error
Usually the actual goal of speech recognition systems is not the word error rate. We are more concerned with the semantic error rate, that is, the part of the speech that is misunderstood.
3. Single channel and multi-person conversation
A good conversational speech recognizer must be able to segment the audio based on who is speaking and should also be able to sort out overlapping conversations (source separation).
4. Changes in other areas
For example: reverberation from changes in the acoustic environment, artifacts caused by hardware, audio codecs and compression artifacts, changes in sampling rates, and different ages of the speakers.
5. Context-related judgment and recognition
It is easy for humans to make judgments based on context in conversation, but it is currently difficult for machines to do so.
Differences from natural language recognition
Speech recognition is a direction of natural language recognition.
In a broad sense, "natural language processing" includes "speech", or "speech" is also a type of "natural language". In a narrow sense, "natural language processing" refers to processing and understanding text. In simple terms, the result of speech recognition becomes one of the raw materials for natural language processing, and the result of natural language processing becomes the raw material for speech generation.
It is named to distinguish it from command speech, but the basic principles are the same. The highlight of natural speech recognition is the natural language understanding function, that is, users can speak the voice task to be recognized according to their personal language habits, using their own accustomed tone and accustomed words. The main difference between natural speech recognition and command speech recognition is the size of the vocabulary and the processing method. All processing of command speech is done locally, and natural speech recognition currently basically uses cloud processing, so its voice library and processing capabilities are incomparable to command speech.
A fundamental problem of speech recognition is the reasonable selection of features. The purpose of feature parameter extraction is to analyze and process speech signals, remove redundant information irrelevant to speech recognition, obtain important information that affects speech recognition, and compress speech signals at the same time. In practical applications, the compression rate of speech signals is between 10-100. Speech signals contain a large amount of various information. What information to extract and how to extract it require comprehensive consideration of various factors, such as cost, performance, response time, and computational complexity. Non-specific person speech recognition systems generally focus on extracting feature parameters that reflect semantics and try to remove the speaker's personal information; while specific person speech recognition systems hope to extract feature parameters that reflect semantics while also including the speaker's personal information as much as possible.
Linear prediction (LP) analysis technology is currently a widely used feature parameter extraction technology, and many successful application systems use cepstrum parameters extracted based on LP technology. However, the linear prediction model is a pure mathematical model and does not take into account the processing characteristics of the human auditory system for speech.
Mel parameters and perceptual linear prediction cepstrum extracted based on perceptual linear prediction (PLP) analysis simulate the processing characteristics of human ears on speech to a certain extent, and apply some research results in human auditory perception. Experiments have shown that the performance of speech recognition systems has been improved by using this technology. From the current usage, Mel-scale cepstrum parameters have gradually replaced the cepstrum parameters derived from the commonly used linear prediction coding, because it takes into account the characteristics of human voice production and reception of sound and has better robustness.
Some researchers have also tried to apply wavelet analysis technology to feature extraction, but the current performance is difficult to compare with the above technologies and needs further research.
Previous article:What are the types of DVI interfaces?
Next article:How Speech Recognition Technology Works
- Red Hat announces definitive agreement to acquire Neural Magic
- 5G network speed is faster than 4G, but the perception is poor! Wu Hequan: 6G standard formulation should focus on user needs
- SEMI report: Global silicon wafer shipments increased by 6% in the third quarter of 2024
- OpenAI calls for a "North American Artificial Intelligence Alliance" to compete with China
- OpenAI is rumored to be launching a new intelligent body that can automatically perform tasks for users
- Arm: Focusing on efficient computing platforms, we work together to build a sustainable future
- AMD to cut 4% of its workforce to gain a stronger position in artificial intelligence chips
- NEC receives new supercomputer orders: Intel CPU + AMD accelerator + Nvidia switch
- RW61X: Wi-Fi 6 tri-band device in a secure i.MX RT MCU
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- LED chemical incompatibility test to see which chemicals LEDs can be used with
- Application of ARM9 hardware coprocessor on WinCE embedded motherboard
- What are the key points for selecting rotor flowmeter?
- LM317 high power charger circuit
- A brief analysis of Embest's application and development of embedded medical devices
- Single-phase RC protection circuit
- stm32 PVD programmable voltage monitor
- Introduction and measurement of edge trigger and level trigger of 51 single chip microcomputer
- Improved design of Linux system software shell protection technology
- What to do if the ABB robot protection device stops
- CGD and Qorvo to jointly revolutionize motor control solutions
- CGD and Qorvo to jointly revolutionize motor control solutions
- Keysight Technologies FieldFox handheld analyzer with VDI spread spectrum module to achieve millimeter wave analysis function
- Infineon's PASCO2V15 XENSIV PAS CO2 5V Sensor Now Available at Mouser for Accurate CO2 Level Measurement
- Advanced gameplay, Harting takes your PCB board connection to a new level!
- Advanced gameplay, Harting takes your PCB board connection to a new level!
- A new chapter in Great Wall Motors R&D: solid-state battery technology leads the future
- Naxin Micro provides full-scenario GaN driver IC solutions
- Interpreting Huawei’s new solid-state battery patent, will it challenge CATL in 2030?
- Are pure electric/plug-in hybrid vehicles going crazy? A Chinese company has launched the world's first -40℃ dischargeable hybrid battery that is not afraid of cold
- SinlinxA33 development board Linux kernel tasklet mechanism (with actual test code)
- Ceramic Antenna Requirements
- [GD32L233C-START Review] 8. ADC and DAC Test
- How much do you know about TCP/UDP data transmission protocols?
- Double 12 Download Center Data Promotion! 4G data can be downloaded in one click, valid for three days!
- Sugar gliders, first experience with RSL10-SENSE-GEVK, brightness sensor error, help
- I would like to ask, how is the power accuracy of BQ4050 as a high current protection board (40A-80A)? Is there any power jump?
- MSP430F full range of Flash memory common library
- The role of static in C language
- 【LAUNCHXL-CC2650】Pressure gasket pressure test