Brief analysis of the working principle of speech recognition technology-EEWORLD

Collect

Speech recognition technology is a technology that allows machines to convert speech signals into text through recognition, and then convert them into instructions through understanding. The purpose is to give machines human auditory characteristics, understand what people say, and take corresponding actions. Speech recognition systems usually consist of two parts: acoustic recognition model and language understanding model, which correspond to the calculation of speech to syllable and syllable to word respectively. A continuous speech recognition system (as shown below) generally consists of four main parts: feature extraction, acoustic model, language model and decoder.

(1) The speech input preprocessing module processes the raw speech input signal, filters out unimportant information and background noise, and performs speech signal endpoint detection (that is, finding the beginning and end of the speech signal) and speech frame segmentation (it can be roughly understood that a speech segment is like a video, consisting of many ordered frames, and the speech signal can be cut into individual "frames" for analysis).

(2) Feature extraction: After removing redundant information in the speech signal that is useless for speech recognition, the information that can reflect the essential characteristics of the speech is retained for processing and expressed in a certain form. In other words, the key feature parameters that reflect the characteristics of the speech signal are extracted to form a feature vector sequence for subsequent processing.

Brief analysis of the working principle of speech recognition technology

(3) Acoustic model training. The acoustic model can be understood as the modeling of sound, which can convert speech input into an acoustic output. To be more precise, it gives the probability that the speech belongs to a certain acoustic symbol. The acoustic model parameters are trained based on the characteristic parameters of the training speech library. During recognition, the characteristic parameters of the speech to be recognized can be matched with the acoustic model to obtain the recognition result. The current mainstream speech recognition systems mostly use the hidden Markov model HMM for acoustic model modeling.

(4) Language model training. A language model is a model used to calculate the probability of a sentence appearing. Simply put, it is to calculate the probability of whether a sentence is grammatically correct. Because the structure of a sentence is often regular, the words that appear in the front often predict the words that may appear later. It is mainly used to determine which word sequence is more likely, or to predict the next word to appear when several words appear. It defines which words can follow the last recognized word (matching is a sequential processing process), so that some impossible words can be excluded from the matching process.

Language modeling can effectively combine the knowledge of Chinese grammar and semantics to describe the internal relationship between words, thereby improving the recognition rate and reducing the search scope. The training text database is subjected to grammatical and semantic analysis, and the language model is obtained through training based on the statistical model.

(5) Speech decoding and search algorithm. The decoder refers to the recognition process in speech technology. For the input speech signal, a recognition network is established based on the trained HMM acoustic model, language model and dictionary. The search algorithm is used to find the best path in the network. This path is the word string that can output the speech signal with the highest probability, thus determining the text contained in the speech sample. Therefore, the decoding operation refers to the search algorithm, that is, the method of finding the optimal word string through search technology at the decoding end.

The search in continuous speech recognition is to find a word model sequence to describe the input speech signal, so as to obtain a word decoding sequence. The search is based on the acoustic model score and language model score in the formula. In actual use, it is often necessary to add a high weight to the language model based on experience and set a long word penalty score.

Speech recognition is essentially a pattern recognition process, where the pattern of unknown speech is compared with the reference pattern of known speech one by one, and the best matching reference pattern is used as the recognition result. The mainstream algorithms of speech recognition technology today mainly include the dynamic time warping (DTW) algorithm, the vector quantization (VQ) method based on the non-parametric model, the hidden Markov model (HMM) method based on the parametric model, and speech recognition methods based on deep learning and support vector machines in recent years.

Reference address：Brief analysis of the working principle of speech recognition technology

Previous article：Solutions for audio and video live streaming systems and cloud servers
Next article：Application of security technology in smart home

Popular Resources
Popular amplifiers