Introduction to Speech Recognition Technology[Copy link]
For beginners, the main learning is the general steps of speech recognition technology and several mainstream methods now. It mainly includes the following steps: 1) Preprocessing. The input speech signal is pre-emphasized and framed and windowed to filter out unimportant information and background noise, and endpoint detection is performed to determine the valid speech segment; 2) Feature extraction. Common feature parameters include amplitude, zero-crossing rate, energy based on the time domain, and linear prediction cepstral coefficients (LPCC) and Mel cepstral coefficients (MFCC) based on the frequency domain; 3) Pattern matching. There are several mainstream speech recognition technologies: 1) Dynamic Time Warping (DTW) technology. It uses dynamic warping method and combines time transformation relationship to obtain the distance between feature vectors. It is a classic algorithm in speech recognition. DTW technology is relatively easy to implement, but it cannot fully utilize the timing characteristics and dynamic characteristics of speech signals. Therefore, it is suitable for relatively simple Chinese speech recognition systems such as isolated words and small words. 2) Hidden Markov Model (HMM) technology. HMM uses the state in the Markov chain to represent the pronunciation process of speech. In the process of single word generation, the system transfers from one state to another, and generates an output in each state until the single word is output. HMM uses Markov chain to simulate the change process of signal, and indirectly describes this change through sequence, so it is a double random process, and can well describe the overall non-stationarity and short-term stationarity of speech signal. HMM needs to make a priori assumptions about the current state sequence distribution; it has weak modeling ability for high-level acoustic phonemes, making acoustically similar words easily confused; HMM speech recognition system is difficult to implement with hardware. 3) Artificial neural network (ANN) technology. Long training time. Difficulties of existing speech recognition: 1) Recognition performance depends on the surrounding environment. When the training environment and the test environment are different, the effect deteriorates; 2) Noise problem. How to denoise; 3) The ambiguity of speech information. How to identify words with similar pronunciations and words with the same pronunciation but different meanings.