The basic process of speech recognition

fish001

The basic process of speech recognition [Copy link]

This post was last edited by fish001 on 2018-8-27 22:26 The process of computer speech recognition is basically the same as the process of human speech recognition. The current mainstream speech recognition technology is based on the basic theory of statistical pattern recognition. A complete speech recognition system can be roughly divided into three parts: 1. Speech feature extraction: Its purpose is to extract the speech feature sequence that changes over time from the speech waveform 2. Acoustic model and pattern matching: The acoustic model usually generates the acquired speech features through a learning algorithm. During recognition, the input speech features are matched and compared with the acoustic model (pattern) to obtain the best recognition result 3. Language model and language processing: The language model includes a grammatical network composed of recognized speech commands, or a language model composed of statistical methods. Language processing can perform grammatical and semantic analysis. The acoustic model is the underlying model of the recognition system and is the most critical part of the speech recognition system. The purpose of establishing an acoustic model is to provide an effective method to calculate the distance between the speech feature vector sequence and the pronunciation template. The size of the acoustic model unit (word pronunciation model, semi-pronunciation model or phoneme model) has a great influence on the amount of speech training data, system recognition rate, and flexibility. The size of the recognition unit must be determined based on the characteristics of the different languages and the size of the vocabulary of the recognition system. The language model is particularly important for speech recognition systems with medium and large vocabulary. When classification errors occur, they can be corrected based on linguistic models, grammatical structures, and semantics. In particular, some homophones must be determined through context structures to determine their semantics. Generally speaking, there are three methods for speech recognition: methods based on vocal tract models and speech knowledge, template matching methods, and methods using artificial neural networks [1]. (1) Methods based on phonetics and acoustics This method started early. When speech recognition technology was first proposed, there were studies in this area. However, due to the complexity of its model and speech knowledge, it has not yet reached the practical stage. It is generally believed that there are a limited number of different speech primitives in common languages, and they can be distinguished by the frequency domain or time domain characteristics of their speech signals. Therefore, this method is implemented in two steps: The first step, segmentation and labeling The speech signal is divided into discrete segments according to time, and each segment corresponds to the acoustic characteristics of one or several speech primitives. Then, according to the corresponding acoustic characteristics, a similar speech label is given to each segment. The second step is to obtain a word sequence According to the speech label sequence obtained in the first step, a speech primitive grid is obtained, and a valid word sequence is obtained from the dictionary. It can also be combined with the grammar and semantics of the sentence at the same time. (2) Template matching method The template matching method has developed relatively maturely and has reached the practical stage. In the template matching method, there are four steps: feature extraction, template training, template classification, and judgment. There are three commonly used technologies: dynamic time warping (DTW), hidden Markov (HMM) theory, and vector quantization (VQ) technology. Dynamic Time Warping (DTW) Endpoint detection of speech signals is a basic step in speech recognition, and it is the basis for feature training and recognition. The so-called endpoint detection is to exclude silent segments from the speech signal at the positions of the starting and ending points of various segments (such as phonemes, syllables, and morphemes) in the speech signal. In the early days, the main basis for endpoint detection was energy, amplitude, and zero-crossing rate. But the effect was often not obvious. In the 1960s, Japanese scholar Itakura proposed the dynamic time warping algorithm (DTW: Dynamic Time Warping). The idea of the algorithm is to uniformly lengthen or shorten the unknown quantity until it is consistent with the length of the reference pattern. In this process, the time axis of the unknown word should be unevenly twisted or bent so that its features are aligned with the model features.