What are the speech recognition algorithms? Speech recognition feature extraction methods

Publisher:温暖的微风Latest update time:2024-01-29 Source: elecfans Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

What are the speech recognition algorithms?

This article lists several different speech recognition algorithms.

The first one: algorithm based on dynamic time warping

It is still the mainstream method in continuous speech recognition.

This method has a large amount of computation, but is technically simple and has a high recognition accuracy.

In small vocabulary and isolated word recognition systems, many improved DTW algorithms have been proposed, for example, a method for isolated word recognition using a frequency-scale DTW algorithm.

The second method: the hidden Markov model (HMM) based on the parameter model

This algorithm is mainly used in speech recognition systems with large vocabulary. It requires more model training data, longer training and recognition time, and also requires a larger memory space.

Generally speaking, the continuous hidden Markov model requires more computation than the discrete hidden Markov model, but the recognition rate is higher.

The third method: vector quantization (VQ) based on non-parametric model

The method requires very little model training data, training and recognition time, and working storage space.

However, the recognition performance of the VQ algorithm for large vocabulary speech recognition is not as good as that of HMM.

It has been well applied in isolated character (word) speech recognition systems.

In addition, there are algorithms based on artificial neural networks (ANN) and hybrid algorithms, such as ANN/HMM method, FSVQ/HMM method, etc.

More speech recognition algorithms are as follows:

Convolutional Neural Networks

Deep Learning Neural Networks

BP Neural Network

RBF Neural Network

Fuzzy Clustering Neural Network

Improved TS fuzzy neural network

Recurrent Neural Networks

Wavelet Neural Network

Chaotic Neural Network

Wavelet Chaotic Neural Network

Neural Networks and Genetic Algorithms

Dynamically Optimizing Neural Networks

K-means and neural network ensembles

Combination of HMM and Self-organizing Neural Network

Orthogonal Basis Function Counter-propagation Process Neural Network

HMM and a new feed-forward neural network

Random mapping of feature space

SVM multi-class classification algorithm

Normalization of feature parameters

Multiband spectral subtraction

Independent Perception Theory

Segmented Fuzzy Clustering Algorithm VQ-HMM

Optimized competition algorithm

Double Gaussian GMM feature parameters

MFCC and GMM

MFCCs and PNN

SBC and SMM

MEL cepstral coefficients and vector quantization

DTW

LPCC and MFCC

Hidden Markov Model HMM

Speech Recognition Feature Extraction Method

Speech recognition has the following requirements for feature parameters:

1. Able to convert speech signals into speech feature vectors that can be processed by computers

2. Be able to conform to or be similar to the auditory perception characteristics of the human ear

3. It can enhance speech signals and suppress non-speech signals to a certain extent

The commonly used feature extraction methods are as follows:

(1) Linear Prediction Coefficients (LPC)

The principle of human-like vocalization is obtained by analyzing the model of the cascade of vocal tract short tubes. Assuming that the transfer function of the system is similar to that of a full-pole digital filter, 12-16 poles are usually enough to describe the characteristics of the speech signal. Therefore, for the speech signal at time n, we can use the linear combination of the signals at previous times to approximate the simulation. Then calculate the sampling value of the speech signal and the sampling value of the linear prediction. And minimize the mean square error (MSE) between the two, and you can get LPC.

(2) Perceptual Linear Predictive (PLP)

A characteristic parameter based on an auditory model. This parameter is a feature equivalent to LPC and is also a set of coefficients of the full-pole model prediction polynomial. The difference is that PLP is based on human hearing and is applied to spectrum analysis through calculation. The input speech signal is processed by the human hearing model to replace the time domain signal used by LPC. This has the advantage of being conducive to the extraction of noise-resistant speech features.

(3) Tandem and Bottleneck features

These are two types of features extracted using neural networks. Tandem features are obtained by reducing the dimensionality of the posterior probability vector of the category corresponding to the node in the output layer of the neural network and concatenating it with features such as MFCC or PLP. Bottleneck features are extracted using a neural network with a special structure. The number of nodes in one hidden layer of this neural network is much smaller than that in other hidden layers, so it is called the Bottleneck layer, and the output feature is the Bottleneck feature.

(4) Fbank features based on filter banks (Filterbank)

Also known as MFSC, the Fbank feature extraction method is equivalent to MFCC without the last step of discrete cosine transform. Compared with MFCC features, Fbank features retain more original speech data.

(5) Linear Predictive Cepstral Coefficient (LPCC)

Based on the important characteristic parameters of the vocal tract model, LPCC discards the excitation information in the signal generation process. Then more than ten cepstral coefficients can represent the characteristics of the resonance peak. Therefore, it can achieve good performance in speech recognition.

(6) Mel Frequency Cepstrum Coefficient (MFCC)

Based on the hearing characteristics of the human ear, the Mel frequency cepstrum frequency band division is equally spaced on the Mel scale. The logarithmic distribution relationship between the frequency scale value and the actual frequency is more in line with the hearing characteristics of the human ear, so it can make the speech signal have a better representation. It was developed by Davis and Mermelstein in 1980. Since then, MFCC has been a standout in the field of speech recognition.

Q: Why is MFCC so popular?

People produce sounds through the vocal tract, and the shape of the vocal tract determines what kind of sound is produced. The shape of the vocal tract includes the tongue, teeth, etc. If we can accurately know this shape, then we can accurately describe the phoneme produced. The shape of the vocal tract is shown in the envelope of the short-time power spectrum of speech. And MFCC is a feature that accurately describes this envelope.

Spectrogram

When processing speech signals, how to describe them is very important, because different descriptions show different information, and the spectrogram description method is the most conducive to observation and understanding.

What are the speech recognition algorithms? Speech recognition feature extraction methods

As can be seen from the figure above, this speech is divided into many frames, and each frame of speech corresponds to a spectrum (calculated by short-time FFT), which represents the relationship between frequency and energy. In actual use, there are three types of spectrum diagrams, namely linear amplitude spectrum, logarithmic amplitude spectrum, and autopower spectrum (the amplitude of each spectrum line in the logarithmic amplitude spectrum is calculated logarithmically, so the unit of its vertical axis is dB (decibel). The purpose of this transformation is to pull those components with lower amplitudes higher relative to those with higher amplitudes, so as to observe periodic signals hidden in low-amplitude noise).

What are the speech recognition algorithms? Speech recognition feature extraction methods

First, the spectrum of one frame of speech is represented by coordinates, as shown in Figure (a). Rotate 90 degrees to get Figure (b). Map these amplitudes to a grayscale representation to get Figure (c). The reason for this operation is to increase the time dimension and get a spectrum that changes over time. This is the spectrogram that describes the speech signal. In this way, the spectrum of a segment of speech can be displayed instead of a frame of speech, and static and dynamic information can be intuitively seen.

What are the speech recognition algorithms? Speech recognition feature extraction methods

Cepstrum Analysis

Below is a spectrum of speech. The peaks represent the main frequency components of the speech. We call these peaks formants, which carry the identification properties of the sound and can be used to identify different sounds. Therefore, we need to extract them. We need to extract not only the positions of the formants, but also the process of their transformation. So we extract the spectral envelope. This envelope is a smooth curve connecting these formant points.

What are the speech recognition algorithms? Speech recognition feature extraction methods

As can be seen from the figure above, the original spectrum consists of two parts: the envelope and the spectrum details. Therefore, we need to separate these two parts to get the envelope. Decompose according to the method in the figure below. Based on the given logX[k], we can obtain logH[k] and logE[k] to satisfy logX[k]=logH[k]+logE[k].

What are the speech recognition algorithms? Speech recognition feature extraction methods

From the above figure, we can see that the envelope is mainly low-frequency components, while the high frequency is mainly the details of the spectrum. The superposition of the two is the original spectrum signal. That is, h[k] is the low-frequency part of x[k], so by passing x[k] through a low-pass filter, we can get h[k], which is the envelope of the spectrum.

The professional term for the above unwinding process is called homomorphic signal processing (another method is based on linear transformation). The speech itself can be regarded as a response function of the vocal tract impact information (including speaker personality information and semantic information, expressed as low-frequency components of the spectrum) after glottal excitation, which is expressed in the form of convolution in the time domain. In order to separate the two and obtain the vocal tract resonance characteristics and fundamental frequency period, it is necessary to convert this nonlinear problem into a linear problem. The first step is to convert it into a multiplicative signal through FFT (convolution in the time domain is equivalent to product in the frequency domain); the second step is to convert the multiplicative signal into an additive signal by taking the logarithm; the third step is to perform an inverse transformation to restore it to a convolution signal. At this time, although both the front and back are time domain sequences, the discrete time domains they are in are obviously different, so the latter is called the cepstrum frequency domain. The calculation process is shown in the figure below.

[1] [2]
Reference address:What are the speech recognition algorithms? Speech recognition feature extraction methods

Previous article:How to implement voice recognition_How to set up mobile phone voice recognition
Next article:Advantages and disadvantages of speech recognition_Introduction to speech recognition function

Latest Embedded Articles
Change More Related Popular Components
Guess you like

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号