FPGA-based voice endpoint detection-EEWORLD

Collect

Speech endpoint detection is to find the starting point and end point of speech from background noise. Its goal is to separate the speech signal from other signals (such as background noise) in an input signal and accurately determine the endpoint of the speech. Studies have shown that even in a quiet environment, more than half of the recognition errors of speech recognition systems come from endpoint detection. Therefore, the importance of endpoint detection cannot be ignored, especially in a noisy environment. Its accuracy directly affects whether subsequent work can be carried out effectively [1].

Most current speech recognition systems are designed with ARM and DSP as the core. They are expensive, lack flexibility, have a long development cycle, and are difficult to meet high-speed system requirements. In the study of speech endpoint detection algorithms, a variety of algorithms such as energy-based, zero-crossing rate-based, and LPC prediction residuals have been proposed [2]. However, most of these methods are based on computer software and are not suitable for hardware development [3].

FPGA has the advantages of low power consumption, small size, and high speed, and can meet the real-time requirements of speech recognition systems. This paper attempts to use FPGA to implement speech endpoint detection, improves the commonly used Lawrence Rabiner endpoint detection method, and implements speech endpoint detection using a pure hardware method. It also uses words and phrases such as "Changsha" as examples to verify its accuracy and feasibility.

1 Basic Principles of Voice Endpoint Detection Using FPGA

It is mainly completed by four parts: pre-emphasis, framing, windowing and endpoint judgment. The FPGA implementation method also goes through these four steps.

1.1 Pre-emphasis

The average power spectrum of the speech signal is affected by glottal excitation and oral and nasal radiation. The high-frequency end is attenuated by 6 dB/Oct (octave) above 800 Hz. In this way, the higher the frequency in the spectrum of the speech signal, the fewer the corresponding components are. Therefore, it is more difficult to obtain the frequency of the high-frequency part than the low-frequency part. Therefore, before analyzing the speech signal, the speech signal must be enhanced to make the short-time spectrum of the speech signal flatter, so as to facilitate spectrum analysis and vocal tract parameter analysis. There are two methods for enhancement: analog circuit method and digital circuit method. This design mainly uses digital circuit method. The general digital circuit method uses a first-order digital filter to achieve:

FPGA-based voice endpoint detection [Figure]

Formula (2) only has shift and addition and subtraction operations, that is, simple shift is used to replace complex decimal multiplication operations, so it can be easily implemented using FPGA.

1.2 Frame splitting and windowing

Framing is to divide the pre-emphasized speech signal into multiple segments for analysis, that is, to decompose a new time-dependent sequence from the original speech sequence, which is convenient for describing the characteristics of the speech signal. The speech signal has time-varying characteristics, but its characteristics remain basically unchanged within a relatively short time range, so it can be analyzed in segments. Assuming that the speech signal is stable within 10 ms to 30 ms, the speech signal can be analyzed in ms segments with this time period as the unit, where each segment is called a "frame" and the length of each frame is called the frame length. In order to maintain a continuous and smooth transition between frames, framing generally adopts the overlapping segmentation method, and the overlapping part of the previous frame and the next frame is called the frame shift. The ratio of the frame shift to the frame length is generally taken as 0 to 1/2. In order to facilitate the extraction of features in the speech recognition system, 2n is taken as the frame length. The sampling frequency of the speech signal in this paper is 16 kHz, the frame length is 256 (16 ms), and the frame shift is 128.

FPGA implementation of framing. The key is to solve the problem of frame shift superposition. It can be implemented with two FIFOs (F1 and F2). The specific process is: first write 128 numbers to F1; read the numbers in F1 to get the first 128 numbers of this frame, and write the numbers in F1 to F2 at the same time; when the numbers in F1 are finished reading, F2 has also been finished. At this time, read the numbers in F2 to get the last 128 numbers of this frame (at this time, a frame of voice signal is obtained), and write the data of the next frame to F1 while reading the data in F2. This cycle continues to complete the voice framing.

After framing, the spectral characteristics of the speech signal at the reconnection point between frames will be different from the original ones. In order to make the spectral characteristics of the speech signal at the reconnection point between frames closer to the original ones, windowing processing is required. The commonly used window functions in speech signal processing are rectangular window and Hamming window [5]. Their expressions are as follows (where N is the frame length):

Rectangular Window:

FPGA-based voice endpoint detection [Figure]

The main lobe width of the rectangular window is small, so it has a higher frequency resolution; but its side lobe peak is large, so its spectrum leakage is more serious. In comparison, although the main lobe width of the Hamming window is twice that of the rectangular window, its side lobe attenuation is larger, so it has a smoother low-pass characteristic and can reflect the spectral characteristics of short-term speech signals to a higher degree. Therefore, this paper uses the Hamming window.

FPGA implementation of windowing. Windowing is to multiply the framed data by the window function. The difficulty of adding Hamming window in FPGA implementation is the fractional cosine multiplication operation. If the algorithm is used to implement the operation, it will be slow. Considering that N is relatively small, the table lookup method can be used to implement windowing. The table lookup method is to store the various values of the window function in ROM and search them one by one. Here, the DSP Builder tool is used to generate the various values of the window function, because the DSP Builder tool developed by Altera has strong digital signal processing functions and can complete the operation of the window function well. The specific operation steps are: open the simulink tool in Matlab and open the Altera DSP Builder Blockset toolbox, then create a new ".mdl" file, find the corresponding module in the toolbox and connect it. Enter "0.54-0.56*cos([0:2*pi/255:2*pi])" in the "Matlab Array" of the "hamming_table" module. Then compile and synthesize, and the system will automatically generate the ".hex" file used by the table lookup method.

1.3 Endpoint determination

Endpoint judgment is the most important part of the entire endpoint detection, and it is also the part with the largest amount of calculation. Therefore, the choice of algorithm is very important. The algorithm used in this article is improved based on the Lawrence Rabiner endpoint detection method. First, the Lawrence Rabiner endpoint detection method is introduced. This method uses the zero crossing rate ZRC and energy E as features to detect the start and end points. The specific method is:

The algorithm is based on the start and end point algorithm of energy. According to the data of 10 consecutive frames known to be in a "static" state before the pronunciation begins, the energy thresholds T1 (low energy threshold) and T2 (high energy threshold) are calculated. The energy of each frame in the first 10 frames is calculated, and the maximum value is called MX, the minimum value is MN, and the zero-crossing rate threshold is ZCT, then:

FPGA-based voice endpoint detection [Figure]

Among them, F is a fixed value, generally 25, ZC and c are the mean and standard deviation of the zero-crossing rate of the first 10 frames respectively. First, the initial starting point BN (starting point frame number) is calculated according to T1 and T2. The method is: starting from the 11th frame, the average amplitude of each frame is compared successively, and BN is the frame number of the first frame whose energy exceeds T1. However, if the energy of the subsequent frame drops below T1 before exceeding T2, the original BN is not used as the initial starting point, and the frame number of the next frame whose energy exceeds T1 is recorded as BN, and so on. When the first frame whose energy exceeds T2 is found, the comparison is stopped. When BN is determined, search from the BN frame to the (BN-25) frame, and compare the zero-crossing rate of each frame in turn. If there are more than 3 frames with ZCR>ZCT, the starting point BN is set to the frame number of the first frame that meets ZCR>ZCT, otherwise BN is used as the starting point. This starting point detection method is also called the double threshold front-end detection algorithm. The detection method of the speech end point EN (end point frame number) is the same as that of the detection start point. Search from back to front to find the frame number of the first frame whose energy is lower than T1 and whose forward frame energy does not drop below T1 before exceeding T2, recorded as EN, and then search the frame according to the zero-crossing rate (EN=25). If there are more than 3 frames with ZCR≥ZCT, the end point EN is set as the frame number of the last frame that satisfies ZCR≥ZCT, otherwise EN is used as the end point.

This algorithm is complex to implement in hardware and slow, so the algorithm needs to be improved. The improved algorithm is: exceeding the high threshold can be used to determine the beginning of speech, and the low threshold is used to determine the end of speech. Exceeding the high threshold does not necessarily mean the beginning of speech. Sometimes the energy of the noise may be quite large and exceed the high threshold, but the noise generally lasts for a short time. The duration of exceeding the high threshold can be used to determine whether it is noise or speech. When the high threshold has determined the beginning of speech, the low threshold is used to determine the end point of speech. Falling below the low threshold does not necessarily mean the end of speech. Sometimes the energy of the speech signal may be lower than the low threshold, but the time the speech signal is lower than the low threshold cannot be very long. The time it is lower than the low threshold can be used to determine the end point of speech. In this way, the check of the start and end points reduces the judgment of the zero-crossing rate and the calculation of the mean and standard deviation of the zero-crossing rate of the first 10 frames. Therefore, the choice of the threshold value of this algorithm has a great impact on the detection of speech endpoints. The threshold value of this design is based on the Lawrence Rabiner endpoint detection method and is obtained through a large number of experiments. The calculation formulas are as follows: (10) and (11). Among them, AE is the average energy of the first 14 frames, T1 is the low threshold, and T2 is the high threshold.

T1=1.5AE(10)

T2=2T1(11)

In FPGA design, the design method of state machine is one of the most widely used design methods. FSM (finite state machine) and its design technology are important components of practical digital system design and an important way to achieve high efficiency and high reliability logic control. The improved algorithm can divide the entire endpoint judgment process into three states, and the state machine can be used to complete the design of FPGA. The state transition diagram is shown in Figure 1. S0, S1, and S2 are three states; E is the frame energy; T1 and T2 are the low threshold and high threshold respectively; C1 is the number of frames in state S1 where T2>E≥T1; C2 is the number of frames in state S1 where T2≤E; C3 is the number of frames in state S2 where T1>E.

FPGA-based voice endpoint detection [Figure]

The specific judgment process is as follows: (1) In the S0 state, E

2 Experimental Results

The sound samples in the experiment were collected by computer sound card (16 kHz, 8 bit) "wav" files, and the common words were experimented. Figure 2 is the endpoint detection simulation result of the word "Changsha" on Matlab, where the horizontal axis represents the frame number and the vertical axis represents the frame energy. The speech segments of the two words are 64 to 82 frames and 95 to 120 frames respectively. Figure 3 is the simulation result of the word "Changsha" on QuartusⅡ, where num represents the frame number of each frame, start represents the frame number of the speech start, and end represents the frame number of the speech end. It can be seen from Figures 1 and 2 that the endpoint check simulation results of the word "Changsha" on QuartusⅡ and Matlab are consistent. It can be seen from the figure that the improved endpoint detection method has a very good detection effect.

FPGA-based voice endpoint detection [Figure]

This paper uses the DSP Builder tool reasonably in the process of windowing, simplifies the hardware design, and speeds up the processing speed. It is a FPGA windowing method that is worth learning from. In the endpoint judgment algorithm, the improved Lawrence Rabiner endpoint detection method is used to improve the calculation of the algorithm threshold and the start and end point judgment, and the FPGA design is implemented using a finite state machine. Experiments have shown that the algorithm can accurately find the start and end points of the speech signal under low signal-to-noise ratio conditions. Compared with some other endpoint detection methods, this algorithm is simpler and more stable, and requires less storage space. It is an ideal hardware endpoint detection method and has certain reference value for the development and design of speech recognition systems.

Keywords：FPGA Reference address：FPGA-based voice endpoint detection

Previous article：Using FPGA to build a high-level video surveillance system
Next article：WCDMA system: Research on an effective WCDMA channel coding and decoding task scheduling scheme

Recommended ReadingLatest update time:2024-11-16 19:54

Color Image Processing System Based on FPGA and ARM

Introduction The rapid development of image processing technology has made image acquisition and processing systems more and more widely used in improving the degree of automation in agricultural production. At present, some image acquisition systems are based on CCD cameras, image acquisition cards and comput

[Microcontroller]

Color Image Processing System Based on FPGA and ARM

Silicon Witchery Releases Development Board Integrating Nordic SoC and Lattice FPGA

Swedish embedded module supplier Silicon Witchery has released a very compact module, the S1, designed to connect a Nordic Semi nRF52840 to the most space-constrained projects, which also integrates a Lattice iCE40 field-programmable gate array (FPGA). “Designed for efficient AI on the smallest edge devic

[Embedded]

Silicon Witchery Releases Development Board Integrating Nordic SoC and Lattice FPGA

Frequency Hopping Radio Transmission System Based on FPGA+DSP

introduction Frequency hopping technology is a spread spectrum technology with high anti-interference and anti-interception capabilities. The receiving system is a very important part of the frequency hopping communication system. Adaptive frequency hopping technology, high-speed frequency hopping technology, channel

[Embedded]

Frequency Hopping Radio Transmission System Based on FPGA+DSP

Communication between STM32 spi and FPGA

I've been studying the SPI bus recently, so I won't go into details about the protocol and hardware description. The four lines include clock, chip select, receive, and send initialization. SPI_InitStructure.SPI_Direction = SPI_Direction_2Lines_FullDuplex; //Full-duplex SPI_InitStructure.SPI_Mode = SPI_Mode_Maste

[Microcontroller]

Implementation of Modbus communication protocol based on Picoblaze core of FPGA

0 Introduction At present, according to the sales data released by major FPGA manufacturers, Xilinx's FPGA market share accounts for nearly 50%. The device density of the Spartan-3E series ranges from 100,000 to 1.6 million system gates. Its unit logic unit cost is the lowest in the FPGA industry. It can realiz

[Microcontroller]

Implementation of Modbus communication protocol based on Picoblaze core of FPGA

IGLOO's FPGA-based motor control solution

This article introduces the main features and advantages of the IGLOO series, the IGLOO series architecture block diagram, the main features of the motor control daughter board using AGL125, the stepper motor control logic block diagram, the BLDC motor control logic block diagram, and the motor control daughter boar

[Embedded]

IGLOO's FPGA-based motor control solution

Design of a New Harmonic Analyzer Based on FPGA

With the promotion of energy-saving technology and automation technology, the capacity and quantity of power electronic devices such as frequency conversion equipment and current conversion equipment are increasing day by day, making the harmonic pollution in the power grid increasingly serious, bringing harm to the p

[Test Measurement]

Design of a New Harmonic Analyzer Based on FPGA

Redefining the future root of trust architecture

The rapid digitalization of corporate environments, the surge in complex cyber threats, the evolving security regulations, and the rise of quantum computing technology have created huge waves in the field of cybersecurity, and the industry has also placed higher demands on agility and resilience. To cope

[Embedded]

Redefining the future root of trust architecture

Popular Resources
Popular amplifiers