Speech endpoint detection is to find the starting point and end point of speech from background noise. Its goal is to separate the speech signal from other signals (such as background noise) in an input signal and accurately determine the endpoint of the speech. Studies have shown that even in a quiet environment, more than half of the recognition errors of speech recognition systems come from endpoint detection. Therefore, the importance of endpoint detection cannot be ignored, especially in a noisy environment. Its accuracy directly affects whether subsequent work can be carried out effectively [1].
Most current speech recognition systems are designed with ARM and DSP as the core. They are expensive, lack flexibility, have a long development cycle, and are difficult to meet high-speed system requirements. In the study of speech endpoint detection algorithms, a variety of algorithms such as energy-based, zero-crossing rate-based, and LPC prediction residuals have been proposed [2]. However, most of these methods are based on computer software and are not suitable for hardware development [3].
FPGA has the advantages of low power consumption, small size, and high speed, and can meet the real-time requirements of speech recognition systems. This paper attempts to use FPGA to implement speech endpoint detection, improves the commonly used Lawrence Rabiner endpoint detection method, and implements speech endpoint detection using a pure hardware method. It also uses words and phrases such as "Changsha" as examples to verify its accuracy and feasibility.
1 Basic Principles of Voice Endpoint Detection Using FPGA
It is mainly completed by four parts: pre-emphasis, framing, windowing and endpoint judgment. The FPGA implementation method also goes through these four steps.
1.1 Pre-emphasis
The average power spectrum of the speech signal is affected by glottal excitation and oral and nasal radiation. The high-frequency end is attenuated by 6 dB/Oct (octave) above 800 Hz. In this way, the higher the frequency in the spectrum of the speech signal, the fewer the corresponding components are. Therefore, it is more difficult to obtain the frequency of the high-frequency part than the low-frequency part. Therefore, before analyzing the speech signal, the speech signal must be enhanced to make the short-time spectrum of the speech signal flatter, so as to facilitate spectrum analysis and vocal tract parameter analysis. There are two methods for enhancement: analog circuit method and digital circuit method. This design mainly uses digital circuit method. The general digital circuit method uses a first-order digital filter to achieve:
Formula (2) only has shift and addition and subtraction operations, that is, simple shift is used to replace complex decimal multiplication operations, so it can be easily implemented using FPGA.
1.2 Frame splitting and windowing
Framing is to divide the pre-emphasized speech signal into multiple segments for analysis, that is, to decompose a new time-dependent sequence from the original speech sequence, which is convenient for describing the characteristics of the speech signal. The speech signal has time-varying characteristics, but its characteristics remain basically unchanged within a relatively short time range, so it can be analyzed in segments. Assuming that the speech signal is stable within 10 ms to 30 ms, the speech signal can be analyzed in ms segments with this time period as the unit, where each segment is called a "frame" and the length of each frame is called the frame length. In order to maintain a continuous and smooth transition between frames, framing generally adopts the overlapping segmentation method, and the overlapping part of the previous frame and the next frame is called the frame shift. The ratio of the frame shift to the frame length is generally taken as 0 to 1/2. In order to facilitate the extraction of features in the speech recognition system, 2n is taken as the frame length. The sampling frequency of the speech signal in this paper is 16 kHz, the frame length is 256 (16 ms), and the frame shift is 128.
FPGA implementation of framing. The key is to solve the problem of frame shift superposition. It can be implemented with two FIFOs (F1 and F2). The specific process is: first write 128 numbers to F1; read the numbers in F1 to get the first 128 numbers of this frame, and write the numbers in F1 to F2 at the same time; when the numbers in F1 are finished reading, F2 has also been finished. At this time, read the numbers in F2 to get the last 128 numbers of this frame (at this time, a frame of voice signal is obtained), and write the data of the next frame to F1 while reading the data in F2. This cycle continues to complete the voice framing.
After framing, the spectral characteristics of the speech signal at the reconnection point between frames will be different from the original ones. In order to make the spectral characteristics of the speech signal at the reconnection point between frames closer to the original ones, windowing processing is required. The commonly used window functions in speech signal processing are rectangular window and Hamming window [5]. Their expressions are as follows (where N is the frame length):
Rectangular Window:
The main lobe width of the rectangular window is small, so it has a higher frequency resolution; but its side lobe peak is large, so its spectrum leakage is more serious. In comparison, although the main lobe width of the Hamming window is twice that of the rectangular window, its side lobe attenuation is larger, so it has a smoother low-pass characteristic and can reflect the spectral characteristics of short-term speech signals to a higher degree. Therefore, this paper uses the Hamming window.
FPGA implementation of windowing. Windowing is to multiply the framed data by the window function. The difficulty of adding Hamming window in FPGA implementation is the fractional cosine multiplication operation. If the algorithm is used to implement the operation, it will be slow. Considering that N is relatively small, the table lookup method can be used to implement windowing. The table lookup method is to store the various values of the window function in ROM and search them one by one. Here, the DSP Builder tool is used to generate the various values of the window function, because the DSP Builder tool developed by Altera has strong digital signal processing functions and can complete the operation of the window function well. The specific operation steps are: open the simulink tool in Matlab and open the Altera DSP Builder Blockset toolbox, then create a new ".mdl" file, find the corresponding module in the toolbox and connect it. Enter "0.54-0.56*cos([0:2*pi/255:2*pi])" in the "Matlab Array" of the "hamming_table" module. Then compile and synthesize, and the system will automatically generate the ".hex" file used by the table lookup method.
1.3 Endpoint determination
Endpoint judgment is the most important part of the entire endpoint detection, and it is also the part with the largest amount of calculation. Therefore, the choice of algorithm is very important. The algorithm used in this article is improved based on the Lawrence Rabiner endpoint detection method. First, the Lawrence Rabiner endpoint detection method is introduced. This method uses the zero crossing rate ZRC and energy E as features to detect the start and end points. The specific method is:
The algorithm is based on the start and end point algorithm of energy. According to the data of 10 consecutive frames known to be in a "static" state before the pronunciation begins, the energy thresholds T1 (low energy threshold) and T2 (high energy threshold) are calculated. The energy of each frame in the first 10 frames is calculated, and the maximum value is called MX, the minimum value is MN, and the zero-crossing rate threshold is ZCT, then:
Among them, F is a fixed value, generally 25, ZC and c are the mean and standard deviation of the zero-crossing rate of the first 10 frames respectively. First, the initial starting point BN (starting point frame number) is calculated according to T1 and T2. The method is: starting from the 11th frame, the average amplitude of each frame is compared successively, and BN is the frame number of the first frame whose energy exceeds T1. However, if the energy of the subsequent frame drops below T1 before exceeding T2, the original BN is not used as the initial starting point, and the frame number of the next frame whose energy exceeds T1 is recorded as BN, and so on. When the first frame whose energy exceeds T2 is found, the comparison is stopped. When BN is determined, search from the BN frame to the (BN-25) frame, and compare the zero-crossing rate of each frame in turn. If there are more than 3 frames with ZCR>ZCT, the starting point BN is set to the frame number of the first frame that meets ZCR>ZCT, otherwise BN is used as the starting point. This starting point detection method is also called the double threshold front-end detection algorithm. The detection method of the speech end point EN (end point frame number) is the same as that of the detection start point. Search from back to front to find the frame number of the first frame whose energy is lower than T1 and whose forward frame energy does not drop below T1 before exceeding T2, recorded as EN, and then search the frame according to the zero-crossing rate (EN=25). If there are more than 3 frames with ZCR≥ZCT, the end point EN is set as the frame number of the last frame that satisfies ZCR≥ZCT, otherwise EN is used as the end point.
This algorithm is complex to implement in hardware and slow, so the algorithm needs to be improved. The improved algorithm is: exceeding the high threshold can be used to determine the beginning of speech, and the low threshold is used to determine the end of speech. Exceeding the high threshold does not necessarily mean the beginning of speech. Sometimes the energy of the noise may be quite large and exceed the high threshold, but the noise generally lasts for a short time. The duration of exceeding the high threshold can be used to determine whether it is noise or speech. When the high threshold has determined the beginning of speech, the low threshold is used to determine the end point of speech. Falling below the low threshold does not necessarily mean the end of speech. Sometimes the energy of the speech signal may be lower than the low threshold, but the time the speech signal is lower than the low threshold cannot be very long. The time it is lower than the low threshold can be used to determine the end point of speech. In this way, the check of the start and end points reduces the judgment of the zero-crossing rate and the calculation of the mean and standard deviation of the zero-crossing rate of the first 10 frames. Therefore, the choice of the threshold value of this algorithm has a great impact on the detection of speech endpoints. The threshold value of this design is based on the Lawrence Rabiner endpoint detection method and is obtained through a large number of experiments. The calculation formulas are as follows: (10) and (11). Among them, AE is the average energy of the first 14 frames, T1 is the low threshold, and T2 is the high threshold.
T1=1.5AE(10)
T2=2T1(11)
In FPGA design, the design method of state machine is one of the most widely used design methods. FSM (finite state machine) and its design technology are important components of practical digital system design and an important way to achieve high efficiency and high reliability logic control. The improved algorithm can divide the entire endpoint judgment process into three states, and the state machine can be used to complete the design of FPGA. The state transition diagram is shown in Figure 1. S0, S1, and S2 are three states; E is the frame energy; T1 and T2 are the low threshold and high threshold respectively; C1 is the number of frames in state S1 where T2>E≥T1; C2 is the number of frames in state S1 where T2≤E; C3 is the number of frames in state S2 where T1>E.
The specific judgment process is as follows: (1) In the S0 state, E
2 Experimental Results
The sound samples in the experiment were collected by computer sound card (16 kHz, 8 bit) "wav" files, and the common words were experimented. Figure 2 is the endpoint detection simulation result of the word "Changsha" on Matlab, where the horizontal axis represents the frame number and the vertical axis represents the frame energy. The speech segments of the two words are 64 to 82 frames and 95 to 120 frames respectively. Figure 3 is the simulation result of the word "Changsha" on QuartusⅡ, where num represents the frame number of each frame, start represents the frame number of the speech start, and end represents the frame number of the speech end. It can be seen from Figures 1 and 2 that the endpoint check simulation results of the word "Changsha" on QuartusⅡ and Matlab are consistent. It can be seen from the figure that the improved endpoint detection method has a very good detection effect.
This paper uses the DSP Builder tool reasonably in the process of windowing, simplifies the hardware design, and speeds up the processing speed. It is a FPGA windowing method that is worth learning from. In the endpoint judgment algorithm, the improved Lawrence Rabiner endpoint detection method is used to improve the calculation of the algorithm threshold and the start and end point judgment, and the FPGA design is implemented using a finite state machine. Experiments have shown that the algorithm can accurately find the start and end points of the speech signal under low signal-to-noise ratio conditions. Compared with some other endpoint detection methods, this algorithm is simpler and more stable, and requires less storage space. It is an ideal hardware endpoint detection method and has certain reference value for the development and design of speech recognition systems.
Previous article:Using FPGA to build a high-level video surveillance system
Next article:WCDMA system: Research on an effective WCDMA channel coding and decoding task scheduling scheme
Recommended ReadingLatest update time:2024-11-16 19:54
- Popular Resources
- Popular amplifiers
- Analysis and Implementation of MAC Protocol for Wireless Sensor Networks (by Yang Zhijun, Xie Xianjie, and Ding Hongwei)
- MATLAB and FPGA implementation of wireless communication
- Intelligent computing systems (Chen Yunji, Li Ling, Li Wei, Guo Qi, Du Zidong)
- Summary of non-synthesizable statements in FPGA
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Analysis of previous "Instrumentation" competition topics
- Hard work pays off? TMS320F28379D debugging experience
- EEWORLD University ---- [Open Source Sao Ke] FPGA-based SDRAM controller design (SDRAM Season 1)
- 【Qinheng Trial】549+ Light up a lamp first
- LED constant current driver chip recommendation
- SONY ICD-SX2000 Voice Recorder Disassembly
- Gated Clock.rar
- EEWORLD University Hall----Live playback: Bidirectional CLLLC resonance, dual active bridge (DAB) reference design
- Online Upgrade Method of DSP Application Program Based on Serial Communication
- If you are interested in AI, which technology do you focus on?