Implementation of embedded real-time music speech recognition system-EEWORLD

Collect

introduction

With the rapid development of electronic music, there is an urgent need for a smarter and more convenient user operating system. Automatic music speech recognition system can provide convenient human-computer interaction [1], making it easier for people to learn music knowledge by themselves. It will become a major method and a development direction. At present, automatic speech recognition systems have achieved good results in laboratory environments, but there are few automatic music speech recognition systems applied to electronic music. When automatic speech recognition is applied to electronic music, the recognition method must be improved accordingly to meet its requirements for computing speed, memory resources, etc. In order to solve this problem, this paper will design and implement an embedded music speech recognition system based on the characteristics of music speech.

1. System hardware circuit design system

The principle block diagram of the hardware circuit design is shown in Figure 1. It mainly consists of the music voice information acquisition part, the music voice processing DSP part, the program data storage FLASH part, the data storage SRAM part, the keyboard management part, the sound source chip voice output part, and the power supply part. The music voice information acquisition part is mainly completed by the MCU GPL162001. The chip has a 12-bit ADC and 72 I/O ports, which is convenient for keyboard management. The music voice processing DSP part uses the currently common TI company's TMS320VC5402 16-bit microprocessor, which has a fast processing speed and the fastest running speed can reach 100MIPS. It has low power consumption and is an ideal DSP processor. Considering the high speed requirement, the DSP crystal oscillator uses a 100MHZ crystal oscillator. In addition, since the music output requires professional music effects, the circuit selects the 64-chord MIDI audio processing chip provided by SMIC Micro. In addition, TMS320VC5402 has no FLASH and only 16K RAM. Considering the large size of voice data, we expanded the 1M FLASH chip and 64K SRAM chip. DSP (TMS320VC5402) is the signal processing center of the entire hardware system, completing the music voice recognition work, managing and scheduling the data of RAM and FLASH storage chips, and providing feedback information to the main control chip MCU. The working voltage of the power supply is 3.3V.

Figure 1 System Schematic Diagram [page]

2. Software Implementation of the System

Like most speech recognition systems, the music speech recognition system is essentially a pattern recognition system. Its basic flow chart is shown in Figure 2, which mainly includes several steps such as speech signal preprocessing, endpoint detection, feature parameter acquisition and speech recognition.

Figure 2 System identification algorithm flow chart

2.1 Speech Signal Preprocessing

Speech signal preprocessing mainly involves the preliminary optimization of speech signals to facilitate subsequent endpoint detection and speech recognition. Speech signal preprocessing mainly includes frame processing, pre-emphasis processing, windowing processing, filtering and glitch elimination processing.

2.1.1 Speech Signal Framing

The characteristics of speech signals change over time. Only in a short time interval can speech signals maintain relatively stable and consistent characteristics. Usually, this time interval is 5 to 50 ms. In the program, 200 sampling points are taken, which is equivalent to 25 ms for a sampling frequency of 8k. The overlap between frames is 100 sampling points, which is 12.5 ms.

2.1.2 Pre-emphasis

Since the average power spectrum of the speech signal is affected by glottal excitation and oral and nasal radiation, the high-frequency signal above 800HZ drops by 6DB/octave. Therefore, when calculating the speech signal spectrum, the higher the frequency, the fewer the corresponding components. The spectrum of the high-frequency part is more difficult to obtain than the low-frequency part, so pre-emphasis processing is required. In digital speech signal processing, digital speech signals usually pass through a low-order system (typically a first-order filter), that is, In the formula, is the pre-emphasis coefficient, and the most commonly used value is usually around 0.95. Since this system uses =0.94

2.1.3 Windowing

The essence of adding window to each frame of speech is to multiply the speech waveform by a window function. In order to reduce the slope at both ends of the time window, make both ends of the window edge smoothly transition to zero, and reduce the truncation effect of the speech frame, the typical application here is to add a Hamming window in the speech recognition system. [page]

2.1.4 Filtering and eliminating burrs

Since speech signals contain a lot of noise signals, these noise signals show high-frequency randomness, burrs and other signals in the time domain. These signals are likely to affect the recognition effect. Therefore, bandpass filtering and burr elimination processing of the signal can greatly improve the recognition accuracy. Since the human voice is mainly in 60-1000HZ, a 50-1000HZ FIR bandpass filter is used to filter the original signal to obtain good results. The method of eliminating the influence of burrs mainly adopts the method of peak and valley value detection of speech signals, removes the very inconspicuous valley value between two adjacent peak values and the very inconspicuous peak value between two adjacent valley values, and performs curve shaping on some smaller burrs in the speech curve to eliminate those obvious burrs!

2.2 Endpoint Detection

Endpoint detection is a key and difficult point in speech recognition. The quality of endpoint detection directly affects the subsequent extraction of speech feature parameters and the effect of speech recognition. Its purpose is to detect the speaker's voice command from the noisy speech and find the start and end points of the speech segment. This system uses the energy curve of the speech signal combined with the zero-crossing rate to perform endpoint detection [5]. The whole process is shown in Figure 3. Since the musical sound signal range of human voice is 50-1000HZ, the original speech signal is first filtered in different frequency bands to obtain energy curves after six frequency band filtering. E (1) is the speech signal in the 50-1000HZ segment, E (2) is the speech signal in the 100-1000HZ segment, E (3) is the speech signal in the 200-1000HZ segment, E (4) is the speech signal in the 400-1000HZ segment, E (5) is the speech signal in the 600-1000HZ segment, and E (6) is the speech signal in the 800-1000HZ segment. Energy segmentation is based on peak and valley detection. The changes in the peak and valley points of the energy curve are used to segment the speech segments in the energy curve, and the starting and ending points of the speech segments are taken as the endpoints we want to obtain. However, due to the complex changes in speech signals, especially when the speech is closely connected, the speech segmentation method based on the energy curve may not be able to segment. Therefore, this system adopts an improved energy curve segmentation algorithm. By analyzing the energy curve of the speech signal, we found that the energy curves of different frequency bands reflect different characteristics. The energy curves obtained by filtering the speech signal in different frequency bands also show different speech endpoint information. Some speech signals can be well segmented in the energy curve of the high-frequency band. Therefore, the improved algorithm based on energy curve segmentation obtained by filtering the speech signal in six frequency bands gives E(1) a weight of 1 in the decision basis, and E(2), E(3), E(4), E(5), and E(6) require more than two of the same to be regarded as endpoints. Based on the fact that the segmentation points of all energy curves must be checked to see if the zero-rate threshold meets the requirements, the purpose of the improved algorithm is to segment the language signal as accurately as possible, to ensure that there is no misclassification, to improve the accuracy of segmentation as much as possible, and to avoid misclassification.

Figure 3 Endpoint detection flow chart [page]

2.3 Speech feature parameter extraction

There are many parameters to be extracted for speech recognition. Due to the existence of noise, the music speech recognition system has high requirements for recognition accuracy. This system uses the classic Mel Frequency Cepstrum Parameters MFCC [4]. The MFCC parameters are based on Fourier spectrum analysis. Its core idea is to use the perceptual characteristics of the human ear to set several bandpass filters within the frequency spectrum of the speech. Each filter has a triangular or sinusoidal filtering characteristic. The signal energy of the corresponding filter group is calculated, and then the corresponding cepstrum coefficients are calculated through DCT.

Figure 4 MFCC parameter acquisition process

2.4 Training and Recognition of Speech Signals

The music speech recognition system is a highly professional speech recognition system with a small vocabulary. Due to the high recognition speed requirement and the small vocabulary in the musical sound, the range of musical sounds that human voice can sing is usually only dozens (generally within 4 octaves). This system uses the relatively simple and effective DTW algorithm for speech recognition. Based on the idea of dynamic programming, the algorithm extracts the characteristic parameters of each frame of the speech signal and converts it into a set of feature vectors. Speech recognition is to match this feature vector with the speech feature vector (reference template) stored in the template library to find the template with the shortest distance. Speech recognition requires the establishment of a speech template library, that is, the training of the speech model. Referring to the music speech pitch frequency comparison table, we only train the human voice range (60HZ-1000HZ, that is, the name of the musical sound from C-) for a total of 32 pitches in four octaves. The range of pitch in each song is within a certain range, so we often have fewer training samples, and the smaller vocabulary greatly improves the speed of musical sound recognition.

3 Experimental results and analysis

We tested the recognition performance of the system. Six testers (3 male and 3 female music professionals) selected a microphone with good directionality to conduct the test experiment in a quiet indoor environment. Since boys and girls can generally pronounce different pitches, boys generally have lower pitches than girls, first let the six testers record and train all the sounds he (she) can pronounce according to the note name table, and then randomly select several songs for testing. The experimental results show that in the recognition of musical sounds of specific people, because girls have clearer pronunciation and boys have more mellow pronunciation, the correct recognition rate of boys is above 95%, the correct recognition rate of girls is above 97%, and the average correct recognition rate is above 96%, which meets the practical requirements.

4 Conclusion

This paper introduces the hardware and software system of an embedded music speech recognition system based on DSP. Based on the traditional speech recognition method, some improvements are made in combination with the characteristics of music speech. The hardware structure and software process of the music speech recognition system are described. A new method based on multi-band energy curve segmentation combined with zero-crossing rate to detect endpoints is adopted, which simplifies the amount of calculation and further improves the recognition performance. The speech recognition technology is well used in electronic music, and embedded real-time music speech recognition is realized. The experimental results show that this system has high accuracy and can basically meet the practical needs.

The author's innovation points:

(1) Apply speech recognition methods to electronic music, design and implement professional music speech recognition software and hardware systems, cleverly use pre-processing methods such as filtering and burr removal, train professional music speech samples, and improve speech recognition accuracy.

(2) A new endpoint detection method based on multi-band energy curve segmentation combined with zero-crossing rate was established. The accuracy of speech segmentation was improved while ensuring no misclassification. At the same time, the accuracy of endpoint detection was improved by combining the zero-crossing rate threshold.

References

[1] Cai Lianhong, Huang Dezhi, Cai Rui. Modern Speech Technology Foundation and Applications[M]. Beijing: Tsinghua University Press, 2003.

[2] Hu Guangrui. Speech Processing and Recognition[M]. Shanghai: Shanghai Science and Technology Literature Press, 1994.

[3] Wang Bingxi, Qu Dan, Peng Xuan. Fundamentals of Practical Speech Recognition[M]. National Defense Industry Press, 2005.

[4] Chen Feili. Research on dynamic characteristic modeling methods in Chinese continuous speech recognition [D]. Shanghai Jiaotong University. 2002

[5] Jiang Guanxing, Wang Jianying. An improved method for detecting speech endpoints[J]. Microcomputer Information. 2006, No.13, P.138-139

About the Author

1. Liang Wenbin (1982-), male (Han nationality), from Xinshao, Hunan, master’s degree candidate in control theory and control engineering, School of Electrical and Information Engineering, Hunan University, research direction: embedded systems and their applications

2. Zhang Fan (1967-), male (Han nationality), from Changsha, Hunan, associate professor of the School of Electrical and Information Engineering, Hunan University, research direction: embedded systems and their applications

3. Cheng Jing (1968-), female (Han nationality), from Changsha, Hunan, professor at the School of Software, Hunan University, research direction: embedded systems and their applications

4. Zhao Xinkuan (1982-), male (Han nationality), from Zhongxiang, Hubei, is a master’s student in control theory and control engineering at the School of Electrical and Information Engineering, Hunan University. His research interests include embedded systems and their applications.

Reference address：Implementation of embedded real-time music speech recognition system

Previous article：Windows platform integrated development environment for S698 series processors
Next article：Design and implementation of embedded banknote recognition system

Popular Resources
Popular amplifiers