Using DSP core technology for voice compression development-EEWORLD

Collect

Abstract: This paper introduces a handheld voice device designed and developed using DSP core chip. This voice device does not require development system support, has a voice playback time of up to 200 minutes, a compression ratio of 46:1, and only uses a 32-megabit flash memory to save all data.

Keywords: DSP kernel voice compression

As the human-machine interface of instruments becomes more and more humane, the demand for speech recognition continues to grow, speech processing technology develops rapidly, and various processing algorithms emerge in an endless stream, providing increasingly flexible technical means for large-capacity speech applications. However, the higher the compression ratio of voice data, the stronger the computing power required by the data playback algorithm. At present, most high compression ratio voice compression data generation must use special voice development tools and development copyrights, which has caused certain difficulties for small-scale domestic users. At the 8K sampling rate, a 4-megabit flash chip is also used to store data. The playback time provided by different speech algorithms varies greatly. For example, using the ADPCM (Adaptive Delta Pulse Coding) algorithm can only provide 128 seconds of playback time, while using TI's LPC (Linear Predictive Coding) algorithm can get 50 minutes of playback time. In a certain engineering project, we needed to develop a low-cost handheld voice device with a playback time of up to 200 minutes. Due to the use of DSP core chips, the development work was completed in a short period of time and reached the pre-proposed performance indicators. .

1 Speech algorithm and chip characteristics

In this project, we chose DSP Group's latest speech compression algorithm, Triple Rate Coder, which has a compression ratio of 46:1, good sound quality, and a MOS index of 3.98.

The basic idea of this algorithm is: first divide the speech into several small segments. Since the spectrum change of the speech signal is a slow variable, the signal changes smoothly in each small segment. Then a digital filter and an excitation function are used to represent the discrete sampling sequence of this time domain waveform. In the actual algorithm, a tenth-order linear prediction filter is used. Each frame is divided into 4 subframes during calculation. The filter coefficient vector of each subframe is calculated from the data of the previous frame and this frame, and The filter coefficient vector of the last subframe is obtained by the vector decomposition prediction method; the excitation function uses a pseudo-random multi-pulse excitation function, which is obtained by the maximum likelihood algorithm. After calculating the filter coefficient vector and function generator for each frame, these coefficients are then compressed and packaged to obtain the final compressed speech data. When decompressing speech compressed data, first expand the packed data, then establish a linear prediction filter, and input the regenerated pseudo-random multi-pulse excitation function into the filter, so that it can be restored at the output end of the filter. sequence of speech signals.

The real-time operation of this algorithm requires an operation speed of more than 22 MIPS. For this reason, the DSP Group integrated the DSP core and algorithm code into the D6571 series of chips to meet a wider range of applications besides PC users. The schematic block diagram of D6571 is shown in Figure 1. It can be directly plugged in and manage 4-megabit flash. It can provide 25 minutes of playback time at a data rate of 2.8KB at an 8K sampling rate. The chip has an industry-standard codec interface and can be directly connected to an audio codec chip with a serial PCM interface, such as National Semiconductor's TP3054 or South Korea's Samsung Semiconductor's KS8620. D6571 can be connected to two external audio codec chips. After power-on, the setting command can be used to set the working mode of the external audio codec chip. For example: set whether the clock of the external chip is external synchronization or self-synchronization; set whether the external chip is in output mode or input mode, etc.

Based on actual development experience, we believe that the biggest advantage of the D6571 chip compared with some DSP core chips provided by other companies is that it requires almost no development tools or software to use. Because the data of this chip is bidirectional, the compressed voice data can be decompressed by the host computer and converted into speech through it, or the input voice can be compressed in real time through it and sent to the host computer. This greatly facilitates the use of voice development users with long playback times. What's more, the management of voice data by many voice compression chips is not public at present, such as the voice compression chips of some digital recording telephones.

2 System composition

The system structure related to speech processing is shown in Figure 2.

The voice data is stored in a 32-megabit flash, using Samsung's K29W3200, which is a flash memory with an 8-bit parallel interface. The parallel interface is conducive to improving code efficiency and meeting real-time requirements.

During the compression and playback process of voice data, the data throughput between D6571 and flash is performed through the host computer. The upper computer adopts 89C52. The system also has 64×64 dot matrix LCD module ACM6464 and other peripheral devices. All devices use a common 8-bit data bus, which is the P0 port of the CPU; the six port lines of the P2 port are used for keyboard management; the two port lines of the P3 port are used as two serial lines; this is used for peripheral management There are still 16 oral cords available. The system actually uses 14 of the lines: 6 are used related to flash management, 4 are used related to D6571, and 4 are used for LCD management.

The audio codec interface chip uses a TP3054. The synchronization pulses, sampling clocks, data signals, etc. required for the operation of the TP3054 only need to be connected to the four control lines of the D6571 to obtain them.

3 system development

The voice development of this system is divided into three processes: uploading, data synthesis and downloading. Uploading refers to obtaining voice compressed data; data synthesis refers to organizing the entire system's data into a file according to a certain structure; downloading refers to burning the file into flash when the instrument is assembled and shipped from the factory. These three processes are all carried out through PC. Since the serial signal of the MCU on the system board directly outputs TTL level, the only additional hardware required for voice development of this system is to use a MAX232 to complete the level conversion with the PC.

D6571 has a 16-bit width bus, but it also allows the use of an 8-bit bus in a time-sharing manner. At this time, the host computer must use the HL signal to indicate whether the upper 8 bits or the lower 8 bits are sent to the bus; and when D6571 actively sends the data When the bus is connected, an ACK signal will be sent to notify the host computer to read the data. HRD and HWR are the control lines for reading and writing. Since the Triple Rate Coder algorithm samples frames in 30 milliseconds and then analyzes and compresses them, both reading the compressed data and sending back the compressed data must be completed within one frame, otherwise the D6571 will enter the sleep state on its own. The process of transmitting voice data to D6571 is as follows: first send the decompression control command, and then receive a return status word. The status word contains the number of bytes required by the current frame. The host computer will continuously send the specified number of data, waiting for one frame to be processed. After completion, D6571 will continue to send status words. In this way, the voice can be played back continuously. The data processing process of speech compression using D6571 is just the opposite. The status word contains the number of bytes obtained by compressing the current frame, and the host computer should continuously receive the specified number of data.

When uploading, the first thing to get is the compressed data of each speech segment. As a preparatory work, first use the PC's recorder tool to record the voice we need into a WAV file. Then write two programs for the MCU and the PC that work together to complete the following functions: the PC plays the sound to the D6571 through the sound card; the 89C52 controls the D6571 to perform voice compression and read back the compressed data, and then sends it back to the PC through the serial line. The PC saves the compressed data of each segment to disk.

Since each segment of speech must have a certain number of Chinese dot matrix characters displayed during playback, the task of data synthesis is to add index and character dot matrix data to each segment of data and then synthesize it into a complete binary file of nearly 32 megabits. In order to facilitate data positioning and readout programming when the file is formed, the data block is based on the page of flash, and one page is 528 bytes.

The downloading work is relatively simple. Before the equipment leaves the factory, you can directly use the serial port of the system 89C52 to download the file formed by data synthesis into the flash memory.

The compression rate of D6571 is extremely high, and it is also relatively convenient to use and develop. Therefore, although the voice capacity of this system is up to 200 minutes, the overall system design is very simple and economical, and requires almost no debugging. The main development work is to develop some programs for MCU and PC using C51 and VB programming, and the development speed is relatively fast.

Since the control commands of D6571 are rich, the system user interface software is easy to write. For example, since the chip has 30 levels of volume control commands, we added digital volume control functionality to the device without adding any hardware. The control commands of D6571 also include more advanced commands such as automatic gain control, variable speed playback, digital filters, etc. Therefore, it can be applied to almost any speech situation.

Reference address：Using DSP core technology for voice compression development

Previous article：Using TL16C750 to realize high-speed serial communication between DSP and PC
Next article：Baseband signal processing chip component AD20msp425