Implementation of low-rate speech codec in IP telephony-EEWORLD

Collect

Abstract: The implementation method of G.729.A on TMS320C6201DSP and the optimization method and programming skills to improve the running speed of G.729.A encoder. And the test results of the encoder are introduced.

Keywords: ITU-T G.729.A IP telephony encoder

In recent years, IP telephony technology has advanced by leaps and bounds, from the original PC-to-PC connection method to the IP telephony gateway method. Through the IP telephony gateway, the PBX can be connected to the Internet, thereby enabling ordinary telephones to communicate through the Internet. Therefore, IP telephony gateway has become a hot research topic in the fields of computers and communications in recent years. One of the most important performance indicators of an IP telephony gateway is its processing density (that is, the number of voice channels that can be processed simultaneously). The processing density of the IP telephony gateway mainly depends on the delay size of the voice codec it uses to process one frame of data. Currently, the standard followed by IP phones is H.323, and the preferred voice coder for the H.323 standard is ITU-T G.729.A. ITU-T G.729.A is a compression coding algorithm recommendation for speech and other sound signals. It is a simplified version of G.729, with a coding rate of 8Kbps and high voice quality. However, the algorithm of this encoder is complex, and the processing delay of one frame of speech is large, which greatly affects the processing density of the IP telephony gateway. Therefore, in order to improve the processing density of the IP telephony gateway, this article uses the currently best-performing DSP, TMS320C6201, when implementing the ITU-T G.729.A voice codec. In view of the parallelism and pipeline characteristics of TMS320C6201, in-depth The programming techniques for implementing G.729.A codec on TMS320C6201 were studied; a series of optimization methods to reduce codec processing delay were summarized. Using these priority methods and programming techniques, the encoding running time of each frame of ITU-T G.729.A can be reduced to 0.47 milliseconds (calculated based on TMS320C6201 working at 200MHz), so that a single-chip TMS320C6201 can process 20 channels of voice at the same time. . This indicator has reached the most advanced level in the world; moreover, this codec has been successfully used in the IP telephony gateway developed by the author.

1 G.729.A codec algorithm

1.1 Encoding algorithm

The ITU-T G.729.A standard uses an algorithm called "Conjugate Structure Algebraic-Code-Excited Linear-Prediction (CS-ACELP)" to encode speech signals.

Before starting the encoding, the input analog signal is phone-band filtered, sampled at 8kHz, and converted into a 16-bit linear PCM code as the input to the encoder.

The unit of speech processing by the encoder is frame. One frame is 10 milliseconds of speech, including 80 sound samples (sampling frequency is 8kHz). The encoder analyzes each frame of speech signal, extracts the parameters of the CPLD model (linear prediction filter parameters), adaptive and fixed codebook index and gain), and encodes and transmits these parameters. The coding process is shown in Figure 1.

In the preprocessing stage, the input signal is high-pass filtered and multiplied by a scale factor, and then a linear prediction analysis is performed on the preprocessed signal of each frame to calculate the linear prediction filter coefficients, where the linear prediction filter coefficients are defined as: . These coefficients are converted into Line Spectrum Pairs (LSP) and quantized into 18 bits using predictable second-order vector quantization. The excitation signal is selected using a synthetic analysis search process such that the error between the original signal and the reconstructed signal is minimal in the sensory weighted distortion measurement.

The excitation parameters (fixed and adaptive codebook parameters) are found for each subframe (5 ms, including 40 samples). The quantized and unquantized linear interpolation coefficients of this frame. The open-loop pitch delay is estimated from the perceptually weighted speech signal for each frame. The generalist performs the following operations for each subframe: passes the linear prediction residual through the impulse response h(n) of the weighted synthesis filter, uses the target signal x(n) and the impulse response h(n) to pass around the open-loop pitch delay Search, and perform closed-loop pitch analysis (resulting in adaptive codebook delays and gains). The pitch delay of the first subframe is encoded as 8 bits, and the second subframe is encoded as 5 bits using the differential method. The target signal x(n) is updated by subtracting the (filtered) adaptive codebook contribution, and the new target x'(n) is used in the fixed codebook search to find the optimal excitation. The fixed codebook incentive uses a 17-bit algebraic codebook. The contribution gains of the adaptive and fixed codebooks are vector quantized to 7 bits (using a moving average prediction method for the fixed codebook gains). Finally, the resulting excitation signal is used to update the filter state. All these parameters are finally encapsulated into an 80-bit compressed data frame.

1.2 Decoding algorithm

Decoding algorithm

The decoder algorithm block diagram is shown in Figure 2.

First, the index of each parameter is obtained from the compressed bit stream, and then the encoder parameters of a frame of speech are obtained from these indexes, including LSP coefficients, 2 partial pitch delays, 2 fixed codebook vectors, 2 sets of adaptive sums Fixed codebook delay, these parameters are used to generate the excitation signal and synthesize the filter parameters. The LSP coefficients are interpolated to form the LP filter for each subframe. Then, each subframe is processed as follows:

·The adaptive and fixed codebook vectors are multiplied by their respective gain coefficients to obtain the excitation signal;

·The excitation signal passes through the linear prediction synthesis filter to obtain the reconstructed speech;

·The reconstructed speech signal then goes through a post-processing stage, including adaptive filters based on long-term and short-term synthesis filters, and then passes through a high-pass filter and multiplied by the corresponding scaling factor.

2 Key technologies for ITU-T G.729.A codec implementation

2.1 Hardware platform for ITU-T G.729.A codec implementation

The ITU-T G.729.A codec implementation platform is an integrated IP telephony gateway developed by the author. The design idea of this integrated IP telephony gateway is based on PC and integrates common market boards, such as LSI/C6200DSP resource card, Dialogic voice card and gateway, etc., and uses these boards as hardware platforms, according to relevant protocol and developed a set of IP gateway software. The basic hardware structure of the integrated IP telephony gateway is shown in Figure 3. The G.729.A codec is implemented by the TMS320C6201 DSP on the LSI/C6200 resource card.

TMS320C6201 DSP is currently the fastest fixed-point digital signal processor produced by American TEXAS INSTRUMENT company. TMS320C6201 DSP adopts VLIW (Very Long Instruction Word) architecture, its operating frequency can reach up to 200MHz, and it has 1600MIPS internally[4]. In addition, TMS320C6201 DSP provides 64KB of internal program RAM and data RAM respectively. The off-chip memory can be expanded to 4GB and can be connected to SDRAM, SBSRAM and Flash Memory. TMS320C6201 DSP also provides a wealth of peripheral circuit interfaces, such as: Scbus voice bus, MVIP voice bus, HOST interface and JTAG port.

2.2 Design of ITU-T G.729.A software module

The hardware platform on which the G.729.A codec runs is TMS320C6201DSP, which supports SPOX. SPOX is a powerful real-time operating system. Under the scheduling of the SPOX operating system, multiple channels of voice can be compressed and decoded in a timely manner. The G.729.A codec device is mainly composed of three parts: the scheduling and command interpretation module, the G.729.A data compression and decompression module, and the interface module.

(1) Scheduling and command interpretation module

This module is mainly used to interpret various commands sent by HOST, such as sending or receiving codec data, querying codec status, starting and stopping codec operations, etc. This module does not deal directly with HOST, but uses the services provided by SPOX. In this way, data exchange with HOST is indirectly realized through the interface function module. At the same time, with the support of SPOX, timely scheduling of multi-channel voice codecs is completed.

(2) G.729.A data compression and decompression module

This module is the core module of the ITU-T G.729.A codec, which greatly affects the performance of the codec. This module implements all functions of ITU-T G.729.A. This part has formed a separate TMS320C6201 function library and can be connected with any other part.

(3) Interface module

This module mainly implements data exchange between TMS320C6201 and HOST and the voice card, so the module is divided into two parts. One part is mainly responsible for data transmission between the TMS320C6201 DSP and the voice card. It is responsible for continuously sending the voice data collected by the voice card to the LSI/PCI6200 resource card RAM through the voice bus (such as SCbus) through isochronous communication, or The data decoded by the codec is sent to the voice card via the SCbus bus. The other part is mainly responsible for the data exchange between the TMS320C6201 DSP and the HOST. On the one hand, the compressed voice signal is sent to the HOST through the PCI bus; on the other hand, the unpacked code stream of the HOST is classified and read into the codec. The data exchange between the codec and the HOST is synchronized using interrupts.

2.3 Key technologies implemented by ITU-T G.729.A standard on TMS320C6201

Processing density is an important indicator of the performance of an IP telephony gateway. As long as the hardware platform of an IP telephone network is determined, its processing density mainly depends on the speech encoding processing delay of the codec it uses, that is, the execution speed of the code. How to improve the execution speed of G.729.A speech coding is one of the key technical issues in the implementation of G.729.A codec. This article summarizes a series of programming techniques and optimal methods to better solve this problem.

(1) The algorithms specified in the G.729.A standard are basic algorithms. Therefore, during implementation, fast algorithms can be used. For example, the calculation of the correlation coefficient uses one of the most basic calculation methods in the G.729.A standard. If fast Fourier transform technology or the decomposition factor calculation method is used, the calculation speed can be accelerated.

(2) There are many FIR and IIR operations in the algorithm, such as formant filters, auditory weighted filters, joint filters, etc. When designing these filters, use larger arrays to store the filter coefficients. In this way, every time the output is calculated, there is no need to update the coefficients and shift, which can reduce the number of memory operations and therefore improve the execution speed of the code by sacrificing memory space. For example: the formant filter is a tenth-order filter. The conventional implementation method is to set a one-dimensional array with a length of 10 elements to save the latest 10 formant sample points. This array needs to be updated every time the filter outputs a sample point. For 40 sample points in a subframe, 40 update operations are required. If you set an array with a length of 70 elements, you can avoid the update operation. You can greatly manipulate the speed of your code.

(3) Use more pointers to minimize repeated copy operations between variables.

(4) Use static table query methods instead of dynamic calculations to reduce calculation delays. For example, when designing the cos() function, the program generates a 512-item cos() function table during initialization. When it is necessary to calculate the cos() function value, the lookup table method can be used instead of dynamic calculation.

(5) Reasonable allocation of memory units. The on-chip memory of TMS320C6201 DS has 64KB data memory. Since TMS320C6201 reads a word from on-chip memory 14 times faster than reading a word from off-chip memory, when programming, try to allocate frequently used data to on-chip memory.

(6) When G.729.A is implemented at fixed point on TMS320C6201, data accuracy is also a key issue. When implementing certain floating-point algorithms on a fixed-point signal processing chip, fixed-point numbers can be used to represent floating-point numbers, which can speed up the operation, but may result in insufficient calculation accuracy. The solution is that in places where accuracy requirements are relatively high, the intermediate variables for calculation can be represented by 32 bits or even 40 bits.

(7) Make full use of the compiler and optimization tools of TMS320C6201 to optimize C and linear assembly codes, and choose optimization parameters reasonably. Optimization parameters related to speed are: -o3, -pm, -mt, mi, etc. And try to use the linear assembly or assembly voice of TMS320C6201 to implement the G.729.A codec algorithm.

(8) Make full use of the features of TMS320C6201 to write code. Such as pipeline function, parallel operation function of 8 functional units, 32-bit word read and write function and the use of Intrinsics, etc. For example: for multiple loops, if the innermost loop has fewer times and is simpler, the innermost loop can be expanded so that The outer loop is used as an assembly line; for some simple loops with no causal relationship before and after, merging these loops is also beneficial to the assembly line.

3 Performance test

Two test tools were used to test the processing delay of the G.729.A codec. The first test tool is C6X Simulator (TMS320C6201 simulation software). The test condition is to assume that all codes are installed in the same program memory of TMS320C6201 chip; therefore, it is called Non cache test mode. Another test method is to use TI's C6X EVM card (evaluation card). The test condition is to use the 64KB RAM on-chip of TMS320C6201 as Cache; therefore it is called cache mode. The test results of the two test modes are shown in Table 1.

Table 1 Number of G.729.A codec clocks

Test items	C6 emulator (unbuffered mode)	C6 Evaluation Board (unbuffered mode)
Encoding (per frame)	86720 cycles	91650 cycles
Decoding (per frame)	34120 cycles	37310 cycles

As can be seen from Table 1, if TMS320C6201 works at a frequency of 200MHz, that is, each Cycle is 0.5 milliseconds, the delay time of G.729.A encoding one frame (30 milliseconds) can be calculated to be 0.43 to 0.46 milliseconds. Therefore, the single-chip TMS320C6201 can process approximately 20 channels of G.729.A encoding at the same time (the current international highest level is 22 channels); moreover, the encoding and decoding results have strictly passed the test vector test provided by G.729.A , the actual playback sound quality is also very good.

ITU-T G.729.A voice signal compression codec technology integrates many advantages of low-rate voice codecs and greatly improves the voice quality of low-rate codecs, but the algorithm is more complex. The TMS320C6201 DSP is currently the fastest fixed-point digital signal processor. If you can make full use of the above key technologies when writing programs, you can fully utilize the functions of TMS320C6201, greatly reduce the processing delay of G.729.A codec, and can Maintain good voice quality. Applying this codec to the IP telephony gateway increases the processing density of the IP telephony gateway to a great extent and improves the performance of the IP telephony gateway. Therefore, the G.729.A codec implemented in this article has great application value.

Reference address：Implementation of low-rate speech codec in IP telephony

Previous article：Design and implementation method of high-speed continuous data acquisition with high reliability
Next article：Solution for dual asynchronous serial port communication through AT89C2051 and TMS320VC5402HPI port