Software optimization design of real-time MPEG-4 encoding based on DSP-EEWORLD

Collect

Abstract: Combined with the structure and characteristics of the development tool TMS320C6201EVM board, the work done on the software optimization of the algorithm in realizing MPEG-4 real-time video encoding is explained.

Keywords: TMS320C6201 MPEG-4 optimized parallel processing

The TMS320C6201 chip is a newly launched parallel processing digital signal processor from TI. Its maximum processing power is as high as 1600MIPS, which is 1.6 billion fixed-point operations per second. It is the fastest and most powerful DSP processor among all DSP chips currently on the market. Its application prospects are very broad. This article uses the C6201 development tool EVM (analog evaluation) board to implement real-time MPEG-4 encoding in software. The key modules in MPEG-4 video encoding are discussed in detail, and the special processing of software optimization is done in-depth research on the physical instruction structure of TMS320C6000.

1 Introduction to development tools

The evaluation tool used by the author is TI's C6XEVM[2]. Its structure is shown in Figure 1.

In addition to the core DSP, C6XEVM also provides the following tools: a 64K×32bit, 133MHz z-synchronous pulse static random access memory (SBSRAM); two 1M×32bit, 100MHz synchronous dynamic RAM (SDRAM); including JTAG simulation supported by PCI or external XDS510; supports stereo 16-bit audio boundary code with sampling rate from 5.5kHz to 48kHz; 1.8V/2.5V DC voltage 3.3V DC single-board conversion voltage regulator; single-board line that simulates 5V DC voltage Voltage regulator; 3 LED indicators (voltage, 2 custom indicators), etc.

SBSRAM is mapped to the CE0 storage space of the DSP and is used for program booting. Usually SBSRAM works at 133MHz. When the full-speed interface is used, the CPU clock is equal to the SBSRAM clock; when the half-speed interface is used, the SBSRAM speed is half the CPU clock speed.

EVM provides two SDRAMs with 1M×32bit word storage areas. Each storage space contains two 512×2banks×16-bit devices. They are mapped to the CE2 and CE3 storage spaces of the DSP, and each space uses a 16Mbit address space. SDRAM is typically half the CPU clock speed.

The asynchronous storage connector provided by EVM allows attaching a storage area or storage mapping area to the daughter board. The extended storage interface is mapped to the lower 3M space of the DSP's 4M asynchronous CE1 storage space. The address of the extended space in CE1 ranges from 0x100000 to 12FFFFF, and is 0x1400000 to 16FFFFF in MAP0 and MAP1 modes. The top 1M bytes of CE1 can be allocated to on-board peripherals. This allocation of CE1 memory space allows the coexistence of on-board devices and expansion devices.

2 MPEG-4 video encoding

MPEG-4 encoding is based on VOP encoding [3]. The so-called VOP refers to the video target plane, that is, the existence of the video object VO at a certain time. The structural block diagram of the VOP encoder is shown in Figure 2.

The encoder mainly consists of two parts: one is the shape encoder; the other is the traditional motion estimation and compensation and texture VOP encoder. VOP can use intra-frame coding (Intera-VOP, referred to as I-VOP) and inter-frame predictive coding (Inter-VOP). Inter-frame prediction coding can be divided into forward causal prediction coding (P-VOP) and forward and backward non-causal coding (B-VOP). Interframe predictive coding eliminates temporal redundancy of video information. For the encoding of VOPs, each VOP is first divided into 16×16 macroblocks (MB) from top to bottom. Specific shape, motion and texture encoding are based on MB, so the information of a MB is the sum of shape, motion and texture (Shape-Motion-Texture). When performing MB encoding, it is divided into the brightness Y component of four 8×8 blocks (Block), and the chrominance Cr and Cb components of two 8×8 blocks for encoding respectively. Then the 6 blocks are subjected to 8×8DCT two-dimensional transformation, quantization and Huffman coding respectively.

The image used for testing is in QCIF format (176×144 pixels), and the image data is read from the host. The time cycle ratio of each module measured by CCS is as follows:

The proportions of each part in the total operation volume are:

Analysis shows that motion estimation and motion compensation modules and texture coding modules are the most important bottlenecks in MPEG-4 implementation. Therefore, the work done on program optimization is mainly based on these two modules.

3. Program optimization considerations

In order to give full play to the computing power of TMS320C6201, we must start from its hardware structure, make full use of the eight functional units, use software pipelines, and try to make the program execute in parallel without conflict. The advantage of parallel execution is that the processes are executed in parallel without conflict with each other. The advantage of parallel execution is that when processing operations that are not connected to each other, they can be completed in parallel if CPU resources allow. However, its advantages cannot be used in situations where there is an ongoing relationship or frequent judgments or jumps. Generally, loop bodies meet the conditions for parallel processing, and loop bodies are often the longest in the program. Therefore, focus on the loop body when optimizing.

3.1 Optimization of jump instructions

Most DSP instructions are single-cycle instructions, but transfer instructions usually consume more clock cycles. Each jump has 5 delay gaps. It is a time-consuming task from a performance perspective, so it should be done as much as possible. Reduce branches in the program as much as possible.

In fact, through the analysis of the program, we can see that many judgment transfers can be realized with a simple combination of conditions. For example, the following applet.

if(rcoeff[i]>(lim-1)) rcoeff[i]=(lim-1);

else if(rcoeff[i]<(-lim)) rcoeff[i]=(-lim);

It can be changed to: rcoeff[i]=MIN(rcoeff[i],(lim-1));

rcoeff[i]=MAX(rcoeff[i],(-lim));

Another common way to reduce judgment transfer is to unroll the loop. Especially for the control of multiple loops, if there are fewer outer loops, the inner loops can be arranged directly and the transfer conditions can be combined to reduce the interconnection between layers.

3.2 Using library functions

TI provides powerful IMAGE LIB[4] library support for TMS320C62XX users. This library contains many commonly used functions that can complete DCT/IDCT transformation, wavelet transformation, DCT quantization, adaptive filtering and other functions. These functions are optimized and can fully realize software pipeline with high efficiency.

3.3 Rewrite linear assembly

Linear assembly language is a unique programming language in TMS320C6000, which is between high-level language and low-level language. In order to improve the performance of your code, you can use linear assembly to rewrite critical code sections that affect speed. In linear assembly, there is no need to give information such as the registers used, the delay cycle of the instruction, and which functional unit is used. The powerful assembly optimizer of C6201 will automatically determine this information based on the code [5]. However, many times, in order to improve the efficiency of the code, it is necessary to indicate which functional unit is used. When using linear assembly, please note: when optimizing the loop body, you cannot use jump instructions to jump to the outside of the loop; use counters to count down, etc.

When optimizing, you must first determine the number of loops. For the case where the number of loops is a variable, the optimizer cannot optimize in parallel; secondly, double words or word access operations should be used as much as possible. For example, a small program segment in motion estimation and compensation:

void MC_case_a(uchar ref[NUM_ROWS][NUM_COLS],

uchar curr[UNM_ROWS][NUM_COLS],const int r_x,const int c_x,const int r_y,const int c_y,const int size)

{

int m,n;

for(m=0;m

for(n=0;n

cuff[c_x+m][c_y+n]=ref[r_x+m][r_y+n];

}}

{

The corresponding linear assembly program is as follows:

.def_MC_case_a

.sect ".text"

_MC_case_a: .cproc ref,curr,r_x,c_x,r_y,c_y,mum_cols

.reg r_temp1,r_temp2,c_temp1,c_temp2

.reg p_r,P_c,np_r

.reg lshift,rshift,count

.reg r_w1,r_w2,r_w3,r_w4

.regtemp

SHL r_x,0x05,r_temp1

SHL c_x,0x05,c_temp1

ADD r_y,ref,r_temp2

ADD c_y,curr,c_temp2

ADD r_temp1,r_temp2,p_r

ADD c_temp1,c_temp2,p_c

SUB num_cols,2,num_cols

MVK 8,count ; The number of loops is 8

MVK 0xFFFc,temp

AND p_r,temp,np_r

AND p_r,0x0003,rshift

SUB.L 0x04,rshift,lshift

SHL rshift,0x03,rshift

SHL lshift,0x03,lshift

loo:.trip 8

LDW *np_r++[1],r_w1

LDW *np_r++[1],r_w2

LDW *np_r++[num_cols],r_w3

SHRU r_w1,rshift,r_w1

SHL r_w3,lshift,r_w3

SHL r_w2,lshift,r_w4

SHRU r_w2,rshift,r_w2

OR r_w1,r_w4,r_w1

OR r_w2,r_w3,r_w2

STW r_w1,*p_c++[1]

STW r_w2,*p_c++[num_cols]

ADD p_c,4,p_c

[count] SUB count,1,count

[count] B loop

.endproc

Before optimization, the C program segment consumed 574 clock cycles measured on CCS (Code Composer Studio); while the optimized linear assembly consumed 58 clock cycles, significantly improving efficiency.

3.4 Storage space considerations

The configuration of DSP storage space is very important. Because the access speed of DSP to different storage units is different, the access speed to on-chip registers is the fastest, and the access speed to on-chip RAM is faster than the access speed to off-chip RAM. Therefore, rational allocation and use of storage space has a great impact on the overall efficiency of the system. Frequently accessed constant tables and code segments should be loaded into on-chip RAM as much as possible. If they are too large, some of them should be loaded into off-chip memory.

At the same time, storage bank conflicts must also be considered. Since C6201DSP uses a cross storage scheme, the memory is divided into 4 or 8 banks. Each bank is a single-port storage area, so only one access is allowed per cycle. Two accesses to a bank in one cycle will cause memory blocking. . Memory blocking causes all pipeline operations to stop for one cycle to read the second data from memory. The solution is to modify the code segment.

3.5 Other optimization methods

In addition, there are some more basic methods, such as:

·In order to improve the implementation efficiency of the algorithm and reduce the actual cost of operation, the parameters that need to be calculated at runtime should be made into lookup tables or constant values as much as possible, thereby converting runtime calculations into compile-time calculations. This is not only suitable for some relatively regular parameter tables, but also for some irregular runtime calculations, especially more time-consuming calculations (such as floating point division), it can also be tabulated as much as possible.

· Fixed-point floating-point numbers. When writing MPEG-4 simulation algorithms, for convenience, C language generally has both integer and floating-point numbers. Because of the fixed-point chip used, it is necessary to change all floating-point operations to fixed-point operations.

·Use words to access two 16-bit data and place them in the upper 16-bit and lower 16-bit fields of the 32-bit register respectively. This can double the rate at which the program reads data, thereby greatly improving execution efficiency.

·Use shift instructions instead of multiplication and division operations. The shift instruction only has one clock cycle, which can save many clock cycles compared with multiplication and division operations.

The original C code was executed on the EVM board at a processing rate of only 0.8 frames/second. After optimizing the source program through the above method, real-time MPEG-4 encoding can be achieved on the EVM board of C6201, with a processing speed of 30 frames/second.

The scope of use of DSP chips has become wider and wider. Especially in the field of mobile communications, the implementation of new technologies such as software radios and smart antennas requires the support of powerful digital signal processing. The TMS320C6000 series can meet this need. This article specifically explains the software optimization development method of TMS320C6000 based on examples of its application in MPEG-4 encoding. There are inevitably some shortcomings in the work that need to be further explored.

Reference address：Software optimization design of real-time MPEG-4 encoding based on DSP

Previous article：Software implementation of KEELOQ technology
Next article：Various PWM implementations based on TMS320F240