Abstract: Combined with the structure and characteristics of the development tool TMS320C6201EVM board, the work done on the software optimization of the algorithm in realizing MPEG-4 real-time video encoding is explained.
Keywords: TMS320C6201 MPEG-4 optimized parallel processing
The TMS320C6201 chip is a newly launched parallel processing digital signal processor from TI. Its maximum processing power is as high as 1600MIPS, which is 1.6 billion fixed-point operations per second. It is the fastest and most powerful DSP processor among all DSP chips currently on the market. Its application prospects are very broad. This article uses the C6201 development tool EVM (analog evaluation) board to implement real-time MPEG-4 encoding in software. The key modules in MPEG-4 video encoding are discussed in detail, and the special processing of software optimization is done in-depth research on the physical instruction structure of TMS320C6000.
1 Introduction to development tools
The evaluation tool used by the author is TI's C6XEVM[2]. Its structure is shown in Figure 1.
In addition to the core DSP, C6XEVM also provides the following tools: a 64K×32bit, 133MHz z-synchronous pulse static random access memory (SBSRAM); two 1M×32bit, 100MHz synchronous dynamic RAM (SDRAM); including JTAG simulation supported by PCI or external XDS510; supports stereo 16-bit audio boundary code with sampling rate from 5.5kHz to 48kHz; 1.8V/2.5V DC voltage 3.3V DC single-board conversion voltage regulator; single-board line that simulates 5V DC voltage Voltage regulator; 3 LED indicators (voltage, 2 custom indicators), etc.
SBSRAM is mapped to the CE0 storage space of the DSP and is used for program booting. Usually SBSRAM works at 133MHz. When the full-speed interface is used, the CPU clock is equal to the SBSRAM clock; when the half-speed interface is used, the SBSRAM speed is half the CPU clock speed.
EVM provides two SDRAMs with 1M×32bit word storage areas. Each storage space contains two 512×2banks×16-bit devices. They are mapped to the CE2 and CE3 storage spaces of the DSP, and each space uses a 16Mbit address space. SDRAM is typically half the CPU clock speed.
The asynchronous storage connector provided by EVM allows attaching a storage area or storage mapping area to the daughter board. The extended storage interface is mapped to the lower 3M space of the DSP's 4M asynchronous CE1 storage space. The address of the extended space in CE1 ranges from 0x100000 to 12FFFFF, and is 0x1400000 to 16FFFFF in MAP0 and MAP1 modes. The top 1M bytes of CE1 can be allocated to on-board peripherals. This allocation of CE1 memory space allows the coexistence of on-board devices and expansion devices.
2 MPEG-4 video encoding
MPEG-4 encoding is based on VOP encoding [3]. The so-called VOP refers to the video target plane, that is, the existence of the video object VO at a certain time. The structural block diagram of the VOP encoder is shown in Figure 2.
The encoder mainly consists of two parts: one is the shape encoder; the other is the traditional motion estimation and compensation and texture VOP encoder. VOP can use intra-frame coding (Intera-VOP, referred to as I-VOP) and inter-frame predictive coding (Inter-VOP). Inter-frame prediction coding can be divided into forward causal prediction coding (P-VOP) and forward and backward non-causal coding (B-VOP). Interframe predictive coding eliminates temporal redundancy of video information. For the encoding of VOPs, each VOP is first divided into 16×16 macroblocks (MB) from top to bottom. Specific shape, motion and texture encoding are based on MB, so the information of a MB is the sum of shape, motion and texture (Shape-Motion-Texture). When performing MB encoding, it is divided into the brightness Y component of four 8×8 blocks (Block), and the chrominance Cr and Cb components of two 8×8 blocks for encoding respectively. Then the 6 blocks are subjected to 8×8DCT two-dimensional transformation, quantization and Huffman coding respectively.
The image used for testing is in QCIF format (176×144 pixels), and the image data is read from the host. The time cycle ratio of each module measured by CCS is as follows:
The proportions of each part in the total operation volume are:
Analysis shows that motion estimation and motion compensation modules and texture coding modules are the most important bottlenecks in MPEG-4 implementation. Therefore, the work done on program optimization is mainly based on these two modules.
3. Program optimization considerations
In order to give full play to the computing power of TMS320C6201, we must start from its hardware structure, make full use of the eight functional units, use software pipelines, and try to make the program execute in parallel without conflict. The advantage of parallel execution is that the processes are executed in parallel without conflict with each other. The advantage of parallel execution is that when processing operations that are not connected to each other, they can be completed in parallel if CPU resources allow. However, its advantages cannot be used in situations where there is an ongoing relationship or frequent judgments or jumps. Generally, loop bodies meet the conditions for parallel processing, and loop bodies are often the longest in the program. Therefore, focus on the loop body when optimizing.
3.1 Optimization of jump instructions
Most DSP instructions are single-cycle instructions, but transfer instructions usually consume more clock cycles. Each jump has 5 delay gaps. It is a time-consuming task from a performance perspective, so it should be done as much as possible. Reduce branches in the program as much as possible.
In fact, through the analysis of the program, we can see that many judgment transfers can be realized with a simple combination of conditions. For example, the following applet.
if(rcoeff[i]>(lim-1)) rcoeff[i]=(lim-1);
else if(rcoeff[i]<(-lim)) rcoeff[i]=(-lim);
It can be changed to: rcoeff[i]=MIN(rcoeff[i],(lim-1));
rcoeff[i]=MAX(rcoeff[i],(-lim));
Another common way to reduce judgment transfer is to unroll the loop. Especially for the control of multiple loops, if there are fewer outer loops, the inner loops can be arranged directly and the transfer conditions can be combined to reduce the interconnection between layers.
3.2 Using library functions
TI provides powerful IMAGE LIB[4] library support for TMS320C62XX users. This library contains many commonly used functions that can complete DCT/IDCT transformation, wavelet transformation, DCT quantization, adaptive filtering and other functions. These functions are optimized and can fully realize software pipeline with high efficiency.
3.3 Rewrite linear assembly
Linear assembly language is a unique programming language in TMS320C6000, which is between high-level language and low-level language. In order to improve the performance of your code, you can use linear assembly to rewrite critical code sections that affect speed. In linear assembly, there is no need to give information such as the registers used, the delay cycle of the instruction, and which functional unit is used. The powerful assembly optimizer of C6201 will automatically determine this information based on the code [5]. However, many times, in order to improve the efficiency of the code, it is necessary to indicate which functional unit is used. When using linear assembly, please note: when optimizing the loop body, you cannot use jump instructions to jump to the outside of the loop; use counters to count down, etc.
When optimizing, you must first determine the number of loops. For the case where the number of loops is a variable, the optimizer cannot optimize in parallel; secondly, double words or word access operations should be used as much as possible. For example, a small program segment in motion estimation and compensation:
void MC_case_a(uchar ref[NUM_ROWS][NUM_COLS],
uchar curr[UNM_ROWS][NUM_COLS],const int r_x,const int c_x,const int r_y,const int c_y,const int size)
{
int m,n;
for(m=0;m for(n=0;n cuff[c_x+m][c_y+n]=ref[r_x+m][r_y+n]; }} { The corresponding linear assembly program is as follows: .def_MC_case_a .sect ".text" _MC_case_a: .cproc ref,curr,r_x,c_x,r_y,c_y,mum_cols .reg r_temp1,r_temp2,c_temp1,c_temp2 .reg p_r,P_c,np_r .reg lshift,rshift,count .reg r_w1,r_w2,r_w3,r_w4 .regtemp SHL r_x,0x05,r_temp1 SHL c_x,0x05,c_temp1 ADD r_y,ref,r_temp2 ADD c_y,curr,c_temp2 ADD r_temp1,r_temp2,p_r ADD c_temp1,c_temp2,p_c SUB num_cols,2,num_cols MVK 8,count ; The number of loops is 8 MVK 0xFFFc,temp AND p_r,temp,np_r AND p_r,0x0003,rshift SUB.L 0x04,rshift,lshift SHL rshift,0x03,rshift SHL lshift,0x03,lshift loo:.trip 8 LDW *np_r++[1],r_w1 LDW *np_r++[1],r_w2 LDW *np_r++[num_cols],r_w3 SHRU r_w1,rshift,r_w1 SHL r_w3,lshift,r_w3 SHL r_w2,lshift,r_w4 SHRU r_w2,rshift,r_w2 OR r_w1,r_w4,r_w1 OR r_w2,r_w3,r_w2 STW r_w1,*p_c++[1] STW r_w2,*p_c++[num_cols] ADD p_c,4,p_c [count] SUB count,1,count [count] B loop .endproc Before optimization, the C program segment consumed 574 clock cycles measured on CCS (Code Composer Studio); while the optimized linear assembly consumed 58 clock cycles, significantly improving efficiency. 3.4 Storage space considerations The configuration of DSP storage space is very important. Because the access speed of DSP to different storage units is different, the access speed to on-chip registers is the fastest, and the access speed to on-chip RAM is faster than the access speed to off-chip RAM. Therefore, rational allocation and use of storage space has a great impact on the overall efficiency of the system. Frequently accessed constant tables and code segments should be loaded into on-chip RAM as much as possible. If they are too large, some of them should be loaded into off-chip memory. At the same time, storage bank conflicts must also be considered. Since C6201DSP uses a cross storage scheme, the memory is divided into 4 or 8 banks. Each bank is a single-port storage area, so only one access is allowed per cycle. Two accesses to a bank in one cycle will cause memory blocking. . Memory blocking causes all pipeline operations to stop for one cycle to read the second data from memory. The solution is to modify the code segment. 3.5 Other optimization methods In addition, there are some more basic methods, such as: ·In order to improve the implementation efficiency of the algorithm and reduce the actual cost of operation, the parameters that need to be calculated at runtime should be made into lookup tables or constant values as much as possible, thereby converting runtime calculations into compile-time calculations. This is not only suitable for some relatively regular parameter tables, but also for some irregular runtime calculations, especially more time-consuming calculations (such as floating point division), it can also be tabulated as much as possible. · Fixed-point floating-point numbers. When writing MPEG-4 simulation algorithms, for convenience, C language generally has both integer and floating-point numbers. Because of the fixed-point chip used, it is necessary to change all floating-point operations to fixed-point operations. ·Use words to access two 16-bit data and place them in the upper 16-bit and lower 16-bit fields of the 32-bit register respectively. This can double the rate at which the program reads data, thereby greatly improving execution efficiency. ·Use shift instructions instead of multiplication and division operations. The shift instruction only has one clock cycle, which can save many clock cycles compared with multiplication and division operations. The original C code was executed on the EVM board at a processing rate of only 0.8 frames/second. After optimizing the source program through the above method, real-time MPEG-4 encoding can be achieved on the EVM board of C6201, with a processing speed of 30 frames/second. The scope of use of DSP chips has become wider and wider. Especially in the field of mobile communications, the implementation of new technologies such as software radios and smart antennas requires the support of powerful digital signal processing. The TMS320C6000 series can meet this need. This article specifically explains the software optimization development method of TMS320C6000 based on examples of its application in MPEG-4 encoding. There are inevitably some shortcomings in the work that need to be further explored.
Previous article:Software implementation of KEELOQ technology
Next article:Various PWM implementations based on TMS320F240
- Popular Resources
- Popular amplifiers
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications