Software Optimization Design of MPEG-4 ASP Video Encoder-EEWORLD

Collect

IntroductionMPEG

-4SP (Simple Profile) encoder has attracted wide attention for its outstanding compression efficiency and image quality, and many PC-based codecs (such as Divx, Xvid, etc.) have been born, which have been widely used in distance education and high-definition movies. The ASP encoder included in the MPEG-4 standard V2.0 released in 2001 adds some new tools on the basis of SP, further improving the compression efficiency, so it is more suitable for application in embedded systems such as wireless video communication and digital cameras.

1 Introduction to the hardware platform TMS320C6416

The experimental hardware platform selected is TMS320C6416 DSK (DSP Starter Kit). Its core processor is the high-performance fixed-point 32-bit DSP C6416 from TI of the United States. It is based on the second-generation high-performance Ve2lociTI. 2 VLIW structure, with 64 32-bit word-length registers, 8 highly independent functional units (2 multiplication units, 6 arithmetic logic units), a working clock frequency of 600MHz, and a peak processing speed of up to 4800Mbit/s. The C6416 DSP has 1MB of on-chip storage space and adopts a two-level cache structure. Among them, the L1P and L1D directly connected to the CPU can run at the same speed as the CPU, and the L2CACHE has 5 configuration modes, and the size of the L2CACHE can be set according to actual needs. At the same time, the C6416 also has 64 independent EDMA channels, which can move a large amount of data in the CPU background, and integrates 16MB of SDRAM, which can be configured as a high-speed cache to improve access efficiency.

2 MPEG-4ASP video coding

In 2001, the Moving Picture Experts Group (MPEG) added some new tools and frameworks to its newly released V2.0 version, including ASP. On the basis of SP, ASP adds support for B-VOP, 1/4 pixel precision motion vector, optional quantizer, global motion compensation GMC, etc., further improving compression efficiency.
(1) B-VOP uses bidirectional prediction to improve the efficiency of motion compensation, that is, each block or macroblock can be obtained by weighting forward and backward prediction.
(2) 1/4 pixel motion vector: Before motion estimation and compensation, the reference VOP is first interpolated at the 1/2 pixel position and then at the 1/4 position. Although this increases the complexity of motion estimation, motion compensation and image reconstruction, the coding efficiency is improved compared to the SP encoder.
(3) Optional quantizer: An optional inverse quantization method is provided in ASP. In this method, the quantized coefficients FQ (u, v) are inverse quantized to generate coefficients F (u, v) as follows: if (FQ = 0) F = 0; else F = [ (2 × Fc (u, v) + k) × WW (u, v) × QP] / 16. Where WW is an 8 × 8 weighting factor matrix. This inverse quantization method allows the encoder to use WW to change the step size according to the position of the quantized coefficients in the block.
(4) Global Motion Compensation (GMC): Macroblocks in the same video object (VO) may experience similar motion, such as linear movement caused by zooming and rotating the camera lens, and some of the macroblocks may move in the same direction. The encoder with GMC only needs to send a small number of motion parameters to describe this "global" motion for the entire VOP. Therefore, when a considerable number of macroblocks in a VOP have the same motion characteristics, GMC can significantly improve compression efficiency.

3 Software transplantation and optimization

Since DSP is different from ordinary PC environment, simply putting the code on DSP for compilation will result in low running efficiency or even failure to run. It is necessary to transplant, rewrite and optimize the code to meet the characteristics of DSP to meet the real-time requirements.

3.1 Software transplantation
In order to make the code suitable for running on DSP platform, first delete a large amount of debug information such as printf in the program code, use puts for necessary information output to reduce function overhead; use long type definition for double type data; delete unnecessary floating point operations (such as PSNR calculation), and implement necessary floating point operations through calibration. [page]

3.2 Memory optimization
C6416DSP has 1MB on-chip storage space, which can be accessed at the maximum CPU clock frequency. 16Mb/s SDRAM is integrated on DSK, which can be accessed at a frequency of 100MHz through EM IFA. The difference in access speed and the CPU addressing external storage space will cause the pipeline to stop for several cycles. Therefore, how to reasonably use the on-chip storage space and secondary cache structure of C6416 has become a very critical factor. The 1MB storage space is divided into 256k L2CACHE and 768k L2SRAM. The code segment, global data, etc. are placed on the on-chip memory L2SRAM. The external SDRAM is set to cacheable to improve access efficiency. These settings can be completed by calling the CSL (Chip Support Library) library function:

#include
#include
CSL_init();
CSL_enableCaching(CACHE_EM IFA_CE00);
CACHE_setL2Mode (CACHE_256 k CACHE).

3.3 Project-level optimization
TI provides a series of compilation optimization parameters for its integrated compilation environment CCS, which can be selected according to code performance requirements. Therefore, by continuously combining and optimizing various parameters (-mw, -pm, -o3, -mt, etc.), this can be done through the PBC option of CCS 2.20. At the same time, in the code linking process, a certain arrangement of the code segment linking order can reduce the cache miss caused by code calls during program execution and improve the execution efficiency of the program.

3. 4 Code optimization
Code optimization is an important part of the MPEG-4 ASP video encoder software development. The execution efficiency of unoptimized code on the DSK platform is very low, with an average of about 25 seconds to encode one frame, while the real-time indicator is more than 25 frames per second.

(1) Use TI library functions
TI provides an image processing function library IMGL IB, which can call the functions in it to perform FDCT and IDCT transformations.

(2) Rewrite the C code
First, the loop operations in the program are decomposed and expanded. For loops that cannot be expanded, the inner and outer layers of the loops are reasonably arranged to improve the pipeline efficiency to a greater extent. The C6000 compiler also provides many intrinsics, which can be directly mapped to the corresponding assembly instructions to improve the efficiency of the program. At the same time, the compiler can be provided with some prior knowledge using the pragma directive to improve the compilation efficiency. For example, #pragma (minimum value, maximum value, factor) is used to indicate the information of loop execution to the compiler, so that the compiler can use data packing and other technologies for optimization. Take the dev16 function that calculates the mean deviation of pixels in a macroblock as an example. After the above method is used to rewrite, the number of function execution cycles is reduced from 277 cycles to 130 cycles (under the same o3 condition), and the performance is improved by more than 50%.

(3) Rewrite the linear assembly language
Linear assembly is a programming language between C and assembly language that is optimized for the structural characteristics of C6000. Its compilation efficiency can reach more than 90% of assembly code. At the same time, the C64x series DSP has added many special instructions for image and video applications, which has improved the efficiency of code writing for these applications. For example, in the ASP video encoder, the avgu4, shrmb, unpklu4 and unpkhu4 instructions are used for half-pixel interpolation, the dotpu4 and subabs4 instructions are used for calculating SAD, and the SPACK2 instruction is used for image reconstruction. It also facilitates code writing, such as the LDNDW instruction for reading pixel values in the reference image frame during ME (Motion Estimation), which solves the problem that the data in the reference image does not meet the double word alignment. The following is the code after rewriting the function transfer_16 to8copy() through linear assembly. Under the same o3 option, the linear assembly code only needs 15.8% of the instruction cycles of the C code. Table 1 shows the performance comparison before and after the rewriting of some codes (under the same o3 optimization option).

. global _transfer_16 to8copy
_transfer_16 to8copy: . cp roc dst, src, stride
. reg pdst, p src, count
. reg ahi: alo, bhi: blo, chi: clo
mvk 8, count
mv dst, pdst
mv src, p src
loop: . trip 8, 8
lddw 3 *psrc, ahi: alo
spacu4 ahi, alo, blo; keep the value
in the range 0 - 255
lddw 3 *+psrc (8) , chi: clo
spacu4 chi, clo, bhi
stdw bhi: blo, 3 pdst
add pdst, stride, pdst
add p src, 16, p src
[ count ] sub count, 1, count
[ count ] b loop
. endp roc [page]

3. 5 Data transfer optimization
Due to the limited on-chip storage space, the reference image and reconstructed image data can only be placed in the external SDRAM, but this also leads to huge overhead when accessing the external memory. The EDMA and QDMA of C64x only need to spend a few clock cycles to initialize the parameters, and then they can perform high-speed data transfer operations in the CPU background, which improves the program execution efficiency. For simple data transfer, the DAT function provided by the CSL library can be used. Taking a simple 2D data transfer as an example, the implementation code after using QDMA is given:
unsigned int transferID = DAT_open (DAT_CHAA-NY, DAT_PR I_LOW, DAT_OPEN_2D);
DAT_copy2d (DAT_2D2D, con, ref, 16, 16, width);
DAT_wait (transferID).

For complex data transfer, multi-channel EDMA can be used to implement it. EDMA provides a linking and chaining mechanism. After part of the data is moved, the EDMA link or channel parameters are automatically updated and loaded without CPU intervention, which is particularly suitable for large-scale data movement. However, it should be noted that since the data to be moved in SDRAM has a copy in L2CACHE, before data movement, consistency operations (Coherence Operations) need to be performed on the data to be moved in L2CACHE and SDRAM, otherwise the correct result will not be obtained.

4 Experimental Results and Analysis
The MPEG-4 video encoder was simulated on the C6416 DSK through the software optimization method mentioned above. In order to obtain coding information, such as peak signal-to-noise ratio (PSNR), the calc_psnr() function was temporarily added to the code to facilitate the performance comparison of the ASP encoder and the SP encoder. Taking the 352 × 288 size CIF format foreman video sequence as an example, when the encoding bit rate is 256 K, the performance of the ASP encoder and SP encoder that support GMC, QPEL and B-VOP respectively and support the above three tools at the same time are compared (the SP encoding form is "IPPPP.", and when ASP uses B-VOP, it is "IBBPBB-
PBBP.").

Table 2 gives the length of the encoded file obtained. It can be seen that the ASP encoder requires less storage space than the SP encoder, and the image quality does not change much, so it is more suitable for embedded applications such as digital cameras.

Figure 1 compares the ASP encoder (supporting B-VOP, GMC and QPEL) with the SP encoder. It can be seen that the former has a flatter PSNR performance than the latter, a smaller mean square error, and a more stable image quality.

Figure 1 PSNR performance comparison of foreman sequence ASP and SP video encoders

Although the compression efficiency is improved, the amount of calculation increases, and because the B-VOP is used in encoding to increase the backward prediction, the encoding delay increases and the image frame rate decreases.

5 Conclusion

Since the ASP video encoder has a higher compression efficiency, although the encoding speed is reduced and the delay increases, it can still perform real-time encoding on the DSP, so it is suitable for applications in places with limited storage capacity (such as digital cameras, video surveillance networks, etc.).

Keywords：MPEG-4 Reference address：Software Optimization Design of MPEG-4 ASP Video Encoder

Previous article：Application of SoC Technology in FC Chip Design
Next article：Design of non-contact reading and writing module based on FM1702

Recommended ReadingLatest update time:2024-11-16 21:33

Design and Implementation of Embedded MPEG-4 Decoding System

This paper introduces an embedded high-quality MPEC-4 video stream decoding system. The system uses embedded Linux as the operating system and adopts hard decoding to convert the MPEC-4 video code stream (ES, PS and TS) input from the IDE interface device or network port into a PAL/NTSC TV signal output. The design

[Microcontroller]

Design and Implementation of Embedded MPEG-4 Decoding System

Method to realize fast quantization of DCT coefficients in MPEG-4 standard using fixed-point DSP

Abstract: Taking the use of TMS320C6200 fixed-point DSP chip to complete the quantization of DCT coefficients in the MPEG-4 standard as an example, we briefly introduce the quantization method of the MPEG-4 standard, and propose a method that uses fixed-point multiplication and shift operations to replace divisio

[Embedded]

Popular Resources
Popular amplifiers

Latest Microcontroller Articles

Download from the Internet--ARM Getting Started Notes
A brief introduction: From today on, the ARM notebook of the rookie is open, and it can be regarded as a place to store these notes. Why publish it? Maybe you are interested in it. In fact, the reason for these notes is ...
Learn ARM development(22)
Turning off and on interrupts Interrupts are an efficient dialogue mechanism, but sometimes you don't want to interrupt the program while it is running. For example, when you are printing something, the program suddenly interrupts and another ...
Learn ARM development(21)
First, declare the task pointer, because it will be used later. Task pointer volatile TASK_TCB* volatile g_pCurrentTask = NULL;volatile TASK_TCB* vol ...
Learn ARM development(20)
With the previous Tick interrupt, the basic task switching conditions are ready. However, this "easterly" is also difficult to understand. Only through continuous practice can we understand it. ...
Learn ARM development(19)
After many days of hard work, I finally got the interrupt working. But in order to allow RTOS to use timer interrupts, what kind of interrupts can be implemented in S3C44B0? There are two methods in S3C44B0. ...
Learn ARM development(14)
Learn ARM development(15)
Learn ARM development(16)
Learn ARM development(17)

He Limin Column Microcontroller and Embedded Systems Bible

Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.

MoreSelected Circuit Diagrams

Change More Related Popular Components

MorePopular Articles

MoreDaily News

Guess you like