IntroductionMPEG
-4SP (Simple Profile) encoder has attracted wide attention for its outstanding compression efficiency and image quality, and many PC-based codecs (such as Divx, Xvid, etc.) have been born, which have been widely used in distance education and high-definition movies. The ASP encoder included in the MPEG-4 standard V2.0 released in 2001 adds some new tools on the basis of SP, further improving the compression efficiency, so it is more suitable for application in embedded systems such as wireless video communication and digital cameras.
1 Introduction to the hardware platform TMS320C6416
The experimental hardware platform selected is TMS320C6416 DSK (DSP Starter Kit). Its core processor is the high-performance fixed-point 32-bit DSP C6416 from TI of the United States. It is based on the second-generation high-performance Ve2lociTI. 2 VLIW structure, with 64 32-bit word-length registers, 8 highly independent functional units (2 multiplication units, 6 arithmetic logic units), a working clock frequency of 600MHz, and a peak processing speed of up to 4800Mbit/s. The C6416 DSP has 1MB of on-chip storage space and adopts a two-level cache structure. Among them, the L1P and L1D directly connected to the CPU can run at the same speed as the CPU, and the L2CACHE has 5 configuration modes, and the size of the L2CACHE can be set according to actual needs. At the same time, the C6416 also has 64 independent EDMA channels, which can move a large amount of data in the CPU background, and integrates 16MB of SDRAM, which can be configured as a high-speed cache to improve access efficiency.
2 MPEG-4ASP video coding
In 2001, the Moving Picture Experts Group (MPEG) added some new tools and frameworks to its newly released V2.0 version, including ASP. On the basis of SP, ASP adds support for B-VOP, 1/4 pixel precision motion vector, optional quantizer, global motion compensation GMC, etc., further improving compression efficiency.
(1) B-VOP uses bidirectional prediction to improve the efficiency of motion compensation, that is, each block or macroblock can be obtained by weighting forward and backward prediction.
(2) 1/4 pixel motion vector: Before motion estimation and compensation, the reference VOP is first interpolated at the 1/2 pixel position and then at the 1/4 position. Although this increases the complexity of motion estimation, motion compensation and image reconstruction, the coding efficiency is improved compared to the SP encoder.
(3) Optional quantizer: An optional inverse quantization method is provided in ASP. In this method, the quantized coefficients FQ (u, v) are inverse quantized to generate coefficients F (u, v) as follows: if (FQ = 0) F = 0; else F = [ (2 × Fc (u, v) + k) × WW (u, v) × QP] / 16. Where WW is an 8 × 8 weighting factor matrix. This inverse quantization method allows the encoder to use WW to change the step size according to the position of the quantized coefficients in the block.
(4) Global Motion Compensation (GMC): Macroblocks in the same video object (VO) may experience similar motion, such as linear movement caused by zooming and rotating the camera lens, and some of the macroblocks may move in the same direction. The encoder with GMC only needs to send a small number of motion parameters to describe this "global" motion for the entire VOP. Therefore, when a considerable number of macroblocks in a VOP have the same motion characteristics, GMC can significantly improve compression efficiency.
3 Software transplantation and optimization
Since DSP is different from ordinary PC environment, simply putting the code on DSP for compilation will result in low running efficiency or even failure to run. It is necessary to transplant, rewrite and optimize the code to meet the characteristics of DSP to meet the real-time requirements.
3.1 Software transplantation
In order to make the code suitable for running on DSP platform, first delete a large amount of debug information such as printf in the program code, use puts for necessary information output to reduce function overhead; use long type definition for double type data; delete unnecessary floating point operations (such as PSNR calculation), and implement necessary floating point operations through calibration. [page]
3.2 Memory optimization
C6416DSP has 1MB on-chip storage space, which can be accessed at the maximum CPU clock frequency. 16Mb/s SDRAM is integrated on DSK, which can be accessed at a frequency of 100MHz through EM IFA. The difference in access speed and the CPU addressing external storage space will cause the pipeline to stop for several cycles. Therefore, how to reasonably use the on-chip storage space and secondary cache structure of C6416 has become a very critical factor. The 1MB storage space is divided into 256k L2CACHE and 768k L2SRAM. The code segment, global data, etc. are placed on the on-chip memory L2SRAM. The external SDRAM is set to cacheable to improve access efficiency. These settings can be completed by calling the CSL (Chip Support Library) library function:
#include
#include
CSL_init();
CSL_enableCaching(CACHE_EM IFA_CE00);
CACHE_setL2Mode (CACHE_256 k CACHE).
3.3 Project-level optimization
TI provides a series of compilation optimization parameters for its integrated compilation environment CCS, which can be selected according to code performance requirements. Therefore, by continuously combining and optimizing various parameters (-mw, -pm, -o3, -mt, etc.), this can be done through the PBC option of CCS 2.20. At the same time, in the code linking process, a certain arrangement of the code segment linking order can reduce the cache miss caused by code calls during program execution and improve the execution efficiency of the program.
3. 4 Code optimization
Code optimization is an important part of the MPEG-4 ASP video encoder software development. The execution efficiency of unoptimized code on the DSK platform is very low, with an average of about 25 seconds to encode one frame, while the real-time indicator is more than 25 frames per second.
(1) Use TI library functions
TI provides an image processing function library IMGL IB, which can call the functions in it to perform FDCT and IDCT transformations.
(2) Rewrite the C code
First, the loop operations in the program are decomposed and expanded. For loops that cannot be expanded, the inner and outer layers of the loops are reasonably arranged to improve the pipeline efficiency to a greater extent. The C6000 compiler also provides many intrinsics, which can be directly mapped to the corresponding assembly instructions to improve the efficiency of the program. At the same time, the compiler can be provided with some prior knowledge using the pragma directive to improve the compilation efficiency. For example, #pragma (minimum value, maximum value, factor) is used to indicate the information of loop execution to the compiler, so that the compiler can use data packing and other technologies for optimization. Take the dev16 function that calculates the mean deviation of pixels in a macroblock as an example. After the above method is used to rewrite, the number of function execution cycles is reduced from 277 cycles to 130 cycles (under the same o3 condition), and the performance is improved by more than 50%.
(3) Rewrite the linear assembly language
Linear assembly is a programming language between C and assembly language that is optimized for the structural characteristics of C6000. Its compilation efficiency can reach more than 90% of assembly code. At the same time, the C64x series DSP has added many special instructions for image and video applications, which has improved the efficiency of code writing for these applications. For example, in the ASP video encoder, the avgu4, shrmb, unpklu4 and unpkhu4 instructions are used for half-pixel interpolation, the dotpu4 and subabs4 instructions are used for calculating SAD, and the SPACK2 instruction is used for image reconstruction. It also facilitates code writing, such as the LDNDW instruction for reading pixel values in the reference image frame during ME (Motion Estimation), which solves the problem that the data in the reference image does not meet the double word alignment. The following is the code after rewriting the function transfer_16 to8copy() through linear assembly. Under the same o3 option, the linear assembly code only needs 15.8% of the instruction cycles of the C code. Table 1 shows the performance comparison before and after the rewriting of some codes (under the same o3 optimization option).
. global _transfer_16 to8copy
_transfer_16 to8copy: . cp roc dst, src, stride
. reg pdst, p src, count
. reg ahi: alo, bhi: blo, chi: clo
mvk 8, count
mv dst, pdst
mv src, p src
loop: . trip 8, 8
lddw 3 *psrc, ahi: alo
spacu4 ahi, alo, blo; keep the value
in the range 0 - 255
lddw 3 *+psrc (8) , chi: clo
spacu4 chi, clo, bhi
stdw bhi: blo, 3 pdst
add pdst, stride, pdst
add p src, 16, p src
[ count ] sub count, 1, count
[ count ] b loop
. endp roc [page]
3. 5 Data transfer optimization
Due to the limited on-chip storage space, the reference image and reconstructed image data can only be placed in the external SDRAM, but this also leads to huge overhead when accessing the external memory. The EDMA and QDMA of C64x only need to spend a few clock cycles to initialize the parameters, and then they can perform high-speed data transfer operations in the CPU background, which improves the program execution efficiency. For simple data transfer, the DAT function provided by the CSL library can be used. Taking a simple 2D data transfer as an example, the implementation code after using QDMA is given:
unsigned int transferID = DAT_open (DAT_CHAA-NY, DAT_PR I_LOW, DAT_OPEN_2D);
DAT_copy2d (DAT_2D2D, con, ref, 16, 16, width);
DAT_wait (transferID).
For complex data transfer, multi-channel EDMA can be used to implement it. EDMA provides a linking and chaining mechanism. After part of the data is moved, the EDMA link or channel parameters are automatically updated and loaded without CPU intervention, which is particularly suitable for large-scale data movement. However, it should be noted that since the data to be moved in SDRAM has a copy in L2CACHE, before data movement, consistency operations (Coherence Operations) need to be performed on the data to be moved in L2CACHE and SDRAM, otherwise the correct result will not be obtained.
4 Experimental Results and Analysis
The MPEG-4 video encoder was simulated on the C6416 DSK through the software optimization method mentioned above. In order to obtain coding information, such as peak signal-to-noise ratio (PSNR), the calc_psnr() function was temporarily added to the code to facilitate the performance comparison of the ASP encoder and the SP encoder. Taking the 352 × 288 size CIF format foreman video sequence as an example, when the encoding bit rate is 256 K, the performance of the ASP encoder and SP encoder that support GMC, QPEL and B-VOP respectively and support the above three tools at the same time are compared (the SP encoding form is "IPPPP.", and when ASP uses B-VOP, it is "IBBPBB-
PBBP.").
Table 2 gives the length of the encoded file obtained. It can be seen that the ASP encoder requires less storage space than the SP encoder, and the image quality does not change much, so it is more suitable for embedded applications such as digital cameras.
Figure 1 compares the ASP encoder (supporting B-VOP, GMC and QPEL) with the SP encoder. It can be seen that the former has a flatter PSNR performance than the latter, a smaller mean square error, and a more stable image quality.
Figure 1 PSNR performance comparison of foreman sequence ASP and SP video encoders
Although the compression efficiency is improved, the amount of calculation increases, and because the B-VOP is used in encoding to increase the backward prediction, the encoding delay increases and the image frame rate decreases.
5 Conclusion
Since the ASP video encoder has a higher compression efficiency, although the encoding speed is reduced and the delay increases, it can still perform real-time encoding on the DSP, so it is suitable for applications in places with limited storage capacity (such as digital cameras, video surveillance networks, etc.).
Previous article:Application of SoC Technology in FC Chip Design
Next article:Design of non-contact reading and writing module based on FM1702
Recommended ReadingLatest update time:2024-11-16 21:33
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Help with two solidification methods in micropython
- MSP430F5438A Series Serial Communication
- LPS22HH Threshold Interrupt
- Live presentation: Fujitsu FRAM and glasses-free 3D video technology
- [Submission Instructions] 2020-2021 ON Semiconductor and Avnet IoT Creative Design Competition
- Recruitment information of Beijing Chuangxin Micro Technology Co., Ltd.
- Wireless transmission network combining zigbee and GPRS
- Just a few words
- Is there any video or material about the calculation process of converting 220V to 12V 9V 5V through a transformer in the peripheral circuit of a single-chip microcomputer?
- Can someone tell me what kind of controller should be used for this automatic servo crane?