Several strategies for optimizing video algorithm systems based on DSP-EEWORLD

Collect

The demand for digital video products has surged in recent years. Mainstream applications include video communications, video surveillance, and industrial automation, and the most popular are entertainment applications, such as DVD, HDTV, satellite TV, standard definition (SD) or high definition (HD) set-top boxes, digital cameras and HD camcorders, high-end displays (LCD, plasma displays, DLP), and personal video cameras. These applications have put forward huge demands for high-quality video codec algorithms and their standards. The current mainstream compression standards are mainly MPEG2, MPEG4, and H.264/AVC, and there are various implementation solutions for these codec standards. This article mainly discusses several factors that need to be considered in the process of optimizing the video decoding algorithm standard system based on TI's C64 series DSP.

TI's C64 series DSP is widely used in the field of video processing due to its powerful processing capabilities. However, due to different understandings of the structure, instructions, and functions of the C64 series DSP, there are many differences in the effects of algorithm implementation. This is specifically reflected in the CPU resources used to implement the algorithm. For example, there will be differences in the CPU resources occupied when implementing H.264 MP@D1 decoding, or in the subset of algorithm tools included, such as using CAVLC instead of CABAC when implementing H.264 MP@D1 decoding. The main reasons for these differences are as follows:

Optimization of key algorithm modules
Memory management during
algorithm system integration EDMA resource allocation management during algorithm system integration
This article gradually explores several factors that need to be considered in algorithm optimization integration from these three aspects.

Optimization of key algorithm modules

Generally speaking, the current mainstream video decompression standards all have similar modules that consume a lot of DSP CPU resources. For example, the motion vector search in H.264/AVC, MPEG4, AVS and other encodings is very resource-intensive, and these modules are called quite frequently during the entire system implementation process. Therefore, we first find out these modules. TI's CCS provides an engineering profiling tool (Profile) that can quickly find the modules that occupy the most DSP CPU resources in the entire project; then optimize these modules.

We can divide the optimization of these key algorithm modules into three steps, as shown in Figure 2. First, carefully analyze this part of the code and make corresponding adjustments, such as minimizing the code with judgment jumps, especially in for loops, where judgment jumps will interrupt the software flow. The method used can be to use table lookup or use Intrinsics such as _cmpgtu4 and _cmpeq4 to replace the comparison judgment instructions, thereby cleverly replacing the judgment jump statements. At the same time, use the #pragma provided in TI's CCS to provide the compiler with as much information as possible, including the number of for loops, data alignment information, etc. If this part of the optimization cannot meet the system requirements, then use linear assembly to implement this part of the module. Linear assembly is a language implementation form between C and assembly, which can control the use of instructions without paying special attention to the allocation and use of registers and functional units (S, D, M, L). Using linear assembly generally has higher execution efficiency than using C language. If linear assembly still cannot meet the requirements, assembly implementation is used. To write a highly parallel, deep software pipeline assembly, it is necessary to draw relevant diagrams, create a scheduling table, and other steps. Due to space limitations, I will not go into details here.

Table 1

Optimization options: -pm, -o3, based on C64plus kernel, C+Instrinsics means using Instrinsics in C.

Table 1 shows the number of DSP CPU cycles consumed in different ways when calculating the SAD value of 16×16 macroblocks required in motion search. It can be seen that the number of CPU cycles consumed by assembly implementation is the least, but the premise is that you need to fully understand the structure of DSP CPU, instructions and algorithm module structure, so as to be able to write highly parallel and deep software pipeline assembly, otherwise it is possible that the assembly written is not as efficient as linear assembly or C. To this end, an effective method is to make full use of the functions in the algorithm library provided by TI, because the functions in the algorithm library are all fully optimized algorithm modules, and most of them provide corresponding C, linear assembly and assembly source code, and have documents to introduce API.

Memory management during algorithm system integration

In the development of DSP-based embedded systems, storage resources, especially on-chip high-speed storage resources, are limited. Therefore, memory management is very important for improving the optimization of the entire system during algorithm system integration. On the one hand, it affects the speed of data reading and moving; on the other hand, it also affects the cache hit rate. The following analysis is divided into two aspects: program and data.

Program area: The most important principle is to put the algorithm modules that are frequently scheduled on-chip. To achieve this, TI's CCS provides #pragma CODE_SECTION, which can separate the function segments that need to be controlled and stored separately from the .text segment, so that these function segments can be mapped to separate physical addresses in the .cmd file. You can also use the program dynamically to schedule the code segments that need to be run into the on-chip memory first. For example, the CAVLC and CABAC algorithm modules in H.264/AVC are mutually exclusive. Therefore, these two algorithm modules can be placed outside the chip and correspond to the same running area on the chip. Before running one of the algorithm modules, call it into the chip first, so as to make full use of the limited high-speed storage area on the chip. The management of the program area takes into account the hit rate of the first-level program cache (L1 P). It is best to configure the functions with a sequence of execution in order of address.

In the program space, processing functions with larger codes are split into small functions.

Data area: In video standard codecs, since data blocks are very large, such as a D1 4:2:0 image frame with a size of 622k, and 3 to 5 or even more buffer frames need to be opened in the codec, the data basically cannot be stored on the chip. For this reason, in the system's memory optimization management, it is necessary to open the secondary cache of the C64 series DSP (for TMS320DM642, the secondary cache used in video codecs is often opened with 64k). At the same time, it is best to align the data of the video buffer mapped by the cache outside the chip with 128 bytes. This is because the size of each line of the secondary cache of the C64 series DSP is 128 bytes, and 128-byte alignment is conducive to cache refresh and consistency maintenance.

Resource allocation management of EDMA during algorithm system integration

Since block data is often moved in video processing, and the C64 series DSP provides EDMA, which has 64 logical channels, the configuration and use of EDMA is very important for optimizing the system. To this end, the following steps can be used to fully configure the system's EDMA resources.

1. Count the various situations in the system where EDMA is needed and the time it takes to occupy the EDMA physical bus, as shown in Table 2:

Note: This table is for video through the video port (720*480, 4:2:0, 30Frame/s), audio through McBSP (sampling rate is 44k) entering the DSP, the compressed data rate is about 2Mbps, data is output through PCI as a 128-byte packet every 488uS (PCI port operating frequency is 33MHz), and the clock frequency of the external SDRAM is 133MHz. This is just a reference application example.

2. After counting this information, it is necessary to prioritize each used EDMA channel based on the system's real-time performance of various code streams and the size of their transmission data blocks. Generally speaking, since the audio stream transmission block is small, it occupies the EDMA bus for a short time, while the video transmission block is relatively large and occupies the EDMA bus for a longer time. Therefore, the priority of the EDMA channel corresponding to the input audio is set to Q0 (urgent), the priority of the video is set to Q2 (medium), the priority of the EDMA channel corresponding to the output code stream is set to Q1 (high), and the priority of the QDMA scheduled in the audio and video algorithm processing is set to Q3 (low). Of course, these settings may need to be adjusted in real system applications.

The actual TI DSP video algorithm optimization integration process is based on the steps shown in Figure 1. First, preliminarily configure the memory and select the corresponding compilation optimization options. If the compilation result can meet the real-time requirements, the subsequent optimization will be terminated; otherwise, the configuration of memory and EDMA will be optimized to improve the utilization of cache and internal bus; if the requirements cannot be met, the code segment or function that consumes the most CPU resources will be determined by analyzing the entire project, and these key modules will be optimized, using linear assembly or even assembly until the entire system can meet the requirements.

Reference address：Several strategies for optimizing video algorithm systems based on DSP

Previous article：En Verv applies Tensilica technology to power line communications for smart grids
Next article：Design of digital phase shifter-voltage-frequency converter module based on DSP control

Recommended ReadingLatest update time:2024-11-16 15:29

Application of TMS320C6713DSP in Music Fountain Control System

introduction The musical fountain is a combination of modern technology and art. It uses fountains to express the beauty of music, which is pleasing to the eye. At present, many units have launched their own musical fountains, which have achieved good results. However, looking at these sound control product

[Embedded]

Tianhuiwei 2.4G Bluetooth dual-mode Quantum KT1200 customized Bluetooth DSP headset transceiver module PCBA

The KT1200 chip is a highly integrated low-power 2.4G plus Bluetooth wireless transceiver that can realize multiple applications such as wireless microphones, wireless headphones and wireless speakers. With the rise of high-quality online media, wireless audio products are no longer simple accessories but mainstream

[Embedded]

Tianhuiwei 2.4G Bluetooth dual-mode Quantum KT1200 customized Bluetooth DSP headset transceiver module PCBA

Design of real-time data acquisition system based on ADmC812 and DSP

　　 Introduction 　　ADmC812 is ADI's new micro-converter with 8051 (8052) core as the control core. Because ADmC812 integrates a large number of peripheral devices. It itself is a fully programmable, self-calibrating, high-precision data acquisition system that can replace the traditional MCU+A/D+ROM+RAM high-cost, lar

[Embedded]

Design of a two-layer data acquisition system based on CAN bus and DSP

1 Introduction CAN (Controller Area Network) is a serial communication network designed by BOSCH of Germany to realize data communication between automobile measurement and execution components, supporting distributed control and real-time control. The CAN BUS fieldbus has been approved by ISO/TC22 Technica

[Embedded]

Design of high-speed SDRAM storage system outside DSP chip

　　In high-speed signal processing applications such as digital image processing and aerospace, strong support of high-speed and large-capacity storage space is needed to meet the system's requirements for massive data throughput. By using large-capacity synchronous dynamic RAM (SDRAM) to expand the storage space of th

[Embedded]

Design of high-speed SDRAM storage system outside DSP chip

Design of constant current charging power supply based on DSP

　　1 Introduction 　　When the battery is charged normally, the better charging method is the graded constant current method, that is, a larger constant current is used at the beginning of charging, and after a certain period of time or when the battery reaches a certain voltage, a smaller constant current is used for ch

[Embedded]

Design of constant current charging power supply based on DSP

DSP Programming Skills 17---Very "Critical" Keywords

　　What are " keywords "? Keywords are words that have been used by the C language itself and cannot be used for other purposes. For example, keywords cannot be used as variable names, function names, etc. So how critical are "keywords"? Simply put, if we don't know how to use them, the program will not produce t

[Embedded]

Design of USB port communication module between DSP and Bluetooth module

　　Circuit principle: When the DSP communicates with the Bluetooth module using the USB interface, it must go through the USB port conversion circuit and then connect to the USB bidirectional ports D+ and D- of the Bluetooth module; when the Bluetooth module USB port low-speed connection mode is used, the rate can als

[Embedded]

Design of USB port communication module between DSP and Bluetooth module

Popular Resources
Popular amplifiers