The demand for digital video products has surged in recent years. Mainstream applications include video communications, video surveillance, and industrial automation, and the most popular are entertainment applications, such as DVD, HDTV, satellite TV, standard definition (SD) or high definition (HD) set-top boxes, digital cameras and HD camcorders, high-end displays (LCD, plasma displays, DLP), and personal video cameras. These applications have put forward huge demands for high-quality video codec algorithms and their standards. The current mainstream compression standards are mainly MPEG2, MPEG4, and H.264/AVC, and there are various implementation solutions for these codec standards. This article mainly discusses several factors that need to be considered in the process of optimizing the video decoding algorithm standard system based on TI's C64 series DSP.
TI's C64 series DSP is widely used in the field of video processing due to its powerful processing capabilities. However, due to different understandings of the structure, instructions, and functions of the C64 series DSP, there are many differences in the effects of algorithm implementation. This is specifically reflected in the CPU resources used to implement the algorithm. For example, there will be differences in the CPU resources occupied when implementing H.264 MP@D1 decoding, or in the subset of algorithm tools included, such as using CAVLC instead of CABAC when implementing H.264 MP@D1 decoding. The main reasons for these differences are as follows:
Optimization of key algorithm modules
Memory management during
algorithm system integration EDMA resource allocation management during algorithm system integration
This article gradually explores several factors that need to be considered in algorithm optimization integration from these three aspects.
Optimization of key algorithm modules
Generally speaking, the current mainstream video decompression standards all have similar modules that consume a lot of DSP CPU resources. For example, the motion vector search in H.264/AVC, MPEG4, AVS and other encodings is very resource-intensive, and these modules are called quite frequently during the entire system implementation process. Therefore, we first find out these modules. TI's CCS provides an engineering profiling tool (Profile) that can quickly find the modules that occupy the most DSP CPU resources in the entire project; then optimize these modules.
We can divide the optimization of these key algorithm modules into three steps, as shown in Figure 2. First, carefully analyze this part of the code and make corresponding adjustments, such as minimizing the code with judgment jumps, especially in for loops, where judgment jumps will interrupt the software flow. The method used can be to use table lookup or use Intrinsics such as _cmpgtu4 and _cmpeq4 to replace the comparison judgment instructions, thereby cleverly replacing the judgment jump statements. At the same time, use the #pragma provided in TI's CCS to provide the compiler with as much information as possible, including the number of for loops, data alignment information, etc. If this part of the optimization cannot meet the system requirements, then use linear assembly to implement this part of the module. Linear assembly is a language implementation form between C and assembly, which can control the use of instructions without paying special attention to the allocation and use of registers and functional units (S, D, M, L). Using linear assembly generally has higher execution efficiency than using C language. If linear assembly still cannot meet the requirements, assembly implementation is used. To write a highly parallel, deep software pipeline assembly, it is necessary to draw relevant diagrams, create a scheduling table, and other steps. Due to space limitations, I will not go into details here.
Table 1
Optimization options: -pm, -o3, based on C64plus kernel, C+Instrinsics means using Instrinsics in C.
Table 1 shows the number of DSP CPU cycles consumed in different ways when calculating the SAD value of 16×16 macroblocks required in motion search. It can be seen that the number of CPU cycles consumed by assembly implementation is the least, but the premise is that you need to fully understand the structure of DSP CPU, instructions and algorithm module structure, so as to be able to write highly parallel and deep software pipeline assembly, otherwise it is possible that the assembly written is not as efficient as linear assembly or C. To this end, an effective method is to make full use of the functions in the algorithm library provided by TI, because the functions in the algorithm library are all fully optimized algorithm modules, and most of them provide corresponding C, linear assembly and assembly source code, and have documents to introduce API.
Memory management during algorithm system integration
In the development of DSP-based embedded systems, storage resources, especially on-chip high-speed storage resources, are limited. Therefore, memory management is very important for improving the optimization of the entire system during algorithm system integration. On the one hand, it affects the speed of data reading and moving; on the other hand, it also affects the cache hit rate. The following analysis is divided into two aspects: program and data.
Program area: The most important principle is to put the algorithm modules that are frequently scheduled on-chip. To achieve this, TI's CCS provides #pragma CODE_SECTION, which can separate the function segments that need to be controlled and stored separately from the .text segment, so that these function segments can be mapped to separate physical addresses in the .cmd file. You can also use the program dynamically to schedule the code segments that need to be run into the on-chip memory first. For example, the CAVLC and CABAC algorithm modules in H.264/AVC are mutually exclusive. Therefore, these two algorithm modules can be placed outside the chip and correspond to the same running area on the chip. Before running one of the algorithm modules, call it into the chip first, so as to make full use of the limited high-speed storage area on the chip. The management of the program area takes into account the hit rate of the first-level program cache (L1 P). It is best to configure the functions with a sequence of execution in order of address.
In the program space, processing functions with larger codes are split into small functions.
Data area: In video standard codecs, since data blocks are very large, such as a D1 4:2:0 image frame with a size of 622k, and 3 to 5 or even more buffer frames need to be opened in the codec, the data basically cannot be stored on the chip. For this reason, in the system's memory optimization management, it is necessary to open the secondary cache of the C64 series DSP (for TMS320DM642, the secondary cache used in video codecs is often opened with 64k). At the same time, it is best to align the data of the video buffer mapped by the cache outside the chip with 128 bytes. This is because the size of each line of the secondary cache of the C64 series DSP is 128 bytes, and 128-byte alignment is conducive to cache refresh and consistency maintenance.
Resource allocation management of EDMA during algorithm system integration
Since block data is often moved in video processing, and the C64 series DSP provides EDMA, which has 64 logical channels, the configuration and use of EDMA is very important for optimizing the system. To this end, the following steps can be used to fully configure the system's EDMA resources.
1. Count the various situations in the system where EDMA is needed and the time it takes to occupy the EDMA physical bus, as shown in Table 2:
Note: This table is for video through the video port (720*480, 4:2:0, 30Frame/s), audio through McBSP (sampling rate is 44k) entering the DSP, the compressed data rate is about 2Mbps, data is output through PCI as a 128-byte packet every 488uS (PCI port operating frequency is 33MHz), and the clock frequency of the external SDRAM is 133MHz. This is just a reference application example.
2. After counting this information, it is necessary to prioritize each used EDMA channel based on the system's real-time performance of various code streams and the size of their transmission data blocks. Generally speaking, since the audio stream transmission block is small, it occupies the EDMA bus for a short time, while the video transmission block is relatively large and occupies the EDMA bus for a longer time. Therefore, the priority of the EDMA channel corresponding to the input audio is set to Q0 (urgent), the priority of the video is set to Q2 (medium), the priority of the EDMA channel corresponding to the output code stream is set to Q1 (high), and the priority of the QDMA scheduled in the audio and video algorithm processing is set to Q3 (low). Of course, these settings may need to be adjusted in real system applications.
The actual TI DSP video algorithm optimization integration process is based on the steps shown in Figure 1. First, preliminarily configure the memory and select the corresponding compilation optimization options. If the compilation result can meet the real-time requirements, the subsequent optimization will be terminated; otherwise, the configuration of memory and EDMA will be optimized to improve the utilization of cache and internal bus; if the requirements cannot be met, the code segment or function that consumes the most CPU resources will be determined by analyzing the entire project, and these key modules will be optimized, using linear assembly or even assembly until the entire system can meet the requirements.
Previous article:En Verv applies Tensilica technology to power line communications for smart grids
Next article:Design of digital phase shifter-voltage-frequency converter module based on DSP control
Recommended ReadingLatest update time:2024-11-16 15:29
- Popular Resources
- Popular amplifiers
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Infineon Tmall flagship store is giving away gifts!
- Detailed explanation of digital filtering using single chip microcomputer
- New version of MicroPython firmware for nRF52840
- EEWORLD University Hall ---- Introduction to Artificial Intelligence Zhejiang University of Technology
- Summary of performance indicators of integrated operational amplifiers
- EEWORLD University - Using the 75 W TAS6424-Q1 Class D Audio Amplifier for DC and AC Load Diagnostics
- EEWORLD University Hall----Live playback: High-performance i.MX RT processors help smart nodes achieve machine learning without Internet access
- Help, failed to create new IP core
- Setting the switching frequency of the switching power supply chip LTM4613
- Does the power layer of a four-layer board need to be made into a complete plane?