introduction
This paper designs an MPEG-4 encoder based on TMS320C6000 series DSP. The images acquired by the camera are compressed in real time according to the MPEG-4 standard and displayed in real time through VGA. At the same time, the compressed data is transmitted to the ARM controller through the PCI bus, and the video data is transmitted through the network according to actual needs.
MPEG-4 is an open standard, many parts of which are not specified, and new algorithms can be added. Therefore, the use of general-purpose DSP can update and optimize the algorithm at any time, making the encoding efficiency higher. Due to the complexity of the MPEG-4 encoding algorithm and the large amount of data required to be stored, storage space allocation, data transmission and computing speed are all challenges for DSP.
The C6000 series DSP is a high-end DSP produced by TI. This series of DSPs are all VLIW DSPs based on the VelociTITM architecture. They can execute eight 32-bit instructions in each cycle and have a CPU up to 200MHZ, which makes their computing power reach 1600MIPS. However, at a main frequency of 600MHz, the 6416 can simultaneously process single-channel MPEG-4 video encoding, single-channel MPEG-4 video decoding, and single-channel MPEG-2 video encoding using only 50% of its computing power. At the same time, it has flexible external interfaces and complete development tools, and is adopted by most embedded image real-time compression systems. Therefore, this system uses TI's TMS320C6416 chip as the core processor.
1. Structure and features of TMS320C6416
The CPU structure of DSP is shown in Figure 1. It has two channels, each channel has 4 functional units (1 multiplier and 3 arithmetic logic units), 16 32-bit general registers, and the functional units of each channel can access the registers of this channel at will. The CPU also has two cross units, through which the functional units of one channel can access the registers of another channel. In addition, the CPU also has 256-bit wide data and program channels, which can enable the program memory to provide 8 parallel execution instructions in each clock cycle. This CPU structure is the most basic condition for DSP to have a VLIW structure. The storage space of this DSP is mapped into internal memory, internal peripherals and extended memory. The internal memory consists of 64KB internal program memory and data memory. The internal program memory can be mapped to the CPU address space or used as a cache operation. Both internal and external data memories can be accessed through the CPU, DMA or HPI (Host Interface). The HPI interface enables the host computer to access the storage space of the DSP.
2. System hardware design
This system is mainly divided into three parts: video acquisition module, video MPEG-4 encoding module and video transmission module. Its structural block diagram is shown in Figure 2.
2.1 Video Capture
In this system, the acquisition of the input analog video signal is completed by the BT835 video decoder. The supported video input is standard analog video signal of PAL or NTSC format. The input video signal can be either composite video signal or S-Video signal, and the output is image data in 4:2:2 YUV format.
Figure 3 shows the block diagram of the DSP analog video input interface. The standard analog video signal enters the A/D converter after preprocessing; at the same time, it passes through the clock generation circuit to obtain the A/D conversion clock with the same phase as the line synchronization, so that the sampling points of each line are integers. In order to ensure that the video data is collected into the DSP for processing in its entirety, the line synchronization signal is used as the starting point for the FIFO to read the data. At the same time, the line synchronization, field synchronization and odd and even field mark signals also enter the DSP directly, so that it can determine the specific position of the read-in video data in a frame. In order to enhance the real-time performance of the system, the DMA (direct memory access) channel background operation characteristics of the TMS320C6416 DSP are used here to enable the data exchange between the DSP and the peripherals to be carried out simultaneously with the high-speed computing operation of its internal CPU. The function of the FIFO is that through its buffering, the DSP can easily exchange data with other peripherals other than the A/D.
The function of ARM7 is to generate clock and control video acquisition chip, convert the acquired data from 8 or 16 bits to 32 bits, and arrange the data in a way that Y, U, and V are separated. This is equivalent to preprocessing the acquired data for the convenience of video encoding. In addition, ARM7 outputs 32-bit wide data to 32-bit FIFO. Using 32-bit FIFO and converting video data to 32 bits can make the 32-bit data bus not idle when DSP reads video data, thereby improving the efficiency of DSP reading video data; FIFO is used here to reduce the time for DSP to read data and reduce the mismatch between high-speed devices and low-speed devices. Every time the FIFO is half full, ARM7 will send an interrupt signal to DSP, and use DMA to read video data in the interrupt handler; if ARM7 is not used, DSP will be frequently interrupted, thus spending a lot of time on stacking, popping and register settings.
2.2 MPEG-4 video encoding module
After the DSP reads in the video data, it performs preliminary processing, such as converting the YUV format to the RGB format, etc., and then performs MPEG-4 video encoding. In this process, data access usually takes up 50% of the time, arithmetic operations take up 30% of the time, and control takes up 20% of the time. Because motion estimation and motion compensation are required, one I (original frame) frame image and at least one P (predicted frame) frame image are usually stored in the data memory. These images occupy a relatively large space, so they are placed in the external memory SDRAM. In the encoding process, DCT coefficients, motion vectors, quantization matrices, variable-length coding tables, Z-shaped coding tables, etc. are also stored. Since they occupy a small storage space and will be used repeatedly, they are placed in the on-chip memory.
2.3 Video Transmission
Different from PC, the two-level storage architecture and data allocation principle of DSP inside and outside the chip determine that there must be a lot of data transmission in the encoder implementation process, so it must be effectively managed to reduce the time required for data transmission.
As for the data collection part, it can be done by using DSP's DMA. Most TMS320C6000 DSPs have several independent DMA channels. The characteristic of DMA is that it can complete the movement of data from the source address to the destination address without CPU intervention.
However, DMA is only suitable for the overall movement of data blocks. For data transmission between different data structures, the DMA controller of the previous DSP is powerless. Therefore, ARM7 can be used to control DSPDMA to complete the complex data transmission in video encoding.
The encoded video data is transmitted to the outside world through ARM7, which can be through the Internet, CDMA or GSM network, etc., and ARM7 only needs to design the corresponding transmission interface. As for the communication between ARM7 and the encoding card, it can be realized through parallel port, serial port, USB port, PCI interface, etc. Among them, the PCI interface method is easy for ARM7 and the encoder to transmit data at high speed, so the PCI interface can be used. The encoded data reaches ARM7 through the HPI of DSP, PCI bridge chip, and PCI bus. ARM7 directly accesses the storage space of DSP through the HPI of DSP.
3 Software Design and Optimization
3.1 Video Capture
This system designs a data structure in video acquisition to convert the spatially continuous look-ahead buffer into a circular buffer, and its simple schematic diagram is shown in FIG4 .
With this method, as long as a large enough space is allocated for the buffer, the number of image frames stored in it is at least greater than 3, so that it can ensure that new image data can be collected synchronously while processing image data without any data conflict. The system will always retain the oldest N frames of images in the ring buffer until they are taken away by the system.
3.2 Video Coding
MPEG-4 video coding is object-based video coding, which still uses the traditional hybrid coding method consisting of predictive coding, motion compensation, and DCT transformation. The core algorithms of the encoder include motion estimation, DCT/IDCT, quantization, VLC, etc., among which motion estimation occupies nearly a quarter of the computational load of the entire encoder. Therefore, studying motion estimation algorithms that are suitable for DSP structures and have a good compromise between speed and coding quality is a key issue in achieving real-time coding.
Block matching motion estimation algorithm should be used in video encoding, but the traditional block matching algorithm cannot achieve satisfactory results in matching speed. Therefore, this system adopts a four-step search block matching algorithm improved on the basis of the three-step search algorithm.
The four-step search algorithm is described as follows:
(1) Search for matching points to form a diamond window, as shown in Figure 5. The initial 9 matching points are the 4 vertices of the diamond, the midpoints of the 4 edges and the center of the diamond, as shown in the solid point in Figure 5. Calculate the SAD value for each point and select the point with the smallest SAD. If the point is the center of the search window, jump to step 4, otherwise go to step 2.
Figure 5 Schematic diagram of the four-step search algorithm
(2) The point with the smallest SAD is taken as the center point of the new diamond matching point window, and the remaining matching points are selected according to the following principles.
a) If the point with the smallest SAD is a corner point of the current search window, such as point A, then select another 5 points that are not adjacent to point A, such as the points in the shape of Figure 5. Select the point with the smallest SAD and go to step 3;
b) If the point with the smallest SAD is on the edge of the current search window, such as point B, then select the other three points that are not adjacent to point B, such as the points in Figure 5. Select the point with the smallest SAD and go to step 3;
c) If the point with the smallest SAD is the center point C of the current search window, go to step 4;
(3) The search pattern is the same as in step 2, and both end in step 4.
(4) Select four surrounding points as matching points, and change the step size to 1, as shown in the hollow point in Figure 5. Select the point with the smallest SAD as the final target point.
The four-step search algorithm is less complex than the three-step search algorithm, but the accuracy is not reduced. At the same time, the algorithm rules are easy to implement software pipelining, and are very suitable for implementation on DSP.
3.3 Software Optimization
Since image processing involves large amounts of data, high correlation, and strict frame and field time constraints, how to optimize DSP programming based on the characteristics of image processing and give full play to its performance becomes the key to improving the performance of the entire system.
In order to give full play to the computing power of DSP, we must start from its hardware structure, make the best use of the eight functional units, use software pipelines, and try to make the program run in parallel without conflict. Generally, the loop body meets the conditions for parallel processing, and the loop body is often the longest time-consuming in the program. Therefore, the focus is on the loop body when optimizing.
1) Optimization of DSP jump instructions
Most DSP instructions are single-cycle instructions, but transfer instructions usually consume more clock cycles. Each jump has 5 delay gaps, which is a very time-consuming task from a performance perspective. Therefore, branches in the program should be reduced as much as possible.
2) Using library functions
TI provides powerful IMAGE LIB library support for TMS320C6000 users. This library contains many commonly used functions, which can complete DCT/IDCT transformation, wavelet transformation, DCT quantization, adaptive filtering and other functions. These functions are optimized and can fully realize software pipelining with high efficiency.
3) Storage space considerations
The configuration of DSP storage space is very important. Because DSP has different access speeds to different storage units, the access speed to on-chip registers is the fastest, and the access speed to on-chip RAM is faster than the access speed to off-chip RAM. Therefore, the reasonable configuration and use of storage space has a great impact on the overall efficiency of the system. Constant tables and code segments that are frequently accessed should be loaded into on-chip RAM as much as possible. If they are too large, part of them should be loaded into off-chip memory.
4) Hybrid Programming
Different from traditional VLIW, Veloci TI uses a variety of advanced technologies to make the DSP C compiler highly efficient. We call it a DSP chip oriented to C language structure. Its average compilation efficiency can reach 84% of manual assembly. This allows us to use C language to write programs in most applications, making full use of a large number of algorithm programs described in C, and obtaining maintainability, portability, and inheritability far superior to traditional DSP programs, shortening the development cycle.
Although the C compiler of C6000 has such high compilation efficiency, it is far from enough to use only C language for complex algorithms such as MPEG-4. Generally, a combination of C language and assembly language is used to complete the program design. The program design process is as follows: first write C code and optimize it. If it cannot achieve the expected operating efficiency, write assembly code to improve efficiency.
4 Conclusion
The system is very flexible and supports standard analog video signals of PAL or NTSC formats. The input video signals can be composite video signals or S-Video signals. It also supports multiple resolutions, including FULL, CIF and QCIF, to meet the needs of various applications. Tests have shown that the above optimization can achieve real-time compression of video images, while the system operates reliably and consumes low power.
Previous article:Real-time image processing system based on DSP+FPGA+ASIC
Next article:Design and implementation of a multifunctional frequency divider based on CPLD/FPGA
Recommended ReadingLatest update time:2024-11-16 21:02
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Solution to TI cc2538 serial port failure to program and bootloader failure to enter
- Shenzhen Electronic Hardware Engineer Recruitment
- Fill block copper plating hazards!
- Chinese Academy of Sciences releases domestic open source RISC-V processor "Xiangshan": the first version is scheduled to be taped out in July
- Trump administration "blacklists" Xiaomi, Xiaomi using Qualcomm chips is a "military-related company"???
- 『Anxinke Bluetooth Development Board PB-02-Kit』-3: Start Docker
- FAQ_How to calculate the response time from shutdown state to READY state
- Battery and external power switching circuit, MOS on and off conditions
- [NXP Rapid IoT Review] Low Power Consumption Experiment & Summary
- NVC voice ic offline commonly used models!