Real-time Image Processing System Based on DSP TMS320C6416-EEWORLD

Collect

introduction

This paper designs an MPEG-4 encoder based on TMS320C6000 series DSP. The images acquired by the camera are compressed in real time according to the MPEG-4 standard and displayed in real time through VGA. At the same time, the compressed data is transmitted to the ARM controller through the PCI bus, and the video data is transmitted through the network according to actual needs.

MPEG-4 is an open standard, many parts of which are not specified, and new algorithms can be added. Therefore, the use of general-purpose DSP can update and optimize the algorithm at any time, making the encoding efficiency higher. Due to the complexity of the MPEG-4 encoding algorithm and the large amount of data required to be stored, storage space allocation, data transmission and computing speed are all challenges for DSP.

The C6000 series DSP is a high-end DSP produced by TI. This series of DSPs are all VLIW DSPs based on the VelociTITM architecture. They can execute eight 32-bit instructions in each cycle and have a CPU up to 200MHZ, which makes their computing power reach 1600MIPS. However, at a main frequency of 600MHz, the 6416 can simultaneously process single-channel MPEG-4 video encoding, single-channel MPEG-4 video decoding, and single-channel MPEG-2 video encoding using only 50% of its computing power. At the same time, it has flexible external interfaces and complete development tools, and is adopted by most embedded image real-time compression systems. Therefore, this system uses TI's TMS320C6416 chip as the core processor.

1. Structure and features of TMS320C6416

The CPU structure of DSP is shown in Figure 1. It has two channels, each channel has 4 functional units (1 multiplier and 3 arithmetic logic units), 16 32-bit general registers, and the functional units of each channel can access the registers of this channel at will. The CPU also has two cross units, through which the functional units of one channel can access the registers of another channel. In addition, the CPU also has 256-bit wide data and program channels, which can enable the program memory to provide 8 parallel execution instructions in each clock cycle. This CPU structure is the most basic condition for DSP to have a VLIW structure. The storage space of this DSP is mapped into internal memory, internal peripherals and extended memory. The internal memory consists of 64KB internal program memory and data memory. The internal program memory can be mapped to the CPU address space or used as a cache operation. Both internal and external data memories can be accessed through the CPU, DMA or HPI (Host Interface). The HPI interface enables the host computer to access the storage space of the DSP.

2. System hardware design

This system is mainly divided into three parts: video acquisition module, video MPEG-4 encoding module and video transmission module. Its structural block diagram is shown in Figure 2.

2.1 Video Capture

In this system, the acquisition of the input analog video signal is completed by the BT835 video decoder. The supported video input is standard analog video signal of PAL or NTSC format. The input video signal can be either composite video signal or S-Video signal, and the output is image data in 4:2:2 YUV format.

Figure 3 shows the block diagram of the DSP analog video input interface. The standard analog video signal enters the A/D converter after preprocessing; at the same time, it passes through the clock generation circuit to obtain the A/D conversion clock with the same phase as the line synchronization, so that the sampling points of each line are integers. In order to ensure that the video data is collected into the DSP for processing in its entirety, the line synchronization signal is used as the starting point for the FIFO to read the data. At the same time, the line synchronization, field synchronization and odd and even field mark signals also enter the DSP directly, so that it can determine the specific position of the read-in video data in a frame. In order to enhance the real-time performance of the system, the DMA (direct memory access) channel background operation characteristics of the TMS320C6416 DSP are used here to enable the data exchange between the DSP and the peripherals to be carried out simultaneously with the high-speed computing operation of its internal CPU. The function of the FIFO is that through its buffering, the DSP can easily exchange data with other peripherals other than the A/D.

The function of ARM7 is to generate clock and control video acquisition chip, convert the acquired data from 8 or 16 bits to 32 bits, and arrange the data in a way that Y, U, and V are separated. This is equivalent to preprocessing the acquired data for the convenience of video encoding. In addition, ARM7 outputs 32-bit wide data to 32-bit FIFO. Using 32-bit FIFO and converting video data to 32 bits can make the 32-bit data bus not idle when DSP reads video data, thereby improving the efficiency of DSP reading video data; FIFO is used here to reduce the time for DSP to read data and reduce the mismatch between high-speed devices and low-speed devices. Every time the FIFO is half full, ARM7 will send an interrupt signal to DSP, and use DMA to read video data in the interrupt handler; if ARM7 is not used, DSP will be frequently interrupted, thus spending a lot of time on stacking, popping and register settings.

2.2 MPEG-4 video encoding module

After the DSP reads in the video data, it performs preliminary processing, such as converting the YUV format to the RGB format, etc., and then performs MPEG-4 video encoding. In this process, data access usually takes up 50% of the time, arithmetic operations take up 30% of the time, and control takes up 20% of the time. Because motion estimation and motion compensation are required, one I (original frame) frame image and at least one P (predicted frame) frame image are usually stored in the data memory. These images occupy a relatively large space, so they are placed in the external memory SDRAM. In the encoding process, DCT coefficients, motion vectors, quantization matrices, variable-length coding tables, Z-shaped coding tables, etc. are also stored. Since they occupy a small storage space and will be used repeatedly, they are placed in the on-chip memory.

2.3 Video Transmission

Different from PC, the two-level storage architecture and data allocation principle of DSP inside and outside the chip determine that there must be a lot of data transmission in the encoder implementation process, so it must be effectively managed to reduce the time required for data transmission.

As for the data collection part, it can be done by using DSP's DMA. Most TMS320C6000 DSPs have several independent DMA channels. The characteristic of DMA is that it can complete the movement of data from the source address to the destination address without CPU intervention.

However, DMA is only suitable for the overall movement of data blocks. For data transmission between different data structures, the DMA controller of the previous DSP is powerless. Therefore, ARM7 can be used to control DSPDMA to complete the complex data transmission in video encoding.

The encoded video data is transmitted to the outside world through ARM7, which can be through the Internet, CDMA or GSM network, etc., and ARM7 only needs to design the corresponding transmission interface. As for the communication between ARM7 and the encoding card, it can be realized through parallel port, serial port, USB port, PCI interface, etc. Among them, the PCI interface method is easy for ARM7 and the encoder to transmit data at high speed, so the PCI interface can be used. The encoded data reaches ARM7 through the HPI of DSP, PCI bridge chip, and PCI bus. ARM7 directly accesses the storage space of DSP through the HPI of DSP.

3 Software Design and Optimization

3.1 Video Capture

This system designs a data structure in video acquisition to convert the spatially continuous look-ahead buffer into a circular buffer, and its simple schematic diagram is shown in FIG4 .

With this method, as long as a large enough space is allocated for the buffer, the number of image frames stored in it is at least greater than 3, so that it can ensure that new image data can be collected synchronously while processing image data without any data conflict. The system will always retain the oldest N frames of images in the ring buffer until they are taken away by the system.

3.2 Video Coding

MPEG-4 video coding is object-based video coding, which still uses the traditional hybrid coding method consisting of predictive coding, motion compensation, and DCT transformation. The core algorithms of the encoder include motion estimation, DCT/IDCT, quantization, VLC, etc., among which motion estimation occupies nearly a quarter of the computational load of the entire encoder. Therefore, studying motion estimation algorithms that are suitable for DSP structures and have a good compromise between speed and coding quality is a key issue in achieving real-time coding.

Block matching motion estimation algorithm should be used in video encoding, but the traditional block matching algorithm cannot achieve satisfactory results in matching speed. Therefore, this system adopts a four-step search block matching algorithm improved on the basis of the three-step search algorithm.

The four-step search algorithm is described as follows:

(1) Search for matching points to form a diamond window, as shown in Figure 5. The initial 9 matching points are the 4 vertices of the diamond, the midpoints of the 4 edges and the center of the diamond, as shown in the solid point in Figure 5. Calculate the SAD value for each point and select the point with the smallest SAD. If the point is the center of the search window, jump to step 4, otherwise go to step 2.

Figure 5 Schematic diagram of the four-step search algorithm

(2) The point with the smallest SAD is taken as the center point of the new diamond matching point window, and the remaining matching points are selected according to the following principles.

a) If the point with the smallest SAD is a corner point of the current search window, such as point A, then select another 5 points that are not adjacent to point A, such as the points in the shape of Figure 5. Select the point with the smallest SAD and go to step 3;

b) If the point with the smallest SAD is on the edge of the current search window, such as point B, then select the other three points that are not adjacent to point B, such as the points in Figure 5. Select the point with the smallest SAD and go to step 3;

c) If the point with the smallest SAD is the center point C of the current search window, go to step 4;

(3) The search pattern is the same as in step 2, and both end in step 4.

(4) Select four surrounding points as matching points, and change the step size to 1, as shown in the hollow point in Figure 5. Select the point with the smallest SAD as the final target point.

The four-step search algorithm is less complex than the three-step search algorithm, but the accuracy is not reduced. At the same time, the algorithm rules are easy to implement software pipelining, and are very suitable for implementation on DSP.

3.3 Software Optimization

Since image processing involves large amounts of data, high correlation, and strict frame and field time constraints, how to optimize DSP programming based on the characteristics of image processing and give full play to its performance becomes the key to improving the performance of the entire system.

In order to give full play to the computing power of DSP, we must start from its hardware structure, make the best use of the eight functional units, use software pipelines, and try to make the program run in parallel without conflict. Generally, the loop body meets the conditions for parallel processing, and the loop body is often the longest time-consuming in the program. Therefore, the focus is on the loop body when optimizing.

1) Optimization of DSP jump instructions

Most DSP instructions are single-cycle instructions, but transfer instructions usually consume more clock cycles. Each jump has 5 delay gaps, which is a very time-consuming task from a performance perspective. Therefore, branches in the program should be reduced as much as possible.

2) Using library functions

TI provides powerful IMAGE LIB library support for TMS320C6000 users. This library contains many commonly used functions, which can complete DCT/IDCT transformation, wavelet transformation, DCT quantization, adaptive filtering and other functions. These functions are optimized and can fully realize software pipelining with high efficiency.

3) Storage space considerations

The configuration of DSP storage space is very important. Because DSP has different access speeds to different storage units, the access speed to on-chip registers is the fastest, and the access speed to on-chip RAM is faster than the access speed to off-chip RAM. Therefore, the reasonable configuration and use of storage space has a great impact on the overall efficiency of the system. Constant tables and code segments that are frequently accessed should be loaded into on-chip RAM as much as possible. If they are too large, part of them should be loaded into off-chip memory.

4) Hybrid Programming

Different from traditional VLIW, Veloci TI uses a variety of advanced technologies to make the DSP C compiler highly efficient. We call it a DSP chip oriented to C language structure. Its average compilation efficiency can reach 84% of manual assembly. This allows us to use C language to write programs in most applications, making full use of a large number of algorithm programs described in C, and obtaining maintainability, portability, and inheritability far superior to traditional DSP programs, shortening the development cycle.

Although the C compiler of C6000 has such high compilation efficiency, it is far from enough to use only C language for complex algorithms such as MPEG-4. Generally, a combination of C language and assembly language is used to complete the program design. The program design process is as follows: first write C code and optimize it. If it cannot achieve the expected operating efficiency, write assembly code to improve efficiency.

4 Conclusion

The system is very flexible and supports standard analog video signals of PAL or NTSC formats. The input video signals can be composite video signals or S-Video signals. It also supports multiple resolutions, including FULL, CIF and QCIF, to meet the needs of various applications. Tests have shown that the above optimization can achieve real-time compression of video images, while the system operates reliably and consumes low power.

Keywords：TMS320C6416 Reference address：Real-time Image Processing System Based on DSP TMS320C6416

Previous article：Real-time image processing system based on DSP+FPGA+ASIC
Next article：Design and implementation of a multifunctional frequency divider based on CPLD/FPGA

Recommended ReadingLatest update time:2024-11-16 21:02

A single-chip MPEG-2 decoding scheme for DVB-C

Abstract: A single-chip MPEG-2 decoding solution for DVB-C set-top boxes. Briefly introduce the main parts of the main chip and introduce the implementation of each module in the solution in terms of hardware and software. Keywords: DVB-C set-top box MPEG-2 SmartCard Entering the 21st century, digital

[Embedded]

Popular Resources
Popular amplifiers