Full hardware video processing engine simplifies video system design

Publisher:数字小巨人Latest update time:2011-04-20 Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere
Digital video processing has always been a hot topic in the current multimedia device applications. There are many standards for digital video, and they are still evolving. Therefore, the system design must be able to support a wider range of video formats as much as possible. The traditional choice is to use DSP for software encoding and decoding, but with the rapid popularization of 1080i/p high-definition video applications, the amount of computing required is also increasing sharply, and software-based processing methods are gradually beginning to face great challenges. The advantages of solutions based on hardware accelerators are beginning to emerge. This method can greatly reduce the processor load and meet the demanding low-power requirements of mobile devices. At present, more and more system solutions are beginning to adopt designs based on full hardware video processing engines (VPUs).

Freescale i.MX53 application processor provides a typical structure based on hardware accelerator. Its embedded full hardware VPU supports a wide range of video formats from H.264, MPEG4, Divx to RV10, which can cover most video resources and support 1080i/p high-definition decoding and 720p encoding. In addition, the processor can also perform multi-channel video decoding and full-duplex multi-channel video encoding processing at the same time, and allows each video to use different formats, so as to realize dual-monitor configuration or video conference applications.

Typical hardware video processing engine structure

Different from the full hardware VPU in the usual sense, a significant advantage of this VPU is that it can provide programmability and update the encoding and decoding process to a certain extent. The reason is that it has a built-in 16-bit small programmable DSP. This processor called BIT can flexibly control the encoding and decoding process and the interface interaction with the CPU by executing different firmware.

For the CPU, the amount of computing required to control the VPU does not exceed 1MIPS. Such low computing requirements are also attributed to the BIT processor. It contains a dedicated hardware accelerator to accelerate the processing of the bitstream, and implements functions including frame rate control, FMO, ASO, video codec control, and error recovery. Most of the sub-modules in the VPU are also highly optimized and can be fully reused when encoding and decoding various video formats, thereby reducing the number of gates and power consumption.

The VPU structure of MX53 is shown in Figure 1. It is connected to the ARM processor through standard AXI/APB, so that it can access the on-chip cache to achieve high performance. VPU mainly consists of two components, video codec processing IP and VPU bus converter. The former is the core of the entire VPU, mainly composed of embedded BIT processor, video CODEC and bus arbiter; the latter is responsible for converting the AMBA APB3 bus to the IP Sky Blue bus inside the VPU.

MX53 VPU structure

Video decoding process flow

Thanks to the highly perfect control process of the BIT processor, from the perspective of the external CPU, the VPU is highly autonomous, and the CPU only needs to manage the processes related to the VPU. It should be noted that the process here does not refer to the system process in the usual sense, but the dedicated process inside the VPU.

The VPU can process up to 4 channels of video in different formats at the same time, but the processing flow is the same. It all starts with creating a process (the system is responsible for creating and setting a dedicated process), then running the process (the system needs to run the process at the time point that the decoder is idle and the bitstream is ready in memory), and finally exiting the process.

If multiple processes are ready to run, each process will be assigned a unique process index number, which is assigned based on the order in which it is created. For example, when 1 channel MPEG-4 decoding, 1 channel H.264 decoding, 1 channel MPEG-2 decoding and 1 channel VC-1 decoding are running at the same time, the MPEG-4 decoding process will be assigned index number 0, and the VC-1 decoding will be assigned index number 3.

In a multi-process environment, there is no priority for the execution of processes. After all processes are created, the CPU will start the BIT processor to execute these processes. The BIT processor also uses a mechanism similar to time slice division to schedule a process.

Let's jump out of the VPU and look at its operation from the perspective of the entire system. Let's take the example of simultaneously decoding one H.264 stream and one MPEG-4 stream.

First, initialize the VPU, including loading the firmware code required by the BIT processor into the memory and setting initialization parameters, such as BIT processor configuration parameters, working buffer base address, BIT code address, and stream buffer control, etc.

Then create the H.264 code stream and MPEG-4 decoding process, including setting the base address and size of the code stream buffer, the base address of the frame buffer, etc.

Each process is then executed alternately. A flag (Wait BusyFlag) indicates whether a frame of code stream has been decoded. The decoded code stream will be sent to the image processing unit (IPU) for post-processing and display.

Finally, after decoding is completed, the relevant memory resources are released and the process is destroyed.

Electronic System Design

Memory control is a key issue when using VPU

The VPU has full access to external memory, which it uses to load and store image frames, bitstreams, and code and data for the BIT processor. The amount of memory used depends on the video format itself and the target application. For example, H.264 decoding uses up to 16 reference frames, but H.263 decoding only requires 1. In addition, different formats also require different sizes of temporary memory when processing de-blocking or superposition smoothing filtering.

Basically, VPU uses 6 different storage areas: frame buffer (used to store a frame of image), BIT processor code memory area, working buffer (for intermediate data of BIT processor and for use by video decoding hardware), bitstream buffer (used to load bitstream), parameter buffer (used for BIT processor command execution and return data), search RAM (used by ME module to reduce the bus load of external memory).

Among them, the processing of the code stream buffer is very critical. For each process, the system must allocate an independent code stream buffer. The external code stream buffer will form a buffer ring (ring buffer). The BIT processor will automatically perform a loop operation after obtaining the starting address of the buffer ring.

During the decoding process, the CPU writes the code stream into the buffer, and then the BIT processor reads the code stream. If the two do not work well together, it will cause overwriting or underflow of the code stream. Once this happens, the decoding will fail. To prevent this from happening, the buffer read/write pointer of the current code stream must be exchanged between the external CPU and the BIT processor inside the VPU. The write pointer operated by the CPU and the read pointer operated by the BIT must both be written into the internal register. The BIT processor determines whether the code stream buffer has insufficient code stream by comparing these two pointers. If so, it is necessary to stop decoding to prevent misreading of the code stream until the CPU writes enough code stream data and updates the write pointer. Conversely, the CPU also needs to judge the read pointer before writing data to the buffer ring to ensure that code stream rewriting will not occur.

In applications such as 1080i/p high-definition decoding, the memory bandwidth required by the VPU is very high, and most current operating systems are multi-tasking operating systems, so insufficient memory bandwidth is likely to occur, which will cause unsmooth playback or even incorrect decoding. Therefore, the use of system bandwidth must be carefully planned.

Conclusion

From the above analysis, it can be seen that the use of i.MX53's VPU is very simple. The high degree of encapsulation of the encoding and decoding process by the full hardware VPU actually hides the complexity of this process, making video processing an easy task overall. This is one of the significant advantages of the full hardware VPU. At present, the market competition for multimedia devices is extremely fierce, and the product development time of system manufacturers has been compressed very short. As far as video solutions are concerned, application processor suppliers must ensure that their reference designs can provide simple and easy-to-use APIs, as well as fully verified reliability and real-time encoding and decoding performance. System design based on full hardware video processing is undoubtedly a very attractive solution in the market.

Reference address:Full hardware video processing engine simplifies video system design

Previous article:Core technologies and features of surveillance video quality diagnosis
Next article:Sprite Mobile Video Monitoring System Based on J2ME

Latest Analog Electronics Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号