Video acceleration engine technology based on Xtensa configurable processor technology

Publisher:RadiantGlowLatest update time:2011-04-21 Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere
The growth of handheld multimedia devices has greatly changed the product positioning requirements of terminal multimedia chip suppliers. The IC design targets of these chip providers are no longer just for one or two multimedia codecs. Consumers want their mobile devices to be able to play media using different devices, encode using different standards, and download or receive media data from different devices. Video decoder and encoder engines must meet multiple requirements and have area and power consumption advantages.

1. Traditional RTL method

for designing video acceleration engines The design of the previous generation of video ASICs mainly encodes and decodes MPEG-2 because it is the DVD standard. Some video ASICs also support MPEG-1 for VCD (video CD) playback. In most cases, MPEG-2 encoders and decoders use RTL design methods. A typical MPEG-2 video ASIC architecture is shown in Figure 1, which includes a video subsystem, a main controller, and on-chip memory composed of various RTL modules.

Figure 1 MPEG-2 video ASIC architecture The

use of hard-wired RTL architecture supports multiple video standards, however, this also means that each video standard requires a dedicated RTL module to implement. Using hard-wired RTL modules to implement a multi-standard video acceleration engine has certain limitations. Whether it is to implement a new video standard, update an existing standard, or eliminate a fault, the chip needs to be reprocessed.

2. Advantages of using a processor as a video acceleration engine

Programmable processors can meet the flexibility requirements of multiple video standards. Compared with the RTL module design method, programmable processors have the following advantages: First, it is easy to interface the codec with the processor; second, it can meet the requirements of new video standards, update existing codecs, or use software methods to modify faults after the chip is put into production; third, the performance of the video codec can be easily improved by software updates.
However, traditional 32-bit processors have performance bottlenecks because they are designed for general code rather than video acceleration engines. Embedded DSPs are not specifically tailored for video, but include hardware functional components, instructions, and interfaces, and are specifically used in the field of general DSP. Therefore, in order to implement video codecs on traditional RISC and DSP processors, these processors must run at a very high speed (Mhz), requiring a large amount of memory space, and therefore requiring a lot of power consumption, which is not suitable for portable applications.
By studying the amount of calculation required for a video kernel program, it can be seen at a glance. For example, the SAD operation is a common method used in the motion estimation step of most video coding algorithms. The SAD algorithm will find the motion of macroblocks in two consecutive video frames. To do this, it is necessary to calculate the sum of the absolute differences between each group of corresponding pixel values ​​in the two macroblocks.
The following C code gives a simple implementation of the SAD core algorithm:
for (row = 0; row < numrows; row++) {
for (col = 0; col < numcols; col++) {
accum += abs(macroblk1[row][col] - macroblk2[row][col]);
} /* column loop */
} /* row loop */
The basic calculation method of the SAD core algorithm is shown in Figure 2. As shown in the figure, the SAD core algorithm first performs a subtraction operation, then takes the absolute value, and finally accumulates the previous results.


Figure 2 Main calculation method of the difference absolute value accumulation (SAD)

On a RISC processor, calculating a SAD operation consisting of two 16x16 macroblocks requires 256 subtraction operations, 256 absolute value operations, and 256 accumulation operations, a total of 768 arithmetic operations, not including the data fetch and store operations required for data transfer. Since this requires operations on all macroblocks of each frame, the calculation cost is extremely expensive as the video frame increases with the increase in resolution.
In fact, for a general-purpose RISC processor (including some DSP instructions such as multiplication instructions and multiplication and accumulation instructions), executing an H.264 benchmark decoding algorithm requires 250 MHz performance (CIF resolution), while executing an H.264 benchmark encoding algorithm requires more than 1 GHz performance (CIF resolution). To complete the above operations, the processor core alone requires 500mW of power consumption, not to mention the power consumption used by memory access and other components of the video SOC.

3. Configurable Processor Method

A more efficient way to implement the SAD core algorithm on a processor is to create a dedicated "subtraction-absolute value-addition" instruction. This will greatly reduce the overhead of arithmetic operations, from 768 to 256 for a 16x16 macroblock. Moreover, since a single functional unit can be used to combine multiple simple arithmetic operations, the above operation can be completed in just one instruction cycle, which is equivalent to the original 256 cycles. Users cannot add instructions to a standard 32-bit RISC processor, but they can add dedicated instructions to a configurable processor. Configurable processors allow designers to select related configuration commands from a configurable option menu to expand the processor's functionality, including adding dedicated instructions, register files, and interfaces. The
following are the configuration and expansion options provided by modern configurable processors (such as Tensilica's Xtensa processor), which are not available for traditional fixed-mode processors.
(i) Configuration options: The option menu includes the following items:
a. Instructions that the designer needs or does not need. For example, 16x16 multiplication or multiply-accumulation, shift, floating-point instructions, etc.
b. Zero-overhead loops, five-stage or seven-stage pipelines, number of local data load or store components, etc.
c. Whether memory protection, memory address translation, or memory management unit (MMU) is required
d. Inclusion or non-inclusion of system bus interface
e. System bus width and local memory interface width
f. Size and number of local (tightly coupled) memories
g. Number of interrupts and their types and priorities.
(ii) Expansion options: Add designer-defined functional components, including:
a. Registers and register files.
b. Multi-cycle, arbitrated complex instruction functional components.
c. Single instruction stream multiple data stream SIMD functional components.
d. Convert a single-issue processor to a multi-issue processor.
e. User-defined interfaces that can directly read and write to the data path, for example, processor core ports or pins similar to GPIO (general purpose input/output) pins, used to expand the queue interface of the first-in-first-out FIFO queue (which can interface with other logic or processor cores).
The benefit of configuration options is that designers can build a processor of appropriate size and meet their specific applications by selecting only the options related to their applications. The benefit of extension options is that designers can customize the processor according to the application, including establishing dedicated instructions, register files, functional components and related interfaces to accelerate the execution of system application algorithms.

4. Automated software development tool kit support

The key to configurability and extensibility is not only the ability to automatically generate pre-verified RTL code for designers to customize the processor (including all system extension functions), but also the ability to automatically generate complete software tools, including a development tool kit that matches and optimizes the processor, a clock cycle-based instruction set simulator, and a system model.
This automation means that the compiler knows the new instructions, related registers, and register files added by the designer. Therefore, the compiler can schedule user-defined instructions and perform register allocation operations. Similarly, software developers can understand the registers and register files defined by the designer in addition to the basic registers of the processor itself when debugging; at the same time, software developers can use the instruction set simulator to simulate the new instructions defined by the designer. The real-time operating system RTOS port and system model related to the processor can also be automatically generated. Tensilica's software tools can automatically generate the above software tools within an hour, which is a core promise to users of configurable processors, and can perform operations such as SAD without the need for RTL implementation methods.

5. Using configurable processors to build video acceleration engines Establishing multi-operation functional components


Adding fusion operations such as SAD to a configurable processor is a cumbersome task. A new instruction called "sub.abs.ac" can complete the "subtraction-absolute value-accumulation" operation. This new instruction can turn the operation in Figure 2 into a complex operation in Figure 3.

Figure 3 Using the new instruction to calculate the "subtraction-absolute value-accumulation" operation

After adding this instruction to the processor, the C compiler can recognize this new "sub.abs.ac" instruction and schedule related instructions; the scheduler will display the internal signals used by the "sub.abs.ac" functional component; the assembler can process this new instruction; and the instruction set simulator ISS can simulate it according to the clock cycle mode. The
data path diagram after the new dedicated video functional component is inserted into the processor is shown in Figure 4. Note that in addition to generating the functional unit logic, the hardware generation tool can also automatically insert feed-forward paths, control logic, and bypass logic to interconnect the new functional unit with other logic in the data path.

Figure 4 Simplified data path diagram after inserting the sub.abs.ac video-specific functional unit

The SAD algorithm described in the C code containing the new instructions is as follows:
for (row = 0; row < numrows; row++) { for (col = 0; col < numcols; col++) {
sub.abs.ac( accum, macroblk1[row][col], macroblk2[row][col]);
} /* column loop */
} /* row loop */
As mentioned earlier, for a 16x16 macroblock, the number of operations in the main loop of the program is reduced to 256 (that is, numrows = numcols = 16) after adding the new instructions.

6. Establish a single instruction stream multiple data stream SIMD functional unit

The previous SAD program can be further optimized. The inner loop in the program performs the same operation on the 16 columns in the macroblock. This is an ideal choice for SIMD (Single Instruction Multiple Data) functional components, and the corresponding instruction "sub.abs.ac16" performs the sub.abs.ac operation for 16 pixels simultaneously, as shown in Figure 5.

Figure 5 Simultaneous SIMD calculation operation of sub.abs.ac instruction for 16 pixels

The corresponding C language procedure is named sub.abs.ac16. The SAD kernel C program code rewritten using this procedure name is as follows:
for (row = 0; row < numrows; row++) {
sub.abs.ac16( accum, macroblk1[row], macroblk2[row]);
} /* row loop */
The rewritten SAD kernel program reduces the arithmetic operations from 768 to only 16.
However, the above C program code alone is not enough. Because the instruction sub.abs.ac16 needs to read 128 bits of data from two macroblocks, this requires support from two aspects: a 128-bit register file and a wide data bit load/store interface, which are supported by configurable processors.

7. Create a user-defined register file

In the Xtensa configurable processor, specifying a custom register file of arbitrary width is as simple as writing a line of program. For example, the procedure statement called "myRegFile128" creates a register file with a width of 128 bits and a length of 4, and creates a corresponding new C data type, "myRegFile128", which can be used to declare variables in C/C++ program code. The software tool also creates a "MOVE" operation for converting various C data types to the new custom data type. Therefore, the SAD kernel C program code using the sub.abs.ac16 procedure and the new register file is as follows:
for (row = 0; row < numrows; row++) {
myRegFile128 mblk1, mblk2;
mblk1 = macroblk1[row];
mblk2 = macroblk2[row];
sub.abs.ac16( accum, mblk1, mblk2);
} /* row loop */
The C/C++ compiler will now generate a MOVE instruction to move the data from the generic C data type to the custom C data type "myRegFile128" and allocate registers for the new register file.

8. Create a high-data-bandwidth load/store interface

In order to access data to the high-bandwidth custom register file (and the corresponding SIMD functional unit), the processor should have high-bandwidth data load/store operation capabilities. For configurable processors, designers can specify custom load and store operation instructions to directly complete high-bandwidth load/store data operations on the custom register file. The compiler then automatically generates load/store instructions corresponding to the high-bandwidth load/store interface. The
updated processor data path is shown in Figure 6. The hardware generation tool generates a high-bandwidth custom register file, a load/store interface associated with the data memory, and the corresponding feed-forward logic, control logic, and bypass logic. The hardware tool also generates the corresponding hardware logic for moving data from the base register file to the user-defined register file.

Figure 6 Inserting the data path of the register file and the high-bandwidth load/store interface 9. Load or store operations

while updating the address The Xtensa configurable processor allows users to create another very useful functional extension, that is, to create an instruction that can simultaneously complete the address update operation and the data load/store operation. The new load/store operation instructions established can concurrently complete the following functions: Load A1 ← Memory(Addr1);Addr1 = Addr1 + IndexUpdate This instruction can complete "back-to-back" load/store operations without the need for special instructions to update the address. 10. Establish first-in-first-out (FIFO) interface and general input/output port Video and audio are both streaming media, which requires fast data access to the processor. Traditional processors are limited by the system bus interface and the load and store access to all data before data operations are performed. To support streaming media data/output operations, the Xtensa configurable processor allows designers to define first-in-first-out (FIFO) interfaces and general input/output (GPIO) ports for direct read and write access to the data path. FIFO and GPIO ports can be of any data width (up to 1024 bits) and unlimited in number (each can contain 1024 FIFO and GPIO ports). These high-bandwidth interfaces can be directly connected to the data path, providing high data throughput, and data can be read, processed, and written by the processor core, which is very important for multimedia and network applications. The data path with FIFO interface and GPIO port is shown in Figure 7. The processor can perform the following operations: first, take data from two FIFOs (while ensuring that both FIFO queues are not empty), then calculate a complex operation (such as a multiply-accumulate-round operation), and finally push the calculation result into the output FIFO (while ensuring that the FIFO queues are not full). Then, the hardware generation tool generates the corresponding interface signals, control logic, bypass logic, etc.; and generates complete RTL code for the configured processor. The software generation tool generates a complete set of compiler tools and a clock cycle accurate instruction set simulator ISS for simulating new instructions. Note that this ability to define FIFO interfaces and GPIO ports by designers is unique to Xtensa configurable processors. Figure 7 High-speed communication using customized FIFO interfaces and general input and output (IO) ports 11. Accelerate the execution of complex control-intensive code The amount and complexity of control code in multimedia applications have increased significantly, making data-intensive operations in the program approximately equivalent to computing time. For example, the key part of the H.264 main program decoder is the CABAC (context-dependent binary arithmetic coding) algorithm. This algorithm is almost entirely a control flow decision tree with data calculation and data comparison. Due to the very high complexity of the calculation, most traditional processors use dedicated RTL accelerators to complete the CABAC algorithm. However, the CABAC algorithm can be implemented more efficiently on a configurable processor by adding a set of dedicated instructions. The benefit of this implementation method is that it avoids the constant exchange of data between the processor and the RTL accelerator. Another benefit of using a configurable processor is the use of instruction extension technology. Since the dedicated hardware is inside the processor, the hardware and software interface can be better divided. 12. Summary





















Modern configurable and scalable processors are ideal for building custom video and audio engines. Tensilica provides relevant video and audio IP as SOC modules, including HiFi 2 audio engine, Diamond series standard 38xVDO (video) multi-standard and multi-resolution video methods. The matching software codec is very important. The HiFi 2 audio engine together with the relevant software can complete most of the popular audio codecs, such as MP3, AAC, WMA, etc. Similarly, the Diamond 38xVDO video acceleration engine with the corresponding encoder and decoder software can implement H.264 (including Baseline, Main and profiles), MPEG-4 (SP and ASP), MPEG-2, VC-1/WM9 and other standards. These video technologies cover various resolutions from QCIF to CIF and SD, with low power consumption and small area.
Reference address:Video acceleration engine technology based on Xtensa configurable processor technology

Previous article:Design of mobile home video surveillance system
Next article:OSD video character stacking performance parameters and applications

Latest Analog Electronics Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号