Leverage configurable processors to create multi-standard, multi-resolution video engines-EEWORLD

Collect

With the rapid development of consumer electronics, especially mobile phones, PDAs and portable media players (PMPs), the requirements for terminal silicon suppliers have also greatly increased. For these suppliers, it is no longer enough to design ICs that can only be applied to one or two multimedia codecs or wireless standards. Consumers want their devices to be able to play a variety of media using different encoding standards and wireless download standards. Therefore, a new and more flexible approach must be taken to better adapt to new media standards. In this article, we mainly talk about the challenges and opportunities faced by video decoder and encoder engines.

Traditional video engine design method based on RTL

The previous generation of video ASICs were designed to decode and encode MPEG-2, as this is the standard used by DVDs. Some of them also support MPEG-1 and can play VCDs. In most cases, the logic implementation strategy for this single application is to use RTL (register transfer layer, register transfer logic) to design customized MPEG-2 decoders and encoders. Figure 1 below is a typical MPEG-2 video ASIC structure, showing the RTL functional blocks consisting of the video subsystem, main controller, and on-chip memory.

Figure 1: Typical MPEG-2 video ASIC architecture

As the market situation changes, today's video ASICs must support multiple video standards and have multiple resolutions. The traditional RTL approach is no longer effective due to the following reasons:

As the number of standards increases, the number and complexity of RTL function blocks also increase;

Whether implementing a new video standard, upgrading an existing standard, or fixing a bug, silicon chip re-making is required;

Video codecs, especially encoders, have seen significant performance improvements (bitrate, performance) in the 4-5 years after the first silicon implementation. To implement these improvements, silicon re-spins must also be performed in all RTL methods.

Use processors in the video engine instead of fixed RTL

So, is there any other solution? Using a programmable processor is the best solution because it solves all the problems mentioned above: (1) It is easy to establish a connection port between the processor and the codec (whether adopting a new video standard, upgrading an existing codec, or fixing a bug, it can be easily done in software); (3) Improvements in the implementation of the video codec can be easily applied through software upgrades.

However, due to their performance bottleneck, traditional processors can only be used for general encoding, but not for video engines. Embedded DSPs are not designed specifically for video, but have the hardware functional units, instructions, and interfaces required for general DSP applications. Therefore, to perform video encoding and decoding on traditional RISC and DSP processors, it means that these processors must run at very high speeds (MHz), and also require a lot of memory and consume a lot of power, but this is obviously not feasible in portable devices.

This is easily seen by a simple analysis of the number of calculations required in a video kernel. The Sum of Absolute Differences is an important calculation step performed in motion estimation for most video decoding algorithms. The purpose of the SAD operation is to find the motion of macroblocks between two consecutive video frames. It does this by calculating the sum of the absolute differences between each set of corresponding pixel values in the two macroblocks.

The following C code shows a simple implementation of the SAD operation:

Figure 2 shows the basic calculation steps in the SAD operation. As shown in the figure, the main calculations performed are subtraction, absolute value calculation and result accumulation.

Figure 2: The main calculations performed in the Sum of Absolute Differences (SAD) kernel

Calculating the SAD of two 16x16 macroblocks on a RISC requires 256 subtractions, 256 absolute values, and 256 additions - a total of 768 calculations, not including the load and memory required to transfer the data. Since this operation must be performed for all macroblocks in each frame, it is obviously computationally expensive and becomes increasingly difficult as the resolution of the video frame increases.

In fact, on a mid-range general-purpose RISC processor with some instructions like multiply and multiply-accumulate, H.264 Baseline decoding at CIF resolution requires 250MHz, while H.264 Baseline encoding requires more than 1GHz. This means that the processor core alone consumes nearly 500mW of power, not to mention the power consumed by the memory and other parts of the video on-chip system. Obviously, this processor cannot be used as an embedded multimedia processor in portable devices. [page]

Configurable processors solve the problem

How to perform SAD calculation in the processor? One way is to write an instruction that can perform "subtraction-absolute value-addition" calculation at the same time. This can reduce the number of calculations required for a 16x16 macromodule from 768 to 256. In addition, since a functional unit that performs such a comprehensive simple operation can generally be optimized into one cycle, it means that the calculation cycle is also reduced to 256.

But how to execute this "subtraction - absolute value - addition" instruction?

That’s where configurable processors come in. Configurable processors are embedded, and designers can choose from a menu of configuration options and extend processor functionality by adding application-specific instructions, register files, and interfaces.

The following are some of the configurable and extensible features that current configurable processors have, which traditional fixed processors do not have:

Configurability, with a range of options to choose from:

Instructions that the designer wants or doesn’t want, including: 16x16 multiply or multiply-accumulate, funnel conversion, floating-point instructions, etc.

Zero-cost loops, 5 or 7 stepper pipelines, number of local data load/store units, and other features;

Whether memory protection, memory translation, or a full memory management unit (MMU) is needed;

Whether a system bus interface is required;

The width of the system bus and local memory interface;

The amount and size of local memory;

The number, type and level of interruptions

Extensibility, the following designer-defined components can be freely added:

Registers and register files;

Multi-cycle, arbitrarily complex functional units;

SIMD functional unit;

Convert a basic processor into a multi-issue processor;

Custom interfaces that can read and write directly from the data path, such as ports or pins similar to GPIO (general purpose IO) on the processor core, and external FIFOs that can be used to connect to other logic or processor cores.

The advantage of configurability is that you can build a processor of the right size by selecting the functional options required for your application, while the advantage of scalability is that designers can customize the processor to perfectly match their video application by creating instructions, register files, functional units and interfaces that can speed up the application. However, it must be noted that only today's advanced configurable processors can provide designer-defined scalability.

Building a video engine using configurable processors

Create functional units that can perform multiple operations

This step is the content of SAD calculation and accelerated SAD calculation.

For a configurable processor, adding this comprehensive operation function is a piece of cake. It can add a new instruction called "sub.abs.acc (subtraction-absolute value-addition)" to perform "subtraction, absolute value and addition" operations. As shown in Figure 3.

Figure 3: New instructions for subtracting, finding the absolute value, and adding

The software tools provided with modern configurable processors (such as Tensilica's Xtensa processor) automatically modify the editor tools, including C/C++ editor, assembler, debugger, simulator and ISS (instruction set simulator). At this time, the C editor will recognize the new C internal instruction "sub.abs.acc" and arrange the corresponding instructions, and the debugger will display the internal signals used in the sub.abs.acc function module. At the same time, the assembler will process it as a new instruction, and ISS will perform cycle-accurate simulation on it. [page]

Figure 4 is a simplified diagram of the data path after embedding this new video special function unit. It is important to note that the hardware generation tool can not only automatically generate the function unit logic, but also automatically embed the forward path, control logic and bypass logic to connect this new function unit to the rest of the data path.

Figure 4: Simplified diagram of the data path after embedding the sub. abs. acc video special function unit

Now, the C code using C intrinsic instructions to perform SAD calculation becomes:

As mentioned above, this reduces the number of calculations for a 16x16 macroblock (e.g. numrows=numcols=16) to 256.

Creating SIMD functional units

In addition to the above results, we can achieve further improvements. In this kernel, the inner loop traverses the entire macromodule and performs the same calculation. At this time, we can create a SIMD (single instruction multiple data) functional unit and the corresponding instruction sub. abs. acc16 to perform "subtraction, absolute value and addition" operations on 16 pixels at the same time, as shown in Figure 5.

Figure 5: SIMD performs "subtraction, absolute value and addition" operations on 16 pixels simultaneously

The corresponding C intrinsic instruction is sub.abs.acc16, which is used to rewrite the C code in the SAD operation:

At this time, the number of SAD operations is reduced from 768 to only 16.

However, the C code above is not precise. We glossed over one detail, which is that the sub. abs. acc16 instruction requires 128-b inputs from two macromodules. This requires support for two features—a 128-b register file and a wide load/store interface—which are discussed in the next section.

Creating a Custom Register File

It is very simple to create a custom register file of any size in a configurable processor. For example, a 128b register file named "myRegFile128" with 4 registers can create a corresponding new C data type to be used in C/C++ code to display variables. In addition, the software tool can also perform "move" operations to convert various C data types into this new custom data type.

Therefore, the correct C encoding for the SAD operation using the sub.abs.acc16 intrinsic instructions and the new register file is:

Next, the C/C++ editor will generate move instructions to convert the data from the normal C data types to the custom C data type "myRegFile128" and make register allocations for the new register file. [page]

Create new load/store interface

To read and write data in such a large register file (and the corresponding SIMD functional unit), large-scale loading and storage are required. Still in the configurable processor, the designer can customize the load and store instructions to load and store data directly in the custom register file. Then, the editor will automatically generate load/store instructions corresponding to this load/store interface to load data from memory into the register file.

Figure 6 is an updated diagram of the processor data path. As shown in the figure, the hardware generation tools automatically generate the large custom register file and load/store interface and all related forward control and bypass logic. It is important to note that these tools also generate hardware logic to transfer data from the basic register file to the user-defined register file.

Figure 6

Update address when loading or storing

When creating instructions to do custom loads or stores, it is often useful to be able to update the address while loading or storing. This new load/store instruction can do both:

Load A1←Memory (Address 1); Address 1 = Address 1 + Index Update

This ability to simultaneously load/store data and update the address allows the processor to perform back-to-back loads/stores without the need for an intervening instruction to perform the address update.

Create FIFO interface and general IO port

Another important feature of configurable processors is the ability to define FIFO interfaces and general purpose IO (GPIO) ports to read and write data directly from the data path. The width of these FIFO interfaces and GPIO ports can be arbitrary (1024b in this example) with no numerical restrictions (e.g., both the FIFO and GPIO ports can be 1024 wide). These wide data path direct interfaces can provide the high data throughput required by multimedia and networking applications to read, process, and write data through the processor core.

Figure 7 shows the data path with such a FIFO interface and GPIO port. (With this approach) we can create an instruction that emits two FIFOs (as long as they are not empty), performs a complex calculation (such as a loop multiply-add), and passes the result to another output FIFO (as long as this FIFO is not full). Then, the hardware generation tool again generates the appropriate interface signals, control logic and bypass logic, and generates the complete RTL required for the configured processor, while the software generation tool automatically generates a complete set of editor tools and cycle-accurate ISS that emulates the new instruction.

Figure 7: High-speed communication via FIFO interface and GPIO port

Speeding up complex control code

The amount and complexity of control code in multimedia applications has increased to the point where it consumes almost as much computation time and work as the data-intensive parts of the code. An example of this is the CABAC algorithm (Content Adaptive Binary Arithmetic Coding), a key part of the Mainprofile decoder: this algorithm is almost a control flow decision tree with various complex data calculations and comparisons.

Because CABAC calculations are too complex, many traditional processor solutions have to abandon CABAC and choose a dedicated RTL accelerator. However, CABAC can be used as a set of instruction extensions on configurable processors. Not only is its performance comparable to that of RTL solutions, but it also has another advantage over RTL accelerators, that is, its data does not need to enter or leave the processor. This shows another advantage of processor instruction extensions - since the special application hardware is inside the processor, you can better separate hardware and software.

Summarize

Modern configurable and scalable processors are the perfect choice for creating video and audio engines, and have been widely adopted by many semiconductor ASIC suppliers so far. There are also some video and audio IP products as embedded SoC modules. For example, Tensilica and its partners can supply a complete set of video and audio IP products, including XtensaHiFi2 audio engine and a series of multi-standard multi-resolution video solutions, as well as H. 26? (basic, mainstream and advanced categories), MPEG-4 (SPandASP), MPEG-2, VC-1/WM9 and various standard encoder and decoder software (codec). These video solutions cover QCIF, CIF and SD, all of which are aimed at achieving HD resolution, and are designed with low power consumption and small package as the starting point.

As consumer demand expands the technical specifications of ASICs in consumer devices, more and more applications will be implemented using configurable processors. With the automated design flow brought by configurable processors, new feature support will be as simple as software upgrades, and design and verification time will be greatly reduced.

Reference address：Leverage configurable processors to create multi-standard, multi-resolution video engines

Previous article：Design and implementation of embedded media player based on MiniGUI
Next article：Comparative analysis of fully integrated, partially integrated and discrete switching power supply solutions

Recommended ReadingLatest update time:2024-11-16 16:37

Cadence Launches Joules RTL Design Studio, Taking RTL Productivity and Quality of Results to New Levels

Key points • Accelerate RTL closure by 5 times and improve quality of results by 25% • RTL designers can quickly and accurately understand physical implementation indicators and effectively improve RTL performance based on the provided guidance • Integrated with Cadence Cer

[Semiconductor design/manufacturing]

Cadence Launches Joules RTL Design Studio, Taking RTL Productivity and Quality of Results to New Levels

How to modify the AT89C51 programmer with RTL8139

　　I wanted to play with 89C51, but I didn’t have a programmer, so I wanted to make my own programmer. I found a lot of self-made information. Most of them used 89C51 for control logic, which required a programmer to make. The circuits are generally more complicated (the most ez ones also require 30 components), and th

[Microcontroller]

How to modify the AT89C51 programmer with RTL8139

Popular Resources
Popular amplifiers