With the rapid development of consumer electronics, especially mobile phones, PDAs and portable media players (PMPs), the requirements for terminal silicon suppliers have also greatly increased. For these suppliers, it is no longer enough to design ICs that can only be applied to one or two multimedia codecs or wireless standards. Consumers want their devices to be able to play a variety of media using different encoding standards and wireless download standards. Therefore, a new and more flexible approach must be taken to better adapt to new media standards. In this article, we mainly talk about the challenges and opportunities faced by video decoder and encoder engines.
Traditional video engine design method based on RTL
The previous generation of video ASICs were designed to decode and encode MPEG-2, as this is the standard used by DVDs. Some of them also support MPEG-1 and can play VCDs. In most cases, the logic implementation strategy for this single application is to use RTL (register transfer layer, register transfer logic) to design customized MPEG-2 decoders and encoders. Figure 1 below is a typical MPEG-2 video ASIC structure, showing the RTL functional blocks consisting of the video subsystem, main controller, and on-chip memory.
Figure 1: Typical MPEG-2 video ASIC architecture
As the market situation changes, today's video ASICs must support multiple video standards and have multiple resolutions. The traditional RTL approach is no longer effective due to the following reasons:
As the number of standards increases, the number and complexity of RTL function blocks also increase;
Whether implementing a new video standard, upgrading an existing standard, or fixing a bug, silicon chip re-making is required;
Video codecs, especially encoders, have seen significant performance improvements (bitrate, performance) in the 4-5 years after the first silicon implementation. To implement these improvements, silicon re-spins must also be performed in all RTL methods.
Use processors in the video engine instead of fixed RTL
So, is there any other solution? Using a programmable processor is the best solution because it solves all the problems mentioned above: (1) It is easy to establish a connection port between the processor and the codec (whether adopting a new video standard, upgrading an existing codec, or fixing a bug, it can be easily done in software); (3) Improvements in the implementation of the video codec can be easily applied through software upgrades.
However, due to their performance bottleneck, traditional processors can only be used for general encoding, but not for video engines. Embedded DSPs are not designed specifically for video, but have the hardware functional units, instructions, and interfaces required for general DSP applications. Therefore, to perform video encoding and decoding on traditional RISC and DSP processors, it means that these processors must run at very high speeds (MHz), and also require a lot of memory and consume a lot of power, but this is obviously not feasible in portable devices.
This is easily seen by a simple analysis of the number of calculations required in a video kernel. The Sum of Absolute Differences is an important calculation step performed in motion estimation for most video decoding algorithms. The purpose of the SAD operation is to find the motion of macroblocks between two consecutive video frames. It does this by calculating the sum of the absolute differences between each set of corresponding pixel values in the two macroblocks.
The following C code shows a simple implementation of the SAD operation:
Figure 2 shows the basic calculation steps in the SAD operation. As shown in the figure, the main calculations performed are subtraction, absolute value calculation and result accumulation.
Figure 2: The main calculations performed in the Sum of Absolute Differences (SAD) kernel
Calculating the SAD of two 16x16 macroblocks on a RISC requires 256 subtractions, 256 absolute values, and 256 additions - a total of 768 calculations, not including the load and memory required to transfer the data. Since this operation must be performed for all macroblocks in each frame, it is obviously computationally expensive and becomes increasingly difficult as the resolution of the video frame increases.
In fact, on a mid-range general-purpose RISC processor with some instructions like multiply and multiply-accumulate, H.264 Baseline decoding at CIF resolution requires 250MHz, while H.264 Baseline encoding requires more than 1GHz. This means that the processor core alone consumes nearly 500mW of power, not to mention the power consumed by the memory and other parts of the video on-chip system. Obviously, this processor cannot be used as an embedded multimedia processor in portable devices. [page]
Configurable processors solve the problem
How to perform SAD calculation in the processor? One way is to write an instruction that can perform "subtraction-absolute value-addition" calculation at the same time. This can reduce the number of calculations required for a 16x16 macromodule from 768 to 256. In addition, since a functional unit that performs such a comprehensive simple operation can generally be optimized into one cycle, it means that the calculation cycle is also reduced to 256.
But how to execute this "subtraction - absolute value - addition" instruction?
That’s where configurable processors come in. Configurable processors are embedded, and designers can choose from a menu of configuration options and extend processor functionality by adding application-specific instructions, register files, and interfaces.
The following are some of the configurable and extensible features that current configurable processors have, which traditional fixed processors do not have:
Configurability, with a range of options to choose from:
Instructions that the designer wants or doesn’t want, including: 16x16 multiply or multiply-accumulate, funnel conversion, floating-point instructions, etc.
Zero-cost loops, 5 or 7 stepper pipelines, number of local data load/store units, and other features;
Whether memory protection, memory translation, or a full memory management unit (MMU) is needed;
Whether a system bus interface is required;
The width of the system bus and local memory interface;
The amount and size of local memory;
The number, type and level of interruptions
Extensibility, the following designer-defined components can be freely added:
Registers and register files;
Multi-cycle, arbitrarily complex functional units;
SIMD functional unit;
Convert a basic processor into a multi-issue processor;
Custom interfaces that can read and write directly from the data path, such as ports or pins similar to GPIO (general purpose IO) on the processor core, and external FIFOs that can be used to connect to other logic or processor cores.
The advantage of configurability is that you can build a processor of the right size by selecting the functional options required for your application, while the advantage of scalability is that designers can customize the processor to perfectly match their video application by creating instructions, register files, functional units and interfaces that can speed up the application. However, it must be noted that only today's advanced configurable processors can provide designer-defined scalability.
Building a video engine using configurable processors
Create functional units that can perform multiple operations
This step is the content of SAD calculation and accelerated SAD calculation.
For a configurable processor, adding this comprehensive operation function is a piece of cake. It can add a new instruction called "sub.abs.acc (subtraction-absolute value-addition)" to perform "subtraction, absolute value and addition" operations. As shown in Figure 3.
Figure 3: New instructions for subtracting, finding the absolute value, and adding
The software tools provided with modern configurable processors (such as Tensilica's Xtensa processor) automatically modify the editor tools, including C/C++ editor, assembler, debugger, simulator and ISS (instruction set simulator). At this time, the C editor will recognize the new C internal instruction "sub.abs.acc" and arrange the corresponding instructions, and the debugger will display the internal signals used in the sub.abs.acc function module. At the same time, the assembler will process it as a new instruction, and ISS will perform cycle-accurate simulation on it. [page]
Figure 4 is a simplified diagram of the data path after embedding this new video special function unit. It is important to note that the hardware generation tool can not only automatically generate the function unit logic, but also automatically embed the forward path, control logic and bypass logic to connect this new function unit to the rest of the data path.
Figure 4: Simplified diagram of the data path after embedding the sub. abs. acc video special function unit
Now, the C code using C intrinsic instructions to perform SAD calculation becomes:
As mentioned above, this reduces the number of calculations for a 16x16 macroblock (e.g. numrows=numcols=16) to 256.
Creating SIMD functional units
In addition to the above results, we can achieve further improvements. In this kernel, the inner loop traverses the entire macromodule and performs the same calculation. At this time, we can create a SIMD (single instruction multiple data) functional unit and the corresponding instruction sub. abs. acc16 to perform "subtraction, absolute value and addition" operations on 16 pixels at the same time, as shown in Figure 5.
Figure 5: SIMD performs "subtraction, absolute value and addition" operations on 16 pixels simultaneously
The corresponding C intrinsic instruction is sub.abs.acc16, which is used to rewrite the C code in the SAD operation:
At this time, the number of SAD operations is reduced from 768 to only 16.
However, the C code above is not precise. We glossed over one detail, which is that the sub. abs. acc16 instruction requires 128-b inputs from two macromodules. This requires support for two features—a 128-b register file and a wide load/store interface—which are discussed in the next section.
Creating a Custom Register File
It is very simple to create a custom register file of any size in a configurable processor. For example, a 128b register file named "myRegFile128" with 4 registers can create a corresponding new C data type to be used in C/C++ code to display variables. In addition, the software tool can also perform "move" operations to convert various C data types into this new custom data type.
Therefore, the correct C encoding for the SAD operation using the sub.abs.acc16 intrinsic instructions and the new register file is:
Next, the C/C++ editor will generate move instructions to convert the data from the normal C data types to the custom C data type "myRegFile128" and make register allocations for the new register file. [page]
Create new load/store interface
To read and write data in such a large register file (and the corresponding SIMD functional unit), large-scale loading and storage are required. Still in the configurable processor, the designer can customize the load and store instructions to load and store data directly in the custom register file. Then, the editor will automatically generate load/store instructions corresponding to this load/store interface to load data from memory into the register file.
Figure 6 is an updated diagram of the processor data path. As shown in the figure, the hardware generation tools automatically generate the large custom register file and load/store interface and all related forward control and bypass logic. It is important to note that these tools also generate hardware logic to transfer data from the basic register file to the user-defined register file.
Figure 6
Update address when loading or storing
When creating instructions to do custom loads or stores, it is often useful to be able to update the address while loading or storing. This new load/store instruction can do both:
Load A1←Memory (Address 1); Address 1 = Address 1 + Index Update
This ability to simultaneously load/store data and update the address allows the processor to perform back-to-back loads/stores without the need for an intervening instruction to perform the address update.
Create FIFO interface and general IO port
Another important feature of configurable processors is the ability to define FIFO interfaces and general purpose IO (GPIO) ports to read and write data directly from the data path. The width of these FIFO interfaces and GPIO ports can be arbitrary (1024b in this example) with no numerical restrictions (e.g., both the FIFO and GPIO ports can be 1024 wide). These wide data path direct interfaces can provide the high data throughput required by multimedia and networking applications to read, process, and write data through the processor core.
Figure 7 shows the data path with such a FIFO interface and GPIO port. (With this approach) we can create an instruction that emits two FIFOs (as long as they are not empty), performs a complex calculation (such as a loop multiply-add), and passes the result to another output FIFO (as long as this FIFO is not full). Then, the hardware generation tool again generates the appropriate interface signals, control logic and bypass logic, and generates the complete RTL required for the configured processor, while the software generation tool automatically generates a complete set of editor tools and cycle-accurate ISS that emulates the new instruction.
Figure 7: High-speed communication via FIFO interface and GPIO port
Speeding up complex control code
The amount and complexity of control code in multimedia applications has increased to the point where it consumes almost as much computation time and work as the data-intensive parts of the code. An example of this is the CABAC algorithm (Content Adaptive Binary Arithmetic Coding), a key part of the Mainprofile decoder: this algorithm is almost a control flow decision tree with various complex data calculations and comparisons.
Because CABAC calculations are too complex, many traditional processor solutions have to abandon CABAC and choose a dedicated RTL accelerator. However, CABAC can be used as a set of instruction extensions on configurable processors. Not only is its performance comparable to that of RTL solutions, but it also has another advantage over RTL accelerators, that is, its data does not need to enter or leave the processor. This shows another advantage of processor instruction extensions - since the special application hardware is inside the processor, you can better separate hardware and software.
Summarize
Modern configurable and scalable processors are the perfect choice for creating video and audio engines, and have been widely adopted by many semiconductor ASIC suppliers so far. There are also some video and audio IP products as embedded SoC modules. For example, Tensilica and its partners can supply a complete set of video and audio IP products, including XtensaHiFi2 audio engine and a series of multi-standard multi-resolution video solutions, as well as H. 26? (basic, mainstream and advanced categories), MPEG-4 (SPandASP), MPEG-2, VC-1/WM9 and various standard encoder and decoder software (codec). These video solutions cover QCIF, CIF and SD, all of which are aimed at achieving HD resolution, and are designed with low power consumption and small package as the starting point.
As consumer demand expands the technical specifications of ASICs in consumer devices, more and more applications will be implemented using configurable processors. With the automated design flow brought by configurable processors, new feature support will be as simple as software upgrades, and design and verification time will be greatly reduced.
Previous article:Design and implementation of embedded media player based on MiniGUI
Next article:Comparative analysis of fully integrated, partially integrated and discrete switching power supply solutions
Recommended ReadingLatest update time:2024-11-16 16:37
- Popular Resources
- Popular amplifiers
- AI Embedded Systems: Algorithm Optimization and Implementation_Introduces the optimization principles, design methods and implementation techniques of machine learning algorithms in embedded systems
- A Practical Tutorial on ASIC Design (Compiled by Yu Xiqing)
- Embedded Vision Development with INT8 Optimization on Xilinx Devices
- EDA Technology Practical Tutorial--Verilog HDL Edition (Sixth Edition) (Pan Song, Huang Jiye)
- Apple faces class action lawsuit from 40 million UK iCloud users, faces $27.6 billion in claims
- Apple and Samsung reportedly failed to develop ultra-thin high-density batteries, iPhone 17 Air and Galaxy S25 Slim phones became thicker
- Micron will appear at the 2024 CIIE, continue to deepen its presence in the Chinese market and lead sustainable development
- Qorvo: Innovative technologies lead the next generation of mobile industry
- BOE exclusively supplies Nubia and Red Magic flagship new products with a new generation of under-screen display technology, leading the industry into the era of true full-screen
- OPPO and Hong Kong Polytechnic University renew cooperation to upgrade innovation research center and expand new boundaries of AI imaging
- Gurman: Vision Pro will upgrade the chip, Apple is also considering launching glasses connected to the iPhone
- OnePlus 13 officially released: the first flagship of the new decade is "Super Pro in every aspect"
- Goodix Technology helps iQOO 13 create a new flagship experience for e-sports performance
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- DSP28335 uses FIFO serial port interrupt
- Who made this picture? Haha, what an image!
- How to use allegro package library files in DXP?
- Working waveform problem of single-phase bridge uncontrolled rectifier circuit with capacitor filtering
- Abnormal output of op amp
- Guys, please help me find out what the problem is.
- I would like to ask about the interrupt problem of msp430f5529
- Three questions on embedded ARM basics
- [Rawpixel RVB2601 development board trial experience] GRB breathing light
- Dancing flame pendant