Multi-core video processing technology based on KeyStone DSP-EEWORLD

Collect

summary

随着越来越多的移动手持终端支持视频功能，对于流媒体内容及实时通信的网络支持需求也在显著上升。虽然对已部署的 3G 媒体网关进行升级可以支持较低的分辨率和帧速率，但这种由于自身的有限处理能力而进行的升级并不能满足视频成为主流应用的需求。

为了使可扩展视频应用能够支持高密度 (HD)，需要显著提高视频处理能力，而多核数字信号处理器 (DSP) 不但拥有能满足此类需求的增强型视频处理功能，同时还能充分满足运营商在可扩展性和低功耗方面的需求。

This paper aims to introduce a new multi-core platform that can achieve high-density video processing capabilities by optimizing core communication, task management and memory access. In addition, this paper also explains how the results of the extended implementation support high-density video processing for multi-channel and multi-core HD video applications.

1 Introduction

The widespread deployment of 3G and 4G mobile networks around the world and the continuous emergence of wireless innovation hotspots have created critical data bandwidth required by handheld terminal users. In addition to web/data applications, video has become another driving force for the popularization of mobile data.

随着越来越多的用户转而使用视频应用，网络基础局端需要实现显著的性能提升才能支持视频内容，这从最近苹果公司 Facetime 视频呼叫应用及类似应用的流行上可见一斑。

Handheld terminals can support video capture and playback at higher resolutions and frame rates. Traditionally deployed 3G media gateways are designed to support high-density multi-voice channels in low-resolution video applications, but often fail to meet user expectations for high quality.

In addition, because handheld terminals can usually only support a few standards with limited parameter sets due to technical limitations such as battery life and memory size, media gateways need to support more codecs and transcoding modes, such as transcoding, transmission volume, and transmission rate. For example, when a mobile phone user drives through an area at high speed, it is more efficient to let the network adapt to the bandwidth provided by the instantaneous link conditions at that time and provide corresponding compression, resolution and bit rate, so that the video chain is not interrupted and the handheld terminal does not waste bandwidth or battery power due to supporting scaling or editing.

为了充分满足这些需求，我们需要显著提高高密度媒体网关的视频处理能力。多内核 DSP 能够以较低运营成本提供可扩展的解决方案，从而全面解决运营商重点关注的功耗与空间占用问题。

This article is organized as follows:

First, the challenges and resource requirements of processing high-resolution video are explained, as well as how to efficiently implement video coding algorithms with scalable implementation capabilities to support both low-resolution and high-resolution channels.

Second, how hardware and software options can improve the efficiency of multi-core operations is discussed.

Finally, the article reviews the development of the cutting-edge technology in the field of multicore DSP and discusses the platform that developers can use.

2 Challenges of HD Video in Infrastructure

Figure 1 describes a video communication system based on a basic central office network. A typical system should support multiple functions, including:

Transcoding and Rate Adaptation for High-Density Media
Audio transcoding related to video transcoding
Large-scale multi-party video conferencing
Processing of other media such as speech

Video communication based on the Internet

Figure 1 Network-based video communication

Transcoding is a typical communications infrastructure video application where a YUV domain video stream is decoded from a compressed input stream and then re-encoded using a different standard (transcoding), a different bit rate (bitrate), a resolution (transmission size), or any combination of the above. Video content comes from a wide range of sources, from high-quality HD professional cameras to low-resolution smartphone recordings, and from large HDTV screens to low-resolution handheld terminals. Video infrastructure must meet a full range of requirements, including:

Multiple encoding and decoding standards, such as DV, MPEG2/4, H.264 and future H.265.
Multiple resolutions and frame rates, from 128×96 pixel Sub-Quarter Common Intermediate Resolution Format (SQCIF) and even lower resolutions, to HD (1920x1080) and even UHD (4320P, 8K), ranging from 10 to 60 frames per second.
Input/output (I/O) bit rates for various encodings range from 48 Kbps for low-resolution, low-quality handheld video streams to professional-quality 50 Mbps (H.264 level 4.2) and higher. The bandwidth requirements for YUV domain video streams are very high, for example, a YUV 1080p60 video stream with a 4:2:0 color scheme requires about 1.5 Gbps of bandwidth.

Latency requirements vary by application: video conferencing and real-time gaming applications have very strict latency requirements of less than 100 milliseconds; video-on-demand applications can tolerate moderate latency (up to a few seconds), and non-real-time processing applications such as storage can tolerate longer delays.

The challenge facing infrastructure networks is how to deliver all content to all desired users while maintaining high utilization and efficiency of hardware resources. To further illustrate this challenge, consider the fact that a single 1080i60 channel requires the same processing load as 164 channels of 1/4 Common Medium Resolution Format (QCIF) at 15 fps (assuming a linear relationship between load and resolution and number of frames). Therefore, the hardware that supports a single 1080i60 channel should also be able to support 164 QCIF channels with the same efficiency and high utilization. However, scalability at this level is a challenge.

In order to meet the high scalability requirements, a programmable hardware solution must be adopted. Some video applications require the processor to have very high bit rate signal input and output, so the processor-based solution must have ideal peripherals to support sufficient external interfaces. Such a processor must have enough processing power to handle real-time high-definition high-quality video, and it must also be equipped with sufficient local resources such as fast memory, internal bus and DMA support to make efficient use of the processor's processing power.

Single-core DSPs are highly flexible and can efficiently execute various algorithms. DSPs can not only process voice, audio, and video, but can also perform other functions. However, due to the insufficient computing power of single-core DSPs, they cannot freely process real-time video of any resolution, but can only process videos with lower resolutions such as SQCIF, QCIF, and Common Intermediate Resolution Format (CIF); moreover, the power consumption of single-core DSPs also prevents them from being used in high-density video processing systems.

New multi-core DSPs have very high processing capabilities and consume less power per operation than single-core DSPs. To determine whether multi-core processors can be an effective hardware solution for communication infrastructure, their interfaces, processing performance, memory requirements, and multi-core cooperation and synchronization mechanisms need to be verified for compliance with various use cases.

2.1 External I/O Interface

The bitstream for a typical transcoding application is packaged in IP packets. The bandwidth required for the transcoding application is related to the resolution and the available bandwidth of the network required by the user. The following are common bandwidth requirements for a single-channel consumer quality H.264 encoded video stream as a function of resolution:

HD resolution, 720p or 1080i - 6 to 10 Mbps
D1 resolution, 720×480, 30 frames per second (fps), or 720×576, 25 fps – 1 to 3 Mbps
CIF resolution, 352×288, 30 frames/second – 300 to 768 Kbps
QCIF resolution, 176×144, 15 frames/sec – 64 to 256 Kbps

The total external interface required for a transcoding application is the sum of the bandwidth required for the incoming media stream and the outgoing media stream. To support multiple HD resolution channels or a large number of lower resolution channels, at least one Serial Gigabit Media Independent Interface (SGMII) is required.

Non-transcoded video applications involve encoding or decoding raw video media streams from the YUV (or equivalent) domain. The raw video streams are high bit rate and are typically sent directly to or from the processor via a high bit rate, fast multi-channel bus such as PCI, PCI Express, or Serial Rapid Input/Output (SRIO).

The following lists the bandwidth required to transmit a single-channel raw video stream in the YUV domain using 8-bit pixel data and a 4:2:0 or 4:1:1 color scheme:

1080i60 - 745.496 Mbps
720p60 - 663.552 Mbps
D1 (30fps NTSC or 25 fps PAL) - 124.415 Mbps
CIF（30 fps）- 36.49536 Mbps
QCIF（15 fps）- 4.56192 Mbps

Therefore, a processor capable of decoding four 1080i60 H.264 channels would require a bus capable of supporting over 4 Gbps, assuming 60% bus utilization.

2.2 Processing Performance

The processing performance required for video processing on an H.264 channel of a programmable processor depends on many parameters, including resolution, bit rate, image quality, and video clip content. This chapter will not only discuss the factors that affect cycle consumption, but also provide rules of thumb for average cycle consumption for common application examples.

Like other video standards, H.264 only defines the decoder algorithm. For a given coded media stream, all decoders can generate the same YUV video domain data.

Therefore, the decoder does not determine the image quality, but the encoder does. However, the encoder quality can affect the decoder cycle consumption.

The cycles consumed by the entropy decoder depend on the type of entropy decoder and the bit rate. H.264 MP/HP defines two lossless algorithms for the entropy decoder, Context Adaptive Binary Arithmetic Coding (CABAC) and Context Adaptive Variable Length Coding (CAVLC). CABAC provides a higher compression ratio, and therefore better image quality for the same number of bits, but consumes approximately 25% more cycles per media stream bit than CAVLC. The amount of cycles required to decode a CABAC or CAVLC media stream is a nonlinear monotonic function of the number of bits.

The processing load of all other decoder functions is a function of resolution. Higher resolutions require more cycles, almost linearly with the total number of macroblocks. The video stream content, encoder algorithms and tools can affect the decoder cycle consumption to some extent. Appendix A - Decoder Performance Dependency lists the encoder algorithms and tools that may affect the decoder cycle consumption.

Implementing an encoder for a given bit rate on a programmable device requires a trade-off between quality and processing load. Appendix B – Motion Estimation and Bitrate Control analyzes two encoder algorithms that can affect encoder quality and consume a large number of cycles.

For high-quality video streaming of a typical motion consumer electronic device, the following list provides a rule of thumb for determining the number of cycles consumed by the H.264 encoder for common use cases.

QCIF resolution, 15 fps, 128 Kbps – 27 million cycles per channel
CIF resolution, 30 fps, 300 Kbps – 200 million cycles per channel
D1 resolution, NTSC or PAL, 2 Mbps – 660 million cycles per channel
720p resolution, 30 fps, 6 Mbps – 1.85 billion cycles per channel
1080i60, 60 fields per second, 9 Mbps – 3.45 billion cycles per channel

Similarly, the number of cycles consumed by the H.264 decoder is:

QCIF resolution, 15 fps, 128IKbps – 14 million cycles per channel
CIF resolution, 30 fps, 300 Kbps – 70.5 million cycles per channel
D1 resolution, NTSC or PAL, 2 Mbps – 292 million cycles per channel
720p resolution, 30 fps, 6 Mbps – 780 million cycles per channel
1080i60, 60 fields per second, 9 Mbps – 1.66 billion cycles per channel

The number of cycles consumed by a transcoding application (including the full decoder and encoder) is the sum of the cycles consumed by the encoder and decoder, plus the cost of extensions if necessary.

2.3 Memory Considerations

The trade-off between cost and memory requirements is an important factor to consider in any hardware design. When analyzing the memory requirements of a multicore video processing solution, several questions need to be addressed:

How much storage is required, and what type of storage (dedicated or shared)?
Is the storage fast enough to support the traffic demands?
Is the access bus fast enough to support traffic demands?
Can the memory architecture support multi-core access with minimal multi-core performance loss?
Does the memory architecture support the flow of data into and out of the processor with minimal data contention?
What existing hardware supports memory access (such as DMA channels, DMA controllers, pre-fetch mechanisms, and fast smart cache architecture)? The amount of memory required depends on the application. The following three application examples illustrate three different memory requirements:

Wireless transfer rate: Converting from QCIF H.264BP to QCIF H.264BP with very low latency in a single-motion estimation reference frame requires enough storage capacity to store 5 frames. Each frame requires 38016 bytes, so the memory required for one channel (including storage of input and output media streams) is less than 256KB per channel. Processing 200 channels simultaneously requires 50MB of data storage.

Multi-channel decoder application example: For H264 HP 1080p decoder, if the number of B frames between two consecutive P frames and I frames is equal to or less than 5, then we only need enough storage space to store 7~8 frames, so the storage required for a single channel (including storage of input and output media streams) should be less than 25MB per channel. Processing 5 channels simultaneously requires 125MB of data storage.

Example of a high-quality video stream containing a live TV broadcast: H.264HP 720P60 encoding of live TV broadcasts requires 600MB of storage per channel with a 7-second delay in the system as required by the FCC. Processing two channels in parallel requires 1.2GB of data storage.

To maximize the low cost of video processing systems, data must reside in external memory, the size of which needs to be chosen based on the worst-case application state of system memory. At the same time, processed data must be stored in internal memory to support high throughput of the processor. An optimized system uses a ping-pong mechanism to move data from external memory to internal memory, and from internal memory to external memory while processing data from internal memory. A typical processor has a small L1 memory that can be configured as cache or RAM, a larger dedicated L2 memory for each core (configurable as cache or RAM), and a shared L2 memory that can be accessed by each core in the processor. To enhance the ping-pong mechanism, multiple independent DMA channels are needed to read and write data from external memory.

Appendix C External Memory Bandwidth - To support the ping-pong mechanism for the three application examples above, the bandwidth required to move data from external memory to internal memory is estimated. The effective bandwidth of the external memory must be greater than 3.5 Gbps.

[page]

2.4 Collaboration and synchronization between multiple cores

When multiple cores process the same video channel, the cores must communicate with each other to synchronize, split, or share input data, merge output data, or exchange data during processing. Appendix A - Decoder Performance Dependencies describes several algorithms for partitioning video processing functions into multiple cores.

Two commonly used partitioning algorithms are parallel processing and pipelining. An example of parallel processing is that two or more cores can process the same input channel. There must be a mechanism to share information between multiple cores that is not subject to race conditions. Semaphores can be used to protect global regions from race conditions. Hardware needs to support blocking and non-blocking semaphores to effectively eliminate race conditions, that is, eliminate the possibility of two cores occupying the same memory address at the same time.

If a pipeline algorithm is used, one or more cores can perform the first part of the operation and then pass the intermediate results to a second set of cores for further processing. Since the video processing load depends on the content of the processing, this delivery mechanism may face the following problems:

If more than one core processes the first stage of the pipeline, then frame N+1 may be processed before frame N. Therefore the delivery mechanism must be able to order the input/output.
Even if the cores in the pipeline are balanced overall (in terms of processing load), this may not be the case for individual frames. The delivery mechanism must provide buffers between the different pipeline stages so that cores that have not completed their work do not cause other cores to stall and wait.
If the algorithm requires that two stages of the pipeline be tightly linked (for example, to resolve dependencies), then the mechanism must be able to support both tight and loose links.

2.5 Multi-chip System

Real-time processing of Super Video (SVGA), 4K and higher resolutions, or processing H.264HP level 5 may require more than one chip to work together. To build a two-chip system with ultra-high processing power, it is critical to have a very fast bus connecting the two chips.

The third part describes the KeyStone series DSP architecture that meets all the above requirements and challenges.

3. KeyStone DSP – TI’s latest multi-core processor

The TI KeyStone architecture describes a family of multicore devices that are widely used in applications such as video processing where high performance and bandwidth are required. Figure 2 provides an overview of the KeyStone DSP. This chapter describes the KeyStone DSP features in light of the video processing hardware requirements described in Part II.

KeyStone DSP Block Diagram

Figure 2 KeyStone DSP block diagram

Table 1 illustrates how the KeyStone DSP meets video processing requirements.

How KeyStone DSP Meets Video Processing Requirements

Appendix A - Decoder Performance Correlations

The tools and algorithms used by the encoder and the video content will affect the performance of the decoder. The following factors will affect the performance of the decoder:

Choice of CABAC or CAVLC entropy decoder
Number of frame skips
Complexity of intra prediction modes
Prediction type—motion estimation or intra prediction. (Motion compensation requires a different number of decoding cycles than intra prediction compensation. Whether motion compensation or intra prediction is used is up to the encoder.)
Different motion estimation tools (one motion vector per macroblock, four motion vectors per macroblock, or eight motion vectors per macroblock) can change the decoder complexity and cycle count.
Motion compensation for B-frame macroblocks involves two reference macroblocks and consumes more cycles.
The amount of motion in the media stream not only changes the number of skipped macroblocks, but also changes the processing requirements of the decoder.
The distribution of the bitstream between different values of motion vectors, block values, flags, etc. depends on the content of the media stream and the encoder algorithm. Different distributions will change the number of cycles of the entropy decoder accordingly.

Appendix B — Motion Estimation and Rate Control

Motion estimation is a large part of H.264 encoding. The quality of the H.264 encoder depends on the quality of the motion estimation algorithm. The number of cycles required for motion estimation depends on the functional characteristics and features of the motion estimation algorithm. The following are the main factors that affect the consumption of motion estimation cycles:

Frequency of I-frames, P-frames, and B-frames
Number of reference frames in L0 (for P and B frames) and L1 (for B frames)
Number of search areas
Search area size
Search Algorithms

A good motion estimation algorithm may consume 40-50% or even more of the total encoding cycle.

The rate control algorithm is a major factor affecting encoding quality. To maximize the perceived quality of the video, an intelligent rate control algorithm distributes the available bits between macroblocks and frames.

Some systems can perform multiple processing passes to better distribute the available bits among macroblocks. Multiple passes improve perceived quality but require more intensive processing.

Appendix C - External Memory Bandwidth

The encoder typically requires a higher internal bandwidth than the decoder due to the motion estimation algorithm. The encoder requirement is calculated for two cases: low bitrate QCIF and high bitrate 1080p.

Case 1 - QCIF 264 BP encoder:

Two complete QCIF frames can reside in the cache or L2 ping-pong buffer. Each frame requires less than 40 KB. When encoding a frame with a reference frame, the system should transfer 80KB of data for each QCIF transaction and output a small amount of data. The total internal bandwidth required for 200 QCIF channels at 15 fps is:

80KB * 15 (fps) * 200 (number of channels) + 200 (number of channels) * 256/8 KB (output bit rate of QCIF channel) = 240MB + 6.4MB = 250MB/s

Second case - 1080p 60 H.264 HP:

Assuming a worst-case algorithm to perform motion estimation of a moving reference frame, the reference frame may need to be moved from external memory to internal memory up to three times. Alternatively, an advanced algorithm with up to four reference frames can be assumed. Therefore, the motion estimate for a single 1080p60 channel is:

3 (copied 3 times) * 1920*1080*1 (only 1 byte per pixel in motion estimation) * 60 (fps) * 4 (reference channels) = 1492.992 MBps

Whether to move the current frame for processing and motion compensation is determined by the following:

2 (current frame and motion compensation) * 1920 * 1080 * 1.5 (bytes/pixel) * 60 = 373.248 MBps

In summary, the above two summary results define the output bitrate. The sum for one channel is 1866.24 MBps, or 3732.48 MBps for two H.264 HP 1080p60 encoders, which means about 30% of the raw data bandwidth of the external memory.

Keywords：KeyStone Reference address：Multi-core video processing technology based on KeyStone DSP

Previous article：Analysis Method of DA Conversion Accuracy of PWM Channel in DSP System
Next article：A Design of Interface Between DSP and PCI Bus

Recommended ReadingLatest update time:2024-11-16 20:38

DSP Interrupt Detection and Processing Technology Based on JTAG Emulator

1. Introduction It is very important to set interrupt detection points in the process of implementing real-time analysis of DSP systems using the integrated development and debugging platform CCS combined with an emulator based on JTAG technology. Interrupt detection points can interrupt the execution of the program

[Industrial Control]

DSP Interrupt Detection and Processing Technology Based on JTAG Emulator

Design of Fiber Bragg Grating Demodulation System Based on DSP

0 Introduction Fiber Bragg grating sensor (FBGS) is a functional fiber sensor using fiber Bragg grating (FBG) as a sensitive element. It can be used to directly detect temperature and strain, as well as indirect measurement of many other physical and chemical quantities related to temperature and strain. I

[Embedded]

Samsung HBM memory chips pass Nvidia test and will start mass production

On July 4, South Korean media NewDaily reported that Samsung Electronics passed Nvidia's HBM3e (high bandwidth memory) quality test. Samsung is about to start mass production of HBM memory chips and negotiate with Nvidia on supply issues. Samsung Electronics recently received a PRA (Product Readiness Approval)

[Semiconductor design/manufacturing]

Design of Inverter Power Supply Based on ARM7 and DSP

introduction 　　Today, with the development of electrical intelligence everywhere, countless power consumption occasions cannot do without the Inverted Power Supply System (IPS) to provide stable and high-quality power for field equipment, especially in places such as communication rooms, server workstations, t

[Power Management]

Design of Inverter Power Supply Based on ARM7 and DSP

Improving Energy Efficiency Using DSP Intelligent Motor Control

With motors now responsible for two-thirds of industrial electricity consumption and up to a quarter of residential electricity consumption, the efficiency of motors continues to receive greater attention. While standard motor applications are perfectly capable of operating at a higher energy efficiency, most motors

[Embedded]

DigiKey announces global partnership with Kingston Technology, one of the leaders in memory and storage solutions

DigiKey Announces Global Partnership with Kingston Technology, a Leader in Memory and Storage Solutions DigiKey, a global distributor of electronic components and automation products with full stock and fast delivery, today announced a partnership with Kingston Technology to distribute its memory prod

[Embedded]

DigiKey announces global partnership with Kingston Technology, one of the leaders in memory and storage solutions

Analysis and Implementation of SPWM Direct Area Equivalent Algorithm Based on DSP

　　As the core technology of modern power electronics, frequency conversion technology integrates modern electronics, information and intelligent technology. In view of the fact that industrial frequency (50 Hz in my country) is not the best operating frequency for all electrical equipment, which has led to the long-te

[Embedded]

Analysis and Implementation of SPWM Direct Area Equivalent Algorithm Based on DSP

Design of high voltage power supply based on DSP

1 Introduction Early high-voltage DC power supplies usually use 220 V industrial frequency AC to step up through transformers, rectify and filter. The power supply is large in size and weight, has large ripple, low stability and low efficiency. The current high-voltage power supply mainly uses switching power s

[Power Management]

Design of high voltage power supply based on DSP

Popular Resources
Popular amplifiers