随着越来越多的移动手持终端支持视频功能,对于流媒体内容及实时通信的网络支持需求也在显著上升。虽然对已部署的 3G 媒体网关进行升级可以支持较低的分辨率和帧速率,但这种由于自身的有限处理能力而进行的升级并不能满足视频成为主流应用的需求。
为了使可扩展视频应用能够支持高密度 (HD),需要显著提高视频处理能力,而多核数字信号处理器 (DSP) 不但拥有能满足此类需求的增强型视频处理功能,同时还能充分满足运营商在可扩展性和低功耗方面的需求。
This paper aims to introduce a new multi-core platform that can achieve high-density video processing capabilities by optimizing core communication, task management and memory access. In addition, this paper also explains how the results of the extended implementation support high-density video processing for multi-channel and multi-core HD video applications.
1 Introduction
The widespread deployment of 3G and 4G mobile networks around the world and the continuous emergence of wireless innovation hotspots have created critical data bandwidth required by handheld terminal users. In addition to web/data applications, video has become another driving force for the popularization of mobile data.
随着越来越多的用户转而使用视频应用,网络基础局端需要实现显著的性能提升才能支持视频内容,这从最近苹果公司 Facetime 视频呼叫应用及类似应用的流行上可见一斑。
Handheld terminals can support video capture and playback at higher resolutions and frame rates. Traditionally deployed 3G media gateways are designed to support high-density multi-voice channels in low-resolution video applications, but often fail to meet user expectations for high quality.
In addition, because handheld terminals can usually only support a few standards with limited parameter sets due to technical limitations such as battery life and memory size, media gateways need to support more codecs and transcoding modes, such as transcoding, transmission volume, and transmission rate. For example, when a mobile phone user drives through an area at high speed, it is more efficient to let the network adapt to the bandwidth provided by the instantaneous link conditions at that time and provide corresponding compression, resolution and bit rate, so that the video chain is not interrupted and the handheld terminal does not waste bandwidth or battery power due to supporting scaling or editing.
为了充分满足这些需求,我们需要显著提高高密度媒体网关的视频处理能力。多内核 DSP 能够以较低运营成本提供可扩展的解决方案,从而全面解决运营商重点关注的功耗与空间占用问题。
This article is organized as follows:
First, the challenges and resource requirements of processing high-resolution video are explained, as well as how to efficiently implement video coding algorithms with scalable implementation capabilities to support both low-resolution and high-resolution channels.
Second, how hardware and software options can improve the efficiency of multi-core operations is discussed.
Finally, the article reviews the development of the cutting-edge technology in the field of multicore DSP and discusses the platform that developers can use.
2 Challenges of HD Video in Infrastructure
Figure 1 describes a video communication system based on a basic central office network. A typical system should support multiple functions, including:
- Transcoding and Rate Adaptation for High-Density Media
- Audio transcoding related to video transcoding
- Large-scale multi-party video conferencing
- Processing of other media such as speech
Figure 1 Network-based video communication
Transcoding is a typical communications infrastructure video application where a YUV domain video stream is decoded from a compressed input stream and then re-encoded using a different standard (transcoding), a different bit rate (bitrate), a resolution (transmission size), or any combination of the above. Video content comes from a wide range of sources, from high-quality HD professional cameras to low-resolution smartphone recordings, and from large HDTV screens to low-resolution handheld terminals. Video infrastructure must meet a full range of requirements, including:
- Multiple encoding and decoding standards, such as DV, MPEG2/4, H.264 and future H.265.
- Multiple resolutions and frame rates, from 128×96 pixel Sub-Quarter Common Intermediate Resolution Format (SQCIF) and even lower resolutions, to HD (1920x1080) and even UHD (4320P, 8K), ranging from 10 to 60 frames per second.
- Input/output (I/O) bit rates for various encodings range from 48 Kbps for low-resolution, low-quality handheld video streams to professional-quality 50 Mbps (H.264 level 4.2) and higher. The bandwidth requirements for YUV domain video streams are very high, for example, a YUV 1080p60 video stream with a 4:2:0 color scheme requires about 1.5 Gbps of bandwidth.
Latency requirements vary by application: video conferencing and real-time gaming applications have very strict latency requirements of less than 100 milliseconds; video-on-demand applications can tolerate moderate latency (up to a few seconds), and non-real-time processing applications such as storage can tolerate longer delays.
The challenge facing infrastructure networks is how to deliver all content to all desired users while maintaining high utilization and efficiency of hardware resources. To further illustrate this challenge, consider the fact that a single 1080i60 channel requires the same processing load as 164 channels of 1/4 Common Medium Resolution Format (QCIF) at 15 fps (assuming a linear relationship between load and resolution and number of frames). Therefore, the hardware that supports a single 1080i60 channel should also be able to support 164 QCIF channels with the same efficiency and high utilization. However, scalability at this level is a challenge.
In order to meet the high scalability requirements, a programmable hardware solution must be adopted. Some video applications require the processor to have very high bit rate signal input and output, so the processor-based solution must have ideal peripherals to support sufficient external interfaces. Such a processor must have enough processing power to handle real-time high-definition high-quality video, and it must also be equipped with sufficient local resources such as fast memory, internal bus and DMA support to make efficient use of the processor's processing power.
Single-core DSPs are highly flexible and can efficiently execute various algorithms. DSPs can not only process voice, audio, and video, but can also perform other functions. However, due to the insufficient computing power of single-core DSPs, they cannot freely process real-time video of any resolution, but can only process videos with lower resolutions such as SQCIF, QCIF, and Common Intermediate Resolution Format (CIF); moreover, the power consumption of single-core DSPs also prevents them from being used in high-density video processing systems.
New multi-core DSPs have very high processing capabilities and consume less power per operation than single-core DSPs. To determine whether multi-core processors can be an effective hardware solution for communication infrastructure, their interfaces, processing performance, memory requirements, and multi-core cooperation and synchronization mechanisms need to be verified for compliance with various use cases.
2.1 External I/O Interface
The bitstream for a typical transcoding application is packaged in IP packets. The bandwidth required for the transcoding application is related to the resolution and the available bandwidth of the network required by the user. The following are common bandwidth requirements for a single-channel consumer quality H.264 encoded video stream as a function of resolution:
- HD resolution, 720p or 1080i - 6 to 10 Mbps
- D1 resolution, 720×480, 30 frames per second (fps), or 720×576, 25 fps – 1 to 3 Mbps
- CIF resolution, 352×288, 30 frames/second – 300 to 768 Kbps
- QCIF resolution, 176×144, 15 frames/sec – 64 to 256 Kbps
The total external interface required for a transcoding application is the sum of the bandwidth required for the incoming media stream and the outgoing media stream. To support multiple HD resolution channels or a large number of lower resolution channels, at least one Serial Gigabit Media Independent Interface (SGMII) is required.
Non-transcoded video applications involve encoding or decoding raw video media streams from the YUV (or equivalent) domain. The raw video streams are high bit rate and are typically sent directly to or from the processor via a high bit rate, fast multi-channel bus such as PCI, PCI Express, or Serial Rapid Input/Output (SRIO).
The following lists the bandwidth required to transmit a single-channel raw video stream in the YUV domain using 8-bit pixel data and a 4:2:0 or 4:1:1 color scheme:
- 1080i60 - 745.496 Mbps
- 720p60 - 663.552 Mbps
- D1 (30fps NTSC or 25 fps PAL) - 124.415 Mbps
- CIF(30 fps)- 36.49536 Mbps
- QCIF(15 fps)- 4.56192 Mbps
Therefore, a processor capable of decoding four 1080i60 H.264 channels would require a bus capable of supporting over 4 Gbps, assuming 60% bus utilization.
2.2 Processing Performance
The processing performance required for video processing on an H.264 channel of a programmable processor depends on many parameters, including resolution, bit rate, image quality, and video clip content. This chapter will not only discuss the factors that affect cycle consumption, but also provide rules of thumb for average cycle consumption for common application examples.
Like other video standards, H.264 only defines the decoder algorithm. For a given coded media stream, all decoders can generate the same YUV video domain data.
Therefore, the decoder does not determine the image quality, but the encoder does. However, the encoder quality can affect the decoder cycle consumption.
The cycles consumed by the entropy decoder depend on the type of entropy decoder and the bit rate. H.264 MP/HP defines two lossless algorithms for the entropy decoder, Context Adaptive Binary Arithmetic Coding (CABAC) and Context Adaptive Variable Length Coding (CAVLC). CABAC provides a higher compression ratio, and therefore better image quality for the same number of bits, but consumes approximately 25% more cycles per media stream bit than CAVLC. The amount of cycles required to decode a CABAC or CAVLC media stream is a nonlinear monotonic function of the number of bits.
The processing load of all other decoder functions is a function of resolution. Higher resolutions require more cycles, almost linearly with the total number of macroblocks. The video stream content, encoder algorithms and tools can affect the decoder cycle consumption to some extent. Appendix A - Decoder Performance Dependency lists the encoder algorithms and tools that may affect the decoder cycle consumption.
Implementing an encoder for a given bit rate on a programmable device requires a trade-off between quality and processing load. Appendix B – Motion Estimation and Bitrate Control analyzes two encoder algorithms that can affect encoder quality and consume a large number of cycles.
For high-quality video streaming of a typical motion consumer electronic device, the following list provides a rule of thumb for determining the number of cycles consumed by the H.264 encoder for common use cases.
- QCIF resolution, 15 fps, 128 Kbps – 27 million cycles per channel
- CIF resolution, 30 fps, 300 Kbps – 200 million cycles per channel
- D1 resolution, NTSC or PAL, 2 Mbps – 660 million cycles per channel
- 720p resolution, 30 fps, 6 Mbps – 1.85 billion cycles per channel
- 1080i60, 60 fields per second, 9 Mbps – 3.45 billion cycles per channel
Similarly, the number of cycles consumed by the H.264 decoder is:
- QCIF resolution, 15 fps, 128IKbps – 14 million cycles per channel
- CIF resolution, 30 fps, 300 Kbps – 70.5 million cycles per channel
- D1 resolution, NTSC or PAL, 2 Mbps – 292 million cycles per channel
- 720p resolution, 30 fps, 6 Mbps – 780 million cycles per channel
- 1080i60, 60 fields per second, 9 Mbps – 1.66 billion cycles per channel
The number of cycles consumed by a transcoding application (including the full decoder and encoder) is the sum of the cycles consumed by the encoder and decoder, plus the cost of extensions if necessary.
2.3 Memory Considerations
The trade-off between cost and memory requirements is an important factor to consider in any hardware design. When analyzing the memory requirements of a multicore video processing solution, several questions need to be addressed:
- How much storage is required, and what type of storage (dedicated or shared)?
- Is the storage fast enough to support the traffic demands?
- Is the access bus fast enough to support traffic demands?
- Can the memory architecture support multi-core access with minimal multi-core performance loss?
- Does the memory architecture support the flow of data into and out of the processor with minimal data contention?
- What existing hardware supports memory access (such as DMA channels, DMA controllers, pre-fetch mechanisms, and fast smart cache architecture)? The amount of memory required depends on the application. The following three application examples illustrate three different memory requirements:
Wireless transfer rate: Converting from QCIF H.264BP to QCIF H.264BP with very low latency in a single-motion estimation reference frame requires enough storage capacity to store 5 frames. Each frame requires 38016 bytes, so the memory required for one channel (including storage of input and output media streams) is less than 256KB per channel. Processing 200 channels simultaneously requires 50MB of data storage.
Multi-channel decoder application example: For H264 HP 1080p decoder, if the number of B frames between two consecutive P frames and I frames is equal to or less than 5, then we only need enough storage space to store 7~8 frames, so the storage required for a single channel (including storage of input and output media streams) should be less than 25MB per channel. Processing 5 channels simultaneously requires 125MB of data storage.
Example of a high-quality video stream containing a live TV broadcast: H.264HP 720P60 encoding of live TV broadcasts requires 600MB of storage per channel with a 7-second delay in the system as required by the FCC. Processing two channels in parallel requires 1.2GB of data storage.
To maximize the low cost of video processing systems, data must reside in external memory, the size of which needs to be chosen based on the worst-case application state of system memory. At the same time, processed data must be stored in internal memory to support high throughput of the processor. An optimized system uses a ping-pong mechanism to move data from external memory to internal memory, and from internal memory to external memory while processing data from internal memory. A typical processor has a small L1 memory that can be configured as cache or RAM, a larger dedicated L2 memory for each core (configurable as cache or RAM), and a shared L2 memory that can be accessed by each core in the processor. To enhance the ping-pong mechanism, multiple independent DMA channels are needed to read and write data from external memory.
Appendix C External Memory Bandwidth - To support the ping-pong mechanism for the three application examples above, the bandwidth required to move data from external memory to internal memory is estimated. The effective bandwidth of the external memory must be greater than 3.5 Gbps.
[page]
2.4 Collaboration and synchronization between multiple cores
When multiple cores process the same video channel, the cores must communicate with each other to synchronize, split, or share input data, merge output data, or exchange data during processing. Appendix A - Decoder Performance Dependencies describes several algorithms for partitioning video processing functions into multiple cores.
Two commonly used partitioning algorithms are parallel processing and pipelining. An example of parallel processing is that two or more cores can process the same input channel. There must be a mechanism to share information between multiple cores that is not subject to race conditions. Semaphores can be used to protect global regions from race conditions. Hardware needs to support blocking and non-blocking semaphores to effectively eliminate race conditions, that is, eliminate the possibility of two cores occupying the same memory address at the same time.
If a pipeline algorithm is used, one or more cores can perform the first part of the operation and then pass the intermediate results to a second set of cores for further processing. Since the video processing load depends on the content of the processing, this delivery mechanism may face the following problems:
- If more than one core processes the first stage of the pipeline, then frame N+1 may be processed before frame N. Therefore the delivery mechanism must be able to order the input/output.
- Even if the cores in the pipeline are balanced overall (in terms of processing load), this may not be the case for individual frames. The delivery mechanism must provide buffers between the different pipeline stages so that cores that have not completed their work do not cause other cores to stall and wait.
- If the algorithm requires that two stages of the pipeline be tightly linked (for example, to resolve dependencies), then the mechanism must be able to support both tight and loose links.
2.5 Multi-chip System
Real-time processing of Super Video (SVGA), 4K and higher resolutions, or processing H.264HP level 5 may require more than one chip to work together. To build a two-chip system with ultra-high processing power, it is critical to have a very fast bus connecting the two chips.
The third part describes the KeyStone series DSP architecture that meets all the above requirements and challenges.
3. KeyStone DSP – TI’s latest multi-core processor
The TI KeyStone architecture describes a family of multicore devices that are widely used in applications such as video processing where high performance and bandwidth are required. Figure 2 provides an overview of the KeyStone DSP. This chapter describes the KeyStone DSP features in light of the video processing hardware requirements described in Part II.
Figure 2 KeyStone DSP block diagram
Table 1 illustrates how the KeyStone DSP meets video processing requirements.
Appendix A - Decoder Performance Correlations
The tools and algorithms used by the encoder and the video content will affect the performance of the decoder. The following factors will affect the performance of the decoder:
- Choice of CABAC or CAVLC entropy decoder
- Number of frame skips
- Complexity of intra prediction modes
- Prediction type—motion estimation or intra prediction. (Motion compensation requires a different number of decoding cycles than intra prediction compensation. Whether motion compensation or intra prediction is used is up to the encoder.)
- Different motion estimation tools (one motion vector per macroblock, four motion vectors per macroblock, or eight motion vectors per macroblock) can change the decoder complexity and cycle count.
- Motion compensation for B-frame macroblocks involves two reference macroblocks and consumes more cycles.
- The amount of motion in the media stream not only changes the number of skipped macroblocks, but also changes the processing requirements of the decoder.
- The distribution of the bitstream between different values of motion vectors, block values, flags, etc. depends on the content of the media stream and the encoder algorithm. Different distributions will change the number of cycles of the entropy decoder accordingly.
Appendix B — Motion Estimation and Rate Control
Motion estimation is a large part of H.264 encoding. The quality of the H.264 encoder depends on the quality of the motion estimation algorithm. The number of cycles required for motion estimation depends on the functional characteristics and features of the motion estimation algorithm. The following are the main factors that affect the consumption of motion estimation cycles:
- Frequency of I-frames, P-frames, and B-frames
- Number of reference frames in L0 (for P and B frames) and L1 (for B frames)
- Number of search areas
- Search area size
- Search Algorithms
A good motion estimation algorithm may consume 40-50% or even more of the total encoding cycle.
The rate control algorithm is a major factor affecting encoding quality. To maximize the perceived quality of the video, an intelligent rate control algorithm distributes the available bits between macroblocks and frames.
Some systems can perform multiple processing passes to better distribute the available bits among macroblocks. Multiple passes improve perceived quality but require more intensive processing.
Appendix C - External Memory Bandwidth
The encoder typically requires a higher internal bandwidth than the decoder due to the motion estimation algorithm. The encoder requirement is calculated for two cases: low bitrate QCIF and high bitrate 1080p.
- Case 1 - QCIF 264 BP encoder:
Two complete QCIF frames can reside in the cache or L2 ping-pong buffer. Each frame requires less than 40 KB. When encoding a frame with a reference frame, the system should transfer 80KB of data for each QCIF transaction and output a small amount of data. The total internal bandwidth required for 200 QCIF channels at 15 fps is:
80KB * 15 (fps) * 200 (number of channels) + 200 (number of channels) * 256/8 KB (output bit rate of QCIF channel) = 240MB + 6.4MB = 250MB/s
- Second case - 1080p 60 H.264 HP:
Assuming a worst-case algorithm to perform motion estimation of a moving reference frame, the reference frame may need to be moved from external memory to internal memory up to three times. Alternatively, an advanced algorithm with up to four reference frames can be assumed. Therefore, the motion estimate for a single 1080p60 channel is:
3 (copied 3 times) * 1920*1080*1 (only 1 byte per pixel in motion estimation) * 60 (fps) * 4 (reference channels) = 1492.992 MBps
Whether to move the current frame for processing and motion compensation is determined by the following:
2 (current frame and motion compensation) * 1920 * 1080 * 1.5 (bytes/pixel) * 60 = 373.248 MBps
In summary, the above two summary results define the output bitrate. The sum for one channel is 1866.24 MBps, or 3732.48 MBps for two H.264 HP 1080p60 encoders, which means about 30% of the raw data bandwidth of the external memory.
Previous article:Analysis Method of DA Conversion Accuracy of PWM Channel in DSP System
Next article:A Design of Interface Between DSP and PCI Bus
Recommended ReadingLatest update time:2024-11-16 20:38
- Popular Resources
- Popular amplifiers
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- How to use ATA-L series underwater acoustic power amplifier?
- Live broadcast at 10 am today [Infineon Smart Door Lock Solution]
- [GD32E231 DIY Contest] Getting Started
- [CB5654 Intelligent Voice Development Board Review] Demonstration of Voice Recognition Processing
- Install Debian 8 system
- CC2640 CC1310 high and low temperature test
- Detailed explanation of TL494 switching power supply circuit
- STM32MP157A-DK1 Evaluation + IIO and ADC (6)
- 【Qinheng RISC-V core CH582】 5 Bluetooth routine initial evaluation and environment construction
- RISC-V MCU IDE MRS (MounRiver Studio) development: Solve the problem of RAM usage showing 100% after compilation