Hello everyone, today I would like to share with you the difficulties and challenges encountered in supporting real-time encoding, media processing, system architecture, and proprietary cloud traffic distribution of 4K/8K ultra-high-definition video. I hope it will be helpful to you.
The sharing content is divided into six parts:
01
Background of 4K/8K UHD video
With the broadcast of CCTV Winter Olympics and CCTV 8K channel, ultra-high-definition video has entered people's lives and the demand has gradually increased. However, the popularity of 8K video is still not enough for the following reasons:
1. It has a great impact on the existing live broadcast system architecture. For example, the bit rate of 8K live broadcast is generally higher than 100M. Secondly, the distribution of live broadcast streams usually requires the original 8K stream to be converted into different resolutions, bit rates, and frame rates through media processing before distribution. This brings new difficulties to the system's computing power consumption, processing speed, and processing cost.
In addition, the production process of true 8K video also requires high production costs, resulting in a relatively scarce 8K source. But perhaps AI can provide some help, such as super-scaling the original 4K video to 8K to achieve 8K clarity, to make up for the scarcity of high-definition sources. A low-cost, high-compression, and enhanced real-time live media processing platform can solve the above problems and pain points.
Based on this background, we started to invest in ultra-high-definition live media processing platforms. The first problem we encountered was technology selection, whether to support it through the CPU or through proprietary chips. The advantages of the hardware solution are extremely high encoding frame stability and low computing cost. The advantage of the software solution is that the algorithm design can be more flexible, and the pure CPU encoder can achieve a higher compression rate than the hardware solution through algorithm design. At the same time, the software solution is more convenient to upgrade. For example: the original hardware chip supports 8K265 encoding. If you want to upgrade to support 266 encoding in the future, the hardware needs to be redesigned, while the software only needs to upgrade the code, and the system can continuously iterate to support the latest capabilities.
Taking all the above into consideration, the final choice is to choose a pure CPU real-time encoding solution. The advantages are: higher compression rate, free expansion of encoders and decoders, and easier support for complex logic and business. In addition, the pure CPU solution uses general computing power. When 8K transcoding is not performed, this part of the resources can be easily released for general CPU computing power utilization.
Our current achievements:
1. In terms of 8K real-time encoding, a single 8K machine can achieve 60FPS real-time encoding, and distributed cluster transcoding can achieve 120FPS real-time encoding;
2. Due to the flexibility of the software, the 8K real-time transcoding system can support all mainstream video codec standards;
3. Won the best in all items in the 2022 MSU latest encoder evaluation report;
4. Own more than 100 H.266/VCC codec patent technologies.
02
Codec acceleration
When optimizing the 8K encoder, there are two main directions:
1. Optimize the parallelism of the encoder, allowing the encoder to encode more frames at the same time and improve CPU resource utilization. Currently, we support multi-TILE parallel encoding. In the 8K scenario, copying 8K video frames will become a time-consuming operation, and the encoding process cannot avoid converting the YUV standard arrangement format to the YUV arrangement format optimized inside the encoder. This process requires access copy operations. For this reason, we support dividing 8K video frames into multiple SLICE intervals, with one thread per SLICE to accelerate the video frame copying process in parallel. In addition, it also supports pre-analysis and frame-level parallelism, inter-frame multi-threaded decoupling, adaptively selecting reference frames according to hierarchical characteristics, and improving the parallelism of the overall encoding.
2. Another direction is to directly optimize the encoder algorithm. In view of the encoding characteristics of 8K ultra-high resolution, the search process in the original encoder can be skipped by pre-analyzing MVP; when performing intra-frame and inter-frame analysis, it can adaptively choose to perform intra-frame or inter-frame analysis first for fast search.
After accelerating and optimizing the encoding algorithm, it is found that 8K scenes/DCT take up a large proportion of the time. Therefore, non-standard DCT acceleration is supported to improve the overall speed.
After optimizing the encoder, the pure encoding speed can reach 60 or even 70FPS. However, once decoding is added to the processing flow, the transcoding speed and efficiency of the entire system will drop significantly.
Normally, 4K/1080P decoding does not consume a lot of resources, and the bottleneck is often encoding. But in the 8K scenario, decoding becomes a new bottleneck.
For example, the AVS3 format decoder outputs the NV12 format natively. FFmpeg has a single thread in the middle to convert NV12 to YUV, and finally outputs YUV for subsequent operations. This approach is fine at 4K/1080P resolution, because the conversion from NV12 to YUV is a fast operation, but in the 8K scenario, the conversion rate from NV12 to YUV is difficult to meet the 50FPS requirement. We moved the conversion from NV12 to YUV to the decoder. After decoding, the conversion from NV12 to YUV will be performed on multiple threads to improve the overall decoding speed.
The 8K resolution multi-tile H.265 bitstream is a different solution. The characteristic of H.265 in FFmpeg is that the decoding progress will be updated only after the first line is completely decoded. If there are multiple tiles, the decoding speed will drop significantly. As shown in the figure above, the decoding progress will be updated only after the first two tiles are decoded and the first line of the third tile is completed in the multi-tile scenario, which seriously slows down the parallel decoding. We optimize the H.265 decoder based on the multi-tile scenario, combining the two. After the single-tile decoding completes the line decoding, the progress notification is issued. After optimization, the speed of a single machine can reach more than 50FPS.
After optimizing the encoding and decoding rates, we found that the processing speeds of two 128G memory devices of the same model online were quite different. The faster device had 8×16G memory, while the slower device had 4×32G memory.
Memory bandwidth limitation works normally at low resolutions, but it has a big impact on 8K high resolution. For example, a 3.2GHz CPU has four 32G memory sticks inserted, and its memory bottleneck bandwidth is about 102G, but if 8K50FPS10bit real-time encoding is to be performed, that is, the data bandwidth required to be transmitted in one second is about 4.7G. In actual operation, the memory transfer process can only occupy a very small part of the overall encoding, and most of the time is still spent on encoding operations. The instantaneous bandwidth may be dozens of times that of 4.7G, especially since the system also supports multi-threaded parallel video frame copying, which leads to an increase in instantaneous bandwidth, thus limiting the speed of 8K encoding.
8K video encoding and decoding consumes a lot of memory bandwidth. When configuring device hardware, you need to configure the memory according to the CPU's Memory Channels to increase the memory bandwidth as much as possible.
The above problems show that in the 8K scenario, each operation needs to be considered more carefully. Just adding one more copy may cause the system to slow down and fail to achieve real-time performance. Here, the entire memory pool is reconstructed. No new memory is requested during decoding, pre-processing operations, and watermarking. All operations are processed in place, and memory data is copied only once as much as possible. Reduce the bandwidth used by memory.
After optimizing the encoding and decoding speed and memory, the system can run stably in most cases, but the stability is not high. The same video stream is repeatedly encoded, but the speed is high and low. Sometimes it fully meets the real-time requirements, even up to 60 or 70FPS, but sometimes it slows down severely to only 40FPS.
Modern operating systems usually support NUMA architecture to improve multi-core memory access efficiency. In NUMA architecture, the processor is divided into multiple Node nodes, and each Node node has its own independent memory and memory access controller. The CPU can access the memory of the same Node node through the memory access controller integrated in the Node, and access the memory of other Node nodes through the QPI bus, so the access speed between the CPU and memory in the same Node will be faster than cross-Node access.
During the encoding and decoding process, each CPU performs both encoding and decoding, which will cause the corresponding Node node memory of the CPU to have both decoded frames and encoded frames. When reference is made during the encoding process, a large amount of memory access across Nde nodes will be generated. In the 8K scenario, the memory bandwidth and speed pressure brought by cross-Node access will be rapidly amplified, resulting in IO blocking and reduced concurrency capabilities.
The solution to this problem is to bind the CPU core and finely control the overall transcoding process. For example, operations such as decoding, adding watermarks, converting resolutions, encoding, etc. are all assigned to a specific CPU to ensure that interdependent operations are completed on the same CPU and the same Node, and to minimize cross-Node memory access.
Previous article:What is the best way to adjust the power amplifier?
Next article:What is the difference between noise, phase noise, signal-to-noise ratio, and noise figure?
- Popular Resources
- Popular amplifiers
- Red Hat announces definitive agreement to acquire Neural Magic
- 5G network speed is faster than 4G, but the perception is poor! Wu Hequan: 6G standard formulation should focus on user needs
- SEMI report: Global silicon wafer shipments increased by 6% in the third quarter of 2024
- OpenAI calls for a "North American Artificial Intelligence Alliance" to compete with China
- OpenAI is rumored to be launching a new intelligent body that can automatically perform tasks for users
- Arm: Focusing on efficient computing platforms, we work together to build a sustainable future
- AMD to cut 4% of its workforce to gain a stronger position in artificial intelligence chips
- NEC receives new supercomputer orders: Intel CPU + AMD accelerator + Nvidia switch
- RW61X: Wi-Fi 6 tri-band device in a secure i.MX RT MCU
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- LED chemical incompatibility test to see which chemicals LEDs can be used with
- Application of ARM9 hardware coprocessor on WinCE embedded motherboard
- What are the key points for selecting rotor flowmeter?
- LM317 high power charger circuit
- A brief analysis of Embest's application and development of embedded medical devices
- Single-phase RC protection circuit
- stm32 PVD programmable voltage monitor
- Introduction and measurement of edge trigger and level trigger of 51 single chip microcomputer
- Improved design of Linux system software shell protection technology
- What to do if the ABB robot protection device stops
- CGD and Qorvo to jointly revolutionize motor control solutions
- CGD and Qorvo to jointly revolutionize motor control solutions
- Keysight Technologies FieldFox handheld analyzer with VDI spread spectrum module to achieve millimeter wave analysis function
- Infineon's PASCO2V15 XENSIV PAS CO2 5V Sensor Now Available at Mouser for Accurate CO2 Level Measurement
- Advanced gameplay, Harting takes your PCB board connection to a new level!
- Advanced gameplay, Harting takes your PCB board connection to a new level!
- A new chapter in Great Wall Motors R&D: solid-state battery technology leads the future
- Naxin Micro provides full-scenario GaN driver IC solutions
- Interpreting Huawei’s new solid-state battery patent, will it challenge CATL in 2030?
- Are pure electric/plug-in hybrid vehicles going crazy? A Chinese company has launched the world's first -40℃ dischargeable hybrid battery that is not afraid of cold
- Learn about C2000 32-bit microcontrollers
- Migrate ssh service to EK200-zlib-openssl-openssh
- Application Note Download | Keysight Technologies "Quickly Find and Identify Hidden Signal Errors"
- Ask a Question
- Cost less than 5 yuan, reliable single-fire power supply solution (using NP101A chip, including PDF...
- Mobile station development board TI MSP430FR5969LaunchPad is here!
- Can the STM32F103T8U6 use CAN?
- The romance and art of electronic engineers
- 【AT-START-F425 Review】Interpretation of I2C i2c_application.c
- Based on TI Da Vinci series TMS320DM8148 floating-point DSP C674xRGMII Gigabit Ethernet port, HDMI output interface