Has Moore's Law come to an end? What will the future hold for video codec technology?-EEWORLD

Collect

What kind of coding future are we pursuing?

Cloud

Imagine

Ubiquitous video penetration, explosive traffic growth, diverse scene technology requirements, users' "no compromise" in video experience... The rapid development of the audio and video industry is accompanied by many challenges such as "slow upgrading of encoding standards", "bottoming out of hardware dividends", and "cost issues caused by encoding complexity". Can video encoding still be "volume"? What kind of video encoding technology can meet the balance between experience and cost? Video encoding for machine vision, virtual reality video, intelligent application video... The previous wave is rolling in, how will the "future" of video encoding unfold? This article was planned and interviewed by Chen Gaoxing, head of the video encoding service of IMMENSE and "Alibaba Cloud Video Cloud", and LiveVideoStack.

Many demands, more contradictions

Has the speed of technological iteration solidified? Has Moore's Law come to an end?

Video coding and decoding technology has improved the compression rate by 50% in about 10 years, but the upgrade speed of "ten years of hard work" has long been unable to keep up with the speed of video information expansion. The increase in coding complexity brought by the new coding standard is far higher than the increase in CPU processing power, and it faces the problem of difficulty in "universalizing" coding technology. With the expansion and exploration of video in more application scenarios, a single coding standard can no longer cover the needs of various video applications... Obviously, on the one hand, the advent of the AR and VR era, as well as the high resolution of 4K and 8K, 60-120fps high frame rate, and 10-12bit wide color gamut, have caused the amount of information in the video itself to expand several times; on the other hand, the efficiency of resource stacking and replacement compression, and the progress of "Moore's Law" have reached the "end". In addition, the "ultra-low latency" requirement for video encoding speed, all of this makes the "contradiction" between video experience, bandwidth, computing cost, and encoding speed more obvious. Therefore, we are always facing the demand for higher-definition, more real-time, and more efficient encoding, and also facing many "contradictions" between technology and demand. In the context of these seemingly difficult-to-balance "contradictions", many issues worthy of further discussion have also emerged:

➤ What aspects do existing coding standards not focus on enough?

➤ How to make good use of existing coding standards first?

➤ What are the dimensions that existing video encoding technologies cannot cover?

➤ In addition to bitrate and quality, does video encoding need to focus on more goals?

➤ How to break the technical thinking inertia of resource stacking and replacement to improve video compression efficiency?

…

From the needs, contradictions, and problems, we can draw a deeper understanding: the goal of coding optimization is no longer just to consider the traditional subjective and objective quality, complexity, latency and other dimensions, but also to consider the friendliness with AI processing capabilities, the adaptability of performance under multiple platforms, etc. The raising of problems is always accompanied by the choice of problem-solving ideas and technical directions. Therefore, the codec architecture is being driven to evolve from the traditional to a smarter and more compatible direction.

The ultimate goal is somewhat biased

What exactly do we need to pursue when optimizing codecs?

When Alibaba Cloud Video Cloud proposed the concept of "narrowband HD" to the industry in 2015 and officially launched the narrowband HD technology brand and productized it in 2016, this method of "reducing bit rate" and "improving clarity" has almost become a universal solution in the industry. However, under the continuous evolution to the present, a kind of "involution" has begun to prevail in the industry, that is, excessive pursuit of the optimization of "certain objective indicator data". However, from the perspective of videoization centered on "people", in the final user experience, video should pay more attention to the subjective experience. On the contrary, in the actual research and development process, especially in the optimization of encoders, it is usually based on "active objective indicators" such as PSNR, SSIM, and VMAF-NEG. It is true that in most cases, the improvement of objective quality can be reflected in the improvement of subjective quality to a certain extent, especially when the number of samples is large enough and the objective quality is greatly improved, the objective indicators and subjective feelings can be consistent. However, in the optimization practice of narrowband HD, there are also some cases of "inconsistency" between subjective and objective optimization. For example: the SAO tool in the H.265 standard is used to improve the ringing effect, but it will reduce the VMAF and VMAF-NEG scores; the PSY tool in the X265 encoder can increase high-frequency details in subjective quality, but it is not friendly to objective indicators; another example: JND and ROI technology, in the process of mining visual distortion redundancy, will inevitably cause the decline of active objective indicators; Alibaba Cloud's self-developed code control algorithm will allocate more bitrate to areas prone to subjective problems such as "blocking effect" to protect subjective quality, but this will also lead to a decline in objective quality; in addition, various repair generation technologies in pre-processing enhancement will directly modify the source, which is not very friendly to active objective indicators designed to evaluate the "difference from the source". In addition, "over-optimization" of a single objective indicator may also cause a situation where a single objective indicator is contrary to the subjective experience... Therefore, the value of a single objective indicator, whether high or low, should not be the "ultimate goal" pursued by video coding optimization.

The subtleties reveal the world

What subtle solutions can we find in our coding and decoding vision?

Supported by the above technical concepts and intelligent coding architecture, "Narrowband HD 2.0" starts from the human eye visual model and adjusts the optimization goal of the encoder from "higher fidelity" to "better subjective experience". This can be viewed from the two perspectives of visual coding and detail restoration. In the dimension of visual coding, "Narrowband HD 2.0" adopts frame type decision and block-level bitrate allocation based on scenes and content, and the mode decision adopts a subjective friendly algorithm. In the content adaptive coding part, considering that the brightness, contrast and time domain distortion of the video space domain perceived by the human eye are discontinuous, by discarding visual redundant information based on just noticeable distortion (JND) adaptive coding technology, bandwidth can be greatly saved without significantly reducing subjective quality; at the same time, the bitrate allocation strategy is adjusted through ROI code control technology to further improve the clarity of the area of interest to the human eye. In the dimension of detail restoration, "Narrowband HD 2.0" adopts detail restoration generation technology based on generative adversarial network (GAN). While repairing the mosaic effect and edge burrs caused by coding compression, it "brain-fills" some natural texture details, making the picture texture details richer, more natural and more textured. More importantly, in response to vertically segmented scenes, our model will achieve more intelligent texture generation for scene features. For example: for concert scenes, we have created a customized template for idol portraits exclusively for BesTV, which optimizes the detail restoration and generation effect of the portrait area, and restores the idol's "straight shot" to the audience's screen through live broadcast. Another example: in the NBA basketball game scene, the AI restoration model strengthens the restoration and generation of unique elements of basketball sports events such as the texture of the basketball court floor, close-ups of players, court boundaries, ground advertising letters, numbers on jerseys, basketball nets, etc., greatly improving the picture clarity and overall visual dynamics. It is only in the subtleties that the ultimate technology can be seen.

The inevitable “cost, cost, cost”

The “non-zero-sum game” between cost and experience, how to balance encoding and decoding?

Just as "clarity" and "bandwidth" are the two ends of the scale that "narrowband high-definition" needs to balance, in the current environment of "reducing costs and increasing efficiency", the "non-zero-sum game" between "experience" and "cost" must be a topic that cannot be avoided. Cost (computational complexity) and experience (quality), although the two are in a "trade-off" relationship, to some extent, they can also be optimized and improved unilaterally. For example, through algorithm optimization, the RD curve of the encoder can be optimized in a more cost-effective direction while the complexity remains unchanged; at the same time, through the design of cost-effective adaptive fast algorithms, the improvement of quality can also be converted into cost benefits; or, through the optimization of the underlying layer and the full integration with the computing platform, the potential of heterogeneous coding can be explored, which can further reduce the computing cost while the quality remains unchanged.

Of course, Alibaba Cloud Video Cloud has done more than just this on the road to "making high-compression algorithms and AI truly universal". Similar to video encoding, in the field of video processing, deep learning has far surpassed traditional methods in terms of effectiveness, and is still evolving rapidly. However, the high consumption of computing resources by deep learning has become the main reason that hinders its widespread use in practical applications. Alibaba Cloud Video Cloud has deeply developed its own encoding kernel, including s264 and s265, and has implemented more than 100 algorithms to support live broadcast, on-demand, and RTC scenarios. Compared with open source, it leads in compression rate of 20%+ in all scenarios. At the same time, we introduce AI-assisted encoding decisions to improve content adaptation capabilities in bit rate allocation and mode decisions, and to explore visual redundancy to the extreme. Under the same subjective conditions, the bit rate can be saved by 50%.

[1] [2]

Reference address：Has Moore's Law come to an end? What will the future hold for video codec technology?

Previous article：Working Principle and Industry Application of Voice Coil Motor
Next article：Comprehensive interpretation of camera structure and working principle

Popular Resources
Popular amplifiers