Armv9 Technology Lecture | Accelerating video decoding and image processing using Armv9 CPU and SVE2
Click on the Arm Community above to follow us
By Poulomi Dasgupta, Senior Manager, Consumer Computing Market, Device Business Unit, Arm
With each new generation, Arm CPUs achieve generational performance improvements and introduce architectural improvements to meet the needs of evolving computing workloads. This article will focus on three use cases to demonstrate the impact of the architectural features of the Armv9 CPU in real-world scenarios, especially in HDR video decoding (10% acceleration), image processing (20% acceleration), and LibYUV (26% acceleration) in major mobile applications.
The good news is that some of the Arm SVE2 optimizations discussed in this article are now available to developers, and are expected to improve the user experience of popular media applications and further enhance the way people communicate, work and play.
Challenges facing application developers and OEMs
First, let’s look at the challenges currently faced by mobile application developers. Currently, there are more than 2 million Android applications [1] vying for users’ favor. To remain competitive, these applications must quickly promote their innovative results to a variety of mobile devices. If they rely on hardware with fixed functions, they will face challenges in terms of time to market and portability.
Metrics related to excellent user experience, including app startup time, UI smoothness, tokens per second, and FPS stability, need to meet user expectations. Therefore, OEMs need to balance performance improvements with broader user needs, such as longer battery life, reduced data usage, and device costs. Any shortcoming in any of these aspects may lead to user dissatisfaction and negate the value of upgrading mobile devices.
Developing software on Armv9 CPUs addresses the challenges faced by OEMs and developers.
Practical use cases for SVE2 in Armv9 CPUs
Let’s look at three case studies that demonstrate that software optimizations can accelerate real workloads. First, here is a subset of SVE2 and the new vector instructions in the Armv9 CPU that can accelerate key workloads on mobile devices:
16-bit bit products and 8-bit matrix multiplications to accelerate HDR video playback and video conferencing.
Image processing histogram instructions.
Aggregate read and scatter write for de-interleaving of camera sensor data.
Complex instructions to accelerate fast Fourier transforms in video codecs.
Using these vector instructions allows optimized software to use fewer CPU cycles, which brings two major benefits. First, fewer CPU cycles lead to lower energy consumption and increased battery life; second, improved application performance.
Case 1
SVE2 increases video decoding speed by 10%
Watching multimedia content is one of the most common workloads on mobile devices and a large source of traffic on mobile networks. Therefore, manufacturers are constantly pursuing more efficient codecs, hoping to save network bandwidth while supporting excellent image quality.
HDR technology allows for more realistic details, even in very dark or very bright scenes, due to its higher color accuracy. It uses 10 bits instead of 8 bits to represent each color channel. AV1 and VP9, as well as other modern codecs, support HDR video.
AV1 is a newer format that provides better compression, while VP9 has wider compatibility across browsers and devices. Some popular apps use both AV1 and VP9 formats to play videos.
SVE2 optimization increases HDR video decoding speed by about 10%, VP9 decoding speed by 8%, and AV1 decoding speed by 10%. This reduces CPU cycles by about 10% and power consumption accordingly, allowing users to get longer battery life when playing on-demand videos on mobile devices. In this way, whether watching snapshots, short films or long videos, it will become smoother!
Optimized code for libdav1d (Av1 decoder) and libvpx (Vp9 decoder) has been uploaded and is now available to developers.
Case 2
SVE2 makes LibYUV 26% faster
It is worth mentioning that each of us uses LibYUV unknowingly.
LibYUV is an open source library for color space conversion between RGB and YUV, camera sensor data scaling, and camera filters and rotations. It processes the data from the camera sensor before it is used by the video decoder. In many cases, the data from the video decoder is processed by LibYUV before being sent to the display.
SVE2 optimizations make LibYUV 26% faster (geometric mean across multiple cores on an Armv9 CPU). About 100 cores in LibYUV have been optimized with SVE2, and work is in progress on the remaining cores. Some of this work has been uploaded and can be viewed at https://chromium.googlesource.com/libyuv/libyuv/ .
LibYUV is distributed as part of Chromium, an open source browser project that underpins Chrome and custom browsers from major phone makers, including Xiaomi Browser and Samsung Browser. It is also integrated into AOSP and Android Jetpack. As LibYUV is critical to mobile devices, it is expected to have a profound impact on the overall mobile experience, such as better video conferencing, smoother switching between portrait and landscape modes, better video consumption, and significantly longer battery life.
Case 3
SVE2 makes computational photography 20% faster
Halide is a language specialized in the field of image processing, used in applications such as Adobe Photoshop, and some OEM manufacturers also use it in camera pipelines.
SVE2 instructions (such as aggregate load and scatter store instructions) and TBL (Programmable Lookup Table, used to vectorize small lookup tables) accelerate some key computer vision processes in Halide. Compute-intensive algorithms such as iToFDepth (for perceived depth), bilateral grid (for edge-aware tone mapping), and local Laplacian (for filters) have seen performance improvements of nearly 20% with SVE2.
Using SVE2 to optimize software allows some photographic effects to be applied in real time, opening up new possibilities for entry-level mobile devices, allowing users to get higher quality photos without the need for dedicated hardware.
Arm has optimized the Halide backend for SVE2 code generation. The good news is that some patches are already live and others are in development.
Figure: Halide-SVE2 and Halide-Neon CPU cycle comparison
Figure: Depth effect example image
Figure: Edge-aware tone mapping example image
How to use SVE2 better?
SVE2 introduces several new instructions that are well suited for accelerating key real-world workloads and applications. We will discuss in more detail how to achieve some performance gains with the Armv9 CPU in subsequent technical articles, so stay tuned for more on the Arm Community WeChat Official Account!
Arm is committed to finding a good balance for the ecosystem, better balancing developer support and performance improvements. Some open source libraries and kernels optimized for SVE2 are already online, and more resources will be available in the future.
The latest advances in Armv9 CPUs will enable developers to innovate faster and bring better user experiences to end consumers on all types of mobile devices. What are you waiting for? Start your development project with SVE2 and innovate now!
[1] Please click here to read the original article.