Helium Technical Lecture | Overcoming the influence of Amdahl's Law

Latest update time：2024-08-15

Reads：

This article is reproduced from: Arm Community

As artificial intelligence (AI) is applied to a wide variety of applications, IoT devices, which are the largest in the market, will also be endowed with intelligence. Arm ^® Helium™ technology brings key machine learning and digital signal processing performance improvements to devices based on Arm Cortex ^® -M processors.

Last week, we talked about and memory access . In the last Helium Technical Lecture, we will learn how to overcome the influence of Amdahl's Law. If you want to learn how to use Helium effectively, don't miss the video at the end of the article. Through the actual demonstration of Arm technical experts, it explains in detail how Helium can bring more intelligence to endpoint devices.

How Arm Helium technology was born

Overcoming the Effects of Amdahl's Law

By Thomas Grocutt, M-Series Chief Architect and Fellow, Architecture and Technology Group, Arm

In previous posts, we covered how the Armv8.1-M architecture with Arm Helium technology (also known as MVE) handles vector instructions. The problem is that whenever code is vectorized, the effects of Amdahl’s Law start to kick in quickly and unexpectedly. If you’re not familiar with Amdahl’s Law, it states that the parts of an algorithm that can’t be parallelized quickly become performance bottlenecks. For example, if 50% of the workload can be parallelized, then even if that workload could be parallelized infinitely, the best you can get is a factor of 2 speedup. I don’t know how you feel, but if I could parallelize something infinitely and it only made it 2x faster, that would really annoy me! When designing Helium, we had to consider vector instructions and everything that goes with them to maximize performance.

Serial code is common in loop processing, and the overhead caused by serial code can be considerable, especially for small loops. The following memory copy code is a good example:

Decrementing the loop iteration count and the conditional branch back to the loop top account for 50% of the loop instructions. Many small Cortex-M processors do not have branch predictors (small Cortex-M processors are extremely area efficient, which means that many branch predictors are several times larger than the entire Cortex-M processor). Therefore, the runtime overhead is actually higher than 50% due to branch penalties. Loop unrolling can help reduce the overhead by amortizing the overhead over many iterations, but it increases code size and makes vectorization of the code more complicated. Given that many DSP cores have small loops, it was important to address these issues in the Helium research project. Many dedicated DSP processors support zero-overhead loops. One way to implement this is to use the REPEAT instruction, which tells the processor to repeat the following instruction N times:

The processor must record several pieces of data:

The address where the loop starts

The number of instructions remaining before branching back to the beginning of the loop

The number of loop iterations remaining

Keeping track of all this data while handling an interrupt can be problematic, so some DSPs simply delay the interrupt until the loop is complete. This can take quite a long time if there are a large number of iterations to perform, and is completely inconsistent with the requirements for fast and deterministic interrupt latency that a Cortex-M processor should achieve. This approach is also not suitable for handling precise faults, such as memory management fault exceptions (MemManage) caused by permission violations. Another approach is to add additional registers to handle loop state. But these new registers must be saved and restored on exception entry and return, which in turn increases interrupt latency. To address this problem, Armv8.1-M uses a pair of loop instructions:

The loop first executes the While Loop Start (WLS) instruction, which copies the loop iteration count into the LR and branches to the end of the loop when the loop iteration count reaches zero. There is also a Do Loop Start (DLS) instruction that can be used to set up a loop in which at least one iteration is always executed. The Loop End (LE) instruction checks the LR to see if one more iteration is required and, if so, branches back to the beginning. Interestingly, the processor can cache the information provided by the LE instruction (i.e., where the loop starts and ends), so on the next iteration, the processor can branch back to the beginning of the loop before even fetching the LE instruction. Therefore, the sequence of instructions executed by the processor is as follows:

Adding a loop instruction at the end of the loop has a nice side effect that if the cached loop information is refreshed, the instruction will be re-executed. Re-executing the LE instruction will then refill the cache. As shown in the figure below, the existing fast interrupt handling function is preserved because there is no need to save the loop start and end addresses.

Apart from the first iteration and some setup when recovering from an interrupt, all the time is actually spent in memory copying rather than loop processing. Furthermore, since the processor knows the order of the instructions in advance, it can always fill the pipeline with the correct instructions. This eliminates pipeline flushes and the resulting branch penalties. Therefore, we can vectorize this loop and no longer worry about the effects of Amdahl's law, and we have (temporarily) overcome these difficulties.

When vectorizing code, a loop often starts and ends with different types of instructions, such as vector load (VLDR) and vector multiply-add (VMLA). When executing such a loop, a long, uninterrupted chain of alternating VLDR/VMLA operations is generated (as shown in the figure below). This uninterrupted chain allows the processor to gain the most benefit from instruction overlapping, as it can even overlap from the end of one loop iteration to the beginning of the next, further improving performance. For more information on instruction overlapping, see: "The origin of Arm Helium technology: why not just use Neon?"

Vectorized code has problems when the amount of data to process is not a multiple of the vector length. The typical solution is to process the full vector first, then use a serial/non-vectorized tail cleanup loop to process the remaining elements. Before you know it, Amdahl's Law is back, and it's annoying! Vectors in Helium can hold 16 8-bit values, so when we vectorize our 31-byte memcpy function, less than half of the copies will be performed serially by the tail loop instead of in parallel by the vector instructions. To address this, we added tail-predicted variants of the loop instructions (e.g., WLSTP, LETP). For these tail-predicted loops, the LR holds the number of vector elements to process instead of the number of loop iterations to execute. The loop start instruction (WLSTP) has a size field (".8" in the memcpy function example below) that specifies the width of the elements to be processed.

If you've seen other optimized memcpy routines, you might be surprised at how simple this example is, but for Helium, this is all you need for the best fully vectorized solution. Here's how it works: the processor uses the size field and the number of remaining elements to calculate the number of remaining iterations. If the last iteration has fewer elements to process than the vector length, the corresponding number of elements at the end of the vector will be disabled. So in the example of copying 31 bytes above, Helium will copy 16 bytes in parallel on the first iteration, and then 15 bytes in parallel on the next iteration. This not only avoids the impact of Amdahl's law and achieves the expected performance, but also completely eliminates the serial tail code, reducing the amount of code and simplifying the development process.

With high performance goals and tight area/interrupt latency constraints, we designed Helium like a multi-dimensional puzzle where half of the shape is fixed. Seemingly unrelated parts of the architecture can interact with each other to produce unexpected effects or help solve interesting puzzles.

The entire Helium research team and I are looking forward to seeing Helium technology bring strong support to new applications. Currently, three Cortex-M products support Helium technology - Cortex-M52, Cortex-M55 and Cortex-M85 . I can't wait to see Helium technology continue to empower our ecosystem partners' AI innovation applications.

* This article is an original article from Arm. Please contact "Arm Community" for authorization and indicate the source for reprinting.

END

Latest articles about

■ARM and Allwinner Technology Sign Arm Total Access License Agreement, Entering a New Chapter of Technology Cooperation

■Armv9 Technical Lecture | SME Detailed Explanation

■Celebrate the National Day and start the "core" journey | Arm Technology wishes everyone a happy National Day!

■ICDIA 2024 concluded successfully, taking you to unlock the highlights of ARM Technology!

■High-energy spoilers | Arm Technology sends you an invitation to ICDIA 2024 →

■New products from local research and development! Arm Technology releases its first "Linglong" DPU and new generation VPU

■Unlock the Mid-Autumn Festival keywords and get limited gifts from ARM Technology!

■Teachers’ kindness is like light, illuminating the path to the “core” | Arm Technology wishes all teachers a happy Teachers’ Day!

■ARM Technology and GigaDevice deepen technical cooperation to jointly create opportunities for Arm MCU innovation

■ARM Technology's heterogeneous computing power enables AI computing, and Cixin Technology's first AI PC chip is released