How to develop low-power software systems on the ARM platform-EEWORLD

Collect

We software engineers love to find the perfect solution to any problem we encounter. But oddly enough, we find that in this particular area, there is no perfect solution. Clever tricks may save some power, but this area is dominated by other simpler factors. There are several large elephants in the room, and we must carefully hunt the elephants we can see before spending energy hunting the smaller animals.

When considering the power consumption of a system, it is important to understand what we are actually measuring. When we say "saving power", we can mean several things. Does it mean "power" or "energy". In fact, we need both power and energy. Most handheld portable devices have two different budgets: a power budget - which manages the instantaneous power consumption to avoid overheating or thermal stress, and an energy budget, which manages the total amount of energy used in the long term. Software needs to meet both the short-term power budget and the long-term energy budget.

Obviously, we can reduce the power consumption of any device to nearly zero by not asking it to do anything or any meaningful operation! However, to perform useful functions requires energy consumption. Therefore, we can only make a constant compromise between meaningful operations and energy conservation. In order to perform the desired functions, we must consume energy; but we must try to ensure that these functions are performed in an energy-efficient manner.

Power-Time ProductA better metric commonly used in academic materials on the subject is the "power-time product." Although there are no standard units or specific methods, this metric combines energy and performance metrics. Increasing energy or decreasing performance increases the value of the power-time product, so the goal is to find the lowest acceptable power-time product value, in other words, the lowest energy consumption consistent with executing the required tasks in the allowed time.

Where does the energy go? All computing machines perform two basic functions. Both are necessary, and without them no meaningful task can be accomplished.

The first thing that comes to mind is computation or data processing. Computation is usually performed on values stored in machine registers. To perform computation as efficiently as possible, we need to execute the fewest instructions in the shortest amount of time. Most importantly, efficient computation allows us to choose between two options: either we can finish the computation early and go to sleep, or we can slow down the clock and still complete the computation in the required time.

What is often overlooked here is data communication (data movement). In most architectures (ARM uses a load/store architecture, and is no exception), data movement is a must. If the information is not moved from one location to another and often back to the original location, the user cannot process any information. For example, the value in memory needs to be moved to a register for processing, and then the result is written back to memory.

But which one uses more energy? Where is the biggest payoff?

Figure 1 shows the common fact that about 60% of program-related memory access operations are instruction fetches, and the remaining 40% are data accesses.

Figure 1: Memory access distribution

Figure 2: Memory access energy consumption

Figure 2 shows some research conducted by ARM. If the energy consumption of executing an instruction is 1, then the energy consumption of Tightly Coupled Memory (TCM) access is about 1/25, and the energy consumption of cache access is about 1/6. The energy consumption of external RAM access is 7 times the energy consumption of instruction execution.

In other words, for every external RAM access, we can execute 7 instructions, 40 cache accesses, or about 170 TCM accesses.

Computation is cheap but communication is expensive

So it seems that data movement is more expensive than data processing. So the first elephant is data efficiency.

We can propose two rules for managing the energy consumption of memory accesses.

Proximity - From an energy perspective, the closer the memory is to the core, the lower the relative energy cost of accessing the memory.

Fewer accesses - Reducing the number of memory accesses is more important than reducing the number of instructions.

Take advantage of on-chip memory

It is clear from our energy graphs that TCM is by far the most efficient type of memory a system has. Not all systems have what ARM calls TCM memory (connected to the core via a dedicated and optimized interface), but most have at least some type of fast memory on-chip. For the sake of discussion, we refer to this as a common on-chip memory (SPM). Given that a single access to an SPM consumes about 1/170 the energy of an external RAM access, making full use of this SPM memory should be the first choice.

Figure 3: Energy advantage of SPM

The graph in Figure 3 shows that for a simple “multi-class” benchmark, even a 128-byte SPM region can reduce power consumption by about half. A 1k-byte memory can reduce power consumption by up to 70%. The approach taken in this study (Mar wedel, 2004) is to dynamically relocate code and data segments from external RAM to SPM. Even in terms of moving the overhead on demand, not only does it reduce power consumption, but performance also improves by about 60%.

Clearly, we are losing returns at a certain point. This is where the performance gains slow down beyond 1k SPM and the total system power consumption increases slightly. Here, we are actually incurring SPM power that cannot be used by this particular application because the application itself is not large enough.

You can also notice that, combined with the allocation algorithm used, this particular application cannot use SPM regions smaller than 64 bytes, as there are no available fragments small enough to fit. A more sophisticated algorithm was also demonstrated in this study, which can achieve energy savings of over 80% in the best case.

Always do cache-friendly things

Analyzing the benefits of caches can sometimes be more complicated than analyzing the benefits of SPM. On the one hand, caches are essentially self-managing. On the other hand, caches do not operate on individual memory locations, but rather on fixed-size "lines." Therefore, accessing a single cacheable memory location may load an entire line, resulting in a burst of memory accesses. If this additional data is never accessed, the energy consumed is wasted.

Another disadvantage is the cost (in terms of silicon area and power) of the additional logic required for the cache. [page]

Figure 4: Energy advantage of cache

Figure 4 is from a paper by Princeton (Brooks, 2000) and shows three sets of data for a simple application benchmark. For different cache sizes, the bars represent performance IPC (instructions per cycle), power consumption, and power consumption time product (EDP). In general, performance will increase as the cache size increases. However, the power consumption of the system will also increase because the increase in cache units will increase the power consumption accordingly. The power consumption time product allows us to strike a balance between performance and cache size. In this example, there is an optimal point, that is, when the cache size is 64k, the power consumption time product is the smallest.

Minimize data memory access

One of the characteristics of the ARM architecture is that constants are nondeterministic. In particular, it is not possible to put an arbitrary 32-bit constant into a register with a single instruction. In fact, all memory accesses must operate on addresses in registers, which means that programs need to frequently put these and other constants into registers, which is difficult to do. The standard way to solve this problem is to embed constants as literal data in the code segment and load them at run time using PC-relative loads.

Therefore, it is useful to minimize the impact of constants. Make sure these constants are known at compile time and, if possible, embed them into a single ARM instruction. Minimize the need to load a base pointer to access global variables. This requires ensuring that global variables are in memory at runtime so that a single pointer can be used to access multiple variables. The simplest way to achieve this goal is to put global variables in a structure.

Although ARM's stack access is relatively efficient (stack access can load and store multiple instructions well), programmers can also reduce stack access in many ways: reduce active variables, avoid occupying local variable addresses, make full use of tail call optimization when possible, reduce the number of parameters passed to the function to less than four, allow the compiler to actively inline functions, etc.

Recursive situations and how to avoid them are more complicated. Often compilers can tail-optimize recursive functions very well. In fact, storing all data on the stack can give you better locality than you would otherwise. Perhaps the advice might be better expressed as "don't use recursive algorithms unless other approaches make data locality worse or you are sure the compiler can tail-optimize recursive calls". Exception handlers should be written to increase the chances of tail chaining, thereby avoiding unnecessary saves and restores of stack context.

Now let's turn our attention to the second elephant in the room, instruction execution.

Minimize the number of instructions

In fact, reducing the number of instructions executed is essentially the same as performance optimization. The fewer instructions executed, the lower the energy consumption. In addition, some obvious pointers should be added.

First, configure the tools correctly. Before the compiler and linker fully understand the target platform, even some basic optimizations cannot be implemented.

You need to be smart when writing code to avoid unnecessary operations. For the ARM architecture, 32-bit data types are efficient: 8-bit and 16-bit data types, although they take up less storage space, are also less efficient to process. In the v6 and v7 architectures, the pack and unpack instructions and SIM D operations can help somewhat, but be aware that these instructions are not accessible from C in the main program.

Be careful when writing loops

Loops can be written by following some simple rules: use unsigned integer counters, count down, and terminate when zero is reached. This can make loops shorter, faster, and use fewer registers. Also remember to write loops with vectorization in mind. Some simple rules about control structures and data declarations can make the compiler's job easier when trying to unroll and vectorize even the simplest loops.

Figure 5: Loop unrolling

Figure 5 shows some data related to one particular loop optimization, loop unrolling (Brooks, 2000). As expected, execution time and the number of instructions decrease as the unrolling factor increases. We see the effects of reduced loop overhead and fewer address calculations. The power results are more interesting, but less dramatic. The accuracy of the branch predictor decreases as the loop is further unrolled because there are fewer branches for the predictor to train its behavior and the final misprediction fraction for loop completion failures increases. However, because the continuous data flow of sequential instruction fetches is not interrupted as often, the efficiency of the instruction fetch stage can be improved. The combined result is a net reduction in energy consumption per instruction.

So even though the execution time is essentially less than a factor of 4, the all-important power-time product is reduced because power continues to decrease, so an energy-conscious compiler or developer will be more likely to unroll loops than one that only considers execution time.

The accuracy meets the requirements

You must also consider the precision required for the output. Even if floating-point hardware is available, calculations in fixed-point implementations are usually more efficient than those in floating-point implementations. If you are rendering an image to be viewed on a screen, you probably don't need to be completely compliant with the standard; you just need to render an acceptable image.

A study of progressive optimization of the standard MPEG-4 decoding function (Shin, 2002) has shown that switching from soft floating point to fixed point binary can reduce power consumption by 72%. The loss of precision means that the results are no longer compliant with the standard, but are still good enough for rendering purposes on the systems studied.

About Thumb

The Thumb instruction set was specifically designed to improve code density and also improve performance on narrow memory systems. However, while code density did improve, the number of instructions also increased. This is because the functionality of individual Thumb instructions was reduced compared to ARM instructions. So it seems reasonable that Thumb recompilation would result in increased energy consumption, and this is indeed what we have seen.

The above study shows that if the code size is reduced by 4%, the number of instructions executed increases by 38%, and the energy consumption increases by 28%. To find the third elephant, we need to go beyond the realm of the processor and its memory and look at the larger system. The system we use these days has been put together by our hardware design colleagues, and this system provides a lot of energy saving options.

Energy savings in the wider system

It is obvious that unused components should be placed in low power states whenever possible. This is an integral part of any smartly designed system, and these components should include memory and cache systems and even the processors themselves. In multi-core systems, we must consider the possibility of suspending one or more cores when processing requirements are relatively low.

First, a small but worthwhile consideration: when dealing with peripherals, always try to use interrupts rather than polling. Polling loops just consume power without any purpose. Almost all architectures include some kind of instruction to wait for an interrupt, which can put the system into a standby state in this case. For ARM systems, the core is usually clock gated, leaving only static leakage.

Unnecessary sleep-wake cycles can generally be avoided by designing the interrupt architecture to add tail chaining. The ARM Cortex-M3 architecture does this automatically.

[page]

For individual computing units, choosing a shutdown scheme is easy. For units that are predictably needed, the application or operating system can stop them when they are not needed. For units that are not predictably needed, the system can be powered up on demand or automatically powered down after being idle for a certain period of time. The time scale for powering down a subsystem can be derived from the power consumption when powered on but idle and the energy consumption of the sleep-wake cycle. Ultimately, it depends on the application. However, simple cycle counts of the power cycling code would be the most obvious starting point.

Measurements show that Neon engines run at about 10% higher power than cores such as Cortex-A9. However, for traditional signal processing algorithms, performance is improved by 40% - 150%. The benefits of enabling NeON during a task and shutting it down when not needed are clear. It is common that not only can the Neon engine be shut down when the task is completed, but the entire processor system can save more power.

Often a difficult choice is whether to enable a computation component to complete computations early (and therefore be off longer) or to reduce the processor speed when computations are completed to save power. Figure 6 shows the energy consumption data for each iteration of a simple benchmark (Domeika, 2009). This benchmark was run four times for each of two clock speeds, using different combinations of instruction cache and floating point coprocessor. Two key points are clear. First, although both the instruction cache and the floating point unit reduce power consumption, the floating point unit performs better than the instruction cache.

Figure 6: System component power utilization

Second, the energy consumption per iteration is essentially the same for all configurations, regardless of clock speed, so it is more efficient to enable everything and run at full speed rather than slowing down the clock speed in order to complete the task faster.

Multiprocessing

It is well known that using multiple cores can achieve higher performance and better energy efficiency than turning up the power of a single core. When using a multi-core system, we must consider the option of suspending one or more cores when they are not needed. ARM's research shows that the cost of a single core cycle in an SMPLinux system is 50,000 cycles (most of which are used to clear the first-level cache), which means that this operation will be completed in a few hundred milliseconds, not less, otherwise its energy consumption cost will outweigh its advantages.

ARM is an active research architecture that consists of two cores, a high-performance core for full-function operations and a smaller companion core for low-power operations with lower performance. When higher processing power is needed, the system runs the larger core. When the task is completed, the system can pass all information to the small core and shut down the large core. When the reverse information movement is needed, it switches back to the large core. If the two cores are connected as a related system, the energy cost of switching can be minimized.

About the operating system

Unfortunately, when running on an operating system, application programmers do not have this flexibility. Cache configuration, whether SPM is used or not, power cycles of components, etc. are largely determined by the operating system architecture and device drivers. However, application programmers still have a lot to consider.

Research has shown that poorly designed inter-process communication (IPC) can significantly increase the energy consumption of a system. A simple technique called "vectorizing" IPC, which batches small messages and sends a large number of small messages as a single large message, can often reduce context switching overhead. Additionally, reducing the number of processes can significantly reduce the need for IPC. Processes that need to communicate frequently can be combined into a single process.

Recent research (Tan, 2003) running on embedded Linux has shown that analysis and proper design of inter-process communication2 can potentially improve energy consumption by up to 60%.

in conclusion

Although I have emphasized that many areas are still within the realm of academic research, there is still much that can be done today. The results are relatively simple: reduce external memory accesses, reduce instruction execution, and turn off certain units when they are not in use.

In coming to this conclusion, I recalled a conversation I had with customers in a training class in mid-2009. These customers were concerned with implementing signal processing algorithms on the Cortex-A8 platform with Neon and wanted to know the exact energy consumption of individual instructions. I explained that in reality much of this information is unknown and in any case difficult to derive using current tools. Looking back, we have come to realize that this information is irrelevant in the long-term mission of hunting the elephant. In fact, the elephant that the customer is hunting is very small compared to the other elephants in the room. A better suggestion, either through analysis or continuous tracking of the data, is to estimate the number and type of data accesses involved in each implementation. This, combined with instruction counts, can lead to more informed choices. The power consumption of individual instructions is almost irrelevant compared to poor placement of memory accesses.

We, the software developers, need to continue to put pressure on academia and tool vendors to build these capabilities into the next generation of tools. It won’t be easy but it will happen.

Finally, I must remind you that all of this depends on your system, platform, application, OS, battery, and user. As the saying goes, "Earnings vary."

Reference address：How to develop low-power software systems on the ARM platform

Previous article：Introduction to the principle and program implementation of LPC2100 series encryption ARM chips
Next article：Embedded-ARM register basic concepts

Popular Resources
Popular amplifiers