Code compression technologies for several mainstream embedded architectures
[Copy link]
For embedded software, the smaller the code size, the better. Compressing the code to fit into a memory subsystem that is constrained by cost or space has become an important issue in embedded system development. ARM, MIPS, IBM, and ARC all offer techniques to reduce memory usage. This article will compare and analyze the implementation of code compression techniques in these architectures.
Today, it is not uncommon for the memory subsystem to cost more than the microprocessor. Therefore, it makes sense to choose a processor that can save memory costs. Writing compact code is only one side of the story. The processor's instruction set also has a great impact on memory consumption. For processors with poor code density, no matter how hard you try to compress your C source code, it will not help. If you are concerned about memory consumption, it is wise to choose the right processor and carefully tune the code.
Not all processors have or need code compression. Only 32-bit RISC (Reduced Instruction Set Computer) processors need code compression because RISC processors have poor code density. RISC processors were designed for general-purpose computers and workstations in the past, and memory was considered cheap when they were designed. Although memory may be cheap, wouldn't it be cheaper if you could occupy less memory? For cell phones and other cost-conscious embedded systems, a $5 difference in RAM or ROM can make a huge difference in volume profits. Typically, memory size is fixed, but product features vary. Compact object code means more auto-dialing, better voice recognition, or perhaps a sharper screen.
ARM, MIPS, and PowerPC were the first 32-bit embedded processors to find ways to reduce their memory consumption and increase code density. Earlier processors, such as Motorola's 68k series and Intel's x86 series, did not require code compression. In fact, their standard code density was higher than the code compression mode of RISC processors.
Easy-to-use Thumb Technology
Let's start with ARM's code compression scheme (Thumb) because it is widely used, well supported, a typical processor code compression scheme, and is quite simple and effective.
Thumb is actually a separate instruction set added to ARM's standard RISC instruction set. In your code, you can switch between the two instruction sets with a mode switch instruction. The Thumb instruction set architecture (ISA) consists of about 36 16-bit instructions. These instructions alone cannot accomplish much, but the Thumb instruction set includes basic addition, subtraction, rotation, and jump instructions. By replacing the ARM standard 32-bit instructions with these shorter instructions, the size of some code can be reduced by about 20% to 30%. But there are some issues that need to be noted:
First, Thumb code and standard ARM code cannot be mixed. You must explicitly switch between the two modes, as if Thumb is a completely different instruction set (which it actually is). This forces programmers to separate all 16-bit code from 32-bit code and isolate them into separate modules.
Second, because Thumb is a simplified and streamlined instruction set architecture, you cannot do everything you want in Thumb mode. Thumb mode cannot handle interrupts, long jumps, atomic memory operations, or coprocessor operations. Thumb's limited instructions mean that it is only useful for basic arithmetic and logic operations; anything else must be done using ARM's standard 32-bit instruction set.
Thumb's limitations extend beyond the instruction set. When in Thumb mode, the ARM processor has only eight registers (instead of 16), which prevent conditional execution and shift or rotate operations like standard ARM code. Passing parameters between standard ARM code and Thumb code is not difficult; just place the parameters on the stack or through the processor's first eight registers.
Switching back and forth from standard to Thumb mode also takes time and adds code. In addition, dozens of preamble and postamble instructions are required to organize pointers and flush the CPU pipeline. If the code running in Thumb mode is less than a few dozen instructions, it is not worth the cost.
Finally, Thumb has a small performance impact. Typically, using Thumb instructions to compress code will cause the code to run about 15% slower, mainly due to switching between 16-bit mode and 32-bit mode. Thumb instructions are also less flexible than standard 32-bit instructions, so more instructions are often required to accomplish the same task as 32-bit code. On the positive side, since the instructions are half the size of the 32-bit instruction set, Thumb makes more efficient use of cache.
If the task can be accomplished within these constraints, Thumb can save a lot of money. Thumb technology is already supported by every ARM processor, and most ARM compilers and assemblers support the Thumb instruction set, whether you use it or not. Therefore, the experience of adopting Thumb should be quite easy.
MIPS Processors
Once you understand Thumb technology, MIPS16e is nothing new. Some MIPS processors have added another 16-bit instruction set that is very similar to ARM systems. The MIPS16e instruction set includes a set of simplified versions of the standard MIPS arithmetic, logic, and jump instructions in 16 bits. Its use is the same as Thumb, and it also requires switching back and forth between standard mode and MIPS16e mode, which also incurs time and code overhead. Unless you can run in "compressed" mode for a long time, there is no need to switch modes. Its code compression efficiency is similar to ARM, which is also 20% to 30% for most programs.
Neither MIPS16e nor Thumb can really compress code, they just provide alternative opcodes for some instructions, and the compression ratio depends on the ratio of the total length of short opcodes to long opcodes. In other words, it depends on the task performed by the code. System-level code such as operating systems and interrupt handling routines cannot use 16-bit instructions at all, so code compression cannot be achieved. General algorithms can get good compression efficiency as long as they do not use any large operands. Finally, don't forget that data cannot be compressed, only code can be compressed. If your application code includes a large number of static data structures, the total memory savings you can get will be very small. Also, the 15% performance loss may not be worth it. On the other hand, MIPS16e and Thumb are both free (assuming your processor already includes them), and the cost of choosing them is very low.
PowerPC's CodePack Technology
It is worth mentioning in advance that IBM's CodePack method is the most complex of the various code compression technologies. Unlike Thumb and MIPS16e, the CodePack system actually compresses the running code, just like running WinZip in PowerPC software. CodePack analyzes and compresses the entire program, and the generated user code must decompress and execute the compressed version on the fly. Despite its complexity, CodePack provides 20% to 30% space savings like other technologies.
CodePack is an attractive technology. To use it, you compile your embedded PowerPC code as usual using standard tools, and CodePack even works with existing code (with or without source code). Before you write the code to ROM or load it to disk, you run the CodePack compression tool to compress it. The compression tool analyzes the code's instruction distribution and generates a pair of keys that are specific to the program. When you run the compressed code, a CodePack-enabled processor uses the key pair to decompress the compressed code on the fly, just as if it were running the compressed code directly. The decompression adds a small delay to the processor's pipeline, but the effect is masked by instruction fetch latency and other delays. For most applications, the performance impact of CodePack is negligible.
However, CodePack has some other effects. Because each compressed program has its own compression key, CodePack is essentially both a compression system and an encryption system. Without the key, neither you nor anyone else can run the program. If the key is missing or not available, the compressed program is just a bunch of useless gibberish, which also means that compressed PowerPC programs are not binary compatible. You can't easily exchange a compressed program with another system unless you also include its decompression key. This makes field distribution of embedded system software a little more complicated.
In addition, CodePack generates two keys for each program because the upper 16 bits and lower 16 bits of the instruction are compressed separately. IBM engineers discovered that the upper halfword (where the opcode is located) and the lower halfword (whose contents are usually operands, offsets, or masks) of each PowerPC instruction are distributed differently. Using different compression algorithms for each will make the compression effect better than using any single algorithm, which is what CodePack does for the program.
ARCompact
ARC International has taken another approach to code compression. Because the ARCtangent processor has a user-definable instruction set, ARC (and its users) can modify the instruction set at will. As ARCompact, ARC decided to add a set of 16-bit instructions to improve the code density of its processor.
ARCompact differs from Thumb and MIPS16e in that 16-bit code can be mixed with 32-bit code at will. Since there are no mode switches, there is no overhead for a few 16-bit instructions scattered around the code. The default configuration of the ARC compiler generates 16-bit operations whenever possible (you can turn this off to force the compiler to generate 32-bit code or to maintain compatibility with older processors).
ARC can mix different lengths without the corresponding overhead because its instruction architecture is newer than ARM and MIPS. Those RISC architectures' instruction sets (including PowerPC) do not have bits in the instruction word to indicate the length of the instruction. New pseudo-RISC architectures such as ARC or Tensilica, and older architectures like x86 and 68k have these bits. Whether by accident or foresight, variable-length instruction architectures have the advantage of more compact code. Thumb
-2, an Improvement of Thumb
ARM has revamped its code compression system and released Thumb-2. Thumb-2 is not an upgrade of Thumb, but rather a new one that can completely replace Thumb and the original ARM instruction set. Thumb-2 is somewhat like ARCompact or Motorola's 68k, and can run mixed 16-bit and 32-bit code without switching modes. In general, Thumb-2 provides slightly less efficient code compression, but the performance loss is also smaller.
To do this, ARM needed to find a breakthrough (hole) in its opcode map, and they found the breakthrough they needed in the BL instruction (the BL instruction, which switches between Thumb and ARM modes). In the original instruction set, some bits of the BL instruction were not used, and these previously undefined bits provided the switching entry for the new instruction set. The encoding is indeed not very good, but it works.
The biggest advantage of Thumb-2 is that it is a complete instruction set, and the program does not need to switch back to the "standard" 32-bit ARM mode. The original Thumb mode restrictions are gone. The program can now handle interrupts, set up the MMU, and manage caches, just like a real microprocessor.
Thumb-2 still requires some performance loss. Even without the mode switching overhead, it still takes more Thumb-2 instructions to complete specific tasks than standard ARM code. For ARM processors, these extra instructions (and extra cycles) will slow things down by about 15 to 20 percent.
Future ARM processors will eventually run only Thumb-2 code. Since it effectively replaces both the ARM and Thumb instruction sets with a single, more compressed set of instructions, why wouldn't it eventually replace them all? The question is what about ARM software compatibility? Until now, all ARM processors (except Intel's XScale) have been binary compatible. Although new processors that support Thumb-2 will be able to run existing ARM and Thumb code, the reverse is not true. When Thumb-2 becomes widespread, it will create a separate but equivalent set of software libraries.
|