Memory alignment processing for ARM processors

Jacktang

Memory alignment processing for ARM processors [Copy link]

There are three main alignment issues: variable alignment, structure alignment, and data alignment. The first two are variable mapping and structure layout determined by the compiler. The last one is related to the CPU architecture (CISC/RISC). In most cases, alignment is a matter for the compiler and the CPU, and has nothing to do with the programmer. But in some cases, programmers must consider alignment issues, otherwise there will be some trouble. 0 Conventions and preliminary knowledge 0.1 Address boundaries If bytes are regarded as small houses, memory is the small houses arranged in sequence. Each small house has a house number with a sequential number, for example: 0,1,2,...,0xffffffff. We call this house number an address. This article records the address of integer multiples of 2 as 2n boundaries, the address of integer multiples of 4 as 4n boundaries, and so on. Obviously, each address is a 1n boundary, each 4n boundary is a 2n boundary, and each 8n boundary is a 4n boundary. The so-called "alignment" refers to the address boundary on which the variable is placed, for example: 1n boundary, 2n boundary, or 4n boundary. 0.2 Classification of variables Classification comes from angles. There are as many classifications as there are angles. Recently, I am often forced to listen to "One World, One Dream". In fact, in my opinion, every life has a unique dream, not to mention the country. If a bear has religious beliefs, the God in its mind should be a bear with an elegant appearance. 0.2.1 Basic types and composite types From the perspective of composition, variables can be divided into basic type variables and composite type variables. Basic types are simple types supported by the language, such as char, short, int, double, etc. Composite types are composed of basic types, such as structures. This article records variables of basic types as basic variables, and variables of composite types as composite variables or structure variables. The length of basic variables is currently 1, 2, 4, and 8 bytes. In the future, there may be larger basic variables. Embedded environments usually do not support floating points, and common lengths are 1, 2, and 4 bytes. 0.2.2 Address of variables From the address point of view, variables can be divided into variables with fixed addresses and variables without fixed addresses. The so-called "fixed address" means that there is a fixed address before the program runs. As for variables with "unfixed addresses", their addresses are determined at runtime. Global variables and static variables have fixed addresses. Local variables and dynamically allocated variables do not have fixed addresses. This article will record variables with fixed addresses as addressed variables. 1 Variable alignment 1.1 Variables without fixed addresses Local variables are allocated from the stack, and the compiler usually ensures that the address of each local variable is on the 4n boundary. Dynamically allocated variables are allocated from the heap. The implementation of the heap is related to the standard library and the operating system. In some simple embedded systems, we need to implement dynamic memory allocation ourselves. At this time, we must ensure that the address of each allocated memory block is on the 4n boundary to avoid the data alignment problem discussed later. 1.2 Variables with fixed addresses The address of the addressed variable is determined at link time. The compiler usually has a compilation option to set the variable alignment. We usually use the default value of this option. By default, the compiler will align the addressed variable in the default way. The so-called "aligned by default" means placing the basic variable with a length of 1 on the 1n boundary. Place the basic variable with a length of 2 on the 2n boundary. Place the basic variable with a length of 4 on the 4n boundary, and so on. Each structure variable is always composed of basic variables. Structure variables are aligned according to the longest basic variable in the structure. If the maximum length of a structure basic variable is 1, the compiler can place this structure on the 1n boundary. If the maximum length of a structure basic variable is 4, the compiler should place this structure on the 4n boundary. So how are the member variables in the structure aligned? 1.3 What troubles will variable alignment bring? I once suffered a loss on the variable alignment issue, which can be used as an example in this section. However, to understand this example, readers must know a feature of ARM CPU: basic variables with a length of m must be placed on the mn boundary, otherwise data access errors will occur when reading and writing, where m=2 or 4. This is the data alignment to be introduced in Section 3. Here is the thing, I defined several buffers (large arrays) and then dynamically allocated these memories. My mistake was to define these arrays as byte arrays. My allocation algorithm is to allocate by block, and the size of each data block is an integer multiple of 4. Can readers guess the reason for the error? Since I defined the buffer as a byte array, the compiler can place them on the 1n boundary. If the starting address of the buffer is an odd address, the starting address of the memory block allocated from the buffer is an odd address. If these memory blocks are used for variables that need to be aligned to 2 or 4 bytes, data access errors will occur when reading and writing. If the compiler happens to place these buffers on 4n boundaries, the problem will not be exposed. So the previous compilation may be fine, but the next compilation will cause inexplicable errors. Debugging a program is similar to solving a case. The farther the murderer is from the crime scene, the harder it is to find. Before I find the root cause through various appearances, it is inevitable to suffer a little. The solution to the problem is simple. Define the buffer as an array of unsigned int (hereinafter referred to as uint32), and the compiler will naturally place them on the 4n boundary. In embedded systems, we often need to define stacks for tasks. These stacks are usually arrays of uint32 type. Do you know why they are defined as uint32 arrays instead of byte arrays? 2 Structure alignment 2.1 Basic length For the convenience of description, we define a concept of basic length. The basic length of a primitive variable is its length, and the basic length of a structure variable is the maximum length of the primitive variables in the structure members. As mentioned earlier: by default, structure variables are aligned according to their basic length. 2.2 Alignment By default, it can be considered that the members of a structure are aligned in the default way, that is, primitive variables of length m are placed on the mn boundary, where m=1,2,4 or 8. Because members need to be aligned, padding bytes may appear between members of the structure, and the size of the structure may be greater than the sum of the sizes of the members. For example: typedef struct St1Tag { char ch1; int num1; short sh1; short sh2; char ch2; } St1; The basic length of this structure is 4, so the variables of this structure should be placed on the 4n boundary. The basic length of member num1 is 4, so it should also be placed on the 4n boundary. Member ch1 starts at the 4n boundary and occupies only 1 byte, so there are 3 padding bytes between ch1 and num1. When aligning, the compiler will round the structure length to an integer multiple of the basic length. In this way, the array with this structure as the basic type can be arranged continuously and each element can be placed aligned. Therefore, the value of sizeof(St1) is 16, and there are 3 padding bytes after the last member ch2 of St1. 2.3 Compacting All compilers support the compaction of structures, that is, the member variables of the structure are arranged continuously without any padding bytes between the member variables. At this time, the size of the structure is equal to the sum of the sizes of the member variables. Variables of compacted structures can be placed on 1n boundaries, that is, arbitrary address boundaries. The compact structure can be defined in gcc like this: typedef struct St2Tag { St1 st1; char ch2; } __attribute__ ((packed)) St2; armcc is like this: typedef __packed struct St 2Tag { St1 st1; char ch2; } St2; VC is the most troublesome way to write: #pragma pack(1) typedef struct St2Tag { St1 st1; char ch2; } St2; #pragma pack() If you want to support gcc, armcc, and VC platforms at the same time, you can write the code like this: #ifdef __GNUC__ #define GNUC_PACKED __attribute__((packed)) #else #define GNUC_PACKED #endif #ifdef __arm #define ARM_PACKED __packed #else #define ARM_PACKED #endif #ifdef WIN32 #pragma pack(1) #endif typedef ARM_PACKED struct St2Tag { St1 st1; char ch2; } GNUC_PACKED St2; #ifdef WIN32 #pragma pack() #endif Among them, __GNUC__ is a predefined macro of gcc, __arm__ is a predefined macro of ARM compiler (both __arm and __arm__ are acceptable), which can be used to identify the current compiler. 2.4 Global settings In VC, some programmers are used to setting the struct member alignment of the entire project, which corresponds to the command line option "/Zpi", where i=1,2,4,8,16. If this value is set to 1, all structures in the project are compactly arranged. Tight arrangement will increase the amount of code and reduce the efficiency of structure access. We should use compact structures only when necessary. "/Zp1" is a compact arrangement, so how are options such as "/Zp2" and "/Zp4" arranged? Suppose the length set in option "/Zpi" is i, and the basic length of a structure member is m, then the structure member is aligned according to the smaller value of m and i. For example: if we set "/Zp2", members with a basic length not greater than 2 are aligned according to the basic length, and members with a basic length greater than 2 are aligned according to 2. In fact, we should not use such a strange option as "/Zp2" unless there is a reason to do so. 2.5 The use of compact structures In fact, the most commonly used structure alignment options are: default alignment and compaction. When transferring data between two programs or two platforms, we usually set the data structure to be compact. This not only reduces the amount of communication, but also avoids the trouble caused by alignment. Suppose Party A and Party B are communicating across platforms, Party A uses such a strange alignment option as "/Zp2", and Party B's compiler does not support this alignment method, then Party B can understand what it means to be in tears. When we need to access structure data byte by byte, we usually hope that the structure is compact, so that we don't have to consider which byte is the padding byte. When we save data to non-volatile devices, we usually use compact structures to reduce storage and facilitate reading by other programs. 2.6 Details Finally, I will record a small detail. Both the gcc compiler and the VC compiler support the inclusion of non-packed structures in packed structures. For example, St2 in the previous example can include non-packed St1. But for the ARM compiler, other structures included in a packed structure must be packed. If the packed St2 includes non-packed St1, an error will be reported during compilation: error: #1031efinition of "struct St1Tag" in packed "struct St1T2g"must be __packed 3 Data alignment 3.1 CISC and RISC Based on the characteristics of the instruction set, CPUs can be divided into two categories: CISC and RISC. CISC and RISC are the abbreviations of Complex Instruction Set Computer and Reduced Instruction Set Computer, respectively. The work of the CPU can be seen as a repeated cycle of the following steps: step 1: fetch instructions step 2: fetch data step 3: execute instructions step 4: output results CISC CPU supports many addressing modes, so the time to fetch data is uncertain. The biggest feature of RISC CPU is that it simplifies the addressing mode of instructions. Except for Load/Store instructions, other instructions use register addressing, that is, read and write data from registers. This design makes the time to fetch data relatively stable and can simplify the design of the instruction pipeline. In general, the RISC architecture can reduce the complexity of the CPU and allow the production of more powerful CPUs at the same process level, but it has higher requirements for the design of the compiler. 3.2 Aligned Data Access RISC CPU's Load/Store instructions require that data be aligned. Data of length 4 should be placed on a 4n boundary, and data of length 2 should be placed on a 2n boundary. Take ARM CPU's Load as an example: LDR R5,[R4] LDRSH R7,[R6] LDRB R9,[R8] LDR, LDRSH, and LDRB read a word, half-word, and byte from the memory, respectively, and place them in the specified register. For example, "LDR R5,[R4]" reads a word (length 4) from the memory cell pointed to by R4 and places it in R5. LDR requires that the data address be on a 4n boundary, otherwise an error will occur. LDRSH requires that the data address be on a 2n boundary, otherwise an error will occur. What error occurred? This depends on the specific CPU. On ARM7TDMI, unaligned access will cause the program to jump to the data access error processing vector, that is, address 0x00000010. On ARM920T, the LDR instruction may return incorrect data. CISC CPUs support unaligned data reads. 3.3 Example Let’s look at an example: // Example 1 void test(void) { char a[] = {1,2,3,4,5}; int *pi, i; printf("&a[1]=%p\n", &a[1]); pi = (int *)&a[1]; i = *pi; printf("0xx\n", i); *pi = 0x11223344; for(i = 0; i < sizeof(a)/sizeof(a[0]); i++) { printf("0xx ", a); } } The key is this sentence: i = *pi; and *pi = 0x11223344; We know that the 4 bytes pointed to by address pi are: 0x02, 0x03, 0x04, 0x05. On a little-endian CPU, the output we expect is 0x05040302 and 0x01 0x44 0x33 0x22 0x11. Let's see how this code works on different platforms. 3.3.1 PC/Windows The output is: &a[1]=0x0012FF25 0x05040302 0x01 0x44 0x33 0x22 0x11 This is in line with our expectations and also shows that the PC’s CPU supports unaligned data reading. 3.3.2 PC/Linux The output is: &a[1]=0xbfa0c36c 0x05040302 0x01 0x44 0x33 0x22 0x11 It is worth noting that the gcc compiler places the local variable a on the 1n boundary (0xbfa0c36b). We hope that pi is an odd address, so we modify the test code as follows: // Example 2 void test1(void) { int a[] = {0x04030201, 0x08070605}; int *pi, i; pi = (int *)&((char *)&a)[1]; printf("pi=%p ", pi); i = *pi; printf("x\n",i); *pi = 0x11223344; for(i = 0; i < sizeof(a)/sizeof(a[0]); i++) { printf("0xx ", a); } } The output result is: pi=0xbfe87fe9 0x05040302 0x22334401 0x08070611, which is in line with our expectations. Data alignment is a CPU problem and has nothing to do with the compiler or operating system. 3.3.3 ARM920T/Linux The output result is: &a[1]=0xbec49e55 0x01040302 0x44 0x33 0x22 0x11 0x05 Considering the little endian, the 4 bytes actually read by the CPU are 0x02, 0x03, 0x04, and 0x01. This result is not what we expected, and the CPU has an error. Why? In ARM, there are two types of instructions: ARM and Thumb. ARM instructions: Each time an instruction is executed, the value of PC increases by 4 bytes (32 bits). To access 4 bytes at a time, the starting address of the byte must be at a 4-byte aligned position, that is, the lower two bits of the address are bits [0b00], that is, the address must be a multiple of 4. Thumb instruction: Each time an instruction is executed, the value of PC increases by 2 bytes (16 bits).). To access 2 bytes of content at a time, the starting address of the byte must be at a 2-byte aligned position, that is, the lower two bits of the address are bits [0b0], which means that the address must be a multiple of 2. Currently, after testing, it is found that when writing memory operations, it will be accessed according to address alignment (such as *pi = 0x11223344 above; it will actually be accessed to ((uintptr_t)(pi))& ~(4-1) aligned); while no pattern is found in read operations. But is there a way to perform unaligned access? For this purpose, the ARM compiler provides the __packed keyword, which is one-byte alignment, void test2(void) { char a[] = {1,2,3,4,5}; __packed int *pi, i; printf("&a[1]=%p\n", &a[1 ]); pi = (int *)&a[1]; i = *pi; printf("0xx\n", i); *pi = 0x11223344; for(i = 0; i < sizeof(a)/sizeof(a[0]); i++)[/size ] { printf("0xx ", a); } } The output is: &a[1]=0xbec49e55 0x01040302 0x01 0x44 0x33 0x22 0x11 3.3.4 ARM7TDMI When executing: i = *pi;, the program directly jumps back to the Data Abort processing vector, that is, address 0x00000010. 3.4 Countermeasures When reading a compact structure or a compact member of a structure, the compiler will automatically generate code to read by bytes. We just need to be careful when doing forced pointer conversion. We should not force a pointer to narrow data to a pointer to wide data. Where alignment issues may occur, data is read byte by byte.That is to say, the address must be a multiple of 2. Currently, after testing, it is found that when writing memory, it will be accessed according to the aligned address (such as *pi = 0x11223344 above; it will actually be accessed to ((uintptr_t)(pi))& ~(4-1)); while no pattern is found in the read operation. But is there a way to perform unaligned access? For this purpose, the ARM compiler provides the __packed keyword, which is one-byte alignment, void test2(void) { char a[] = {1,2,3,4,5}; __packed int *pi, i; printf("&a[1]=%p\n", &a[1 ]); pi = (int *)&a[1]; i = *pi; printf("0xx\n", i); *pi = 0x11223344; for(i = 0; i < sizeof(a)/sizeof(a[0]); i++)[/size ] { printf("0xx ", a); } } The output is: &a[1]=0xbec49e55 0x01040302 0x01 0x44 0x33 0x22 0x11 3.3.4 ARM7TDMI When executing: i = *pi;, the program directly jumps back to the Data Abort processing vector, that is, address 0x00000010. 3.4 Countermeasures When reading a compact structure or a compact member of a structure, the compiler will automatically generate code to read by bytes. We just need to be careful when doing forced pointer conversion. We should not force a pointer to narrow data to a pointer to wide data. Where alignment issues may occur, data is read byte by byte.That is to say, the address must be a multiple of 2. Currently, after testing, it is found that when writing memory, it will be accessed according to the aligned address (such as *pi = 0x11223344 above; it will actually be accessed to ((uintptr_t)(pi))& ~(4-1)); while no pattern is found in the read operation. But is there a way to perform unaligned access? For this purpose, the ARM compiler provides the __packed keyword. __packed is one-byte alignment, void test2(void) { char a[] = {1,2,3,4,5}; __packed int *pi, i; printf("&a[1]=%p\n", &a[1]); pi = (int *)&a[1]; i = *pi; printf("0xx\n", i); *pi = 0x11223344; for(i = 0; i < sizeof(a)/sizeof(a[0]); i++) { printf("0xx ", a); } } The output is: &a[1]=0xbec49e55 0x01040302 0x01 0x44 0x33 0x22 0x11 3.3.4 ARM7TDMI When executing: i = *pi;, the program directly jumps back to the Data Abort processing vector, that is, address 0x00000010. 3.4 Countermeasures When reading a compact structure or a compact member of a structure, the compiler will automatically generate code to read by bytes. We just need to be careful when doing forced pointer conversion. We should not force a pointer to narrow data to a pointer to wide data. Where alignment issues may occur, data is read byte by byte.