Memory alignment processing for ARM processors-EEWORLD

Collect

There are three main alignment issues: variable alignment, structure alignment, and data alignment. The first two are variable mapping and structure layout determined by the compiler. The last one is related to the CPU architecture (CISC/RISC). In most cases, alignment is a matter for the compiler and the CPU, and has nothing to do with the programmer. But in some cases, programmers must consider alignment issues, otherwise there will be some trouble.
0 Conventions and preliminary knowledge
0.1 Address boundaries
  If bytes are regarded as small houses, memory is small houses arranged in sequence. Each small house has a house number with a sequential number, for example: 0,1,2,...,0xffffffff. We call this house number an address. This article records the address of integer multiples of 2 as 2n boundaries, the address of integer multiples of 4 as 4n boundaries, and so on. Obviously, each address is a 1n boundary, each 4n boundary is a 2n boundary, and each 8n boundary is a 4n boundary. The
so-called "alignment" is to put the variable on what kind of address boundary, for example: 1n boundary, 2n boundary, or 4n boundary.

0.2 Classification of variables
  Classification comes from perspective. There are as many categories as there are angles. Recently, I have been forced to listen to "One World, One Dream". In fact, in my opinion, every life has a unique dream, not to mention the country. If a bear has religious beliefs, the God in its mind should be a bear with an elegant appearance.

0.2.1 Basic types and composite types
  From the perspective of composition, variables can be divided into variables of basic types and variables of composite types. Basic types are simple types supported by the language, such as char, short, int, double, etc. Composite types are composed of basic types, such as structures. This article will record variables of basic types as basic variables, and variables of composite types as composite variables or structure variables.
  The length of basic variables is currently 1, 2, 4, and 8 bytes. There may be larger basic variables in the future. Embedded environments usually do not support floating points, and the common lengths are 1, 2, and 4 bytes.

0.2.2 Address of variables
  From the perspective of address, variables can be divided into variables with fixed addresses and variables without fixed addresses. The so-called "fixed address" means that there is a fixed address before the program runs. For variables with "unfixed addresses", their addresses are determined at runtime.
  Global variables and static variables have fixed addresses. Local variables and dynamically allocated variables do not have fixed addresses. This article will record variables with fixed addresses as addressed variables.

1 Variable alignment
1.1 Variables without fixed addresses
  Local variables are allocated from the stack, and the compiler usually ensures that the address of each local variable is on the 4n boundary.
  Dynamically allocated variables are allocated from the heap. The implementation of the heap is related to the standard library and the operating system. In some simple embedded systems, we need to implement dynamic memory allocation ourselves. At this time, we must ensure that the address of each allocated memory block is on the 4n boundary to avoid the data alignment problem discussed later.

1.2 Variables with fixed addresses
  The address of the addressed variable is determined at link time. The compiler usually has a compilation option to set the variable alignment, and we usually use the default value of this option. By default, the compiler will align and place the addressed variable in the default way.
  The so-called "align by default" means placing a basic variable with a length of 1 on the 1n boundary. Place a basic variable with a length of 2 on the 2n boundary. Place a basic variable with a length of 4 on the 4n boundary, and so on.
  Each structure variable is always composed of basic variables. Structure variables are aligned according to the longest basic variable in the structure. If the maximum length of a structure basic variable is 1, the compiler can place the structure on the 1n boundary. If the maximum length of a structure basic variable is 4, the compiler should place the structure on the 4n boundary.
  So how are the member variables in the structure aligned?

1.3 What troubles will variable alignment bring?
  I once suffered a loss on the variable alignment issue, which can be used as an example in this section. However, to understand this example, readers must know a feature of ARM CPU: basic variables with a length of m must be placed on the mn boundary, otherwise data access errors will occur when reading and writing, where m=2 or 4. This is the data alignment to be introduced in Section 3.
  The thing is that I defined several buffers (large arrays) and then dynamically allocated these memories. My mistake was to define these arrays as byte arrays. My allocation algorithm is to allocate by block, and the size of each data block is an integer multiple of 4. Can readers guess the reason for the error?
  Since I defined the buffer as a byte array, the compiler can place them on the 1n boundary. If the starting address of the buffer is an odd address, the starting address of the memory block allocated from the buffer is an odd address. If these memory blocks are used for variables that need to be aligned to 2 or 4 bytes, data access errors will occur when reading and writing. If the compiler happens to place these buffers on 4n boundaries, the problem will not be exposed. So the previous compilation may be fine, but the next compilation will cause inexplicable errors. Debugging a program is similar to solving a case. The farther the murderer is from the crime scene, the harder it is to find. Before I find the root cause through various appearances, it is inevitable to suffer a little.
  The solution to the problem is simple. Define the buffer as an array of unsigned int (hereinafter referred to as uint32), and the compiler will naturally put them on the 4n boundary. In embedded systems, we often need to define stacks for tasks. These stacks are usually arrays of uint32 type. Do you know why they are defined as uint32 arrays instead of byte arrays?

2 Structure alignment
2.1 Basic length
  For the convenience of description, we define a concept of basic length. The basic length of a basic variable is its length, and the basic length of a structure variable is the maximum length of the basic variables in the structure members. As mentioned earlier: by default, structure variables are aligned according to their basic length.

2.2 Alignment
  By default, it can be assumed that the members of a structure are aligned in the default way, that is, a basic variable of length m is placed on an mn boundary, where m = 1, 2, 4, or 8. Because the members need to be aligned, padding bytes may appear between the members of the structure, and the size of the structure may be greater than the sum of the sizes of the members.
  For example:
  typedef struct St1Tag {
          char ch1;
          int num1;
          short sh1;
          short sh2;
          char ch2;
  } St1;
  The basic length of this structure is 4, so the variables of this structure are placed on the 4n boundary. The basic length of the member num1 is 4, so it is also placed on the 4n boundary. The member ch1 starts at the 4n boundary and occupies only 1 byte, so there are 3 padding bytes between ch1 and num1.
  When aligning, the compiler will round the length of the structure to an integer multiple of the basic length. In this way, the array with this structure as the basic type can be arranged continuously and each element can be aligned. Therefore, the value of sizeof(St1) is 16, and there are 3 padding bytes after the last member ch2 of St1.

2.3 Compression
  All compilers support the compression of structures, that is, the member variables of the structure are arranged continuously without any padding bytes between the member variables. At this time, the size of the structure is equal to the sum of the sizes of the member variables. Variables of the compressed structure can be placed on the 1n boundary, that is, any address boundary.
  In gcc, the packed structure can be defined like this:
  typedef struct St2Tag {
          St1 st1;
          char ch2;
  } __attribute__ ((packed)) St2;

  armcc is like this:
  typedef __packed struct St2Tag {
          St1 st1;
          char ch2;
  } St2;

  VC's writing is the most troublesome:
#pragma pack(1)
  typedef struct St2Tag {
          St1 st1;
          char ch2;
  } St2;
#pragma pack()

  If you want to support gcc, armcc, and VC platforms at the same time, you can write the code like this:
#ifdef __GNUC__
#define GNUC_PACKED __attribute__((packed))
#else
#define GNUC_PACKED
#endif

#ifdef __arm
#define ARM_PACKED __packed
#else
#define ARM_PACKED
#endif

#ifdef WIN32
#pragma pack(1)
#endif
typedef ARM_PACKED struct St2Tag {
          St1 st1;
          char ch2;
} GNUC_PACKED St2;
#ifdef WIN32
#pragma pack()
#endif

Among them: __GNUC__ is a predefined macro of gcc, and __arm__ is a predefined macro of the ARM compiler (both __arm__ and __arm__ are acceptable). They can be used to identify the current compiler.

2.4 Global Settings
  In VC, some programmers are accustomed to setting the struct member alignment of the entire project, which corresponds to the command line option "/Zpi", where i=1,2,4,8,16. If this value is set to 1, all structures in the project are compactly arranged. Tight arrangement will increase the amount of code and reduce the efficiency of structure access. We should use compact structures only when necessary.
   "/Zp1" is a compact arrangement, so how are options such as "/Zp2" and "/Zp4" arranged?
  Suppose the length set in the option "/Zpi" is i, and the basic length of a structure member is m, then the structure member is aligned according to the smaller value of m and i. For example: if we set "/Zp2", members with a basic length not greater than 2 will be aligned according to the basic length, and members with a basic length greater than 2 will be aligned according to 2.
  In fact, we should not use such a strange option as "/Zp2" unless there is a reason to do so.

2.5 The use of compact structures
  In fact, the most commonly used structure alignment options are: default alignment and compaction. When transferring data between two programs or two platforms, we usually set the data structure to be compact. This not only reduces the amount of communication, but also avoids the trouble caused by alignment. Suppose Party A and Party B communicate across platforms, Party A uses such a strange alignment option as "/Zp2", and Party B's compiler does not support this alignment, then Party B can understand what it means to want to cry but have no tears.
  When we need to access structure data byte by byte, we usually hope that the structure is compact, so that we don't have to consider which byte is the padding byte. When we save data to non-volatile devices, we usually use compact structures, which not only reduces the storage volume, but also facilitates other programs to read out.

2.6 Details
  Finally, record a small detail. Both the gcc compiler and the VC compiler support including non-packed structures in packed structures. For example, St2 in the previous example can include non-packed St1. But for the ARM compiler, other structures included in the packed structure must be packed. If the packed St2 includes non-packed St1, an error will be reported during compilation:
  error: #1031 efinition of "struct St1Tag" in packed "struct St1T2g"must be __packed

3 Data alignment
  3.1 CISC and RISC
  CPUs can be divided into two categories based on the characteristics of the instruction set: CISC and RISC. CISC and RISC are the abbreviations of Complex Instruction Set Computer and Reduced Instruction Set Computer, respectively.
  The work of the CPU can be seen as a repeated cycle of the following steps:
  step 1: fetch instructions
  step 2: fetch data
  step 3: execute instructions
  step 4: output results
CISC CPU supports many addressing modes, so the time to fetch data is uncertain. The biggest feature of RISC CPU is that it simplifies the addressing mode of instructions. Except for Load/Store instructions, other instructions use register addressing, that is, reading and writing data from registers. This design makes the time to fetch data relatively stable and simplifies the design of instruction pipelines.
  Generally speaking, RISC architecture can reduce the complexity of CPU and allow more powerful CPUs to be produced at the same process level, but it has higher requirements for compiler design.

  3.2 Aligned data access
  RISC CPU Load/Store instructions require data to be aligned. Data with a length of 4 should be placed on the 4n boundary, and data with a length of 2 should be placed on the 2n boundary. Take the Load of ARM CPU as an example:
  LDR R5, [R4]
  LDRSH R7, [R6]
  LDRB R9, [R8]
  LDR, LDRSH, and LDRB read a word, half word, and byte from the memory respectively and put them into the specified register. For example, "LDR R5, [R4]" reads a word (length 4) from the storage unit pointed to by R4 and puts it into R5. LDR requires the data address to be on the 4n boundary, otherwise an error will occur. LDRSH requires the data address to be on the 2n boundary, otherwise an error will occur.
  What error occurred? This depends on the specific CPU. On ARM7TDMI, unaligned access will cause the program to jump to the data access error processing vector, that is, address 0x00000010. On ARM920T, the LDR instruction may return incorrect data. CISC CPUs support unaligned data reads.

3.3 Examples
  Let's look at an example:
  // Example 1
  void test(void) {
         char a[] = {1,2,3,4,5};
         int *pi, i;

         printf("&a[1]=%p\n", &a[1]);
         pi = (int *)&a[1];
         i = *pi;
         printf("0xx\n", i);
         *pi = 0x11223344;
         for(i = 0; i < sizeof(a)/sizeof(a[0]); i++)
         {
              printf("0xx ", a);
         }
}
The key is this sentence: i = *pi; and *pi = 0x11223344; We know that the 4 bytes pointed to by address pi are: 0x02, 0x03, 0x04, 0x05 respectively. On a little-endian CPU, we expect the output to be 0x05040302 and 0x01 0x44 0x33 0x22 0x11. Let's see how this code works on different platforms.

3.3.1 PC/Windows
  The output is:
   &a[1]=0x0012FF25 0x05040302
   0x01 0x44 0x33 0x22 0x11
   This is in line with our expectations, and also shows that the PC CPU supports unaligned data reads.

3.3.2 PC/Linux
The output is:
  &a[1]=0xbfa0c36c 0x05040302
  0x01 0x44 0x33 0x22 0x11
It is worth noting that the gcc compiler places the local variable a on the 1n boundary (0xbfa0c36b). We hope that pi is an odd address, and modify the test code to:
  // Example 2
  void test1(void) {
         int a[] = {0x04030201, 0x08070605};
         int *pi, i;

         pi = (int *)&((char *)&a)[1];
         printf("pi=%p ", pi);
         i = *pi;
         printf("x\n", i);

         *pi = 0x11223344;
         for(i = 0; i < sizeof(a)/sizeof(a[0]); i++)
         {
              printf("0xx ", a);
         }
}
The output result is: pi=0xbfe87fe9 0x05040302 0x22334401 0x08070611, which is in line with our expectations. Data alignment is a CPU problem and has nothing to do with the compiler or operating system.

3.3.3 The output result of ARM920T/Linux
is: &a[1]=0xbec49e55 0x01040302 0x44 0x33 0x22 0x11 0x05 Considering the little endian, the 4 bytes actually read by the CPU are 0x02, 0x03, 0x04, and 0x01. This result is not what we expected, and the CPU is wrong.
Why?
In ARM, there are two types of instructions: ARM and Thumb.
ARM instructions: Each time an instruction is executed, the value of PC increases by 4 bytes (32 bits). To access 4 bytes at a time, the starting address of the byte must be at a 4-byte aligned position, that is, the lower two bits of the address are bits [0b00], which means that the address must be a multiple of 4.
Thumb instruction: Each time an instruction is executed, the value of PC increases by 2 bytes (16 bits). To access 2 bytes at a time, the starting address of the byte must be aligned to 2 bytes, that is, the lower two bits of the address are bits [0b0], that is, the address must be a multiple of 2.
At present, after testing, it is found that when writing memory, it will be accessed according to the aligned address (such as *pi = 0x11223344 above; it will actually be accessed to ((uintptr_t)(pi))& ~(4-1)); while no pattern is found in the read operation.
But is there a way to perform unaligned access? To this end, the ARM compiler provides the __packed keyword, __packed is a one-byte alignment,
  void test2(void) {
         char a[] = {1,2,3,4,5};
         __packed int *pi, i;

         printf("&a[1]=%p\n", &a[1]);
         pi = (int *)&a[1];
         i = *pi;
         printf("0xx\n", i);
         *pi = 0x11223344;
         for(i = 0; i < sizeof(a)/sizeof(a[0]); i++)
         {
              printf("0xx ", a);
         }
}

The output result is: &a[1]=0xbec49e55 0x01040302 0x01 0x44 0x33 0x22 0x11

3.3.4 The ARM7TDMI
program jumps directly back to Data when executing: i = *pi; Abort processing vector, that is, address 0x00000010.

3.4 Countermeasures
When reading a compact structure or a compact member of a structure, the compiler will automatically generate code to read by byte. We just need to be careful when doing forced pointer conversion. We should not force a pointer to narrow data to a pointer to wide data. Where data alignment problems may occur, read data by byte.

Reference address：Memory alignment processing for ARM processors

Previous article：Playing with STM32F407 control system composition
Next article：ARM-Linux boot method

Popular Resources
Popular amplifiers