A brief discussion on the bus differences of ARM Cortex-m0/m4 series[Copy link]
Let’s start with a simple question: How fast can the GPIO flip speed of STM32 (for example, used to simulate timing) be? Write a piece of code to test:
The function of this code is to make PA5 flip back and forth between high and low states. After 20 consecutive times, there will be a jump interval. After compiler optimization, it becomes a series of STR instructions: ...... 100001c0: 6159 str r1, [r3, #20] 100001c2: 615a str r2, [r3 , #20] 100001c4: 6159 str r1, [r3, #20] 100001c6: 615a str r2, [r3, #20] 100001c8: 6159 [ b]str r1, [r3, #20] 100001ca: 615a str r2, [r3, #20] 100001cc: 6159 str r1, [r3, #20][ /b]... In this way, two different values are alternately written to the same memory address (the address of the GPIO ODR register), causing the I/O port level to change. If the execution of each STR instruction only requires 1 machine cycle (this is the most ideal situation), the above program can make the GPIO output a square wave at half the system clock frequency. In fact Is the time of each STR instruction one machine cycle? When I was playing an 8-bit AVR, I wrote an ISP downloader program in assembly, and the simulation timing was also used. I still remember clearly that the AVR manual states that it takes 1 machine cycle for the OUT instruction to write to the I/O space register, and 2 machine cycles for the ST instruction to write to the register or SRAM. What about ARM? I don't remember whether there is a description of the number of instruction cycles in the manual, but you can test it first. For the convenience of oscilloscope measurement, I first lowered the system clock frequency to 200kHz. Then...
This is very good, except that there is an extra pause after 20 consecutive pulses because of a jump instruction. Both the high and low levels lasted for 5us, that is, one STR instruction only took 5us, corresponding to exactly one machine cycle. But don't get too excited yet. In the above test, the code was executed in SRAM2 of STM32L452. Now I put the code in SRAM1 and execute it. The result is:
Isn’t it strange? Not only is it slower, but the execution time of the same instruction will also change. I can guess that if the same code is executed on Cortex-m0, the latter effect will be similar. Where does the difference come from? If you know something about how computers work, you can imagine that the CPU needs to obtain instructions before it can execute them. So where do you get the instructions? The most commonly used microcontroller is Flash ROM, it may also be SRAM, or even SRAM, SDRAM, NOR Flash, etc. that are plugged into the chip. If the CPU wants to read the memory device and obtain instructions, it must access the bus. In the above program, the CPU executes the STR instruction to write the GPIO register, which is another bus write operation. Okay, here's the question: reading memory from the bus and writing GPIO devices to the bus, can these two operations be performed at the same time? However, the internal bus of the microcontroller also operates at a fixed clock frequency, and a bus master can only issue a maximum of one request per bus cycle. The picture below is the internal structure of the STM32F0x series (Cortex-m0)
There is only one System Bus between the CPU core and the external data channel.Therefore, its access to SRAM/Flash and access to GPIO must be staggered in time. Then the second picture above will appear - the result of bus contention. Although my experiments were not conducted on Cortex-m0, the reason is the same. So in the first picture, where did that ideal result come from? Take a look at the internal structure diagram of the Cortex-m4 I experimented with:
Attention, attention, Cortex-m4 comes out with three buses[/ b], they can be accessed in parallel. They are called I-Code bus, D-Code bus and System bus respectively. In this way, when the CPU writes the GPIO device on the AHB2 bus from the System bus, the CPU can also read the program instructions in SRAM2 from the I-Code bus at the same time. Thus (taking into account the results of pipeline operations) the effect of flipping the GPIO once per machine cycle can be achieved. In my two experiments above, the code execution efficiency is different in SRAM2 and SRAM1. This is because SRAM1 is accessed from the System bus, and accessing GPIO devices causes bus contention. What if the code is executed in Flash? It is also possible to achieve the same efficiency as execution in SRAM2 (accessed from I-Code bus), but there are some conditions, because the speed of Flash is not as fast as SRAM, and waiting must be inserted when the CPU frequency is high. If there is no cache (Cache), It will affect the speed. Let’s talk about the issue of SRAM1. In the picture above, SRAM1 is connected to the three buses of Cortex-m4. I once asked this question https://bbs.eeworld.com.cn/forum.php?mod =viewthread&tid=508085&extra=, now answer: This is divided by address space. At addresses above 0x20000000, the Cortex-m4 (m3 too) is accessed from the System bus, while at addresses below 0x20000000, it is accessed from the I-Code and D-Code buses. If you want to improve the execution efficiency of the code in SRAM1, you need to enable address remapping:
To summarize, the Cortex-m0 CPU has only one bus (so it belongs to the von Neumann structure, and instructions and data are searched in a unified manner) address), even if the same machine instruction program is executed, the efficiency is different from the Cortex-m3/m4 with three buses (they belong to the Harvard structure, instructions and data are addressed separately). In order to take full advantage of this on Cortex-m3/m4, pay attention to try to let the program execute in a memory device that can be accessed from the I-Code bus.