Analysis of the impact of RMW on the running speed of the STM32F7xx core

Latest update time：2017-03-16

Reads：

Preface

In actual tests using the STM32F7xx based on the Cortex-M7 core , customers found that the STM32F4xx chip based on the Cortex-M4 core executed the same simple program faster than the STM32F7xx at the same main frequency . This will affect customers' confidence in switching to the STM32F7xx , and also question ST and ARM 's claims that the execution time of the Cortex-M7 core is much faster than that of the Cortex-M4 core. This article will analyze the occurrence and solution of this situation based on specific cases.

Problem Description

The customer tested the running time of complex programs. For example, at the same 180MHz main frequency, the execution time of the Coremark test program of STM32F7xx is much shorter than that of STM32F4xx ; that is, the performance of STM32F7xx is better and the operation execution efficiency is better. However, when the customer executes the program sequentially, especially the simple program, it is found that the execution time of STM32F7xx is longer than that of STM32F4xx . For example, when running the same test code below, there is a significant difference:

volatile uint16_t i;

static volatile uint16_t j = 0;

i = 0;

while(i<300)

{

i++;

}

if(j < 100)

{

j++;

}

else

{

j = 0;

}

To facilitate the quantification of time, use the Timer2 counting method to count this period of time. Timer2 runs at 90MHz and counts upward. Test_Counter data is used to output the count value. The added code is as follows:

volatile uint16_t i;

static volatile uint16_t j = 0;

TIM2->CNT = 0;

__HAL_TIM_ENABLE(&htim2);

i = 0;

while(i<300)

{

i++;

}

if(j < 100)

{

j++;

}

else

{

j = 0;

}

__HAL_TIM_DISABLE(&htim2);

Test_Counter = __HAL_TIM_GET_COUNTER(&htim2);

After the above modification and testing, the Test_Counter data are:

STM32F446 data is 1543

STM32F746 data is 1836

If you use Keil 's built-in States cycles calculation method to get the following data, you will use this to calculate the execution time data later.

STM32F446 data is 3009

STM32F746 data is 3635

Cause Analysis:

The above tests were all measured using the Cache and ART acceleration methods. If you want to optimize the performance of STM32F7xx , you can refer to the description of the application document AN4667 "STM32F7 Series systemarchitecture and performance" . This example has optimized the document description part, but the problem is still that the STM32F7xx is slower than the STM32F4xx . The two chips run the same code, and the assembly code of the two chips is also the same:

LDRH r2,[sp,#0x00]

ADDS r2,r2,#1

STRH r2,[sp,#0x00]

LDRH r2,[sp,#0x00]

CMP r2,r3

BCC 0x00000128

By looking at the ARM Cortex-M7 core documentation, the following description is found:

In this example, we find that the defined i data is 16-bit data, and we also find the STRB assembly code in the assembly code. In this way, under the RMW ( read-modify-write ) mechanism, when it is defined as byte and half-word data, there will be a process of reading the data first, modifying it, and then writing it. This read - modify - write process is the problem point that can affect the execution efficiency of the kernel. If it is defined as 32-bit , this problem can be avoided.

problem solved:

According to the document description, we change the 16-bit definition data to 32-bit definition data, that is:

volatile uint32_t i;

static volatile uint16_t j = 0;

The generated assembly code is as follows:

LDR r0,[sp,#0x00]

ADDS r0,r0,#1

STR r0,[sp,#0x00]

CMP r0,r1

BCC 0x08001F28

The test results are as follows:

STM32F446 data is 2102

STM32F746 data is 1807

It can be seen that both STM32F4xx and STM32F7xx have significant speed improvements when data is defined as 32-bit . Of course, the improvement of STM32F7xx is more obvious. Under the same test conditions, the execution time of STM32F7xx is less than that of STM32F4xx .

Because 32-bit data definition will increase memory, and sometimes it is more convenient to define it as byte or halfword , if you need to increase the speed, we can see the instructions given in the kernel file, which is to disable the RMW mechanism :

In fact, it is to write 0 to the first bit of the CM7_DTCMCR register , that is, the following operations can be performed in the program:

__IO uint32_t * DTCM_CR =(uint32_t*)(0xE000EF94);

* DTCM_CR &= 0xFFFFFFFD; /* Disable read-modify-write */

After disabling RMW , the test data is as follows:

16-bit definition data STM32F746 test cycles data is 3022

32-bit definition data STM32F746 test cycles data is 1808

By comparing the test data above, we can also see that when RMW is disabled, the performance of STM32F7xx is better than that of STM32F4xx . The specific test data is as follows: