Memcpy ARM/NEON assembly performance test on Cortex-A8 platform-EEWORLD

Collect

This article introduces the performance comparison of memcpy implemented in C language, ARM assembly and NEON assembly on Cortex-A8 chips based on ARMv7 architecture (FreeScale i.MX51 / i.MX53 / QualComm msm8x50 / msm7x30 / Samsung s5pc100 / s5pc110 / TI omap 3430 / omap 3730 chips), and inputs and analyzes the impact of NEON instructions (NEON memory bit width of different processors varies from 64-bit to 128-bit) and cache preload (preload engine instruction) on performance. The final conclusion shows that 1. There is a performance peak between the copy block size = 512B ~ 32K, and there is also a performance turning point when the block size = 256K. This feature reflects the influence of the chip's 32KB L1 / 256KB L2 cache; 2. The performance of NEON instructions is always higher than that of ARM instructions. With the development, the performance gap between ARM/NEON instructions is narrowing. The performance of alternating ARM/NEON instructions is often worse than that of the NEON version; 3. If there is no good model design, the software will interfere with the use of cache, which will easily cause performance degradation; 4. Under the condition of fit in cache, the Snapdragon platform has the best performance; 5. Under the condition of out of cache, s5pc110 has the best performance; 6. Under the same hardware platform, overclocking has little effect on memory performance; 7. The same implementation has different performances on different hardware platforms. No implementation is the best on all platforms.

Preface

In the C run time library, memcpy is an important function that has a significant impact on the performance of application software. The development of ARM chips to the Cortex-A8[1][2] architecture has not only greatly improved the frequency, but also greatly improved the architectural design. The added NEON instructions are similar to the original MMX instructions under the X86 platform and are designed for multimedia. However, because these instructions can process 64-bit data at a time, they are also helpful for improving the performance of the memcpy function. This article mainly tests various memcpy implementations using the NEON[2] instruction, explores the impact of the NEON instruction and preload instruction on performance, and the changing trend of these impacts after chip optimization and process improvement. At the same time, it is hoped that chip designers can give an explanation based on the understanding of software implementation, so as to guide the direction of further improving performance.
Platform Introduction

The test platform for this time is derived from the Cortex-A8 platform that the author encountered in his work project. See the list below:

FreeScale i.MX51 / i.MX53
Qualcomm msm8x50 / msm7x30
Samsung s5pc100 / s5pc110
TI omap 3430 / omap 3730

i.MX5 family

For an introduction to the i.MX5 family, see [6][7]. The i.MX535 can run at two frequencies: 800MHZ and 1000MHZ.

i.MX515
- freq: 800MHZ
- cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
- cache line: 64-bit wide (NEON), 64-byte / line
i.MX535
freq: 800MHZ / 1000MHZ
cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
cache line: 64-bit wide (NEON), 64-byte / line

Snapdragon family

For an introduction to Snapdragon, see [8][9][10]. The msm7x30 can run at two frequencies: 800MHZ and 1000MHZ. In addition, the Snapdragon cache is special in that it is 128-bit wide (NEON), 128-byte / line. In the standard Cortex-A8, this value is 64-bit wide (NEON), 64-byte / line. This has a significant impact on performance.

msm8x50
- freq: 1000MHZ
- cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
- cache line: 128-bit wide (NEON), 128-byte / line
msm7x30
- freq: 800MHZ / 1000MHZ
- cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
- cache line: 128-bit wide (NEON), 128-byte / line

s5pc family

The s5pc family reference platform can be found in [11].

s5pc100
- freq: 665MHZ
- cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
- cache line: 64-bit wide (NEON), 64-byte / line
s5pc110
- freq: 1000MHZ
- cache size: 32KB/32KB I/D Cache and 512KB L2 Cache
- cache line: 64-bit wide (NEON), 64-byte / line

omap3 family

For the omap3 family reference platform, see [12][13][14].

omap3430
- freq: 550MHZ
- cache size: 16KB/16KB I/D Cache and 256KB L2 Cache
- cache line: 64-bit wide (NEON), 64-byte / line
omap3730
- freq: 1000MHZ
- cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
- cache line: 64-bit wide (NEON), 64-byte / line

Introduction to memcpy implementation

There are three versions of memcpy implementation on the ARM platform:
C language version
ARM assembly version
NEON assembly version

ARM's document [4] has a good description of the implementation of memcpy. Others [5][19][20] have further elaborated on the implementation principles and techniques. A brief description is as follows:

NEON instructions can process 64-bit data at a time, which is more efficient.
The NEON architecture has a direct connection to the L1/L2 cache, and better performance can be achieved after being enabled at the OS level.
The ARM/NEON pipeline may be processed asynchronously, and alternating ARM/NEON instructions may achieve better performance.
In one loop, use as many registers as possible to copy more data to ensure better pipeline efficiency. Currently, the maximum processing block is 128-byte.
The operation of cache is particular.
- memcpy is a one-time scan operation without backtracking. The cache preload strategy can improve the hit rate. Therefore, the pld instruction must be used in the assembly version to prompt ARM to fill the cache line in advance.
- The offset in the pld instruction is very particular. It is usually a multiple of 64 bytes. On the ARMv5TE platform, one pld instruction is used in one loop. On the Cortex-A8 platform, it is faster and requires 2~3 pld instructions in one loop to fill a cache line. Such a loop consumes 2~3 clock cycles in exchange for an increase in cache hit rate, which is worth the effect.
- Furthermore, the Cortex-A8 architecture provides preload engine instructions, which allow software to have a deeper impact on the cache, thereby improving the cache hit rate. However, to use the preload engine instruction in user space, it is necessary to patch the OS to open permissions.

C language version

The C language version is mainly for comparison. Two implementations are used:
32-bit wide copy. Later marked as in32_cpy.
16-byte wide copy. It is marked as vec_cpy. The trick of this implementation is to use the gcc vector extension "__attribute__ ((vector_size(16)))" to implement 16-byte wide copy at the C language level and leave the specific implementation to the compiler.

It is worth noting that the compiler will not actively insert pld instructions because the compiler cannot determine the application's memory access pattern.
ARM assembly version

The ARM assembly version is also mainly for comparison. Two implementations are used:
Implemented by Siarhei Siamashka [15]. Later marked as arm9_memcpy. It is optimized for Nokia N770.
Nicolas Pitre implemented this [16]. It is denoted as armv5te_memcpy. This is the default arm memcpy implementation in glibc.
NEON assembly version

The NEON assembly version uses four implementations:
M?ns Rullg?rd implementation [19]. This is the simplest implementation of a 128-byte-aligned block. It does not detect the case where the block is not 128-byte aligned. Therefore, it is not a practical version. However, this type of implementation can be used to examine the performance limit of memcpy. He provides a total of 4 implementations.
The full ARM assembly implementation is marked as memcpy_arm. In addition, the author also removes the pld instruction as a comparative experiment to examine the impact of the pld instruction. It is marked as memcpy_arm_nopld.
The full NEON assembly implementation is marked as memcpy_neon. In addition, the author also removes the pld instruction as a comparative experiment to examine the impact of the pld instruction. It is marked as memcpy_neon_nopld.
The implementation of alternating use of ARM / NEON instructions. It is marked as memcpy_armneon. In addition, the author also removes the pld instruction as a comparative experiment to examine the impact of the pld instruction. It is marked as memcpy_armneon_nopld.
ple + NEON implementation. It is marked as memcpy_ple_neon. In addition, the author also replaced the NEON instructions with ARM instructions as a comparative test to examine the impact of ple instructions on ARM/NEON instructions. It is marked as memcpy_ple_arm. Because this implementation requires patching the Linux kernel, it did not succeed on the omap3430 platform. It is a bit troublesome to replace the kernel on the Snapdragon platform, so it was not tested.
CodeSourcery implementation [17]. This is the implementation in glibc in the CodeSourcery toolchain. There are also two implementations.
ARM implementation. The following is marked as memcpy_arm_codesourcery. The author also removed the pld instruction as a comparative experiment to examine the impact of the pld instruction. The following is marked as memcpy_arm_codesourcery_nopld.
NEON implementation. It is marked as memcpy_neon_codesourcery. This is also the NEON implementation used in Android bionic. The author also removed the pld instruction as a comparative experiment to examine the impact of the pld instruction. It is marked as memcpy_neon_codesourcery_nopld.
QualComm implementation [18]. It is marked as memcpy_neon_qualcomm. This is an optimized version developed by QualComm for the Snapdragon platform in the Code Aurora Forum. It is mainly optimized for the 8660/8650A platform. The feature of this version is that it is designed for L2 cache line size = 128 bytes, and the pld offset is set to a particularly large value. As a result, it has no effect on other Cortex-A8 platforms. Therefore, the author changed the pld offset to the value implemented by M?ns Rullg?rd. The author also removed the pld instruction as a comparative experiment to examine the impact of the pld instruction. It is marked as memcpy_neon_qualcomm_nopld.
Implemented by Siarhei Siamashka [20]. Later marked as memcpy_neon_siarhei. This is the NEON version submitted by Siarhei Siamashka to glibc, which was not adopted by glibc. However, it was adopted in the MAEMO project. The feature of this version is that the pld offset increases from small to large in order to adapt to the change of block size.
Test plan introduction

The test plan is very simple. It refers to the implementation of the moving memory tester [21]. The execution steps are as follows:
First, verify the correctness of each implementation. The main method is to fill random content with random block size & offset, then perform memcpy operation, and then use the system's memcmp function to verify the two blocks of memory.
Then call each implementation 400 times with different block sizes. If total copy size < 1MB, increase count until the requirement is met. Time the total operation.

Calculate memcpy bandwidth using the formula total copy size / total copy time.

The block size mentioned above = 2^n ( 7 <= n <= 23 ).

In addition, this test program runs in the openembedded-gpe software system. QualComm / Samsung hardware platforms only provide Android software systems, and it is a bit troublesome to switch to the GPE system, so the chroot method is used for testing. Regardless of the software platform, after entering the graphics system, wait for the black screen and then test.

The following table shows the statistics of the operating environment.

Hardware platform	Software Environment
imx51 800MHZ	openembedded-gpe
imx53 1000MHZ	openembedded-gpe
imx53 800MHZ	openembedded-gpe
msm7230 1000MHZ	Android + chroot
msm7230 800MHZ	Android + chroot
msm8250 1000MHZ	Android + chroot
omap3430 550MHZ	openembedded-gpe
omap3730 1000MHZ	openembedded-gpe
s5pc100 665MHZ	Android + chroot
s5pc110 1000MHZ	Android + chroot

The following table shows the statistics of the test items.

Implementation	i.MX51	i.MX53	Snapdragon	s5pc1xx	omap3430	omap3730
int32_cpy	YES	YES	YES	YES	YES	YES
vec_cpy	YES	YES	YES	YES	YES	YES
arm9_memcpy	YES	YES	YES	YES	YES	YES
armv5te_memcpy	YES	YES	YES	YES	YES	YES
memcpy_arm	YES	YES	YES	YES	YES	YES
memcpy_arm_nopld	YES	NO	YES	YES	YES	YES
memcpy_neon	YES	YES	YES	YES	YES	YES
memcpy_neon_nopld	YES	NO	YES	YES	YES	YES
memcpy_armneon	YES	YES	YES	YES	YES	YES
memcpy_ple_arm	YES	YES	N/A	YES	N/A	YES
memcpy_ple_neon	YES	YES	N/A	YES	N/A	YES
memcpy_arm_codesourcery	YES	YES	YES	YES	YES	YES
memcpy_arm_codesourcery_nopld	YES	NO	YES	YES	YES	YES
memcpy_neon_codesourcery	YES	YES	YES	YES	YES	YES
memcpy_neon_codesourcery_nopld	YES	NO	YES	YES	YES	YES
memcpy_neon_qualcomm	YES	YES	YES	YES	YES	YES
memcpy_neon_qualcomm_nopld	YES	NO	YES	YES	YES	YES
memcpy_neon_siarhei	YES	YES	YES	YES	YES	YES

Note 1: Because the i.MX53 EVK board malfunctioned, all no pld test items could not be tested.

Note 2: After opening the preload engine for omap3430, the test generated an illegal instruction error and failed to test the ple test items.

Note 3: It is a bit troublesome to replace the Snapdragon kernel, and the test items of ple cannot be tested.

Test results and analysis

The following chart is limited by the page size and cannot show the details well. The specific data and large picture can be viewed in the data sheet document.
Performance of various implementations on various hardware platforms
imx51 800MHZ
imx53 1000MHZ
imx53 800MHZ
msm7230 1000MHZ
msm7230 800MHZ
msm8250 1000MHZ
omap3430 550MHZ
omap3730 1000MHZ
s5pc100 665MHZ
s5pc110 1000MHZ
1. summary
There is a performance plateau between block size = 512B ~ 32K, and there is also a performance turning point at block size = 256K.
This feature reflects the impact of 32KB L1 / 256KB L2 cache.
The poor performance of less than 512B may be related to the loss caused by the block alignment technique at the beginning of the function call, or it may be related to the block size being too small and the cache being not ready before the function ends.
The document [] is still instructive for the implementation of memcpy. However, with the optimization of the chip and the improvement of the process, some rules have changed.
The performance of NEON instructions is always higher than that of ARM instructions. However, using ARM/NEON instructions alternately does not always lead to performance improvements. With the development, the performance gap between ARM/NEON instructions is narrowing.
The pld instruction is becoming less and less useful. On older chips, such as the omap3430, the same implementation can get a 50% performance improvement with the pld instruction. On newer chips, such as the msm7230/s5pc110, there is basically no difference in performance, and even the same implementation without the pld instruction has a slight performance improvement. This may be because the pld instruction has no effect, but instead wastes clock cycles in each loop.
The performance of the implementation using ple instructions is disappointing. This also shows that if there is no good model design, software intervention in the use of cache can easily cause performance degradation.
The Snapdragon platform has the best cache performance. Beyond the cache, the performance of various implementations (including C language implementation) is basically the same and very efficient. This may be due to the design of the Snapdragon platform's 13-stage load/store pipeline[][]. This feature is good for high-level languages. Because programming cannot use assembly language in many places, developers do not have to consider assembly optimization too much and can rely on the compiler.
The s5pc110 platform has the best average performance. After out of cache, NEON achieves the best performance, which is basically the same.
Performance of various hardware platforms under small/big block size

Since the block size can be divided into two types: fit in cache and out of cache, two profiles are made for comparative analysis.

8K block size. This reflects the performance when it fits in cache.
8M block size. Reflects out of cache performance.

Realization of Möns Rullgörd

Because M?ns Rullg?rd's implementation is the simplest, with only a loop body and no other judgment code, it can be considered as an implementation that reflects the speed limit of the platform.
ARM Implementation
NEON Implementation
summary
The performance of NEON instructions is always higher than that of ARM instructions. With the development, the performance gap between ARM and NEON instructions is narrowing.
The performance of the ARM/NEON version is worse than that of the NEON version when fit in cache is used alternately. When out of cache is used, the performance of the two versions is basically the same.
Under the condition of fit in cache, the Snapdragon platform has the best performance, surpassing the second place s5pc110 by about 43%.
Under out of cache conditions, s5pc110 has the best performance, surpassing the second place omap3730 by about 57%.
On the same hardware platform, overclocking (such as i.MX53 800/1000MHZ & msm7x30 800/1000MHZ) has little effect on memory performance.
Practical ARM/NEON implementation on various hardware platforms

By comparing the performance of the same implementation on different hardware platforms and combining it with the charts in the previous section, we can evaluate the average performance, or adaptability, of an implementation.
ARM Implementation
NEON Implementation
summary
The same implementation may perform differently on different hardware platforms. No one implementation is the best on all platforms.
Codesourcery version, including ARM/NEON version, has good adaptability. It is worthy of being a toolchain company.
Siarhei Siamashka's NEON version is also very adaptable. NOKIA's technical strength is also very strong. This guy seems to be the main force in NEON optimization in the pixman project.
The Qualcomm version is only suitable for Snapdragon platform. Looking forward to testing it on msm8660 and subsequent chips in the future.
Summarize
There is a performance plateau between block size = 512B ~ 32K, and there is also a performance transition at block size = 256K. This feature reflects the impact of 32KB L1 / 256KB L2 cache.
The performance of NEON instructions is always higher than that of ARM instructions. The performance difference between ARM and NEON instructions is narrowing as the development progresses. When using ARM and NEON instructions alternately, the performance is often worse than the NEON version.
Without a good model design, software intervention in cache usage can easily lead to performance degradation.
Under the fit in cache condition, the Snapdragon platform has the best performance.
Under out of cache conditions, s5pc110 has the best performance.
On the same hardware platform, overclocking has little impact on memory performance.
The same implementation may perform differently on different hardware platforms. No one implementation is the best on all platforms.
Further testing

Because in the Cortex-A8 series chips, the NEON module is required. In the Cortex-A9 series chips, the NEON module is optional. Because the NEON module affects the die size, thus affecting power consumption and cost. Therefore, some Cortex-A9 chips, such as Nvidia Tegra250, do not have a NEON module. So what impact will the presence or absence of a NEON module have on software performance?

Reference address：Memcpy ARM/NEON assembly performance test on Cortex-A8 platform

Previous article：Soft floating point and hard floating point issues when compiling ARM code with ARMCC and GCC
Next article：C code optimization method for embedded platform ARM

Recommended ReadingLatest update time:2024-11-16 14:40

Preparation before calling the main function in ARM bare metal driver

Hardware 1. Turn off the CPU watchdog 2 Configure the CPU working clock 3. The program needs to run in SDRAM, so SDRAM must be initialized Software 1 To run a function, stack space is required, so the stack pointer SP must be initialized 2 Set the return address of the main function 3 Calling main 4. Cleaning up

[Microcontroller]

ARM's 9 addressing modes

ARM's 9 addressing modes 1) Immediate addressing The operand is an immediate value, prefixed with "#", and "0x" is used to represent a hexadecimal value. example: MOV R0,#0xFF00 ;0xFF00 - R0 SUBS R0,R0,#1 ;R0 – 1 - R0 2) Register addressing The value of the operand is in the register, and the register value is direc

[Microcontroller]

Miniaturized remote monitoring intelligent power supply system based on ARM Cortex-M3

Traditional power supply maintenance adopts a manual maintenance management mode, while the intelligent power supply monitoring system is based on embedded technology, computer technology, communication technology, etc., realizing the transformation of the power supply system to an intelligent and automated managem

[Power Management]

Miniaturized remote monitoring intelligent power supply system based on ARM Cortex-M3

The difference and usage of ARM function pointer and pointer function

When I was learning ARM, I found that it was easy to confuse "pointer function" with "function pointer", so today, I want to clarify it once and for all, and I found some information and some summary from everyone, and I organized it here to share with you. First, their definitions: 1. A pointer function is a func

[Microcontroller]

Design of Audio Analyzer Based on ARM Microcontroller LPC2148

0 Introduction With the rapid development of microelectronics and information technology, digital technology represented by single-chip microcomputers is developing rapidly. Single-chip microcomputers are widely used in the control of various instruments, computer network communication and data transmission

[Microcontroller]

Embedded processor ARM (Advanced RISC Machines) technology and chips

ARM (Advanced RISC Machines) has become the world's leading provider of embedded RISC processor intellectual property (IP) after more than a decade of unremitting efforts since it launched the first embedded RISC core ARM6 in 1991. ARM is a conceptually innovative company that first proposed the concept of public sale

[Microcontroller]

ARM Linux.2.6.34 kernel porting

ARM-LINUX-GCC version 4.3.2. Installed in /usr/local/arm/4.3.2. Step 1: Modify the linux-2.6.34/Makefile file, find the following two pieces of information in the makefile and modify them ARCH? = arm CROSS_COMPILE? =/usr/local/arm/4.3.2/bin/arm-linux- Step 2: Modify the platform input clock Modify the platform clo

[Microcontroller]

Linux arm mmu basics

ARM MMU page table framework First, here is a general block diagram of the page table structure of the arm mmu (the following discussion is gradually expanded from this diagram): The above is the typical structure of the arm's page table diagram: that is, the secondary page table structure: The first-level page ta

[Microcontroller]

Popular Resources
Popular amplifiers

Memcpy ARM/NEON assembly performance test on Cortex-A8 platform

Preface

Platform Introduction

i.MX5 family

Snapdragon family

s5pc family

omap3 family

Introduction to memcpy implementation

C language version

ARM assembly version

NEON assembly version

Test plan introduction

Test results and analysis

Performance of various implementations on various hardware platforms

imx51 800MHZ

imx53 1000MHZ

imx53 800MHZ

msm7230 1000MHZ

msm7230 800MHZ

msm8250 1000MHZ

omap3430 550MHZ

omap3730 1000MHZ

s5pc100 665MHZ

s5pc110 1000MHZ

summary

Performance of various hardware platforms under small/big block size

Realization of Möns Rullgörd

ARM Implementation

NEON Implementation

summary

Practical ARM/NEON implementation on various hardware platforms

ARM Implementation

NEON Implementation

summary

Summarize

Further testing