Memcpy ARM/NEON assembly performance test on Cortex-A8 platform

Publisher:小悟空111Latest update time:2016-07-14 Source: eefocus Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere
This article introduces the performance comparison of memcpy implemented in C language, ARM assembly and NEON assembly on Cortex-A8 chips based on ARMv7 architecture (FreeScale i.MX51 / i.MX53 / QualComm msm8x50 / msm7x30 / Samsung s5pc100 / s5pc110 / TI omap 3430 / omap 3730 chips), and inputs and analyzes the impact of NEON instructions (NEON memory bit width of different processors varies from 64-bit to 128-bit) and cache preload (preload engine instruction) on performance. The final conclusion shows that 1. There is a performance peak between the copy block size = 512B ~ 32K, and there is also a performance turning point when the block size = 256K. This feature reflects the influence of the chip's 32KB L1 / 256KB L2 cache; 2. The performance of NEON instructions is always higher than that of ARM instructions. With the development, the performance gap between ARM/NEON instructions is narrowing. The performance of alternating ARM/NEON instructions is often worse than that of the NEON version; 3. If there is no good model design, the software will interfere with the use of cache, which will easily cause performance degradation; 4. Under the condition of fit in cache, the Snapdragon platform has the best performance; 5. Under the condition of out of cache, s5pc110 has the best performance; 6. Under the same hardware platform, overclocking has little effect on memory performance; 7. The same implementation has different performances on different hardware platforms. No implementation is the best on all platforms.
  1. Preface

    In the C run time library, memcpy is an important function that has a significant impact on the performance of application software. The development of ARM chips to the Cortex-A8[1][2] architecture has not only greatly improved the frequency, but also greatly improved the architectural design. The added NEON instructions are similar to the original MMX instructions under the X86 platform and are designed for multimedia. However, because these instructions can process 64-bit data at a time, they are also helpful for improving the performance of the memcpy function. This article mainly tests various memcpy implementations using the NEON[2] instruction, explores the impact of the NEON instruction and preload instruction on performance, and the changing trend of these impacts after chip optimization and process improvement. At the same time, it is hoped that chip designers can give an explanation based on the understanding of software implementation, so as to guide the direction of further improving performance.

  2. Platform Introduction

    The test platform for this time is derived from the Cortex-A8 platform that the author encountered in his work project. See the list below:

  • FreeScale i.MX51 / i.MX53
  • Qualcomm msm8x50 / msm7x30
  • Samsung s5pc100 / s5pc110
  • TI omap 3430 / omap 3730
  1. i.MX5 family

    For an introduction to the i.MX5 family, see [6][7]. The i.MX535 can run at two frequencies: 800MHZ and 1000MHZ.

  • i.MX515
    • freq: 800MHZ
    • cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
    • cache line: 64-bit wide (NEON), 64-byte / line
  • i.MX535
  • freq: 800MHZ / 1000MHZ
  • cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
  • cache line: 64-bit wide (NEON), 64-byte / line
  1. Snapdragon family

    For an introduction to Snapdragon, see [8][9][10]. The msm7x30 can run at two frequencies: 800MHZ and 1000MHZ. In addition, the Snapdragon cache is special in that it is 128-bit wide (NEON), 128-byte / line. In the standard Cortex-A8, this value is 64-bit wide (NEON), 64-byte / line. This has a significant impact on performance.

  • msm8x50
    • freq: 1000MHZ
    • cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
    • cache line: 128-bit wide (NEON), 128-byte / line
  • msm7x30
    • freq: 800MHZ / 1000MHZ
    • cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
    • cache line: 128-bit wide (NEON), 128-byte / line
  1. s5pc family

    The s5pc family reference platform can be found in [11].

  • s5pc100
    • freq: 665MHZ
    • cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
    • cache line: 64-bit wide (NEON), 64-byte / line
  • s5pc110
    • freq: 1000MHZ
    • cache size: 32KB/32KB I/D Cache and 512KB L2 Cache
    • cache line: 64-bit wide (NEON), 64-byte / line
  1. omap3 family

    For the omap3 family reference platform, see [12][13][14].

  • omap3430
    • freq: 550MHZ
    • cache size: 16KB/16KB I/D Cache and 256KB L2 Cache
    • cache line: 64-bit wide (NEON), 64-byte / line
  • omap3730
    • freq: 1000MHZ
    • cache size: 32KB/32KB I/D Cache and 256KB L2 Cache
    • cache line: 64-bit wide (NEON), 64-byte / line
  1. Introduction to memcpy implementation

    There are three versions of memcpy implementation on the ARM platform:

  2. C language version
  3. ARM assembly version
  4. NEON assembly version

    ARM's document [4] has a good description of the implementation of memcpy. Others [5][19][20] have further elaborated on the implementation principles and techniques. A brief description is as follows:

  • NEON instructions can process 64-bit data at a time, which is more efficient.
  • The NEON architecture has a direct connection to the L1/L2 cache, and better performance can be achieved after being enabled at the OS level.
  • The ARM/NEON pipeline may be processed asynchronously, and alternating ARM/NEON instructions may achieve better performance.
  • In one loop, use as many registers as possible to copy more data to ensure better pipeline efficiency. Currently, the maximum processing block is 128-byte.
  • The operation of cache is particular.
    • memcpy is a one-time scan operation without backtracking. The cache preload strategy can improve the hit rate. Therefore, the pld instruction must be used in the assembly version to prompt ARM to fill the cache line in advance.
    • The offset in the pld instruction is very particular. It is usually a multiple of 64 bytes. On the ARMv5TE platform, one pld instruction is used in one loop. On the Cortex-A8 platform, it is faster and requires 2~3 pld instructions in one loop to fill a cache line. Such a loop consumes 2~3 clock cycles in exchange for an increase in cache hit rate, which is worth the effect.
    • Furthermore, the Cortex-A8 architecture provides preload engine instructions, which allow software to have a deeper impact on the cache, thereby improving the cache hit rate. However, to use the preload engine instruction in user space, it is necessary to patch the OS to open permissions.

     

  1. C language version

    The C language version is mainly for comparison. Two implementations are used:

  2. 32-bit wide copy. Later marked as in32_cpy.
  3. 16-byte wide copy. It is marked as vec_cpy. The trick of this implementation is to use the gcc vector extension "__attribute__ ((vector_size(16)))" to implement 16-byte wide copy at the C language level and leave the specific implementation to the compiler.

     

    It is worth noting that the compiler will not actively insert pld instructions because the compiler cannot determine the application's memory access pattern.

  4. ARM assembly version

    The ARM assembly version is also mainly for comparison. Two implementations are used:

  5. Implemented by Siarhei Siamashka [15]. Later marked as arm9_memcpy. It is optimized for Nokia N770.
  6. Nicolas Pitre implemented this [16]. It is denoted as armv5te_memcpy. This is the default arm memcpy implementation in glibc.
  7. NEON assembly version

    The NEON assembly version uses four implementations:

  8. M?ns Rullg?rd implementation [19]. This is the simplest implementation of a 128-byte-aligned block. It does not detect the case where the block is not 128-byte aligned. Therefore, it is not a practical version. However, this type of implementation can be used to examine the performance limit of memcpy. He provides a total of 4 implementations.
  9. The full ARM assembly implementation is marked as memcpy_arm. In addition, the author also removes the pld instruction as a comparative experiment to examine the impact of the pld instruction. It is marked as memcpy_arm_nopld.
  10. The full NEON assembly implementation is marked as memcpy_neon. In addition, the author also removes the pld instruction as a comparative experiment to examine the impact of the pld instruction. It is marked as memcpy_neon_nopld.
  11. The implementation of alternating use of ARM / NEON instructions. It is marked as memcpy_armneon. In addition, the author also removes the pld instruction as a comparative experiment to examine the impact of the pld instruction. It is marked as memcpy_armneon_nopld.
  12. ple + NEON implementation. It is marked as memcpy_ple_neon. In addition, the author also replaced the NEON instructions with ARM instructions as a comparative test to examine the impact of ple instructions on ARM/NEON instructions. It is marked as memcpy_ple_arm. Because this implementation requires patching the Linux kernel, it did not succeed on the omap3430 platform. It is a bit troublesome to replace the kernel on the Snapdragon platform, so it was not tested.
  13. CodeSourcery implementation [17]. This is the implementation in glibc in the CodeSourcery toolchain. There are also two implementations.
  14. ARM implementation. The following is marked as memcpy_arm_codesourcery. The author also removed the pld instruction as a comparative experiment to examine the impact of the pld instruction. The following is marked as memcpy_arm_codesourcery_nopld.
  15. NEON implementation. It is marked as memcpy_neon_codesourcery. This is also the NEON implementation used in Android bionic. The author also removed the pld instruction as a comparative experiment to examine the impact of the pld instruction. It is marked as memcpy_neon_codesourcery_nopld.
  16. QualComm implementation [18]. It is marked as memcpy_neon_qualcomm. This is an optimized version developed by QualComm for the Snapdragon platform in the Code Aurora Forum. It is mainly optimized for the 8660/8650A platform. The feature of this version is that it is designed for L2 cache line size = 128 bytes, and the pld offset is set to a particularly large value. As a result, it has no effect on other Cortex-A8 platforms. Therefore, the author changed the pld offset to the value implemented by M?ns Rullg?rd. The author also removed the pld instruction as a comparative experiment to examine the impact of the pld instruction. It is marked as memcpy_neon_qualcomm_nopld.
  17. Implemented by Siarhei Siamashka [20]. Later marked as memcpy_neon_siarhei. This is the NEON version submitted by Siarhei Siamashka to glibc, which was not adopted by glibc. However, it was adopted in the MAEMO project. The feature of this version is that the pld offset increases from small to large in order to adapt to the change of block size.
  18. Test plan introduction

    The test plan is very simple. It refers to the implementation of the moving memory tester [21]. The execution steps are as follows:

  19. First, verify the correctness of each implementation. The main method is to fill random content with random block size & offset, then perform memcpy operation, and then use the system's memcmp function to verify the two blocks of memory.
  20. Then call each implementation 400 times with different block sizes. If total copy size < 1MB, increase count until the requirement is met. Time the total operation.
  21. Calculate memcpy bandwidth using the formula total copy size / total copy time.

     

    The block size mentioned above = 2^n ( 7 <= n <= 23 ).

     

    In addition, this test program runs in the openembedded-gpe software system. QualComm / Samsung hardware platforms only provide Android software systems, and it is a bit troublesome to switch to the GPE system, so the chroot method is used for testing. Regardless of the software platform, after entering the graphics system, wait for the black screen and then test.

     

    The following table shows the statistics of the operating environment.

    Hardware platform

    Software Environment

    imx51 800MHZ

    openembedded-gpe

    imx53 1000MHZ

    openembedded-gpe

    imx53 800MHZ

    openembedded-gpe

    msm7230 1000MHZ

    Android + chroot

    msm7230 800MHZ

    Android + chroot

    msm8250 1000MHZ

    Android + chroot

    omap3430 550MHZ

    openembedded-gpe

    omap3730 1000MHZ

    openembedded-gpe

    s5pc100 665MHZ

    Android + chroot

    s5pc110 1000MHZ

    Android + chroot

     

    The following table shows the statistics of the test items.

    Implementation

    i.MX51

    i.MX53

    Snapdragon

    s5pc1xx

    omap3430

    omap3730

    int32_cpy

    YES

    YES

    YES

    YES

    YES

    YES

    vec_cpy

    YES

    YES

    YES

    YES

    YES

    YES

    arm9_memcpy

    YES

    YES

    YES

    YES

    YES

    YES

    armv5te_memcpy

    YES

    YES

    YES

    YES

    YES

    YES

    memcpy_arm

    YES

    YES

    YES

    YES

    YES

    YES

    memcpy_arm_nopld

    YES

    NO

    YES

    YES

    YES

    YES

    memcpy_neon

    YES

    YES

    YES

    YES

    YES

    YES

    memcpy_neon_nopld

    YES

    NO

    YES

    YES

    YES

    YES

    memcpy_armneon

    YES

    YES

    YES

    YES

    YES

    YES

    memcpy_ple_arm

    YES

    YES

    N/A

    YES

    N/A

    YES

    memcpy_ple_neon

    YES

    YES

    N/A

    YES

    N/A

    YES

    memcpy_arm_codesourcery

    YES

    YES

    YES

    YES

    YES

    YES

    memcpy_arm_codesourcery_nopld

    YES

    NO

    YES

    YES

    YES

    YES

    memcpy_neon_codesourcery

    YES

    YES

    YES

    YES

    YES

    YES

    memcpy_neon_codesourcery_nopld

    YES

    NO

    YES

    YES

    YES

    YES

    memcpy_neon_qualcomm

    YES

    YES

    YES

    YES

    YES

    YES

    memcpy_neon_qualcomm_nopld

    YES

    NO

    YES

    YES

    YES

    YES

    memcpy_neon_siarhei

    YES

    YES

    YES

    YES

    YES

    YES

    Note 1: Because the i.MX53 EVK board malfunctioned, all no pld test items could not be tested.

    Note 2: After opening the preload engine for omap3430, the test generated an illegal instruction error and failed to test the ple test items.

    Note 3: It is a bit troublesome to replace the Snapdragon kernel, and the test items of ple cannot be tested.

  22. Test results and analysis

    The following chart is limited by the page size and cannot show the details well. The specific data and large picture can be viewed in the data sheet document.

  23. Performance of various implementations on various hardware platforms

  24. imx51 800MHZ

     

  25. imx53 1000MHZ

     

  26. imx53 800MHZ

     

  27. msm7230 1000MHZ

     

  28. msm7230 800MHZ

     

  29. msm8250 1000MHZ

     

  30. omap3430 550MHZ

     

  31. omap3730 1000MHZ

     

  32. s5pc100 665MHZ

     

  33. s5pc110 1000MHZ

     

    1. summary

  34. There is a performance plateau between block size = 512B ~ 32K, and there is also a performance turning point at block size = 256K.
  35. This feature reflects the impact of 32KB L1 / 256KB L2 cache.
  36. The poor performance of less than 512B may be related to the loss caused by the block alignment technique at the beginning of the function call, or it may be related to the block size being too small and the cache being not ready before the function ends.
  37. The document [] is still instructive for the implementation of memcpy. However, with the optimization of the chip and the improvement of the process, some rules have changed.
  38. The performance of NEON instructions is always higher than that of ARM instructions. However, using ARM/NEON instructions alternately does not always lead to performance improvements. With the development, the performance gap between ARM/NEON instructions is narrowing.
  39. The pld instruction is becoming less and less useful. On older chips, such as the omap3430, the same implementation can get a 50% performance improvement with the pld instruction. On newer chips, such as the msm7230/s5pc110, there is basically no difference in performance, and even the same implementation without the pld instruction has a slight performance improvement. This may be because the pld instruction has no effect, but instead wastes clock cycles in each loop.
  40. The performance of the implementation using ple instructions is disappointing. This also shows that if there is no good model design, software intervention in the use of cache can easily cause performance degradation.
  41. The Snapdragon platform has the best cache performance. Beyond the cache, the performance of various implementations (including C language implementation) is basically the same and very efficient. This may be due to the design of the Snapdragon platform's 13-stage load/store pipeline[][]. This feature is good for high-level languages. Because programming cannot use assembly language in many places, developers do not have to consider assembly optimization too much and can rely on the compiler.
  42. The s5pc110 platform has the best average performance. After out of cache, NEON achieves the best performance, which is basically the same.
  43. Performance of various hardware platforms under small/big block size

    Since the block size can be divided into two types: fit in cache and out of cache, two profiles are made for comparative analysis.

  • 8K block size. This reflects the performance when it fits in cache.
  • 8M block size. Reflects out of cache performance.
  1. Realization of Möns Rullgörd

    Because M?ns Rullg?rd's implementation is the simplest, with only a loop body and no other judgment code, it can be considered as an implementation that reflects the speed limit of the platform.

     

     

  2. ARM Implementation

     

     

  3. NEON Implementation

     

     

  4. summary

  5. The performance of NEON instructions is always higher than that of ARM instructions. With the development, the performance gap between ARM and NEON instructions is narrowing.
  6. The performance of the ARM/NEON version is worse than that of the NEON version when fit in cache is used alternately. When out of cache is used, the performance of the two versions is basically the same.
  7. Under the condition of fit in cache, the Snapdragon platform has the best performance, surpassing the second place s5pc110 by about 43%.
  8. Under out of cache conditions, s5pc110 has the best performance, surpassing the second place omap3730 by about 57%.
  9. On the same hardware platform, overclocking (such as i.MX53 800/1000MHZ & msm7x30 800/1000MHZ) has little effect on memory performance.
  10. Practical ARM/NEON implementation on various hardware platforms

    By comparing the performance of the same implementation on different hardware platforms and combining it with the charts in the previous section, we can evaluate the average performance, or adaptability, of an implementation.

  11. ARM Implementation

     

     

     

  12. NEON Implementation

     

     

     

  13. summary

  14. The same implementation may perform differently on different hardware platforms. No one implementation is the best on all platforms.
  15. Codesourcery version, including ARM/NEON version, has good adaptability. It is worthy of being a toolchain company.
  16. Siarhei Siamashka's NEON version is also very adaptable. NOKIA's technical strength is also very strong. This guy seems to be the main force in NEON optimization in the pixman project.
  17. The Qualcomm version is only suitable for Snapdragon platform. Looking forward to testing it on msm8660 and subsequent chips in the future.
  18. Summarize

  19. There is a performance plateau between block size = 512B ~ 32K, and there is also a performance transition at block size = 256K. This feature reflects the impact of 32KB L1 / 256KB L2 cache.
  20. The performance of NEON instructions is always higher than that of ARM instructions. The performance difference between ARM and NEON instructions is narrowing as the development progresses. When using ARM and NEON instructions alternately, the performance is often worse than the NEON version.
  21. Without a good model design, software intervention in cache usage can easily lead to performance degradation.
  22. Under the fit in cache condition, the Snapdragon platform has the best performance.
  23. Under out of cache conditions, s5pc110 has the best performance.
  24. On the same hardware platform, overclocking has little impact on memory performance.
  25. The same implementation may perform differently on different hardware platforms. No one implementation is the best on all platforms.
  26. Further testing

    Because in the Cortex-A8 series chips, the NEON module is required. In the Cortex-A9 series chips, the NEON module is optional. Because the NEON module affects the die size, thus affecting power consumption and cost. Therefore, some Cortex-A9 chips, such as Nvidia Tegra250, do not have a NEON module. So what impact will the presence or absence of a NEON module have on software performance?

Reference address:Memcpy ARM/NEON assembly performance test on Cortex-A8 platform

Previous article:Soft floating point and hard floating point issues when compiling ARM code with ARMCC and GCC
Next article:C code optimization method for embedded platform ARM

Recommended ReadingLatest update time:2024-11-16 14:40

Preparation before calling the main function in ARM bare metal driver
Hardware 1. Turn off the CPU watchdog 2 Configure the CPU working clock 3. The program needs to run in SDRAM, so SDRAM must be initialized Software 1 To run a function, stack space is required, so the stack pointer SP must be initialized 2 Set the return address of the main function 3 Calling main 4. Cleaning up
[Microcontroller]
ARM's 9 addressing modes
ARM's 9 addressing modes 1) Immediate addressing The operand is an immediate value, prefixed with "#", and "0x" is used to represent a hexadecimal value. example: MOV R0,#0xFF00 ;0xFF00 - R0 SUBS R0,R0,#1 ;R0 – 1 - R0   2) Register addressing The value of the operand is in the register, and the register value is direc
[Microcontroller]
Miniaturized remote monitoring intelligent power supply system based on ARM Cortex-M3
Traditional power supply maintenance adopts a manual maintenance management mode, while the intelligent power supply monitoring system is based on embedded technology, computer technology, communication technology, etc., realizing the transformation of the power supply system to an intelligent and automated managem
[Power Management]
Miniaturized remote monitoring intelligent power supply system based on ARM Cortex-M3
The difference and usage of ARM function pointer and pointer function
When I was learning ARM, I found that it was easy to confuse "pointer function" with "function pointer", so today, I want to clarify it once and for all, and I found some information and some summary from everyone, and I organized it here to share with you. First, their definitions: 1. A pointer function is a func
[Microcontroller]
Design of Audio Analyzer Based on ARM Microcontroller LPC2148
0 Introduction With the rapid development of microelectronics and information technology, digital technology represented by single-chip microcomputers is developing rapidly. Single-chip microcomputers are widely used in the control of various instruments, computer network communication and data transmission
[Microcontroller]
Embedded processor ARM (Advanced RISC Machines) technology and chips
ARM (Advanced RISC Machines) has become the world's leading provider of embedded RISC processor intellectual property (IP) after more than a decade of unremitting efforts since it launched the first embedded RISC core ARM6 in 1991. ARM is a conceptually innovative company that first proposed the concept of public sale
[Microcontroller]
ARM Linux.2.6.34 kernel porting
ARM-LINUX-GCC version 4.3.2. Installed in /usr/local/arm/4.3.2. Step 1: Modify the linux-2.6.34/Makefile file, find the following two pieces of information in the makefile and modify them ARCH? = arm   CROSS_COMPILE? =/usr/local/arm/4.3.2/bin/arm-linux- Step 2: Modify the platform input clock Modify the platform clo
[Microcontroller]
Linux arm mmu basics
ARM MMU page table framework First, here is a general block diagram of the page table structure of the arm mmu (the following discussion is gradually expanded from this diagram): The above is the typical structure of the arm's page table diagram: that is, the secondary page table structure: The first-level page ta
[Microcontroller]
Linux arm mmu basics
Latest Microcontroller Articles
  • Download from the Internet--ARM Getting Started Notes
    A brief introduction: From today on, the ARM notebook of the rookie is open, and it can be regarded as a place to store these notes. Why publish it? Maybe you are interested in it. In fact, the reason for these notes is ...
  • Learn ARM development(22)
    Turning off and on interrupts Interrupts are an efficient dialogue mechanism, but sometimes you don't want to interrupt the program while it is running. For example, when you are printing something, the program suddenly interrupts and another ...
  • Learn ARM development(21)
    First, declare the task pointer, because it will be used later. Task pointer volatile TASK_TCB* volatile g_pCurrentTask = NULL;volatile TASK_TCB* vol ...
  • Learn ARM development(20)
    With the previous Tick interrupt, the basic task switching conditions are ready. However, this "easterly" is also difficult to understand. Only through continuous practice can we understand it. ...
  • Learn ARM development(19)
    After many days of hard work, I finally got the interrupt working. But in order to allow RTOS to use timer interrupts, what kind of interrupts can be implemented in S3C44B0? There are two methods in S3C44B0. ...
  • Learn ARM development(14)
  • Learn ARM development(15)
  • Learn ARM development(16)
  • Learn ARM development(17)
Change More Related Popular Components
Guess you like

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号