NEON is an ARM technology based on the SIMD concept. SIMD, Single Instruction Multiple Data, is a parallel processing technology that uses a single instruction to process multiple data. Compared with one instruction processing one data, the computing speed will be greatly improved.
ARMv8 has 31 64-bit registers and 1 special register with different names, the purpose depends on the context, so we can think of it as 31 64-bit X registers or 31 32-bit W registers (the lower 32 bits of the X register)
ARMv8 has 32 128-bit V registers. Similarly, we can also think of them as 32 32-bit S registers or 32 64-bit D registers.
It can also be used as 32 64bit D0-D31 or 32 32bit S0-S31 or 32 16bit H0-h31 or 32 8bit B0-B31.
Let’s take a simple example to illustrate the benefits of using Neon.
For example, there is a very simple requirement. There are 2 sets of data, each set of data has 16 x 1024 integers. Let them be added one by one in order to get the sum (the number of each set of data does not exceed 255, and if the sum is greater than 255, 255 is returned).
If implemented in C language:
#include #include #define MAX_LEN 16 * 1024 * 1024 typedef unsigned char uint_8t; typedef unsigned short uint_16t; int main() { double start_time; double end_time; uint_8t *dist1 = (uint_8t *)malloc(sizeof(uint_8t) * MAX_LEN); uint_8t *dist2 = (uint_8t *)malloc(sizeof(uint_8t) * MAX_LEN); uint_16t *ref_out = (uint_16t *)malloc(sizeof(uint_16t) * MAX_LEN); // 2 sets of data are randomly assigned for (int i = 0; i < MAX_LEN; i++) { dist1[i] = rand() % 256; dist2[i] = rand() % 256; } start_time = clock(); for (int i = 0; i < MAX_LEN; i++) { ref_out[i] = dist1[i] + dist2[i]; if (ref_out[i] > 255) { ref_out[i] = 255; } } end_time = clock(); printf("C use time %f sn", end_time - start_time); return 0; } Because the C language implementation only operates one register for each addition, and since each input and output is no greater than 255, it can be stored in an 8-bit register, which causes a waste of registers. If using Neon for acceleration: .text .global asm_add_neon asm_add_neon: LOOP: LDR Q0, [X0], #0x10 LDR Q1, [X1], #0x10 UQADD V0.16B, V0.16B, V1.16B STR Q0, [X2], #0x10 SUBS X3, X3, #0x10 B.NE LOOP RIGHT Q0 represents array A, Q1 represents array B, 128 bits (16) are read each time, and the ARM vector non-saturated addition instruction UQADD is used for calculation, and the result is stored in the X2 register. Compare the performance of C language and ARM NEON acceleration: #include #include #define MAX_LEN 16 * 1024 * 1024 typedef unsigned char uint_8t; typedef unsigned short uint_16t; extern int asm_add_neon(uint_8t *dist1, uint_8t *dist2, uint_8t *out, int len); int main() { double start_time; double end_time; uint_8t *dist1 = (uint_8t *)malloc(sizeof(uint_8t) * MAX_LEN); uint_8t *dist2 = (uint_8t *)malloc(sizeof(uint_8t) * MAX_LEN); uint_8t *out = (uint_8t *)malloc(sizeof(uint_8t) * MAX_LEN); uint_16t *ref_out = (uint_16t *)malloc(sizeof(uint_16t) * MAX_LEN); for (int i = 0; i < MAX_LEN; i++) { dist1[i] = rand() % 256; dist2[i] = rand() % 256; } start_time = clock(); for (int i = 0; i < MAX_LEN; i++) { ref_out[i] = dist1[i] + dist2[i]; if (ref_out[i] > 255) { ref_out[i] = 255; } //printf("%d dist1[%d] dist2[%d] refout[%d] n", i,dist1[i], dist2[i], ref_out[i]); } end_time = clock(); printf("C use time %f sn", end_time - start_time); start_time = clock(); asm_add_neon(dist1, dist2, out, MAX_LEN); end_time = clock(); printf("asm use time %f sn", end_time - start_time); for (int i = 0; i < MAX_LEN; i++) { if (out[i] != ref_out[i]) { printf("ERROR:%dn", i); return -1; } } printf("PASS!n"); return 0; } The performance of the arm neon assembly implementation is exactly about 16 times that of the pure C language implementation.
Previous article:ARM aarch64 assembly learning notes (Part 2): ARM DS-5 simulator installation and use
Next article:ARMv8-A architecture basics - system registers
Recommended ReadingLatest update time:2024-11-17 03:44
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- Rambus Launches Industry's First HBM 4 Controller IP: What Are the Technical Details Behind It?
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- ZigBee acquisition system code is needed
- How to change incremental encoder into absolute encoder
- 【AT-START-F403A Review】+ Comparison between AT32F403A and GD32450I
- Share the experience of using GD32F10x and solve the problem
- Different routing layers, same STUB
- Won't boot!!! How wrong is this schematic?
- Lide Huafu Dongfang Hitachi Mingyang Longyuan
- EEWORLD University ---- RISC-V Processor Design Series
- [NUCLEO-L552ZE Review] +RT-Thread Porting
- Single source shortest path - Dijkstara algorithm