ARM aarch64 assembly learning notes (IX): Using Neon instructions (I)

Publisher:温柔之风Latest update time:2021-11-30 Source: eefocusKeywords:ARM Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

NEON is an ARM technology based on the SIMD concept. SIMD, Single Instruction Multiple Data, is a parallel processing technology that uses a single instruction to process multiple data. Compared with one instruction processing one data, the computing speed will be greatly improved.


ARMv8 has 31 64-bit registers and 1 special register with different names, the purpose depends on the context, so we can think of it as 31 64-bit X registers or 31 32-bit W registers (the lower 32 bits of the X register)

Write the picture description here

ARMv8 has 32 128-bit V registers. Similarly, we can also think of them as 32 32-bit S registers or 32 64-bit D registers.

Write the picture description here

It can also be used as 32 64bit D0-D31 or 32 32bit S0-S31 or 32 16bit H0-h31 or 32 8bit B0-B31.


Let’s take a simple example to illustrate the benefits of using Neon.

For example, there is a very simple requirement. There are 2 sets of data, each set of data has 16 x 1024 integers. Let them be added one by one in order to get the sum (the number of each set of data does not exceed 255, and if the sum is greater than 255, 255 is returned).


If implemented in C language:


#include

#include


#define MAX_LEN 16 * 1024 * 1024

typedef unsigned char uint_8t;

typedef unsigned short uint_16t;


int main()

{

double start_time;

double end_time;

uint_8t *dist1 = (uint_8t *)malloc(sizeof(uint_8t) * MAX_LEN);

uint_8t *dist2 = (uint_8t *)malloc(sizeof(uint_8t) * MAX_LEN);

uint_16t *ref_out = (uint_16t *)malloc(sizeof(uint_16t) * MAX_LEN);


// 2 sets of data are randomly assigned

for (int i = 0; i < MAX_LEN; i++)

{

dist1[i] = rand() % 256;

dist2[i] = rand() % 256;

}

start_time = clock();

for (int i = 0; i < MAX_LEN; i++)

{

ref_out[i] = dist1[i] + dist2[i];

if (ref_out[i] > 255)

{

ref_out[i] = 255;

}

}

end_time = clock();

printf("C use time %f sn", end_time - start_time);

return 0;

}


Because the C language implementation only operates one register for each addition, and since each input and output is no greater than 255, it can be stored in an 8-bit register, which causes a waste of registers.

If using Neon for acceleration:


.text


.global asm_add_neon


asm_add_neon:

LOOP:

LDR Q0, [X0], #0x10

LDR Q1, [X1], #0x10

UQADD V0.16B, V0.16B, V1.16B

STR Q0, [X2], #0x10

SUBS X3, X3, #0x10

B.NE LOOP

RIGHT


Q0 represents array A, Q1 represents array B, 128 bits (16) are read each time, and the ARM vector non-saturated addition instruction UQADD is used for calculation, and the result is stored in the X2 register.


Compare the performance of C language and ARM NEON acceleration:


#include

#include


#define MAX_LEN 16 * 1024 * 1024

typedef unsigned char uint_8t;

typedef unsigned short uint_16t;

extern int asm_add_neon(uint_8t *dist1, uint_8t *dist2, uint_8t *out, int len);

int main()

{

double start_time;

double end_time;

uint_8t *dist1 = (uint_8t *)malloc(sizeof(uint_8t) * MAX_LEN);

uint_8t *dist2 = (uint_8t *)malloc(sizeof(uint_8t) * MAX_LEN);

uint_8t *out = (uint_8t *)malloc(sizeof(uint_8t) * MAX_LEN);

uint_16t *ref_out = (uint_16t *)malloc(sizeof(uint_16t) * MAX_LEN);


for (int i = 0; i < MAX_LEN; i++)

{

dist1[i] = rand() % 256;

dist2[i] = rand() % 256;

}

start_time = clock();

for (int i = 0; i < MAX_LEN; i++)

{

ref_out[i] = dist1[i] + dist2[i];

if (ref_out[i] > 255)

{

ref_out[i] = 255;

}

//printf("%d dist1[%d] dist2[%d] refout[%d] n", i,dist1[i], dist2[i],  ref_out[i]);

}

end_time = clock();

printf("C use time %f sn", end_time - start_time);

start_time = clock();

asm_add_neon(dist1, dist2, out, MAX_LEN);

end_time = clock();

printf("asm use time %f sn", end_time - start_time);

for (int i = 0; i < MAX_LEN; i++)

{

if (out[i] != ref_out[i])

{

printf("ERROR:%dn", i);

return -1;

}

}

printf("PASS!n");

return 0;

}

insert image description here

The performance of the arm neon assembly implementation is exactly about 16 times that of the pure C language implementation.

Keywords:ARM Reference address:ARM aarch64 assembly learning notes (IX): Using Neon instructions (I)

Previous article:ARM aarch64 assembly learning notes (Part 2): ARM DS-5 simulator installation and use
Next article:ARMv8-A architecture basics - system registers

Recommended ReadingLatest update time:2024-11-17 03:44

Embedded System Architecture - ARM Processor 
ARM was founded in Cambridge, UK in 1991, mainly selling licenses for chip design technology. At present, processors using ARM technology intellectual property (IP) cores, commonly known as ARM processors, have spread across various product markets such as industrial control, consumer electronics, communication system
[Microcontroller]
Embedded System Architecture - ARM Processor 
Design and implementation of electronic paper driver under ARM9+Linux
In order to realize fast and effective control of electronic paper under ARM9+Linux system, an electronic paper driver based on electronic paper display control chip (GD6210E) is designed. According to the interface characteristics of S3C2440 ARM9 processor and GD6210E, GD6210E is expanded by using GPIO port of S3C2
[Microcontroller]
Design and implementation of electronic paper driver under ARM9+Linux
In-depth analysis of the design of a dental chair control system based on ARM embedded technology
1 Introduction At present, the market for high-end dental chair equipment is basically dominated by foreign companies. Generally, they are expensive and the technology is not transferable. As people pay more and more attention to oral health, it is particularly necessary to develop a high-end integrated oral diagnosis
[Microcontroller]
In-depth analysis of the design of a dental chair control system based on ARM embedded technology
Unaligned data access operations in ARM
What is the difference between and ? Is it just that the mechanism for executing the "ignore" action is different? I'll leave it here and reply later. If anyone can explain it to me, I'd be very grateful! *************************************************** ******************************************* Unaligned dat
[Microcontroller]
ARM CORTEX-M3 core architecture understanding summary
In my opinion, the main components of the Cotex-M3 core are: Nested Vectored Interrupt Controller (NVIC), Value Fetch Unit, Instruction Decoder, Arithmetic Logic Unit (ALU), Register Group, Memory Map (4GB unified addressing, division and definition of functions in each area). For developers, the main focus is actually
[Microcontroller]
Which one should I choose? DSP vs ARM with DSP functions
  Recently, in a project in the field of industrial control, I saw that the early engineering designers designed a pairing of Cortex-M3 microprocessor and TI  DSP to complete the entire project. "Why not use the Cortex-M4 core?" This question immediately popped up. Today, I carefully checked and made a simple
[Embedded]
Which one should I choose? DSP vs ARM with DSP functions
Using proteus to learn ARM (LPC2103) Part 2: Familiar with the IAR C language development environment
1. About the length of data types in C language We should be familiar with the length of the 51 series C language data types, as shown in the following table: type of data length range unsigned char Single Byte 0~255 signed char Single Byte -128~+127
[Microcontroller]
Analysis and Design of BootLoader for ARM-Linux Embedded System
0 Introduction The boot loader of an embedded system is composed of the Boot Loader and the Boot code (optional) solidified in the firmware. Its role and function are like a ROM chip program BIOS (basic input output system) solidified on the motherboard of the computer. However, it is generally not configured wi
[Industrial Control]
Analysis and Design of BootLoader for ARM-Linux Embedded System
Latest Microcontroller Articles
  • Download from the Internet--ARM Getting Started Notes
    A brief introduction: From today on, the ARM notebook of the rookie is open, and it can be regarded as a place to store these notes. Why publish it? Maybe you are interested in it. In fact, the reason for these notes is ...
  • Learn ARM development(22)
    Turning off and on interrupts Interrupts are an efficient dialogue mechanism, but sometimes you don't want to interrupt the program while it is running. For example, when you are printing something, the program suddenly interrupts and another ...
  • Learn ARM development(21)
    First, declare the task pointer, because it will be used later. Task pointer volatile TASK_TCB* volatile g_pCurrentTask = NULL;volatile TASK_TCB* vol ...
  • Learn ARM development(20)
    With the previous Tick interrupt, the basic task switching conditions are ready. However, this "easterly" is also difficult to understand. Only through continuous practice can we understand it. ...
  • Learn ARM development(19)
    After many days of hard work, I finally got the interrupt working. But in order to allow RTOS to use timer interrupts, what kind of interrupts can be implemented in S3C44B0? There are two methods in S3C44B0. ...
  • Learn ARM development(14)
  • Learn ARM development(15)
  • Learn ARM development(16)
  • Learn ARM development(17)
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号