Summary of assembly optimization for arm architecture 64-bit (AArch64)-EEWORLD

Collect

1. Reference

https://blog.csdn.net/SoaringLee_fighting/article/details/81906495

https://blog.csdn.net/SoaringLee_fighting/article/details/82155608

https://blog.csdn.net/u011514906/article/details/38142177

https://blog.csdn.net/listener51/article/details/82530464

2. Introduction

This article is a summary document of ARM architecture 64-bit (AArch64 execution state) neon optimization, mainly including the basic knowledge of ARM architecture 64-bit optimization, special usage, print debugging and common instructions usage precautions and data sources and other related knowledge. The previous article has a summary of ARM architecture 32-bit assembly optimization, which comprehensively summarizes the ARM architecture 32-bit neon optimization and describes the ARM assembly syntax. The following mainly takes the GNU ASM assembly syntax as an example.

The following figure is a mind map of the arm architecture assembly optimization summary:

Write the picture description here

3. Basic knowledge of 64-bit optimization of arm architecture

[arm] ARM architecture 64-bit entry basics: architecture analysis, registers, calling rules, instruction set and reference manual

This blog has analyzed the basic knowledge of arm architecture 64-bit assembly optimization, mainly including architecture analysis, registers, calling rules, instruction sets and program printing and debugging related knowledge, which can be used as the basic knowledge of arm 64-bit assembly optimization.

4. ARMv8/AArch64 neon instruction format

In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:

{}{} Vd., Vn., Vm.　　

Where:

< prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type.

< op> – operation, such as ADD, AND etc.

< suffix> - suffix

P: “pairwise” operations, such as ADDP，LDP，STP.

V: the new reduction (across-all-lanes) operations, such as ADDV，SMAXV，FMAXV.

2：new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2，SMULL2.

< T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit).

For example:

UADDLP V0.8H, V0.16B

FADD V0.4S, V0.4S, V0.4S

References:

https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference

5. ARM related compilation parameters

When compiling embedded devices (i.e., ARM-based boards), it is best to add -fsigned-char because the default type of embedded devices is unsigned char, not char. In addition, when compiling ARM assembly optimization code, the compile option needs to be added with -c. -c means compiling or assembling source files, but not linking.

ARM-related or hardware-related compilation parameters generally start with -m. Common ARM platform compilation options include:

-mcpu = cortex-a7

-mabi = atpcs

-march = armv7

-mtune = cortex-a53

-mfpu = neon, neon-vfpv4

-mfloat-api = soft, softfp, hard

For more details, please refer to: https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc.pdf Section 3.17.1 AArch64 options and Section 3.17.4 ARM options.

6. How to check the status flag NZCV

mrs x15, nzcv

supply w0, w15

bl print

7. Instructions unique to the A64 instruction set and their usage

1. shl and ushr instructions

shl ., ., #

ushr ., ., #

ushr d2， d2, #8

Notes on use: These two instructions can only operate on 64-bit data, that is, they can only process the D register.

ushr can only right shift 64 bits of data at most, and the right shift will affect the upper 64 bits of data in the V2 register (cleared to zero), so the upper 64 bits of data need to be saved before right shifting, otherwise the related data will be modified.

2. INS instructions

The usage is basically the same as the MOV instruction, and it can realize the transfer between neon scalars and between ARM registers and neon scalars.

INS .[index1], .[index2]

INS .[index1], Rn

3. SUQADD, USQADD instructions

There are both scalar and vector usages.

SUQADD , // signed saturating accumulate of unsigned value

SUQADD ., .

USQADD , // unsigned saturating accumulate of signed value

USQADD ., .

4.RBIT, REV instructions

RBIT , //reverse bits

REV , //reverse bytes

ADDV,SADDLV,SMAXV,SMINV (Vector Reduce（across lanes）)

ADDV , // Integer sum element to scalar(vector)

SADDLV , // Signed Interger sum elements to long scalar(vector)

SMAXV , // Signed Interger maximum elements to scalar(vector)

SMINV , // Signed Interger minimum elements to scalar(vector)

eg.:

addv B0, v1.8B // Add the eight 8-bit data in the lower 64 bits of the v1 register and assign them to the lowest 8 bits of v0.

For more detailed explanation, please refer to: https://static.docs.arm.com/ddi0487/a/DDI0487A_j_armv8_arm.pdf

Write the picture description here

6. Precautions for using sxtw

Negative numbers must be sign-extended when used!

for example:

sxtw x4, w4

7.w register to v register

Use the dup command directly

dup v0.8B, w2

8. Common instruction correspondence (arm32---->arm64)

vmovl------>uxtl/sxtl

vqmovn----->sqxtn

vqmovun----->sqxtun

vqrshrun---->sqrshrun

vceq------->cmeq

vcge------->cmge

wadding------>add

vsub------>sub

vaddl----->saddl,uaddl

vaddw----->saddw,uaddw,sw2addw2,uadd

vmull----->smull,smull2,umull,umull2

vmax,vmin----->smax,umax,smin,umin

vmlal--------> smlal,smlal2,umlal,umlal2

vrshl--------> urshl,srshl

vtrn---------> trn1,trn2

vstm/vstr----> stp/str

vld1.32 {d0[]}, [r0], r2-----> ld1r {v0.S}[0], [x0], x2

addgt,addle,subgt,suble----->csel,cetm,cet,chinc,chinv

More references:

https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf

8. Document review

Before writing arm64-bit assembly language, it is recommended to first read and study the arm official English manual (https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf), focusing on the C7 AArch64 neon instruction part and the C3 ARM instruction part. After understanding the basic instructions and arm64-bit assembly format, you can try to write.

If you already have arm32-bit code, when migrating from arm32-bit code to arm64, you can refer to the instruction comparison table in Section 5.7.23 of (https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf).

For code migration methods, you can refer to my blog: Some ways of Migrating code from ARM32 to AArch64.

For quick command lookup, refer to the command quick reference card:

https://courses.cs.washington.edu/courses/cse469/18wi/Materials/arm64.pdf

9. Summary of optimization experience (full of useful information)

About parameter stacking and register stacking

It is recommended to push the ARM register or NEON register onto the stack after taking out the pushed parameters.

Try to remove data dependencies and make instructions parallel

Do not use the destination register of the current instruction as the source register of the next instruction, especially for vmul instructions and vmla instructions. Do not use the destination register of the current instruction as the source register of the next instruction, especially for vmul instructions and vmla instructions.

Minimize branching

Conditional execution instructions or logical operation instructions can be used to replace branch jumps, such as addgt, suble, vceq, vcge, vbit, vbsl, etc.

Focus on instruction cycle latency

For multiplication instructions, the instruction cycle is relatively long. Try not to use the instruction calculation result immediately, otherwise it will take time to wait.

Data operations should be performed in neon registers as much as possible, and operations between arm registers and neon registers should be avoided.

Minimize the number of times you access data.

Try to use registers that do not need to be saved, as registers are time-consuming to push and pull from the stack.

When the width is a multiple of 4, try to process it in the width direction to improve the cache hit rate.

Use as few instructions as possible to write code, because arm instructions are streamlined instructions and most instructions are single-cycle instructions.

If there are enough registers, try to split the data processing of one row into two or four rows for parallel processing; try to avoid operations between large data, and split the operations on large data into operations on small data.

Reduce loop judgment and conditional comparison

In the data processing process, when there are many loop judgments or conditional judgments, branches can be expanded appropriately, which can improve performance to a certain extent.

THE END!

Keywords：AArch64 Reference address：Summary of assembly optimization for arm architecture 64-bit (AArch64)

Previous article：The correct use of S3C2440A timer
Next article：ARM assembly instructions ADR and LDR usage

Popular Resources
Popular amplifiers