Summary of assembly optimization for arm architecture 64-bit (AArch64)

Publisher:ziyuntingLatest update time:2020-02-07 Source: eefocusKeywords:AArch64 Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

1. Reference

https://blog.csdn.net/SoaringLee_fighting/article/details/81906495

https://blog.csdn.net/SoaringLee_fighting/article/details/82155608

https://blog.csdn.net/u011514906/article/details/38142177

https://blog.csdn.net/listener51/article/details/82530464


2. Introduction

This article is a summary document of ARM architecture 64-bit (AArch64 execution state) neon optimization, mainly including the basic knowledge of ARM architecture 64-bit optimization, special usage, print debugging and common instructions usage precautions and data sources and other related knowledge. The previous article has a summary of ARM architecture 32-bit assembly optimization, which comprehensively summarizes the ARM architecture 32-bit neon optimization and describes the ARM assembly syntax. The following mainly takes the GNU ASM assembly syntax as an example.


The following figure is a mind map of the arm architecture assembly optimization summary:

Write the picture description here

3. Basic knowledge of 64-bit optimization of arm architecture

[arm] ARM architecture 64-bit entry basics: architecture analysis, registers, calling rules, instruction set and reference manual


This blog has analyzed the basic knowledge of arm architecture 64-bit assembly optimization, mainly including architecture analysis, registers, calling rules, instruction sets and program printing and debugging related knowledge, which can be used as the basic knowledge of arm 64-bit assembly optimization.


4. ARMv8/AArch64 neon instruction format

In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:


{}{}  Vd., Vn., Vm.  

Where:

< prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type.

< op> – operation, such as ADD, AND etc.

< suffix> - suffix

P: “pairwise” operations, such as ADDP,LDP,STP.

V: the new reduction (across-all-lanes) operations, such as ADDV,SMAXV,FMAXV.

2:new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2,SMULL2.

< T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit).


For example:

UADDLP    V0.8H, V0.16B

FADD V0.4S, V0.4S, V0.4S

References:

https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference


5. ARM related compilation parameters

When compiling embedded devices (i.e., ARM-based boards), it is best to add -fsigned-char because the default type of embedded devices is unsigned char, not char. In addition, when compiling ARM assembly optimization code, the compile option needs to be added with -c. -c means compiling or assembling source files, but not linking.


ARM-related or hardware-related compilation parameters generally start with -m. Common ARM platform compilation options include:


-mcpu = cortex-a7

-mabi = atpcs

-march = armv7

-mtune = cortex-a53

-mfpu = neon, neon-vfpv4

-mfloat-api = soft, softfp, hard


For more details, please refer to: https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc.pdf Section 3.17.1 AArch64 options and Section 3.17.4 ARM options.


6. How to check the status flag NZCV

mrs  x15, nzcv

supply w0, w15

bl print


7. Instructions unique to the A64 instruction set and their usage

1. shl and ushr instructions

shl  ., ., #

ushr  ., ., #

ushr  d2, d2,  #8

Notes on use: These two instructions can only operate on 64-bit data, that is, they can only process the D register.

ushr can only right shift 64 bits of data at most, and the right shift will affect the upper 64 bits of data in the V2 register (cleared to zero), so the upper 64 bits of data need to be saved before right shifting, otherwise the related data will be modified.


2. INS instructions

The usage is basically the same as the MOV instruction, and it can realize the transfer between neon scalars and between ARM registers and neon scalars.


INS   .[index1], .[index2]

INS   .[index1], Rn


3. SUQADD, USQADD instructions

There are both scalar and vector usages.


SUQADD ,      // signed saturating accumulate of unsigned value

SUQADD ., .

USQADD ,     // unsigned saturating accumulate of signed value

USQADD ., .


4.RBIT, REV instructions


 RBIT , //reverse bits

 REV ,   //reverse bytes

5.

ADDV,SADDLV,SMAXV,SMINV (Vector Reduce(across lanes))

ADDV ,     // Integer sum element to scalar(vector)

SADDLV ,   // Signed Interger sum elements to long scalar(vector)

SMAXV ,    // Signed Interger maximum elements to scalar(vector)

SMINV ,    // Signed Interger minimum elements to scalar(vector)


eg.:

addv B0, v1.8B // Add the eight 8-bit data in the lower 64 bits of the v1 register and assign them to the lowest 8 bits of v0.

For more detailed explanation, please refer to: https://static.docs.arm.com/ddi0487/a/DDI0487A_j_armv8_arm.pdf

Write the picture description here


6. Precautions for using sxtw

Negative numbers must be sign-extended when used!

for example:


sxtw   x4, w4


7.w register to v register

Use the dup command directly


dup v0.8B,  w2


8. Common instruction correspondence (arm32---->arm64)

 vmovl------>uxtl/sxtl

 vqmovn----->sqxtn

 vqmovun----->sqxtun

 vqrshrun---->sqrshrun

 vceq------->cmeq

 vcge------->cmge

 wadding------>add

 vsub------>sub

 vaddl----->saddl,uaddl 

 vaddw----->saddw,uaddw,sw2addw2,uadd

 vmull----->smull,smull2,umull,umull2

 vmax,vmin----->smax,umax,smin,umin

 vmlal--------> smlal,smlal2,umlal,umlal2

 vrshl--------> urshl,srshl

 vtrn---------> trn1,trn2

 vstm/vstr----> stp/str

 vld1.32 {d0[]}, [r0], r2-----> ld1r {v0.S}[0], [x0], x2

 addgt,addle,subgt,suble----->csel,cetm,cet,chinc,chinv

More references:

https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf


8. Document review

    Before writing arm64-bit assembly language, it is recommended to first read and study the arm official English manual (https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf), focusing on the C7 AArch64 neon instruction part and the C3 ARM instruction part. After understanding the basic instructions and arm64-bit assembly format, you can try to write.

    If you already have arm32-bit code, when migrating from arm32-bit code to arm64, you can refer to the instruction comparison table in Section 5.7.23 of (https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf).


    For code migration methods, you can refer to my blog: Some ways of Migrating code from ARM32 to AArch64.


    For quick command lookup, refer to the command quick reference card:

https://courses.cs.washington.edu/courses/cse469/18wi/Materials/arm64.pdf


9. Summary of optimization experience (full of useful information)

About parameter stacking and register stacking

It is recommended to push the ARM register or NEON register onto the stack after taking out the pushed parameters.

Try to remove data dependencies and make instructions parallel

Do not use the destination register of the current instruction as the source register of the next instruction, especially for vmul instructions and vmla instructions. Do not use the destination register of the current instruction as the source register of the next instruction, especially for vmul instructions and vmla instructions.


Minimize branching

Conditional execution instructions or logical operation instructions can be used to replace branch jumps, such as addgt, suble, vceq, vcge, vbit, vbsl, etc.

Focus on instruction cycle latency

For multiplication instructions, the instruction cycle is relatively long. Try not to use the instruction calculation result immediately, otherwise it will take time to wait.

Data operations should be performed in neon registers as much as possible, and operations between arm registers and neon registers should be avoided.

Minimize the number of times you access data.

Try to use registers that do not need to be saved, as registers are time-consuming to push and pull from the stack.

When the width is a multiple of 4, try to process it in the width direction to improve the cache hit rate.

Use as few instructions as possible to write code, because arm instructions are streamlined instructions and most instructions are single-cycle instructions.

If there are enough registers, try to split the data processing of one row into two or four rows for parallel processing; try to avoid operations between large data, and split the operations on large data into operations on small data.


Reduce loop judgment and conditional comparison

In the data processing process, when there are many loop judgments or conditional judgments, branches can be expanded appropriately, which can improve performance to a certain extent.


THE END!


Keywords:AArch64 Reference address:Summary of assembly optimization for arm architecture 64-bit (AArch64)

Previous article:The correct use of S3C2440A timer
Next article:ARM assembly instructions ADR and LDR usage

Latest Microcontroller Articles
  • Download from the Internet--ARM Getting Started Notes
    A brief introduction: From today on, the ARM notebook of the rookie is open, and it can be regarded as a place to store these notes. Why publish it? Maybe you are interested in it. In fact, the reason for these notes is ...
  • Learn ARM development(22)
    Turning off and on interrupts Interrupts are an efficient dialogue mechanism, but sometimes you don't want to interrupt the program while it is running. For example, when you are printing something, the program suddenly interrupts and another ...
  • Learn ARM development(21)
    First, declare the task pointer, because it will be used later. Task pointer volatile TASK_TCB* volatile g_pCurrentTask = NULL;volatile TASK_TCB* vol ...
  • Learn ARM development(20)
    With the previous Tick interrupt, the basic task switching conditions are ready. However, this "easterly" is also difficult to understand. Only through continuous practice can we understand it. ...
  • Learn ARM development(19)
    After many days of hard work, I finally got the interrupt working. But in order to allow RTOS to use timer interrupts, what kind of interrupts can be implemented in S3C44B0? There are two methods in S3C44B0. ...
  • Learn ARM development(14)
  • Learn ARM development(15)
  • Learn ARM development(16)
  • Learn ARM development(17)
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号