1. Reference
https://blog.csdn.net/SoaringLee_fighting/article/details/81906495
https://blog.csdn.net/SoaringLee_fighting/article/details/82155608
https://blog.csdn.net/u011514906/article/details/38142177
https://blog.csdn.net/listener51/article/details/82530464
2. Introduction
This article is a summary document of ARM architecture 64-bit (AArch64 execution state) neon optimization, mainly including the basic knowledge of ARM architecture 64-bit optimization, special usage, print debugging and common instructions usage precautions and data sources and other related knowledge. The previous article has a summary of ARM architecture 32-bit assembly optimization, which comprehensively summarizes the ARM architecture 32-bit neon optimization and describes the ARM assembly syntax. The following mainly takes the GNU ASM assembly syntax as an example.
The following figure is a mind map of the arm architecture assembly optimization summary:
3. Basic knowledge of 64-bit optimization of arm architecture
[arm] ARM architecture 64-bit entry basics: architecture analysis, registers, calling rules, instruction set and reference manual
This blog has analyzed the basic knowledge of arm architecture 64-bit assembly optimization, mainly including architecture analysis, registers, calling rules, instruction sets and program printing and debugging related knowledge, which can be used as the basic knowledge of arm 64-bit assembly optimization.
4. ARMv8/AArch64 neon instruction format
In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:
{ Where: < prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type. < op> – operation, such as ADD, AND etc. < suffix> - suffix P: “pairwise” operations, such as ADDP,LDP,STP. V: the new reduction (across-all-lanes) operations, such as ADDV,SMAXV,FMAXV. 2:new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2,SMULL2. < T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit). For example: UADDLP V0.8H, V0.16B FADD V0.4S, V0.4S, V0.4S References: https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference 5. ARM related compilation parameters When compiling embedded devices (i.e., ARM-based boards), it is best to add -fsigned-char because the default type of embedded devices is unsigned char, not char. In addition, when compiling ARM assembly optimization code, the compile option needs to be added with -c. -c means compiling or assembling source files, but not linking. ARM-related or hardware-related compilation parameters generally start with -m. Common ARM platform compilation options include: -mcpu = cortex-a7 -mabi = atpcs -march = armv7 -mtune = cortex-a53 -mfpu = neon, neon-vfpv4 -mfloat-api = soft, softfp, hard For more details, please refer to: https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc.pdf Section 3.17.1 AArch64 options and Section 3.17.4 ARM options. 6. How to check the status flag NZCV mrs x15, nzcv supply w0, w15 bl print 7. Instructions unique to the A64 instruction set and their usage 1. shl and ushr instructions shl ushr ushr d2, d2, #8 Notes on use: These two instructions can only operate on 64-bit data, that is, they can only process the D register. ushr can only right shift 64 bits of data at most, and the right shift will affect the upper 64 bits of data in the V2 register (cleared to zero), so the upper 64 bits of data need to be saved before right shifting, otherwise the related data will be modified. 2. INS instructions The usage is basically the same as the MOV instruction, and it can realize the transfer between neon scalars and between ARM registers and neon scalars. INS INS 3. SUQADD, USQADD instructions There are both scalar and vector usages. SUQADD SUQADD USQADD USQADD 4.RBIT, REV instructions RBIT REV 5. ADDV,SADDLV,SMAXV,SMINV (Vector Reduce(across lanes)) ADDV SADDLV SMAXV SMINV eg.: addv B0, v1.8B // Add the eight 8-bit data in the lower 64 bits of the v1 register and assign them to the lowest 8 bits of v0. For more detailed explanation, please refer to: https://static.docs.arm.com/ddi0487/a/DDI0487A_j_armv8_arm.pdf Write the picture description here 6. Precautions for using sxtw Negative numbers must be sign-extended when used! for example: sxtw x4, w4 7.w register to v register Use the dup command directly dup v0.8B, w2 8. Common instruction correspondence (arm32---->arm64) vmovl------>uxtl/sxtl vqmovn----->sqxtn vqmovun----->sqxtun vqrshrun---->sqrshrun vceq------->cmeq vcge------->cmge wadding------>add vsub------>sub vaddl----->saddl,uaddl vaddw----->saddw,uaddw,sw2addw2,uadd vmull----->smull,smull2,umull,umull2 vmax,vmin----->smax,umax,smin,umin vmlal--------> smlal,smlal2,umlal,umlal2 vrshl--------> urshl,srshl vtrn---------> trn1,trn2 vstm/vstr----> stp/str vld1.32 {d0[]}, [r0], r2-----> ld1r {v0.S}[0], [x0], x2 addgt,addle,subgt,suble----->csel,cetm,cet,chinc,chinv More references: https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf 8. Document review Before writing arm64-bit assembly language, it is recommended to first read and study the arm official English manual (https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf), focusing on the C7 AArch64 neon instruction part and the C3 ARM instruction part. After understanding the basic instructions and arm64-bit assembly format, you can try to write. If you already have arm32-bit code, when migrating from arm32-bit code to arm64, you can refer to the instruction comparison table in Section 5.7.23 of (https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf). For code migration methods, you can refer to my blog: Some ways of Migrating code from ARM32 to AArch64. For quick command lookup, refer to the command quick reference card: https://courses.cs.washington.edu/courses/cse469/18wi/Materials/arm64.pdf 9. Summary of optimization experience (full of useful information) About parameter stacking and register stacking It is recommended to push the ARM register or NEON register onto the stack after taking out the pushed parameters. Try to remove data dependencies and make instructions parallel Do not use the destination register of the current instruction as the source register of the next instruction, especially for vmul instructions and vmla instructions. Do not use the destination register of the current instruction as the source register of the next instruction, especially for vmul instructions and vmla instructions. Minimize branching Conditional execution instructions or logical operation instructions can be used to replace branch jumps, such as addgt, suble, vceq, vcge, vbit, vbsl, etc. Focus on instruction cycle latency For multiplication instructions, the instruction cycle is relatively long. Try not to use the instruction calculation result immediately, otherwise it will take time to wait. Data operations should be performed in neon registers as much as possible, and operations between arm registers and neon registers should be avoided. Minimize the number of times you access data. Try to use registers that do not need to be saved, as registers are time-consuming to push and pull from the stack. When the width is a multiple of 4, try to process it in the width direction to improve the cache hit rate. Use as few instructions as possible to write code, because arm instructions are streamlined instructions and most instructions are single-cycle instructions. If there are enough registers, try to split the data processing of one row into two or four rows for parallel processing; try to avoid operations between large data, and split the operations on large data into operations on small data. Reduce loop judgment and conditional comparison In the data processing process, when there are many loop judgments or conditional judgments, branches can be expanded appropriately, which can improve performance to a certain extent. THE END!
Previous article:The correct use of S3C2440A timer
Next article:ARM assembly instructions ADR and LDR usage
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- MBUS front-end protection
- KiCad Simplified Chinese Manual
- What is the function of the cross connection of the five tubes in the middle of this comparator?
- Battery Management Chip BQ28Z610 Problem Collection
- From simulation to realization
- 【DFRobot motor driver】+ unboxing and testing
- EEWORLD University ---- Analog Electronic Circuits
- 【GD32450I-EVAL】+ 07 Camera full screen display test
- Recycling Huawei switches and Huawei routers at reasonable prices, safe and reliable, privacy and confidentiality
- Implementation of MSP430F5438A interrupt system