Preface
This series of blog posts is used to introduce the NEON instruction optimization under ARM CPU.
Blog post github address: github
Related code github address: github
NEON instruction set
Most mainstream compilers that support ARM CPU as target platform support NEON instructions. You can use NEON by embedding NEON assembly in your code, but a more common way is to write NEON code through NEON Instrinsic similar to C function. Just like NEON hello world. NEON Instrinsic is a set of buildin types and functions supported by the compiler, which basically covers all NEON instructions. Usually these Instrinsic are included in the arm_neon.h header file.
This article takes arm_neon.h of armv7 in android-ndk-r11c as an example to explain the NEON instruction type.
register
The ARMV7 architecture includes:
16 general registers (32 bits), R0-R15
16 NEON registers (128 bit), Q0-Q15 (can also be regarded as 32 64 bit registers, D0-D31)
16 VFP registers (32 bits), S0-S15
The difference between NEON and VFP is that VFP is a hardware that accelerates floating point calculations and does not have data parallelism capabilities. At the same time, VFP is more suitable for double-precision floating point calculations, while NEON only has single-precision floating point calculation capabilities. For more information, please refer to stackoverflow:neon vs vfp
Basic Data Types
64-bit data type, mapped to registers D0-D31
The corresponding C/C++ language types (types in stdint.h or csdtint header files) are described in the comments.
//typedef int8_t[8] int8x8_t;
typedef __builtin_neon_qi int8x8_t __attribute__ ((__vector_size__ (8)));
//typedef int16_t[4] int16x4_t;
typedef __builtin_neon_hi int16x4_t __attribute__ ((__vector_size__ (8)));
//typedef int32_t[2] int32x2_t;
typedef __builtin_neon_si int32x2_t __attribute__ ((__vector_size__ (8)));
//typedef int64_t[1] int64x1_t;
typedef __builtin_neon_di int64x1_t;
//typedef float16_t[4] float16x4_t;
// (Note: This type is half-precision and is supported on some new CPUs. This basic data type is not yet available in the C/C++ language annotation)
typedef __builtin_neon_hf float16x4_t __attribute__ ((__vector_size__ (8)));
//typedef float32_t[2] float32x2_t;
typedef __builtin_neon_sf float32x2_t __attribute__ ((__vector_size__ (8)));
//poly8 and poly16 types are basically not used in common algorithms
//Detailed explanation:
//http://stackoverflow.com/questions/22224282/arm-neon-and-poly8-t-and-poly16-t
typedef __builtin_neon_poly8 poly8x8_t __attribute__ ((__vector_size__ (8)));
typedef __builtin_neon_poly16 poly16x4_t __attribute__ ((__vector_size__ (8)));
#ifdef __ARM_FEATURE_CRYPTO
typedef __builtin_neon_poly64 poly64x1_t;
#endif
//typedef uint8_t[8] uint8x8_t;
typedef __builtin_neon_uqi uint8x8_t __attribute__ ((__vector_size__ (8)));
//typedef uint16_t[4] uint16x4_t;
typedef __builtin_neon_uhi uint16x4_t __attribute__ ((__vector_size__ (8)));
//typedef uint32_t[2] uint32x2_t;
typedef __builtin_neon_usi uint32x2_t __attribute__ ((__vector_size__ (8)));
//typedef uint64_t[1] uint64x1_t;
typedef __builtin_neon_udi uint64x1_t;
128-bit data type, mapped to registers Q0-Q15
The corresponding C/C++ language types (types in stdint.h or csdtint header files) are described in the comments.
//typedef int8_t[16] int8x16_t;
typedef __builtin_neon_qi int8x16_t __attribute__ ((__vector_size__ (16)));
//typedef int16_t[8] int16x8_t;
typedef __builtin_neon_hi int16x8_t __attribute__ ((__vector_size__ (16)));
//typedef int32_t[4] int32x4_t;
typedef __builtin_neon_si int32x4_t __attribute__ ((__vector_size__ (16)));
//typedef int64_t[2] int64x2_t;
typedef __builtin_neon_di int64x2_t __attribute__ ((__vector_size__ (16)));
//typedef float32_t[4] float32x4_t;
typedef __builtin_neon_sf float32x4_t __attribute__ ((__vector_size__ (16)));
//poly8 and poly16 types are basically not used in common algorithms
//Detailed explanation:
//http://stackoverflow.com/questions/22224282/arm-neon-and-poly8-t-and-poly16-t
typedef __builtin_neon_poly8 poly8x16_t __attribute__ ((__vector_size__ (16)));
typedef __builtin_neon_poly16 poly16x8_t __attribute__ ((__vector_size__ (16)));
#ifdef __ARM_FEATURE_CRYPTO
typedef __builtin_neon_poly64 poly64x2_t __attribute__ ((__vector_size__ (16)));
#endif
//typedef uint8_t[16] uint8x16_t;
typedef __builtin_neon_uqi uint8x16_t __attribute__ ((__vector_size__ (16)));
//typedef uint16_t[8] uint16x8_t;
typedef __builtin_neon_uhi uint16x8_t __attribute__ ((__vector_size__ (16)));
//typedef uint32_t[4] uint32x4_t;
typedef __builtin_neon_usi uint32x4_t __attribute__ ((__vector_size__ (16)));
//typedef uint64_t[2] uint64x2_t;
typedef __builtin_neon_udi uint64x2_t __attribute__ ((__vector_size__ (16)));
typedef float float32_t;
typedef __builtin_neon_poly8 poly8_t;
typedef __builtin_neon_poly16 poly16_t;
#ifdef __ARM_FEATURE_CRYPTO
typedef __builtin_neon_poly64 poly64_t;
typedef __builtin_neon_poly128 poly128_t;
#endif
Structured data types
The following data types are structured data types that are combinations of the above basic data types and are usually mapped to multiple registers.
typedef struct int8x8x2_t
{
int8x8_t val[2];
} int8x8x2_t;
...
//omission...
...
#ifdef __ARM_FEATURE_CRYPTO
typedef struct poly64x2x4_t
{
poly64x2_t val[4];
} poly64x2x4_t;
#endif
Basic instruction set
NEON instructions can be divided into normal instructions, wide instructions, narrow instructions, saturated instructions, and long instructions according to the operand type.
Normal instructions: generate a result vector of the same size and usually the same type as the operand vectors.
Long instructions: perform operations on doubleword vector operands, producing quadword vectors as results. The generated elements are generally twice the width of the operand elements and are of the same type. L flag, such as VMOVL.
Wide instructions: One doubleword vector operand and one quadword vector operand perform an operation, producing a quadword vector result. W flag, such as VADDW.
Narrow instructions: perform operations on quadword vector operands and generate doubleword vector results, where the elements generated are generally half the width of the operand elements. N notation, such as VMOVN.
Saturation instruction: When the data type exceeds the specified range, it is automatically limited to the range. Q flag, such as VQSHRUN
NEON instructions can be divided into the following categories according to their functions: loading data, storing data, addition, subtraction, multiplication and division, logical AND/OR/XOR operations, comparison operations, etc. For more information, please refer to Appendix C and Appendix D in [1].
Commonly used instruction sets include:
Initialize registers
Each lane of the register is assigned a value N.
Result_t vcreate_type(Scalar_t N)
Result_t vdup_type(Scalar_t N)
Result_t vmov_type(Scalar_t N)
Lanes are described below.
Load memory data into registers
Load data into NEON registers at intervals of x
Result_t vld[x]_type(Scalar_t* N)
Result_t vld[x]q_type(Scalar_t* N)
The interval is x, and the data is loaded into the relevant lane (channel) of the NEON register. The data of other lanes (channels) does not change.
Result_t vld[x]_lane_type(Scalar_t* N,Vector_t M,int n)
Result_t vld[x]q_lane_type(Scalar_t* N,Vector_t M,int n)
Load x data from N and duplicate the data to all channels of registers 0-(x-1)
Result_t vld[x]_dup_type(Scalar_t* N)
Result_t vld[x]q_dup_type(Scalar_t* N)
Lane: For example, a float32x4_t NEON register has 4 lanes, each with a float32 value. Therefore, c++ float32x4_t dst = vld1q_lane_f32(float32_t* ptr,float32x4_t src,int n=2) means to first copy the value of the src register to the dst register, and then load the third (lane index starts at 0) float from the memory address ptr to the third lane (channel) of the dst register. Finally, the value of dst is: {src[0], src[1], ptr[2], src[3]}.
Interleaving: Interleaving access is an instruction unique to ARM NEON. For example, in c++ float32x4x3_t = vld3q_f32(float32_t* ptr), the interval is 3, which means that 12 float32s are read interleaved into 3 NEON registers. The values of the 3 registers are: {ptr[0],ptr[3],ptr[6],ptr[9]}, {ptr[1],ptr[4],ptr[7],ptr[10]}, {ptr[2],ptr[5],ptr[8],ptr[11]}.
Store register data to memory
The interval is x, storing the data of the NEON register to the memory
void vstx_type(Scalar_t* N)
void vstxq_type(Scalar_t* N)
Store the relevant lanes of NEON registers into memory at intervals of x
Result_t vst[x]_lane_type(Scalar_t* N,Vector_t M,int n)
Result_t vst[x]q_lane_type(Scalar_t* N,Vector_t M,int n)
Read/modify register data
Read the data of the nth channel of the register
Result_t vget_lane_type(Vector_t M,int n)
Read the high/low part of the register into a new register, and the data becomes narrower (halved in length).
Result_t vget_low_type(Vector_t M)
Result_t vget_high_type(Vector_t M)
Returns the register data of channel n set to N based on the copy of M
Result_t vset_lane_type(Scalar N,Vector_t M,int n)
Register data reordering
The data of the next n channels are taken out from register M and placed in the low position, and then the data of xn channels are taken out from register N and placed in the high position to form a new register data.
Result_t vext_type(Vector_t N,Vector_t M,int n)
Result_t vextq_type(Vector_t N,Vector_t M,int n)
Other data reordering instructions include:
vtbl_tyoe,vrev_type,vtrn_type,vzip_type,vunzip_type,vcombine ...
I will explain them one by one when I have time later.
Type conversion instructions
Force reinterpretation of register value type, from SrcType to DstType, the internal actual value remains unchanged and the total number of bytes remains unchanged, for example: vreinterpret_f32_s32(int32x2_t), converting from int32x2_t to float32x2_t.
vreinterpret_DstType_SrcType(Vector_t N)
Arithmetic instructions
[Normal instruction] Normal addition operation res = M+N
Result_t vadd_type(Vector_t M,Vector_t N)
Result_t vaddq_type(Vector_t M,Vector_t N)
[Long instruction] Variable-length addition operation res = M+N. To prevent overflow, one approach is to use the following instruction to store the addition result in a register of length x2, such as: vuint16x8_t res = vaddl_u8(uint8x8_t M,uint8x8_t N).
Previous article:ARM NEON Programming Series 1 - Introduction
Next article:ARM Address Space
Recommended ReadingLatest update time:2024-11-15 17:49
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- LED chemical incompatibility test to see which chemicals LEDs can be used with
- Application of ARM9 hardware coprocessor on WinCE embedded motherboard
- What are the key points for selecting rotor flowmeter?
- LM317 high power charger circuit
- A brief analysis of Embest's application and development of embedded medical devices
- Single-phase RC protection circuit
- stm32 PVD programmable voltage monitor
- Introduction and measurement of edge trigger and level trigger of 51 single chip microcomputer
- Improved design of Linux system software shell protection technology
- What to do if the ABB robot protection device stops
- Keysight Technologies Helps Samsung Electronics Successfully Validate FiRa® 2.0 Safe Distance Measurement Test Case
- Innovation is not limited to Meizhi, Welling will appear at the 2024 China Home Appliance Technology Conference
- Innovation is not limited to Meizhi, Welling will appear at the 2024 China Home Appliance Technology Conference
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Download from the Internet--ARM Getting Started Notes
- Learn ARM development(22)
- Learn ARM development(21)
- Learn ARM development(20)
- Learn ARM development(19)
- Learn ARM development(14)
- 【15th Anniversary】Want to meet up? Let's DIY an electronic tool box~You decide the functions
- [2022 Digi-Key Innovation Design Competition] Fully Automatic High-Pressure Steam Sterilization Controller - Unboxing
- The DAC output of the MCU is passed through the DAC0832
- What should I do if the current of the small motor is too large when it starts and the microcontroller is reset?
- MSP430 driver function for LCD1602
- With vias in vogue, where will high-speed DDR4 signals go?
- Micropython can be simulated in proteus
- EEWORLD University Hall----Live Replay: Introduction of ON Semiconductor's Photovoltaic and Energy Storage Products
- Watch the video to win a JD card | Taixiang test of Shuige cheats
- The motor coil is an inductive load, so the current in the coil will have a certain delay relative to the load voltage on the coil.