ARM NEON Programming Series 2 - Basic Instruction Set-EEWORLD

Collect

Preface

This series of blog posts is used to introduce the NEON instruction optimization under ARM CPU.

Blog post github address: github

Related code github address: github

NEON instruction set

Most mainstream compilers that support ARM CPU as target platform support NEON instructions. You can use NEON by embedding NEON assembly in your code, but a more common way is to write NEON code through NEON Instrinsic similar to C function. Just like NEON hello world. NEON Instrinsic is a set of buildin types and functions supported by the compiler, which basically covers all NEON instructions. Usually these Instrinsic are included in the arm_neon.h header file.

This article takes arm_neon.h of armv7 in android-ndk-r11c as an example to explain the NEON instruction type.

The ARMV7 architecture includes:

16 general registers (32 bits), R0-R15

16 NEON registers (128 bit), Q0-Q15 (can also be regarded as 32 64 bit registers, D0-D31)

16 VFP registers (32 bits), S0-S15

The difference between NEON and VFP is that VFP is a hardware that accelerates floating point calculations and does not have data parallelism capabilities. At the same time, VFP is more suitable for double-precision floating point calculations, while NEON only has single-precision floating point calculation capabilities. For more information, please refer to stackoverflow:neon vs vfp

Basic Data Types

64-bit data type, mapped to registers D0-D31

The corresponding C/C++ language types (types in stdint.h or csdtint header files) are described in the comments.

//typedef int8_t[8] int8x8_t;

typedef __builtin_neon_qi int8x8_t __attribute__ ((__vector_size__ (8)));

//typedef int16_t[4] int16x4_t;

typedef __builtin_neon_hi int16x4_t __attribute__ ((__vector_size__ (8)));

//typedef int32_t[2] int32x2_t;

typedef __builtin_neon_si int32x2_t __attribute__ ((__vector_size__ (8)));

//typedef int64_t[1] int64x1_t;

typedef __builtin_neon_di int64x1_t;

//typedef float16_t[4] float16x4_t;

// (Note: This type is half-precision and is supported on some new CPUs. This basic data type is not yet available in the C/C++ language annotation)

typedef __builtin_neon_hf float16x4_t __attribute__ ((__vector_size__ (8)));

//typedef float32_t[2] float32x2_t;

typedef __builtin_neon_sf float32x2_t __attribute__ ((__vector_size__ (8)));

//poly8 and poly16 types are basically not used in common algorithms

//Detailed explanation:

//http://stackoverflow.com/questions/22224282/arm-neon-and-poly8-t-and-poly16-t

typedef __builtin_neon_poly8 poly8x8_t __attribute__ ((__vector_size__ (8)));

typedef __builtin_neon_poly16 poly16x4_t __attribute__ ((__vector_size__ (8)));

#ifdef __ARM_FEATURE_CRYPTO

typedef __builtin_neon_poly64 poly64x1_t;

#endif

//typedef uint8_t[8] uint8x8_t;

typedef __builtin_neon_uqi uint8x8_t __attribute__ ((__vector_size__ (8)));

//typedef uint16_t[4] uint16x4_t;

typedef __builtin_neon_uhi uint16x4_t __attribute__ ((__vector_size__ (8)));

//typedef uint32_t[2] uint32x2_t;

typedef __builtin_neon_usi uint32x2_t __attribute__ ((__vector_size__ (8)));

//typedef uint64_t[1] uint64x1_t;

typedef __builtin_neon_udi uint64x1_t;

128-bit data type, mapped to registers Q0-Q15

The corresponding C/C++ language types (types in stdint.h or csdtint header files) are described in the comments.

//typedef int8_t[16] int8x16_t;

typedef __builtin_neon_qi int8x16_t __attribute__ ((__vector_size__ (16)));

//typedef int16_t[8] int16x8_t;

typedef __builtin_neon_hi int16x8_t __attribute__ ((__vector_size__ (16)));

//typedef int32_t[4] int32x4_t;

typedef __builtin_neon_si int32x4_t __attribute__ ((__vector_size__ (16)));

//typedef int64_t[2] int64x2_t;

typedef __builtin_neon_di int64x2_t __attribute__ ((__vector_size__ (16)));

//typedef float32_t[4] float32x4_t;

typedef __builtin_neon_sf float32x4_t __attribute__ ((__vector_size__ (16)));

//poly8 and poly16 types are basically not used in common algorithms

//Detailed explanation:

//http://stackoverflow.com/questions/22224282/arm-neon-and-poly8-t-and-poly16-t

typedef __builtin_neon_poly8 poly8x16_t __attribute__ ((__vector_size__ (16)));

typedef __builtin_neon_poly16 poly16x8_t __attribute__ ((__vector_size__ (16)));

#ifdef __ARM_FEATURE_CRYPTO

typedef __builtin_neon_poly64 poly64x2_t __attribute__ ((__vector_size__ (16)));

#endif

//typedef uint8_t[16] uint8x16_t;

typedef __builtin_neon_uqi uint8x16_t __attribute__ ((__vector_size__ (16)));

//typedef uint16_t[8] uint16x8_t;

typedef __builtin_neon_uhi uint16x8_t __attribute__ ((__vector_size__ (16)));

//typedef uint32_t[4] uint32x4_t;

typedef __builtin_neon_usi uint32x4_t __attribute__ ((__vector_size__ (16)));

//typedef uint64_t[2] uint64x2_t;

typedef __builtin_neon_udi uint64x2_t __attribute__ ((__vector_size__ (16)));

typedef float float32_t;

typedef __builtin_neon_poly8 poly8_t;

typedef __builtin_neon_poly16 poly16_t;

#ifdef __ARM_FEATURE_CRYPTO

typedef __builtin_neon_poly64 poly64_t;

typedef __builtin_neon_poly128 poly128_t;

#endif

Structured data types

The following data types are structured data types that are combinations of the above basic data types and are usually mapped to multiple registers.

typedef struct int8x8x2_t

{

int8x8_t val[2];

} int8x8x2_t;

...

//omission...

...

#ifdef __ARM_FEATURE_CRYPTO

typedef struct poly64x2x4_t

{

poly64x2_t val[4];

} poly64x2x4_t;

#endif

Basic instruction set

NEON instructions can be divided into normal instructions, wide instructions, narrow instructions, saturated instructions, and long instructions according to the operand type.

Normal instructions: generate a result vector of the same size and usually the same type as the operand vectors.

Long instructions: perform operations on doubleword vector operands, producing quadword vectors as results. The generated elements are generally twice the width of the operand elements and are of the same type. L flag, such as VMOVL.

Wide instructions: One doubleword vector operand and one quadword vector operand perform an operation, producing a quadword vector result. W flag, such as VADDW.

Narrow instructions: perform operations on quadword vector operands and generate doubleword vector results, where the elements generated are generally half the width of the operand elements. N notation, such as VMOVN.

Saturation instruction: When the data type exceeds the specified range, it is automatically limited to the range. Q flag, such as VQSHRUN

NEON instructions can be divided into the following categories according to their functions: loading data, storing data, addition, subtraction, multiplication and division, logical AND/OR/XOR operations, comparison operations, etc. For more information, please refer to Appendix C and Appendix D in [1].

Commonly used instruction sets include:

Initialize registers

Each lane of the register is assigned a value N.

Result_t vcreate_type(Scalar_t N)

Result_t vdup_type(Scalar_t N)

Result_t vmov_type(Scalar_t N)

Lanes are described below.

Load memory data into registers

Load data into NEON registers at intervals of x

Result_t vld[x]_type(Scalar_t* N)

Result_t vld[x]q_type(Scalar_t* N)

The interval is x, and the data is loaded into the relevant lane (channel) of the NEON register. The data of other lanes (channels) does not change.

Result_t vld[x]_lane_type(Scalar_t* N,Vector_t M,int n)

Result_t vld[x]q_lane_type(Scalar_t* N,Vector_t M,int n)

Load x data from N and duplicate the data to all channels of registers 0-(x-1)

Result_t vld[x]_dup_type(Scalar_t* N)

Result_t vld[x]q_dup_type(Scalar_t* N)

Lane: For example, a float32x4_t NEON register has 4 lanes, each with a float32 value. Therefore, c++ float32x4_t dst = vld1q_lane_f32(float32_t* ptr,float32x4_t src,int n=2) means to first copy the value of the src register to the dst register, and then load the third (lane index starts at 0) float from the memory address ptr to the third lane (channel) of the dst register. Finally, the value of dst is: {src[0], src[1], ptr[2], src[3]}.

Interleaving: Interleaving access is an instruction unique to ARM NEON. For example, in c++ float32x4x3_t = vld3q_f32(float32_t* ptr), the interval is 3, which means that 12 float32s are read interleaved into 3 NEON registers. The values of the 3 registers are: {ptr[0],ptr[3],ptr[6],ptr[9]}, {ptr[1],ptr[4],ptr[7],ptr[10]}, {ptr[2],ptr[5],ptr[8],ptr[11]}.

Store register data to memory

The interval is x, storing the data of the NEON register to the memory

void vstx_type(Scalar_t* N)

void vstxq_type(Scalar_t* N)

Store the relevant lanes of NEON registers into memory at intervals of x

Result_t vst[x]_lane_type(Scalar_t* N,Vector_t M,int n)

Result_t vst[x]q_lane_type(Scalar_t* N,Vector_t M,int n)

Read/modify register data

Read the data of the nth channel of the register

Result_t vget_lane_type(Vector_t M,int n)

Read the high/low part of the register into a new register, and the data becomes narrower (halved in length).

Result_t vget_low_type(Vector_t M)

Result_t vget_high_type(Vector_t M)

Returns the register data of channel n set to N based on the copy of M

Result_t vset_lane_type(Scalar N,Vector_t M,int n)

The data of the next n channels are taken out from register M and placed in the low position, and then the data of xn channels are taken out from register N and placed in the high position to form a new register data.

Result_t vext_type(Vector_t N,Vector_t M,int n)

Result_t vextq_type(Vector_t N,Vector_t M,int n)

Other data reordering instructions include:

vtbl_tyoe,vrev_type,vtrn_type,vzip_type,vunzip_type,vcombine ...

I will explain them one by one when I have time later.

Type conversion instructions

Force reinterpretation of register value type, from SrcType to DstType, the internal actual value remains unchanged and the total number of bytes remains unchanged, for example: vreinterpret_f32_s32(int32x2_t), converting from int32x2_t to float32x2_t.

vreinterpret_DstType_SrcType(Vector_t N)

Arithmetic instructions

[Normal instruction] Normal addition operation res = M+N

Result_t vadd_type(Vector_t M,Vector_t N)

Result_t vaddq_type(Vector_t M,Vector_t N)

[Long instruction] Variable-length addition operation res = M+N. To prevent overflow, one approach is to use the following instruction to store the addition result in a register of length x2, such as: vuint16x8_t res = vaddl_u8(uint8x8_t M,uint8x8_t N).

[1] [2]

Keywords：ARM NEON Reference address：ARM NEON Programming Series 2 - Basic Instruction Set

Previous article：ARM NEON Programming Series 1 - Introduction
Next article：ARM Address Space

Recommended ReadingLatest update time:2024-11-15 17:49

Epoch-making update! Arm's new generation Armv9 architecture targets Intel

On Tuesday, local time, Arm launched a new generation of instruction set architecture Armv9, with increasingly powerful security and artificial intelligence capabilities to cope with ubiquitous professional processing needs. This is Arm's biggest technological innovation in a decade. The previous generation Armv8 was

[Mobile phone portable]

Design of network camera based on ARM and Ethernet power supply

　　1 System Structure 　　The whole system consists of AT91RM9200 processor, CMOS sensor, audio acquisition system, Ethernet power supply system and Ethernet data communication. First, the image is collected through the CMOS sensor lens, and the audio can also be collected at the same time. After being processed by the

[Microcontroller]

Design of network camera based on ARM and Ethernet power supply

3. Arm machine code

First the assembly program is converted into machine code before it can be run in the machine. First, we disassemble the .elf file generated in the bare metal code above: start.elf: file format elf32-littlearm Disassembly of section .text: 50008000 _start : .text .global _start

[Microcontroller]

ARM Basic Learning-ATPCS Subroutine Call Basic Specifications

ATPCS is the abbreviation of Arm Thumb Procedure Call Standard, which means the basic specification of subroutine calls in arm programs and thumb programs. Register usage rules When the number of parameters is less than or equal to 4, the parameters are passed between subroutines through R0~R3, and there is no nee

[Microcontroller]

ARM instruction analysis

Today I will summarize the study of arm instructions. Today I will not analyze all arm instructions one by one. Here I hope everyone will read the arm assembly manual. I put the Chinese version of this manual at http://download.csdn.net/detail/wrjvszq/8324589. Everyone should get this document first. This document

[Microcontroller]

Design of speaker-independent speech recognition system based on ARM processor

　　With the widespread application of high-tech in the military field, weapons and equipment are gradually developing towards high, precise and advanced directions. Traditional military training often fails to achieve the expected training effect due to long training time, high training costs and narrow training space,

[Microcontroller]

Design of speaker-independent speech recognition system based on ARM processor

ARM bare metal development: C language lights up LED

1. Hardware Platform: Zhengdian Atom i.MX6U Alpha Development Board 2. Compile and build C development environment To develop software using C language, you first need to use assembly to build the C language runtime environment and use assembly to initialize the C language environment, such

[Microcontroller]

ARM bare metal development: C language lights up LED

ARM register introduction

Summary: In the study of ARM, registers run through the whole process. Basically, the beginning of each textbook will first introduce the working mode and register knowledge. This part of the content is very important, but it often does not attract the attention of beginners. I will not go into details about the ARM pr

[Microcontroller]

Popular Resources
Popular amplifiers