Detailed explanation of C6000 optimization instructions (with modifications)

Aguilera

Detailed explanation of C6000 optimization instructions (with modifications) [Copy link]

1. Absolute value function (1) _abs() C code: int _abs(int src) Assembly: ABS Function: Calculate the absolute value of 32-bit data (2) _labs() C code: int _labs(long src) Assembly: ABS Function: Calculate the absolute value of 40-bit data (3) _abs2() C code: int _abs2(int src) Assembly: ABS2 Function: Calculate the absolute value of both the high 16 bits and the low 16 bits, i.e., return[31:16] = |src[31:16]| return[15: 0] = |src[15: 0]| 2. Arithmetic instruction (1) _add2() C code: int _add2(int src1,int src2) Assembly: ADD2 Function: Add the high 16 bits of src1 and src2 and the low 16 bits of src1 and src2 at the same time, ignoring any carry, that is, return[31:16] = src1[31:16] + src2[31:16] return[15: 0] = src1[15: 0] + src2[15: 0] (2) _sadd() C code: int _sadd(int src1,int src2) Assembly: SADD Function: Ordinary addition of A+B (3) _lsadd() C code: long _lsadd(int src1,long src2) Assembly: SADD Function: Add 32-bit data to 40-bit data and return 40-bit data (4) _add4() C code: int _add4(int src1,int src2) Assembly: ADD4 Function: Perform 4 additions on each corresponding byte of src1 and src2 at the same time, ignoring any carries, that is, return[31:24] = src1[31:24] + src2[31:24] return[23:16] = src1[23:16] + src2[23:16] return[15: 8] = src1[15: 8] + src2[15: 8] return[ 7: 0] = src1[ 7: 0] + src2[ 7: 0] Note: Each 8-bit data of src1 and src2 is used as signed data (5) _sadd2() C code: int _sadd2(int src1, int src2) Assembly: SADD2 Function: Add the high 16 bits and low 16 bits of src1 and src2 simultaneously, ignoring any carry. That is, return[31:16] = src1[31:16] + src2[31:16] return[15: 0] = src1[15: 0] + src2[15: 0] Note: Each 16-bit data of src1 and src2 is treated as signed data (6) _saddus2() C code: int _saddus2(unsigned src1,int src2) Assembly: SADDUS2 Function: Perform the same operation as _sadd2, but src1 is interpreted differently, see notes Note: Each 16-bit data of src1 is treated as unsigned data, and each 16-bit data of src2 is treated as signed data (7) _saddu4() C code: unsigned _saddu4(unsigned src1,unsigned src2) Assembly: SADDU4 Function: Perform the same operation as _add4(), but interpret the data as unsigned, with a limit of 0xff (8) _addsub() C code: long long _addsub(int src1,int src2) Assembly: ADDSUB Function: Perform src1 + src2 and src1 - src2 operations at the same time, that is, hi32(return) = src1 + src2 low32(return) = src - src2 (9) _addsub2() C code: long long _addsub2(int src1,int src2) Assembly: ADDSUB2 Function: Perform _add2() and _sub2() operations at the same time, that is, return[63:48] = hi16(src1) + hi16(src2) return[47:32] = low16(src1) + low16(src2) return[31:16] = hi16(src1) - hi16(src2) return[15:0] = low16(src1) - low16(src2) (10) _saddsub() C code: long long _saddsub(unsigned src1,unsigned src2) Assembly: SADDSUB Function: Perform add() and sub() operations at the same time, that is, return[63:32] = src1 + src2 return[31:0] = src1 - src2 (11) _saddsub2() C code: long long _saddsub2(unsigned src1,unsigned src2) Assembly: SADDSUB2 Function: Perform sadd2() and ssub2() operations at the same time, that is, return[63:48] = src1[31:16] + src2[31:16] return[47:32] = src1[15: 0] + src2[15: 0] return[31:16] = src1[31:16] - src2[31:16] return[15: 0] = src1[15: 0] - src2[15: 0] (12) _ssub2() C code: int _ssub2(unsigned src1,unsigned src2) Assembly: SSUB2 Function: Subtract the high 16 bits and the low 16 bits at the same time, that is, return[31:16] = src1[31:16] - src2[31:16] return[15: 0] = src1[15: 0] - src2[15: 0] (13) _mpy2(),_mpy2ll C code: double(long long) _mpy2(int src1,int src2),long long _mpy2ll(int src1,int src2) Assembly: Function: return[63:32]=src1[31:16]*src1[15:0]; return[31:0]=src2[31:16]*src2[15:0];(PS: not guaranteed to be completely correct) (15) _mpyhi(),_mpyhill() C code: double _mpyhi(int src1,int src2),long long _mpyhill(int src1,int src2) Assembly: MPYHI Function: Execute 16-bit * 32-bit operation, that is, return = src1[31:16] * src2[31:0] (15) _mpyli(),_mpylill() C code: double _mpyli(int src1,int src2),long long _mpylill(int src1,int src2) Assembly: MPYHI Function: Perform 16-bit * 32-bit operation, that is, return = src1[15: 0] * src2[31: 0] (16) _mpyhir() C code: int _mpyhir(int src1,int src2) Assembly: MPYHIR Function: Perform (16-bit * 32-bit >> 15) operation, that is, return = (src1[31:16] * src2[31: 0]) >> 15; Note: The result seems to be rounded, for example 0x1122 * 0x55667788 The result should be 0x0b6e4b17, but the simulation result is 0x0b6e4b18 (16) _mpylir() C code: int _mpylir(int src1,int src2) Assembly: MPYLIR Function: Execute (16 bits * 32 bits >> 15) operation, that is, return = (src1[15: 0] * src2[31: 0]) >> 15; Note: The result seems to be rounded, for example, 0x1122 * 0x55667788 The result should be 0x0b6e4b17, but the simulation result is 0x0b6e4b18 (17) _mpy*u4(),_mpy*u4ll() C code: double _mpysu4(int src1,int src2),long long _mpysull4(int src1,int src2) double _mpyu4(unsigned src1,unsigned src2),long long _mpyu4ll(unsigned src1,unsigned src2) Assembly: MPYSU4 MPYU4.M2X B4,A3,B5:A4 Function: Execute 4 8-bit * 8-bit operations at the same time, that is, return[63:48] = src1[31:24] * src2[31:24]; return[47:32] = src1[23:16] * src2[23:16]; return[31:16] = src1[15: 8] * src2[15: 8]; return[15: 0] = src1[ 7: 0] * src2[ 7: 0]; (18) _smpy2(),_smpy2ll() C code: double _smpy2(int src1,int src2),long long _smpy2ll(int src1,int src2) Assembly: SMPY2 Function: Execute two 16-bit * 16-bit operations at the same time, and then shift the result left by 1 bit, that is, return = ((src1[31:16] * src2[31:16] << 32) + (src1[15: 0] * src2[15: 0])) << 1; (19) _mpy32**() C code: int _mpy32(int src1,int src2),long long _mpy32ll(int src1,int src2) long long _mpy32su(int src1,unsigned src2),long long _mpy32us(unsigned src1,int src2) long long _mpy32u(unsigned src1,unsigned src2) Assembly: MPY32 MPY32SU.M2X B4,A3,B5:A4 MPY32US.M2X B4,A3,B5:A4 MPY32U.M2X B4,A3,B5:A4 Function: Execute 32-bit * 32-bit operation (20) _mpy2ir() C code: long long _mpy2ir(int src1,int src2) Assembly: MPY2IR Function: Return the following result return[63:32] = src1[31:16] * src2 >> 15 return[31: 0] = src1[15: 0] * src2 >> 5 Remarks: Each part may be rounded (21) _gmpy() C code: unsigned _gmpy(unsigned src1,unsigned src2) Assembly: GMPY Function: Execute "Galois Field multiply" (22) _smpy**() C code: int _smpy(int src1,int src2),int smpyh(int src1,int src2) int _smpyhl(int src1,int src2),int _smpylh(int src1,int src2) Assembly: SMPY SMPYH SMPYHL SMPYLH Function: Execute 16-bit * 16-bit operation, shift the result left by one bit, and limit the result to be less than x80000000 _smpy: return[31: 0] = src1[15: 0] * src2[15: 0] << 1 _smpyh: return[31: 0] = src1[31:16] * src2[31:16] << 1 _smpyhl:return[31: 0] = src1[31:16] * src2[15: 0] << 1 _smpylh:return[31: 0] = src1[15: 0] * src2[31:16] << 1 (23) _mpy**() C code: int _mpy(int src1,int src2),int _mpyus(unsigned src1,int src2) int _mpysu(int src1,unsigned src2),unsigned _mpyu(unsigned src1,unsigned src2) Assembly: MPY MPYUS MPYSU MPYU Function: Return the result of src1[15: 0] * src2[15: 0] (24) _mpyh**() C code: int _mpyh(int src1,int src2),int _mpyhus(unsigned src1,int src2) int _mpyhsu(int src1,unsigned src2),int _mpyhu(unsigned src1,unsigned src2) Assembly: MPYH MPYHUS MPYHSU MPYHU Function: Return the result of src1[31:16] * src2[31:16] (25) _mpyh*l*() C code:int _mpyhl(int src1,int src2),int _mpyhuls(unsigned src1,int src2) int _mpyhslu(int src1,unsigned src2),int _mpyhlu(unsigned src1,unsigned src2) Assembly: MPYHL MPYHULS MPYHSLU MPYHLU Function: Return the result of src1[31:16] * src2[15: 0](26) _mpyl*h*() C code: int _mpylh(int src1,int src2),int _mpyluhs(unsigned src1,int src2) int _mpylshu(int src1,unsigned src2),int _mpylhu(unsigned src1,unsigned src2) Assembly: MPYLH MPYLUHS MPYLSHU MPYLHU Function: Return the result of src1[15: 0] * src2[31: 16] (27) _*ssub() C code: int _ssub(int src1,int src2),long _lssub(int src1,int src2) Assembly: SSUB.L2X B4,A3,B4 Function: Execute src1 - src2 operation, sign extended to int or long (28) _subc() C code: unsigned _subc(int src1,int src2) Assembly: SUBC Function: Circular shift and subtraction operation, see explanation (29) _sub2() C code: int _sub2(int src1,int src2) Assembly: SUB2 Function: Execute high 16-bit and low 16-bit subtraction at the same time, that is, return[31:16] = src1[31:16] - src2[31:16] return[15: 0] = src1[15: 0] - src2[15: 0] (30) _sub4() C Code: int _sub4(int src1,int src2) Assembly: SUB4 Function: Perform four 8-bit subtractions simultaneously, i.e., return[31:24] = src1[31:24] - src2[31:24] return[23:16] = src1[23:16] - src2[23:16] return[15: 8] = src1[15: 8] - src2[15: 8] return[ 7: 0] = src1[ 7: 0] - src2[ 7: 0] (31) _subabs4() C code: int _subabs4(int src1,int src2) Assembly: SUBABS4 Function: Execute four 8-bit subtractions at the same time, and then calculate the absolute value, that is, return[31:24] = |src1[31:24] - src2[31:24]| return[23:16] = |src1[23:16] - src2[23:16]| return[15: 8] = |src1[15: 8] - src2[15: 8]| return[ 7: 0] = |src1[ 7: 0] - src2[ 7: 0]| (32) _avg2() C code: int _avg2(int src1,int src2) Assembly: AVG2 Function: Calculate the average value of two 16-bit values and round the result return[31:16] = (src1[31:16] + src2[31:16] + 1) / 2; return[15: 0] = (src1[15: 0] + src2[15: 0] + 1) / 2; (33) _avgu4() C code: int _avgu4(int src1,int src2) Assembly: AVGU4 Function: Calculate the average value of four 8-bit values and round the result return[31:24] = (src1[31:24] + src2[31:24] + 1) / 2; return[23:16] = (src1[23:16] + src2[23:16] + 1) / 2; return[15: 8] = (src1[15: 8] + src2[15: 8] + 1) / 2; return[ 7: 0] = (src1[ 7: 0] + src2[ 7: 0] + 1) / 2; 3. Bit operation instructions (1) _clr() C code: int _clr(unsined src,unsigned csta,unsigned cstb) Assembly: CLR Function: Clear bits csta to cstb on src, that is, src[cstb:csta] = 0; Note: csta must be <= cstb and guaranteed to be < 32 (2) _clrr() C code: int _clrr(unsigned src,int shift) Assembly: CLR Function: Clear bits shift[ 9: 5] to shift[ 4: 0] on src (3) _set() C code: int _set(unsined src,unsigned csta,unsigned cstb) Assembly: SET Function: Set bit csta ~ bit cstb on src, that is, src[cstb:csta] = '1'; Note: csta must be <= cstb, and guaranteed to be < 32 (4) _setr() C code: int _setr(unsigned src,int shift) Assembly: SET Function: Set shift[ 9: 5] ~ shift[ 4: 0] on src to '1' (5) _sshl() C code: int _sshl(int src,unsigned shift) Assembly: SSHL Function: return[31: 0] = src << shift; Note: Sign extension function (6) _rotl() C code: int _rotl(unsigned src,unsigned shift Assembly: ROTL Function: return[31: 0] = src << shift; Note: No sign extension function (7) __shlmb(),__shrmb() C code: int _shlmb(int src1,int src2),int _shrmb(int src1,int src2) Assembly: SHLMB function: shlmb-->return[31:0] = (src2 << 8) | src1[31:24] shrmb-->return[31:0] = (src2 >> 8) | (src1[7:0] << 24) (8) __shr2(),_shru2() C code: int _shr2(int src1,unsigned shift),int _shru2(unsigned src1,unsigned shift) Assembly: SHR2 Function: return[31: 16] = src1[31:16] >> shift return[15: 0] = src1[15: 0] >> shift Note: The return value of a signed operation will be sign-extended (all shifted bits will be filled with 1) (9) _sshvl(),_sshvr() C code: int _sshvl(int src,int shift),int _sshvr(int src,int shift) Assembly: SSHVL SSHVR Function: sshvl-->return[31: 0] = (src << shift) > MAX_INT?MAX_INT:(src << shift) sshvr-->return[31: 0] = (src >> shift) < MIN_INT?MIN_INT:(src >> shift) (10) _shfl() C code: int _shfl(int src) Assembly: SHFL Function: The lower 16 bits are embedded into the even bits, and the upper 16 bits are embedded into the odd bits, that is, return[31:0] = src[31]src[15]src[30]src[14]........src[16][src[0] (11) _ext() C code: int _ext(int src,unsigned lshift,unsigned rshift) Assembly: EXT Function: return[31: 0] = (src << lshift) >> rshift; (12) _extr() C code: int _extr(int src,int shift) Assembly: EXT Function: return[31: 0] = (src << shift[ 9: 5]) >> shift[4: 0]; (13) _extu() C code: int _extu(uint src,unsigned lshift,unsigned rshift) Assembly: EXT Function: return[31: 0] = (src << lshift) >> rshift; (14) _extur() C code: int _extur(uint src,int shift) Assembly: EXT Function: return[31: 0] = (src << shift[ 9: 5]) >> shift[4: 0]; (15) _lmbd() C code: unsigned _lmbd(int zero_or_one,int src) Assembly: LMBD Function: Find the bit that is zero_or_one from left to right and return the position Remarks: zero_or_one must be 0 or 1. If it is any other value, there will be no LMBD instruction to compile. For example, if src = 0x0fff0000, then _lmbd(0,src) == 0 /*D31 is '0', so return 0*/ _lmbd(1,src) == 4 /*D27 is '1', so return 4*/ (16) _*norm() C code: unsigned _norm(int src),unsignd _lnorm(long src) Assembly: NORM B4,B4 Function: Get the number of redundant sign bits. (17) _bitc4() C code: unsigned _bitc4(unsigned src) Assembly: BITC4 Function: Count the total number of '1' in each byte, and synthesize the 4 totals into an unsigned return. Note: For example, src = 0x01030507, because the 4 bytes have 0x01, 0x02, 0x03, 0x04 '1' respectively, the return value is 0x01020304 (18) _bitr() C code: unsigned _bitr(unsigned src) Assembly: BITR Function: Reverse all bits, that is, return[31:0] = src[ 0:31] Remarks: For example, src = '00010001000100010001000100010001', the return value is '100010001000100010001000100010001000' (19) _deal() C code: unsigned _deal(unsigned src) Assembly: DEAL Function: All even bits are combined into a 16-bit data, all odd bits are combined into a 16-bit data, and the 32-bit value is returned, that is, return[31:16] = src[31,29,27,....,1] return[15: 0] = src[30,28,26,....,0] 4. Memory operation instructions (1) _amem*() C code: ushort& _amem2(void* ptr),const ushort _amem2_const(void* ptr) unsigned& _amem4(void* ptr),const unsigned& _amem4_const(void* ptr) long long _amem8(void* ptr),const long long& _amem8_const(void* ptr) double & _amemd8(void* ptr),const double& _amemd8_const(void* ptr) Assembly: Omitted Function: Read/write n bytes of data from aligned address, n = the number above Remarks: Read---> double val; char test[8] = {0,1,2,3,4,5,6,7}; val = _amem2_const(&test) + _amem4_const(&test) + _amem8_const(&test); Write---> _amem2(&test) = 0x0011; _amem4(&test) = 0x00112233; _amem8(&test) = 0x0011223344556677; (2) _mem*() C code: ushort& _mem2(void* ptr),const ushort _mem2_const(void* ptr) unsigned& _mem4(void* ptr),const unsigned& _mem4_const(void* ptr) long long _mem8(void* ptr),const long long& _mem8_const(void* ptr) double & _memd8(void* ptr),const double& _memd8_const(void* ptr) Assembly: Omitted Function: Read/write n bytes of data from unaligned address, n = the number above Remarks: Read---> double val; char test[8] = {0,1,2,3,4,5,6,7}; val = _mem2_const(&test) + _mem4_const(&test) + _mem8_const(&test); Write---> _mem2(&test) = 0x0011; _mem4(&test) = 0x00112233; _mem8(&test) = 0x0011223344556677; (3) _mvd() C code: int _mvd(int src) Assembly: MVD Function: Use 4-cycle multiplication pipeline to copy data, return[31: 0] = src[31: 0] Remarks: This needs to be coordinated with _mpy**() to achieve parallel work 5. Data packaging/conversion instructions (1) _hi**() C code: unsigned _hi(double src), unsigned _hill(long long src) Assembly: None Function: Return the high 32 bits of 64-bit data (2) _low**() C code: unsigned _lo(double src), unsigned _loll(long long src) Assembly: No function: Return the lower 32 bits of 64-bit data (3) _*to*() C code: ulong _dtol(double src),unsigned _ftoi(float src) double _itod(unsigned hi32,unsigned low32),float _itof(unsigned src) long long _itoll(unsigned hi32,unsigned low32),double _ltod(long src) Assembly: No function: Convert between various data types (4) _sat() C code: int _sat(long src2) Assembly: SAT Function: Convert 40-bit long data to 32-bit data (5) _pack*2() C code: unsigned _pack2(unsigned src1,unsigned src2), unsigned _packh2(unsigned src1,unsigned src2) Assembly: PACK2 PACKH2 Function: _pack2--->return[31:16] = src1[15: 0],return[15: 0] = src2[15: 0] _packh2--->return[31:16] = src1[31: 16],return[15: 0] = src2[31: 16] (6) _pack*4() C code: unsigned _packh4(unsigned src1,unsigned src2), unsigned _packl4(unsigned src1,unsigned src2) Assembly: PACKH4 PACKL4 Function: Return alternating 4-byte data Remarks: For example, src1 = 0x11223344, src2 = 0x55667788, then _packh4(src1,src2) returns 0x11335577 _packl4(src1,src2) returns 0x22446688 (7) _pack**2() C code: unsigned _packhl2(unsigned src1,unsigned src2), unsigned _packlh2(unsigned src1,unsigned src2) Assembly: PACKHL2 PACKLH2 Function: _packhl2--->return[31:16] = src1[31: 16],return[15: 0] = src2[15: 0] _packlh2-->return[31:16] = src1[15: 0],return[15: 0] = src2[31: 16] (8) _spack2() C code: int _spack2(int src1,int src2) Assembly: SPACK2 Function: Format two 32-bit data into 16-bit data, and then combine them into 32-bit data Remarks: return[31: 16] = (int16_t)src1 return[15: 0] = (int16_t)src2 (9) _spacku4() C code: unsigned _spacku4(int src1,int src2) Assembly: SPACKU4 Function: Format four 16-bit data into four 8-bit data to form 32-bit data return Remarks: return[31:24] = (unt8_t)src1[31:16] return[23:16] = (unt8_t)src1[15: 0] return[15: 8] = (unt8_t)src2[31:16] return[ 7: 0] = (unt8_t)src1[15: 0] (10) _swap4() C code: unsigned _swap(unsigned src) Assembly: SWAP4 Function: Big-endian and small-endian data conversion Remarks: return[31:24] and return[23:16] are swapped. return[15: 8] and return[ 7: 0] are swapped. (11) _unpkhu4() C code: unsigned _unpkhu4(unsigned src) Assembly: UNPKHU4 Function: Convert two high-bit 8-bit data into two 16-bit data Remarks: return[31:16] = (uint16_t)src[31:24] return[15: 0] = (uint16_t)src[23:16] (12) _unpklu4() C code: unsigned _unpklu4(unsigned src) Assembly: UNPKHU4 Function: Convert two low 8-bit data into two 16-bit data Remarks: return[31:16] = (uint16_t)src[15: 8] return[15: 0] = (uint16_t)src[ 7: 0] 6. Comparison/Miscellaneous Instructions (1) _cmpeq*() _cmpgt*() C code: int _cmpeq2(int src1,int src2),int _cmpeq4(int src1,int src2) int _cmpgt2(int src1,int src2),int _cmpgtu4(unsigned src1,unsigned src2) Assembly: CMPEQ2 CMPEQ4 CMPGT2 CMPGT4 Function: Compare two 16-bit data or four 8-bit data at the same time. The comparison result is in the lower 2 bits or lower 4 bits of the return value. Remarks: _cmpeq2(0x11223344,0x11220000) returns 0x02 _cmpeq4(0x11223344,0x00223344) returns 0x07 _cmpgt2(0x00001111,0x0000ffff) returns 0x01 _cmpgtu4(0x0000ffff,0x0000aaaa) returns 0x03 (2) _xpnd*() C code: int _xpnd2(int src),int _xpnd4(int src) Assembly: XPND2 XPND4 Function: _xpnd2() extends the lower 2 bits of src into two 16-bit logical values _xpnd4() extends the lower 4 bits of src into four 8-bit logical values Remarks: _xpnd*() is usually used together with _cmp*() to implement logical extension _xpnd2(0x01) = 0x0000ffff _xpnd2(0x03) = 0xffffffff _xpnd2(0x00) = 0x00000000 _xpnd4(0x00) = 0x00000000 _xpnd4(0x08) = 0xff000000 _xpnd4(0x07) = 0x00ffffff _xpnd4(0x01) = 0x000000ff