A Hybrid Programming Design Example of Motion Compensation Based on TMS320C62X[Copy link]
This post was last edited by fish001 on 2018-6-18 22:19 Motion compensation is an important algorithm in the MPEG-4 standard. Motion compensation refers to finding a reference block in a reference frame based on a motion vector. If the X and Y components of the motion vector are both integer pixel lengths, the reference block is directly found in the reference frame. If it is half pixel length, the reference block needs to be calculated through interpolation. The calculated reference block needs to be added to the error block obtained by decoding to get the current reference block. This article gives a motion compensation method when the X and Y components of the motion vector are both integer pixel lengths. The reference block (8×8) can be directly found in the reference frame based on the motion vector. The C language function that completes this function is as follows: void mc_case_a2(unsigned char *pSrc, short SrcOffset, short SrcWidth, unsigned char *pDst, short RoundCtrl) { …… for (i=0; i<8; i++) { *(tmp_P_Dst+i) = *(tmp_P_Src+i); …… } } The remainder of the parameter motion vector SrcOffset to 4 (4 bytes are a word, 32 bits long) may be 0, 1, 2, or 3. When the remainder is 0, the compiled execution code is read by word (LDW), which fully reflects the advantages of TMS320C62X and makes the program run more efficiently. When the remainder is not 0, it may be read by byte (LDB) or by half word (LDH), which makes the program run less efficiently. Both video encoding and decoding require motion compensation to reconstruct images, which is a time-consuming operation, and its code is also the core code in image processing, so it is required to write an efficient program to complete this operation. In order to make the code run more efficiently, and combined with the hardware characteristics of TMS320C62X, it is hoped that for different motion vectors, the word-based reading and storage method can be used when doing motion compensation. This requires dividing the motion vector parameter by 4, adjusting the pointer according to the remainder, so that the pointer always points to the word alignment method (while in the C program, the current block is char type and stored in bytes, and the shift processing can only be performed byte by byte, which makes it impossible to use the same method as the assembly program to optimize the program in the C program). If the remainder after the motion vector is divided by 4 is 1, in order to align the 8 pixels to be taken with the word access method, the operation should be performed according to Figure 1. The core code program for shifting the motion vector parameters to align it with word access is: MVK .S2 0xFFFC,temp ; get the LSB of the address ADD .L1X pSrc,offset,pSrc ; address of the first element of the reference block AND .L2X pSrc,temp,tmp_pSrc ; address of word-aligned access AND .S1 0x0003,pSrc,rshiftA ; use two LSB bits to get; how many words need to be shifted right SUB .L1 0x04,rshiftA,lshiftA ; how many words need to be shifted left MPY .M1 rshiftA,8,rshiftA ; the number of #bits to be shifted right MPY .M1 lshiftA,8,lshiftA ; the number of #bits to be shifted left As a design example to illustrate the mixed programming of C language and assembly program, parallel assembly is used to implement the optimization of this function. Only part of the assembly program is given here: .text; this section of assembly code is arranged in the .text section. Of course, by using #program_section in C language, it can also be arranged in other sections named by yourself. .global mc_case_a ;Function name, use .def or .gloal to declare it so that the C code calls the function _mc_case_a: ;Label, is the interface between the C calling function and the assembly called function... .asg B10,ocsr .asg B11,rw_4 STW .D2 ocsr,*stack--[1] ;The called function uses registers from B10 to B15, A10 STW .D2 r_w4,*stack--[1] ;~A15, they need to be protected MVC .S2 CSR,ocsr AND .S2 -2,ocsr,ocsr MVC .S2 ocsr,CSR ;Turn off some interrupts... loop: LDW .D2 *tmp_pSrc++[src_width1],r_w1 ;Read the first word LDW .D1 *pSrc++[1],r-w2 ;Read the second word LDW .D1 *pSrc++[src_width2],r-w3 ;Read the third word SHRU .S2 r_w1,rshiftB, r-w1 SHL .S1 r_w3,lshiftA, r_w3 SHL .S2X r_w2,rshiftB,r_w4 SHRU .S1 r_w2,rshiftA, r_w2 OR .L2 r_w1,r_w4, r_w1 OR .L1 r_w1,r_w3,r_w2 ;These steps perform the operation process in Figure a STW .D2 r_w1,*pDst++[2] STW .D1 r_w2,*tmpDst++[2] ;Store the two words B .S2 loop ;Delay jump to label loop to implement loop... LDW .D2T2 *++stack[1],r_w4 LDW .D2T2 *++stack[1],ocsr ;Restore the protected registers in the called function MVC .S2 ocsr, CSR ;Restore the interrupt environment B .S2 B3 ;Return to the calling function... The C language program and parallel assembly program of this algorithm were tested on the TI CCS using its library function CLOCK(). In pure C language, when the remainder of the motion vector offset of 4 is 0, it takes about 33 instruction cycles, when the remainder is 1, it takes about 93 instruction cycles, when the remainder is 2, it takes about 51 instruction cycles, and when the remainder is 3, it takes about 93 instruction cycles, and the average time is about 67 cycles. When it is written in parallel assembly code, the number of cycles is constant at 33 instruction cycles. The execution time of 33 instruction cycles basically achieves the maximum optimization of the function for this function.