Application skills of parallel processing instructions in ADSP2106x-EEWORLD

Collect

Abstract: ADSP2106x is a high-speed DSP processing chip with excellent performance launched by AD Company. It has multiple parallel internal processing units and rich parallel instructions. This article mainly introduces the application skills of parallel operation processing instructions in this DSP chip. Keywords: DSP parallel processing instructions ADSP2106x

Introduction

    According to the instructions of the processor chip and its implementation form, we can divide the processor chip into complex instruction system (CISC) and reduced instruction system (RISC). The former pursues the powerful function of a single instruction to simplify programming; the latter emphasizes the instruction Simplified to improve hardware efficiency. Because RISC has the advantages of consistent instruction length, single-cycle execution time, easy parallelization and pipeline processing, the vast majority of DSP processing chips use RISC. In addition, according to the computer's memory structure and its bus connection form, computer systems can be divided into von Neumann structures and Harvard structures. The former shares data storage space, program storage space, and memory bus; the latter has separate data and programs. Spatial and separate access buses. Since the Harvard structure can parallelize instruction fetching and number fetching during instruction execution, it has higher execution efficiency, so most DSP chips use the Harvard structure.

    ADSP2106x is a DSP processing chip using Super Harvard structure and RISC. Its powerful floating-point and fixed-point arithmetic functions and large-capacity on-chip memory make it capable of demanding real-time signal processing tasks; and its rich external interfaces and 10 channels of DMA can make the data processed unobstructed [1]; coupled with the on-chip arbitration logic, 6 ADSP2106x and a host can be easily connected together to form a parallel processing system. A powerful signal processing system can be developed using ADSP2106x [2].

    Although the ADSP2106x chip itself provides excellent performance, this performance cannot be achieved without the support of software programming. For example, the peak operation speed of ADSP2106x can reach 120MFLOPS (main frequency 40MHz), that is, one multiplication, one addition and one subtraction can be completed in one clock cycle, but these three parallel operation instructions require reasonable arrangements to achieve. In addition, due to the Super Harvard structure used inside the chip, two data can be accessed at the same time under certain conditions, but this also requires a reasonable arrangement of data placement in the data memory and program memory to make parallel access instructions effective.

    This article mainly introduces some application skills of parallel instructions in ADSP2106x, focusing on parallel operation instructions and data access instructions. Through the application of these techniques, you can improve programming efficiency, fully utilize the potential of the hardware, and have a deeper understanding of the internal structure of ADSP2106x.

The arithmetic processing unit in ADSP2106x

    The core processing part in ADSP2106x contains three arithmetic units: ALU, multiplier and shifter. The connection relationship between them and the register group is shown in Figure 1 [3]. The functions of the three arithmetic units are as follows:

    (1) ALU unit: fixed-point, floating-point addition, subtraction and averaging; logical operations; absolute value, maximum value, minimum value, limiting, comparison; fixed-point <--> floating point Convert.

    (2) Multiplier: floating-point multiplication; fixed-point multiplication and multiplication-accumulation.

    (3) Shifter: shift operation; bit operation; bit field extraction and storage.

    The data channels of the three arithmetic processing units in ADSP2106x are only connected to the register group and cannot directly access the operands from the memory. This is a typical RISC structure. This structure often requires a large number of registers to store and exchange intermediate results. ADSP2106x There are 32 registers in it, 16 of which work in the foreground and 16 in the background. The multiplier inputs operands from two registers and stores the result in another register; the shifter inputs data from 3 registers and stores a result in another register; the ALU inputs operands from two registers and stores the two operations The results are stored in the other two registers respectively.

    The register can access one piece of data each from the data memory and the program memory in one clock cycle. This is the advantage of the ADSP2106x super Harvard structure. In addition, the operation of the arithmetic unit and the access of the register can be parallel, but during programming, this parallelism is often accompanied by a pipeline process.

    The multiplier and ALU in ADSP2106x also have the ability to perform parallel operations; in one clock cycle, the multiplier can complete a multiplication, and the ALU can complete an addition and a subtraction at the same time, which enables the ADSP2106x to achieve a peak operation of 120MFLOPS at a main frequency of 40MHz. speed.

    The following considers the issues that should be paid attention to when programming software for this kind of parallel operation.

The basic format of parallel operation instructions

    in ADSP2106x is as shown in Figure 2. Here we take floating-point operations as an example. If it is fixed-point operation, just replace all the prefixes "F" with "R".

    The order of each operation in the parallel instruction must comply with the requirements in the figure and be separated by commas, otherwise an error will occur during compilation.

    The range of the DM address generation register is 0~7, and the range of the PM address generation register is 8~15; that is,

    I0≤ Ia ≤ I7, M0≤ Mb ≤ M7; I8≤ Ic≤ I15, M8≤ Md ≤ M15. And it is very important: the memory address range that DM(Ia,Mb) actually points to must be in the data memory. For ADSP21060, under the 32-bit data length, the address range of its data memory is 0x30000~0x3FFFF; PM(Ic,Md ) actually points to the memory address range that must also be in data memory. Otherwise, although it can pass during compilation, the parallel effect cannot be achieved during runtime, and errors are likely to occur. Such errors have a certain degree of uncertainty, are difficult to detect during program debugging, and are potentially very harmful.

    When multipliers and ALUs are parallel, there are strict requirements for them to obtain their operands from registers. The 16 registers are divided into 4 groups, F0~F3 are the first group, F4~F7 are the second group, F8~F11 are the third group, and F12~F15 are the fourth group. When the multiplier and ALU operations are parallel, the two operands of the multiplier must be taken from the first group and the second group respectively; the two operands of the ALU must be taken from the third group and the fourth group respectively.

    In the above parallel processing, any register can be both read and written; when the instruction is executed, the principle of reading first and then writing is followed, that is, the data is read from a certain register in the first half of the clock cycle, and in the second half of the clock cycle, the data is read from a certain register. The operation result is written back to the register in one clock cycle. A solid understanding of this can help you adopt pipeline steps when programming.

Pipeline Steps in Parallel Processing When

    programming with ADSP2106x parallel instructions, since the flow of data between memory and computing units must be mediated by registers, pipeline steps need to be used during programming. We use the following example as a general format to represent this pipeline step.

    Assume that there are N pieces of data in the memory: xn, 0≤n≤N-1; after performing some kind of operation on it, N processing results are obtained: yn, 0≤n≤N-1, and yn is written back to the memory. If we do not use parallel processing, the processing steps are as follows:

    For n=0 to N-1

    Fx ← Memory (xn);

    Fy = Operation (Fx);

    Fy → Memory (yn);

    End

    The above processing requires a total of 3*N clock cycles (not considering the initialization of the loop). If we adopt the following parallel processing and perform the following pipelined flow steps of data from memory? register? operation unit? register to memory,

    Fx ← memory (x0);

    Fy = operation (Fx), Fx ← memory (x1 ); /*Preparation operation for entering the loop*/

    For n=2 to N-1

    Fy = Operation (Fx), Fx ← Memory (xn), Fy → Memory (yn); /*Parallel processing of the loop body*/

    End

    Fy = Operation (Fx), Fy → Memory (y N-2 );

    Fy ? Memory (yN-1); /*Writeback operation after exiting the loop*/The

    total processing time is shortened to N+2 clock cycles. At this time, in order to realize the parallel instructions in the loop body, it is necessary to complete the preparation operation of data prefetching before entering the loop body, and to complete the write-back operation of the operation results after the loop body exits; at the same time, xn and yn are required to be in the program memory and in data storage.

    Although this example is simple, it basically expresses the common pipeline steps in parallel processing. In the following section, we will use two specific examples to illustrate the application of parallel processing instructions.

Two examples

    Example 1 is to find the inner product of two arrays. Let xn, yn, 0≤n≤N-1 be two arrays, and their inner product is defined as:. In order to obtain xn and yn at the same time during the operation, we need to arrange xn and yn in the program memory and data memory respectively. After the operation, the inner product result is in the register f8.

    f8=0; /*Clear the result register*/

    i0=x; m0=1; /*Assign the first address of the array xn to i0 */

    i8=y; m8=1; /*Assign the first address of the array yn Assign to i8 */

    f0=dm(i0,m0), f4=pm(i8,m8);

    f12=f0*f4, f0=dm(i0,m0), f4=pm(i8,m8); /* is Enter the loop to prepare*/

    lcntr=N-1, do loop until lce;

    loop: f12=f0*f4, f8=f8+f12, f0=dm(i0,m0), f4=pm(i8,m8); / *Parallel processing of the loop body*/

    f12=f0*f4, f8=f8+f12;

    f8=f8+f12; /*The remaining operations of exiting the loop, the inner product result is in f8*/The

    second example is the complex multiplication operation, A complex multiplication requires four real number multiplications, one real number addition and one real number subtraction. Therefore, at least four instructions are required to complete a complex multiplication. Among these four instructions, four operands must be read in and two results returned. Write. Here, we assume that there are two complex number groups: xn=xrn+j*xin and yn=yrn+j*yin, 0(n(N-1; after multiplying the two, we get zn=zrn+j*zin=(xrn *yrn-xin*yin)+j*(xrn*yin+xin*yrn), 0(n(N-1). xr, yr and zi are arranged in the data memory, xi, yi and zr are arranged in the program memory . The specific program is as follows,

    i0=xr; i1=yr; i3=zi; i8=xi; i9=yi; i10=zr; m0=1; m8=1; /*Assign initial value to the address generation register*/

    f0 =dm(i0,m0), f4=pm(i9,m8); /* f0= xr0, f4= yi0*/

    f5=dm(i1,m0), f1=pm(i8,m8); /* f5= yr0, f1=xi0 */

    f8=f0*f5; /* f8= xr0* yr0 */

    f12=f1*f4; /* f12= xi0* yi0 */

    f9=f0*f4, f2=f8-f12, f0 =dm(i0,m0), f4=pm(i9,m8); /* f9= xr0* yi0, zr0=f2= xr0* yr0- xi0* yi0 */ /

    * f0= xr1, f4= yi1*/

    f13 =f1*f5, f5=dm(i1,m0), f1=pm(i8,m8); /* f13= xi0* yr0 , f5=yr1, f1=xi1 */

    lcntr=N-2, do CMTI until lce ;

    f8=f0*f5, f3=f9+f13; /* f8= xr* yr , zi=f3=xr* yi+ xi* yr */

    f12=f1*f4, dm(i3,m0)=f3, pm( i10,m8)=f2; /* f12= xi* yi, save zi and zr */

    f9=f0*f4, f2=f8-f12, f0=dm(i0,m0), f4=pm(i9,m8) ; /* f9= xr* yi, zr=f2= xr* yr- xi* yi */

    /* f0= xr, f4= yi */

    CMTI: f13=f1*f5, f5=dm(i1,m0), f1=pm(i8,m8); /* f13= xi* yr , f5=yr, f1=xi */

    f8=f0*f5, f3=f9+f13;

    f12=f1*f4, dm(i3,m0)=f3, pm(i10,m8)=f2;

    f9=f0*f4, f2=f8-f12;

    f13= f1*f5;

    f3=f9+f13;

    dm(i3,m0)=f3, pm(i10,m8)=f2;

Conclusion

    This article analyzes the structure of the internal arithmetic processing unit of the ADSP2106x chip, and on this basis summarizes The general format of parallel processing instructions and the pipeline steps in specific applications are given. Finally, two typical examples of array inner product and complex array multiplication are given.