Design of a high performance 32-bit shift register unit-EEWORLD

Collect

1 Introduction

As the number of bits and performance of CPU designs continue to increase, the requirements for dedicated hardware shift registers in the CPU execution unit are also getting higher and higher. The performance of the CPU shift register directly affects the processing capability and execution speed of the designed CPU for shift instructions. In traditional CPU structures, the design of shift registers generally adopts a matrix structure and a tree structure. When the number of bits of the CPU reaches 32 bits and the speed reaches more than 100M, it is difficult to achieve the requirements of arbitrarily shifting 32-bit data within 32 bits in one instruction cycle with previous design methods. There has been a behavioral description of the 32-bit barrel shift register, but it is only applicable to the RISC instruction set, and as a dedicated hardware in the CPU, in order to achieve the best power consumption, speed and area, the hardware circuit is usually fully customized.

This paper presents a shift register circuit that can be used for CPU execution units above 32 bits, and optimizes it for the CISC instruction set INTEL X86 (since the shift instructions in the RISC instruction set are relatively simple to implement, they are not discussed in this paper); by using instruction preprocessing technology and redundant bits, it is very convenient to implement shifting with carry flag CF and setting CF bit, and the average execution speed of each shift instruction is two instruction cycles. It effectively improves the CPU's execution performance for shift instructions, and as a basic core unit, it can be easily transplanted to CPU designs with different instruction sets (RISC or CISC).

2 Overall structure of execution unit in 32-bit CPU The

execution part of the 32-bit CPU we designed adopts a dual-bus structure, and the width of the data bus (Abus, Bbus) is 32 bits. Since shift instructions will inevitably consume too many CPU cycles if implemented with ALU, in order to achieve arbitrary bit shift operations on 32-bit data within one instruction cycle, it is necessary to design a dedicated hardware shift register in the execution unit, which performs 32-bit data shift when executing shift instructions.

Figure 1 shows a simplified diagram of the overall structure of the 32-bit CPU execution unit data flow, and omits all control signals. In the figure, Abus is a bidirectional 32-bit data bus, and Bbus is a unidirectional 32-bit data bus. Considering the need to implement all shift instructions (RCR, RCL, ROR, ROL, etc.) of the INTEL X86 series, the shift register is designed with dual input terminals, that is, the actual shift register can achieve a maximum of 64-bit shifts. The flag bit is set by a special instruction pre-setting method and by adding redundant bits.

3 Design of shift register unit

3.1 Matrix shifter and tree shifter

The design of shift register unit in CPU generally adopts matrix structure and tree structure shifter.

3.1.1 Matrix structure (Matrix Style) Shifter

Its structure is an array composed of transmission gates. The number of rows is equal to the width of the operation data, and the number of columns is equal to the maximum number of shifts as shown in Figure 2 (taking 4 bits as an example).

Among them, A3~A0 are 4-bit data input lines, and sh3~sh0 are 4 control signal lines. Each time an N-bit shift operation is performed, the corresponding shN is high and the other control signals are low.

The advantages of this structure are: (1) fast data transmission speed, each signal reaches the output end after only one level of transmission, and is not limited by the number of bits of the shifter; (2) the layout is very regular. The disadvantages are: (1) the load of each control signal is too large, such as a 32-bit shifter, each signal line (sh0, sh1, ... sh31) must drive 32 switch tubes; (2) the number of transistors required is too large, such as the number of transistors required for an n-bit shifter is 2×n×n＝2n2 (the transmission gate part is implemented using CMOS), which will increase the power consumption and chip area; (3) each shift operation only requires one control line to be 1, so an additional decoding unit is required. [page]

3.1.2 Tree Style Shifter

The number of stages required for this structure is log 2M. Each stage is controlled by two signal lines (shn and sh n#) to control the data transmission. The data is either shifted 2 i bits or not at all at the i-th stage. The tree style shifter is shown in Figure 3.

The advantages of this structure are: (1) Small number of transistors. The number of transistors required for an n-bit shifter is 2×n×log n (the transmission gate part is implemented using CMOS), and the layout area is smaller than that of a matrix shifter; (2) The control signals shN~sh0 are themselves binary representations, and no additional decoding unit is required. The disadvantages are: the number of switch tubes required for the data path is too large, and the number of stages required for an M-bit shifter is log 2M, which leads to a large delay.

3.2 Matrix-tree structure shifter

From the above analysis, we can see that if the processor being designed is a CPU below 16 bits, then its shifter can meet the requirements regardless of which of the above solutions is adopted. However, when the data width is above 32 bits, the inherent disadvantages of the above solutions will become very prominent in terms of power consumption, speed and layout area. In this design, the actual input of the shift register is 64 bits. In order to combine the advantages of the matrix structure (fast speed, regular layout) and the advantages of the tree structure (few transistors, simple decoding), we use the matrix-tree structure in the design. The entire shift register is input by a dual bus, that is, the input is 64 bits. Table 1 lists the number of transistors required for the matrix-tree structure of different levels (n1 is the level of the tree, n2 is the control line of the matrix, and n3 is the number of transistors used in the matrix). After comprehensive consideration, we use the matrix-tree level ratio of the second row, that is, the matrix part can achieve a maximum of 8 bits of shift, and the tree part can achieve a maximum of 4 bits of shift.

After comprehensive consideration of various aspects, the front stage of the shift register we designed is the matrix structure part (input data is 64 bits, control signal is 8 bits), which forms a 36-bit data and sends it to the next level tree structure (input data is 36 bits, control signal is 2 bits) to complete the remaining 4 bits of shifting to form 32-bit output data. The structural diagram is shown in Figure 4.

In this structure, the tree shifter at the back stage can achieve a maximum shift of 3 bits. The input 2-bit signal is a binary code, and the two bits are directly sent to the lowest two bits by the shift counter sh4~sh0 (to be introduced in the next section). The matrix structure of the front stage completes 64-bit input and 36-bit output. We assume that the 64-bit data input is provided by Abus and Bbus, as shown in Figure 5. Each small grid represents 4 bits of data. After the 64-bit data is sent to the matrix shifter, it is decoded according to the high three bits sh4~sh2 of the counter and shifted in one of 4, 8, 12, 16, 20, 24, 28, and 32 (corresponding to one bit in 8 bits is high). The 36-bit data output is sent to the lower-level tree shifter to complete the shift of the remaining bits. The 36-bit data output format is shown in Figure 6. COUNT represents the total number of shifts.

4 Instruction preprocessing and implementation of shift instructions

In the CPU we designed, it is necessary to be compatible with INTEL's X86 series shift instructions. Therefore, the shift register unit needs to be able to implement ROL, ROR, RCL, RCR, SHL, SHR, SAR in one instruction beat with the cooperation of the surrounding decoding and latch units. Among them, RCL and RCR implement the shift with flag C (for instruction description, see reference [4]). Therefore, the processor control unit needs to preprocess the instructions before each type of shift instruction is shifted. [page]

4.1 Overall structure of the shift register unit

The overall structure of the shift register unit designed is shown in Figure 7, in which the matrix-tree structure shift register of its core part uses the structure described in the previous section. The data in the counter (sh4~sh0) is written by Bbus in the previous shift and decoded. The lower two bits (sh1, sh0) are directly sent to the tree structure shift part, and the upper three bits (sh4, sh3, sh2) are decoded to generate 8-bit control signals and sent to the matrix shift part. The Abus and Bbus input latches can latch 32-bit data inputs and operate according to the requirements of different instructions to pre-process the instructions. The shift result is sent to the ALU output latch and the CF register is set.

4.2 Instruction preprocessing

Since the shift with carry CF needs to be implemented and CF needs to be set after the shift operation, in general, this requires the CPU control unit to provide a multi-cycle instruction beat to achieve this. In this design, the Abus and Bbus input latches are designed to be able to implement the operations of clearing 0 and shifting CF left or right by one bit according to different instructions, so as to prepare the data for the shift, so that the 0-32 bit shift of the input data can be completed within one instruction cycle. The specific settings for different instructions are shown in Figure 8. In the figure, CF represents the carry flag; len is the operand length (such as 32-bit data); n is the shift number; DATA indicates that the data output by the input latch is the operation data itself; 0 indicates that the data output by the input latch is 0; CF: DATA (-1) indicates that the data output by the input latch is the operand with CF shifted right by one bit; DATA (-1): CF indicates that the data output by the input latch is the operand with CF shifted left by one bit; SIGN_EXT indicates that the data output by the input latch is the operand with sign extension. The horizontal line shows the format of the data in the Abus and Bbus latches after preprocessing before the shift, and the area above the horizontal line shows the data output and the position of the carry CF after the shift is completed.

Example: RCL AX, CL instruction
set AX = 0001H, CL = 3, CF = 1
The Abus latch output data is the operand 0001H;
The Bbus latch output data is the operand with CF shifted right by one position to 1000H;
In the output, CF is 0 at the leftmost end of the output result.

5 Verification and conclusion Verification and conclusion

Verilog behavior simulation and starsim timing simulation show that the performance fully meets the requirements. Compared with the standard execution cycle of shift instructions in the INTEL X86 instruction set of 4 to 7 machine cycles, the average execution time of shift instructions in this design is 2 instruction cycles, which greatly improves the execution efficiency of shift instructions. As the dedicated hardware of the execution unit in the CPU, the performance of the shift register directly affects the speed and efficiency of the CPU processing shift instructions. The matrix-tree structure shift register used in this paper, combined with instruction preprocessing technology, can effectively implement 32-bit data shift operations, and is compatible with all shift instructions of the INTEL X86 series. It can also be easily transplanted to other instruction-level CPU designs as general hardware.

Keywords：32-bit Reference address：Design of a high performance 32-bit shift register unit

Previous article：Research on designing water level automatic control system using ultrasonic sensor
Next article：Design and Implementation of System Control Coprocessor in 32-bit Embedded CPU

Popular Resources
Popular amplifiers