Design and implementation of high-speed pipeline floating-point multiplier based on FPGA-EEWORLD

Collect

1 Introduction

With the rapid development of digitalization, people have higher and higher requirements on the performance of microprocessors. As the main standard for measuring the performance of microprocessors, the main frequency and the cycle of a multiplication of the multiplier are closely related. Therefore, in order to further improve the performance of microprocessors, it is imperative to develop high-speed and high-precision multipliers. At the same time, due to the large dynamic range of floating-point operations based on the IEEE754 standard, high precision can be achieved, and the operation rules are simpler than fixed-point operations, the design and research of floating-point operation units have received widespread attention. This paper introduces the design of a 32-bit floating-point multiplier, which adopts the Radix-4 Booth algorithm, the improved 4:2 compressor and the Booth encoding algorithm, and combines the characteristics of FPGA itself, using pipeline design technology, while achieving high-speed floating-point multiplication, it also makes the system have the characteristics of high stability, regular structure, easy FPGA implementation and ASIC HardCopy.

2. Operation rules and system structure

2.1 Representation rules of floating point numbers

This design uses the single-precision IEEE754 format [2]. Assume that the two numbers A and B involved in the operation are both single-precision floating-point numbers, that is:

2.2 Hardware system structure of floating-point multiplier

This design is used for a dedicated floating-point FFT processor, so it has high requirements for the operation speed. In order to ensure that the floating-point multiplier can run stably below 80M, this design uses pipeline technology. Pipeline technology can increase the operating speed of the synchronous circuit and increase data throughput. The internal structure characteristics of FPGA are very suitable for pipeline design, and only little or no additional cost is required. In summary, according to the system partitioning, this design will use 5-level pipeline processing. Figure 1 is the hardware structure diagram of the floating-point multiplier.

3 Main module design and simulation

3.1 Exponential processing module (E_Adder) design

The 32-bit floating-point format is defined in the literature [2]. As mentioned above, the main process of floating-point multiplication is to multiply two mantissas, and to process exponent addition and overflow detection in parallel. For a 32-bit floating-point multiplier, the exponent is 8 bits, so this design uses an 8-bit carry-lookahead adder with carry output to complete exponent addition, de-skew and other operations. The specific process is as follows.

The E_Adder module is responsible for completing the summation operation of the exponent field in the floating-point multiplier operation, as shown in the following formula:

Where E[8] is the carry generated by the MSB bit. Bias=127 is the exponent offset value defined in the IEEE754 standard. Normalization completes the normalization operation because the exponent summation result is related to the mantissa multiplication result. In this design, by selecting a method, the exponent part of the product can be obtained almost immediately after the Normalization flag is generated, so that E_Adder is not in the critical path.

This design collects the three-level carry signal and cooperates with the Normalization signal of the mantissa multiplication unit to normalize the calculation result and decide whether to output infinity, infinitesimal or normal value.

According to the timing simulation view of E_Adder, it can be seen that the design fully meets the application requirements.

3.2 Improved Booth Encoder Design

Since the delay of the entire multiplier is mainly determined by the number of partial products added, the number of partial products must be reduced to shorten the operation delay of the entire multiplier. This design uses the Gibbs encoder to reduce the partial products to 13 and improve the traditional encoding scheme. The encoding algorithm is shown in Table 1.

Since FPGA has abundant AND and OR gate resources, this method makes full use of FPGA internal resources and saves area while ensuring speed and accuracy, while meeting the requirements of low power consumption.

3.3 Partial product generation and compression structure design

3.3.1 Partial product generation structure

According to the output of the Booth encoder, the partial product generation follows the following formula [4]:

Where PPi is the partial product; Ai is the multiplicand. After the extension of the hidden bit and the sign bit, the 26-bit mantissa of the multiplicand will generate 13 partial products. In the floating-point multiplier, the mantissa operation uses binary complement operation. Therefore, when NEG=1, 1 must be added to the lowest bit of the partial product because PPi only completes the inversion operation. In order to enhance the parallelism of the design, the operation of adding 1 to the lowest bit of the partial product is implemented in the partial product compression structure. In addition, in order to complete the addition of signed numbers, the sign bit of the partial product needs to be extended, and the result is shown in Figure 4. Among the 13 partial products, except for the first partial product which is 29 bits, the remaining partial products are extended to 32 bits. Among them, the first partial product includes a 3-bit sign extension bit "SSS", the sign extension bits of the 2nd to 13th partial products are "SS", and the addition operation bit is "NN", following the following formula:

Where i is the number of rows of partial products, and sign (i) is the sign of the partial product of the i-th row.

3.3.2 Partial Product Compression Structure

This design uses a mixture of 4:2 compressors, 3:2 compressors, full adders, and half adders to achieve fast compression of 13 partial products while ensuring accuracy. The division of the partial product compression structure in this paper is shown in Figure 2.

In Figure 2, the dotted line gives the compression division of the traditional partial product, while the solid line describes the division of the partial product compression structure used in this paper. Such a division is conducive to simplifying the second-level compression structure, thereby saving FPGA internal resources while ensuring speed. As can be seen from Figure 2, some bits do not need to be calculated because these bits are generated by the sign bit of the multiplier mantissa introduced during Booth encoding, and 48 bits are sufficient to express the calculation result.

3.3.3 Improved 4:2 compressor

This design adopts the widely used 4:2 compressor and improves it according to the characteristics of FPGA internal resources. As shown in Figure 3. The traditional 4:2 compressor is two full adders cascaded, which requires a total of four XOR gates and 8 NAND gates. The improved 4:2 compressor requires four XOR gates and two selectors (MUX). 8 NAND gates require 36 transistors, while two MUXs require 20 transistors. At the same time, a large number of XOR gates and selector resources are integrated inside the FPGA, and this design method is also a full utilization of the FPGA.

Since a large number of 4:2 compressors are required to compress the partial product, the improved circuit can reduce the area of the layout to a certain extent, which also brings advantages to the ASIC back-end design of the multiplier. In addition, the delay from the 4 inputs to the output S of the improved compressor is the same, which is a 3-level XOR gate delay.

Implementation and Simulation of 432-bit Floating-point Multiplier

Figure 4 shows the FPGA timing simulation results of this design. The timing simulation environment is QuartusII7.0, the target chip is EP1C6Q240C8 of the Cyclone series, and the functional simulation environment is Modelsim6.0b. The entire design uses VHDL language for structural description, and the synthesis strategy is area priority. It can be seen from the simulation view that the floating-point multiplier can stably operate at a frequency of 80M and below. After a delay of 5 cycles, it can stably output the first-level multiplication result in each subsequent cycle, achieving high throughput. If full customization is used for back-end layout layout and routing, the performance of the multiplier will be even better.

5 Conclusion

The author's innovation: Aiming at the internal resource characteristics of FPGA devices, a 5-stage pipeline high-speed floating-point multiplier suitable for FPGA implementation is proposed. The multiplier supports IEEE754 standard 32-bit single-precision floating-point numbers, and adopts components such as the radix-4 Booth algorithm, the improved Booth encoder, and the partial product compression structure, thereby reducing the hardware scale while ensuring high speed, making the design of the multiplier suitable for engineering applications and scientific computing, and easy to implement in the back-end layout of ASIC. The design has been used in the floating-point FFT processor designed by the author and achieved good results.

Keywords：FPGA Reference address：Design and implementation of high-speed pipeline floating-point multiplier based on FPGA

Previous article：Design of inter-line transfer area array CCD driving circuit based on FPGA
Next article：Reconfigurable Design Based on ARM and FPGA

Recommended ReadingLatest update time:2024-11-17 10:59

Is FPGA power supply design suitable for concurrent engineering?

If designers can meet the power requirements and constraints of FPGA-based designs early in the development process, a significant competitive advantage can be achieved in the final implementation of the system. However, according to this self-repeated prayer throughout the technical literature, what else is there in

[Power Management]

Is FPGA power supply design suitable for concurrent engineering?

Let "you" live in the user-specific design environment of FPGA

As someone who leads an FPGA corporate marketing team, I have to say that FPGAs are continuing to deliver on their promise to enable system-on-chip designs, thanks to significant advances in process technology and ingenuity in silicon chip design. With each new generation, FPGAs are taking on more and more functio

[Embedded]

Design of a Radar Beam Control System Based on FPGA

O Introduction The basic function of the beam control system is to provide the required control signals to each phase shifter in the antenna array. In addition to this basic function, modern radars also require the beam control system to be high-speed, efficient, low-cost, and miniaturized, and have self-checkin

[Industrial Control]

Design of a Radar Beam Control System Based on FPGA

Difference between ARM and FPGA

In the field of embedded development, ARM is a very popular microprocessor with a very high market coverage. DSP and FPGA are coprocessors for embedded development, helping microprocessors to better realize product functions. What are the technical characteristics and differences between the three? The following i

[Microcontroller]

Embedded Applications Using FPGAs for Microcontroller Applications

When you open any smart electronic device, from an old TV remote to a GPS, you will find that almost all of them use at least one microcontroller (MCU), and many have multiple microcontrollers. MCUs are often used in dedicated end products or devices, and they perform specific tasks very well. On the other hand, the br

[Microcontroller]

Battle for automotive AI chips, architecture & parameters

In the era of autonomous driving, AI chips have suddenly emerged. As the "brain" of self-driving cars, the on-board AI chip is responsible for the most difficult and complex tasks. However, there are currently many players in the automotive chip market and the chip architecture is also complex. From traditional car

[Automotive Electronics]

Battle for automotive AI chips, architecture & parameters

Design and implementation of DRFM radar multi-target simulator based on FPGA

Abstract: The design and implementation of the digital radio frequency memory (DRFM) unit in the radar multi-target simulation system is studied. According to the design requirements of the simulation system, a design method of the digital radio frequency memory unit based on high-performance FPGA is proposed;

[Embedded]

Design and implementation of DRFM radar multi-target simulator based on FPGA

The era of video surveillance based on FPGA

Video surveillance market and development trends The escalating security issues have forced governments and institutions to invest heavily in surveillance and security equipment. In addition, technological innovations in the field of image and video processing have completely changed the video surveillance indus

[Embedded]

The era of video surveillance based on FPGA

Popular Resources
Popular amplifiers