1 Introduction
With the rapid development of digitalization, people have higher and higher requirements on the performance of microprocessors. As the main standard for measuring the performance of microprocessors, the main frequency and the cycle of a multiplication of the multiplier are closely related. Therefore, in order to further improve the performance of microprocessors, it is imperative to develop high-speed and high-precision multipliers. At the same time, due to the large dynamic range of floating-point operations based on the IEEE754 standard, high precision can be achieved, and the operation rules are simpler than fixed-point operations, the design and research of floating-point operation units have received widespread attention. This paper introduces the design of a 32-bit floating-point multiplier, which adopts the Radix-4 Booth algorithm, the improved 4:2 compressor and the Booth encoding algorithm, and combines the characteristics of FPGA itself, using pipeline design technology, while achieving high-speed floating-point multiplication, it also makes the system have the characteristics of high stability, regular structure, easy FPGA implementation and ASIC HardCopy.
2. Operation rules and system structure
2.1 Representation rules of floating point numbers
This design uses the single-precision IEEE754 format [2]. Assume that the two numbers A and B involved in the operation are both single-precision floating-point numbers, that is:
2.2 Hardware system structure of floating-point multiplier
This design is used for a dedicated floating-point FFT processor, so it has high requirements for the operation speed. In order to ensure that the floating-point multiplier can run stably below 80M, this design uses pipeline technology. Pipeline technology can increase the operating speed of the synchronous circuit and increase data throughput. The internal structure characteristics of FPGA are very suitable for pipeline design, and only little or no additional cost is required. In summary, according to the system partitioning, this design will use 5-level pipeline processing. Figure 1 is the hardware structure diagram of the floating-point multiplier.
3 Main module design and simulation
3.1 Exponential processing module (E_Adder) design
The 32-bit floating-point format is defined in the literature [2]. As mentioned above, the main process of floating-point multiplication is to multiply two mantissas, and to process exponent addition and overflow detection in parallel. For a 32-bit floating-point multiplier, the exponent is 8 bits, so this design uses an 8-bit carry-lookahead adder with carry output to complete exponent addition, de-skew and other operations. The specific process is as follows.
The E_Adder module is responsible for completing the summation operation of the exponent field in the floating-point multiplier operation, as shown in the following formula:
Where E[8] is the carry generated by the MSB bit. Bias=127 is the exponent offset value defined in the IEEE754 standard. Normalization completes the normalization operation because the exponent summation result is related to the mantissa multiplication result. In this design, by selecting a method, the exponent part of the product can be obtained almost immediately after the Normalization flag is generated, so that E_Adder is not in the critical path.
This design collects the three-level carry signal and cooperates with the Normalization signal of the mantissa multiplication unit to normalize the calculation result and decide whether to output infinity, infinitesimal or normal value.
According to the timing simulation view of E_Adder, it can be seen that the design fully meets the application requirements.
3.2 Improved Booth Encoder Design
Since the delay of the entire multiplier is mainly determined by the number of partial products added, the number of partial products must be reduced to shorten the operation delay of the entire multiplier. This design uses the Gibbs encoder to reduce the partial products to 13 and improve the traditional encoding scheme. The encoding algorithm is shown in Table 1.
Since FPGA has abundant AND and OR gate resources, this method makes full use of FPGA internal resources and saves area while ensuring speed and accuracy, while meeting the requirements of low power consumption.
3.3 Partial product generation and compression structure design
3.3.1 Partial product generation structure
According to the output of the Booth encoder, the partial product generation follows the following formula [4]:
Where PPi is the partial product; Ai is the multiplicand. After the extension of the hidden bit and the sign bit, the 26-bit mantissa of the multiplicand will generate 13 partial products. In the floating-point multiplier, the mantissa operation uses binary complement operation. Therefore, when NEG=1, 1 must be added to the lowest bit of the partial product because PPi only completes the inversion operation. In order to enhance the parallelism of the design, the operation of adding 1 to the lowest bit of the partial product is implemented in the partial product compression structure. In addition, in order to complete the addition of signed numbers, the sign bit of the partial product needs to be extended, and the result is shown in Figure 4. Among the 13 partial products, except for the first partial product which is 29 bits, the remaining partial products are extended to 32 bits. Among them, the first partial product includes a 3-bit sign extension bit "SSS", the sign extension bits of the 2nd to 13th partial products are "SS", and the addition operation bit is "NN", following the following formula:
Where i is the number of rows of partial products, and sign (i) is the sign of the partial product of the i-th row.
3.3.2 Partial Product Compression Structure
This design uses a mixture of 4:2 compressors, 3:2 compressors, full adders, and half adders to achieve fast compression of 13 partial products while ensuring accuracy. The division of the partial product compression structure in this paper is shown in Figure 2.
In Figure 2, the dotted line gives the compression division of the traditional partial product, while the solid line describes the division of the partial product compression structure used in this paper. Such a division is conducive to simplifying the second-level compression structure, thereby saving FPGA internal resources while ensuring speed. As can be seen from Figure 2, some bits do not need to be calculated because these bits are generated by the sign bit of the multiplier mantissa introduced during Booth encoding, and 48 bits are sufficient to express the calculation result.
3.3.3 Improved 4:2 compressor
This design adopts the widely used 4:2 compressor and improves it according to the characteristics of FPGA internal resources. As shown in Figure 3. The traditional 4:2 compressor is two full adders cascaded, which requires a total of four XOR gates and 8 NAND gates. The improved 4:2 compressor requires four XOR gates and two selectors (MUX). 8 NAND gates require 36 transistors, while two MUXs require 20 transistors. At the same time, a large number of XOR gates and selector resources are integrated inside the FPGA, and this design method is also a full utilization of the FPGA.
Since a large number of 4:2 compressors are required to compress the partial product, the improved circuit can reduce the area of the layout to a certain extent, which also brings advantages to the ASIC back-end design of the multiplier. In addition, the delay from the 4 inputs to the output S of the improved compressor is the same, which is a 3-level XOR gate delay.
Implementation and Simulation of 432-bit Floating-point Multiplier
Figure 4 shows the FPGA timing simulation results of this design. The timing simulation environment is QuartusII7.0, the target chip is EP1C6Q240C8 of the Cyclone series, and the functional simulation environment is Modelsim6.0b. The entire design uses VHDL language for structural description, and the synthesis strategy is area priority. It can be seen from the simulation view that the floating-point multiplier can stably operate at a frequency of 80M and below. After a delay of 5 cycles, it can stably output the first-level multiplication result in each subsequent cycle, achieving high throughput. If full customization is used for back-end layout layout and routing, the performance of the multiplier will be even better.
5 Conclusion
The author's innovation: Aiming at the internal resource characteristics of FPGA devices, a 5-stage pipeline high-speed floating-point multiplier suitable for FPGA implementation is proposed. The multiplier supports IEEE754 standard 32-bit single-precision floating-point numbers, and adopts components such as the radix-4 Booth algorithm, the improved Booth encoder, and the partial product compression structure, thereby reducing the hardware scale while ensuring high speed, making the design of the multiplier suitable for engineering applications and scientific computing, and easy to implement in the back-end layout of ASIC. The design has been used in the floating-point FFT processor designed by the author and achieved good results.
Previous article:Design of inter-line transfer area array CCD driving circuit based on FPGA
Next article:Reconfigurable Design Based on ARM and FPGA
Recommended ReadingLatest update time:2024-11-17 10:59
- Popular Resources
- Popular amplifiers
- Analysis and Implementation of MAC Protocol for Wireless Sensor Networks (by Yang Zhijun, Xie Xianjie, and Ding Hongwei)
- MATLAB and FPGA implementation of wireless communication
- Intelligent computing systems (Chen Yunji, Li Ling, Li Wei, Guo Qi, Du Zidong)
- Summary of non-synthesizable statements in FPGA
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- New breakthrough! Ultra-fast memory accelerates Intel Xeon 6-core processors
- New breakthrough! Ultra-fast memory accelerates Intel Xeon 6-core processors
- Consolidating vRAN sites onto a single server helps operators reduce total cost of ownership
- Consolidating vRAN sites onto a single server helps operators reduce total cost of ownership
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- UPS Uninterruptible Power Supply
- 【LuatOS-ESP32】Light up the LED
- Telecommunications Law and Telecommunications Network Interconnection
- Is your company greatly affected by the chip shortage? How do you deal with it?
- Mobile device battery voltage remote monitoring system based on MC20E
- [NUCLEO-L552ZE Review] Driving WS2812——1
- [NXP Rapid IoT Review] Week 5: DIY BLE_APP for NXP IoT: RGB Dimming Control
- TI C6000 Data Storage Processing and Performance Optimization
- Porting OpenCV on DSP_6748
- Questions about vhdl testbench