FPGA Implementation of 20×18-bit Signed Fixed-point Multiplier-EEWORLD

Collect

With the rapid development of computer and information technology, people have higher and higher requirements for device processing speed and performance. In various chips such as high-speed digital signal processors (DSPs), microprocessors and RSICs, multipliers are essential arithmetic logic units and are often in the critical delay path. Multiplication operations need to be completed within one clock cycle. The cycle of completing a multiplication operation basically determines the main frequency of the microprocessor. Therefore, high-performance multipliers are important components in modern microprocessors and high-speed digital signal processing. At present, there are four kinds of multiplier design ideas in China, namely: parallel multiplier, shift-add multiplier, lookup table multiplier, and adder tree multiplier. Among them, parallel multipliers are easy to implement and have fast computing speed, but they consume a lot of resources, especially when the number of bits of the multiplication operation is wide, the resources consumed will be huge; the design idea of the shift-and-add multiplier is to implement it through shift-and-add term by term, which consumes fewer devices, but consumes clocks and is slow; the lookup table multiplier puts the product directly in the memory, and uses the operand as the address to access the memory, and the output data obtained is the multiplication result. The speed of this method is limited to the storage speed of the memory, but as the number of bits of the multiplier increases, the memory space will increase sharply. This method is not suitable for multiplication operations with a high number of bits; the adder tree multiplier adopts a pipeline structure and can complete the multiplication of two numbers in one clock, but when the number of bits of the multiplier increases, the number of pipeline stages increases, resulting in the use of many registers and increased device consumption. Multipliers using the Booth algorithm have great advantages in speed, devices, accuracy, and power consumption.

This paper introduces the design of 20×18 bit fixed-point array multiplier, which adopts radix 4-Booth algorithm and 4-2 compression. The basic logic unit is the standard unit library provided by SMIC's O.18/μm process. While reducing the number of multiplier devices, the system has the characteristics of high speed and low power consumption. The structure is regular and easy to implement in FPGA. At the same time, it is also a good choice in ASIC design.

1 Multiplier Structure

The logic design of the 20×18-bit multiplier can be divided into: Booth coding, partial product generation, 4-2 compression tree, look-ahead adder, and rounding overflow processing. Among them, the Booth algorithm can reduce 50% of the partial product terms, and the 4-2 compression tree reduces the number of addends to be summed, which can reduce the number of adders and save devices. Compared with the traditional method, it also reduces the multi-stage transmission delay in the serial accumulation or Wallace tree structure, thereby improving the speed of the entire multiplier.

After 4-2 compression, the last two numbers are directly added, and the delay is the delay of a carry-lookahead adder. After obtaining the result, overflow processing and rounding are performed according to the required data accuracy as shown in Figure 1.

Multiplier Structure [page]

1.1 Booth Coding and Partial Product Design

The base 4-Booth encoding method is used here. In the binary data represented by the complement code, the extension of its highest bit has no effect. The bit width of the multiplier A is N. If N is an odd number, A is sign-extended to A\' to make its bit width an even number. Set: After processing, the width of the multiplier A\' is H, H is an even number and must not be less than N. Then the multiplier A\' can be expressed as:

formula

The values are shown in Table 1:

Encoded truth table

It can be seen that the radix-4 Booth coding considers 3 bits at a time: the current bit, the adjacent high bit, and the adjacent low bit; it processes 2 bits, determines the operation amount 0, 1B, 2B, and forms (H/2) coding items and product items. For the implementation of 2B, it is only necessary to shift B left by 1 bit. Therefore, no matter from which aspect, the radix-4 algorithm is convenient and fast. The radix-2 algorithm only considers 2 bits at a time, processes 1 bit, and forms N coding items and product items, which is just convenient. In the O. 18 VM standard cell library provided by SMIC, the Booth coding logic expression is:

formula

M2 indicates the adjacent high bit, M1 indicates the current bit, and M0 indicates the adjacent low bit. S is positive when 0 and negative when 1; when A is 0, the operand is 0, and when it is 1, the operand is B; when X2 is 0, the operand is 0, and when it is 1, the operand is 28. 0, B, and 2B are relatively easy to implement, 2B=(B<<1); for (-2B), the implementation is as follows: -2B=2×(-B)=[～(B<<1)]+1 In the hardware implementation, the weight difference between adjacent partial products is 4, that is, the partial products are staggered by two bits, and the addition 1 is taken out; for all As are 1, all the addition 1s are taken out and used as partial products separately, which can save multiple adders and devices. For an 18b multiplier, 9 partial products can be generated. By improving this Booth code and adding a 1-complement number, a total of 10 addends are generated. [page]

1.2 4-2 Compression Logic Implementation

The schematic diagram of 4-2 compression is shown in Figure 2. It has 5 input terminals: A, B, C, D, ICI; and 3 output terminals: S, CO, ICO. The 5-3 encoders are combined into one row, which is a 5-3 counting row; if the ICO of the adjacent low position is connected to the ICI of the current position, it becomes a 4-2 compressor. This can reduce 2 operands. The algebraic operation formula of the 5-3 counter is as follows:

S+CO×2+ICO×2=A+B+C+D+ICI

That is: the weights of I0, I1, I2, I3, Ci, D are 1; the weights of C, C0 are 2.

4-2 Compression Schematic Diagram

In the 0.18 vm standard cell library provided by SMIC, the logical expression of 4-2 compressed CMPR42 is:

formula

When implementing this module in hardware, because there are 10 partial products, 4-2 compression is called 4 times in total, divided into 3 levels, 2-1-1 type from top to bottom. The 4-2 compression interconnection is shown in Figure 3.

10 partial product 4-2 compression interconnects [page]

1.3 Overflow handling and rounding

Fixed-point multiplication will not overflow, but the final number of digits of the result will increase. 20b×18b results in 38b. Sometimes 38b is not stored in full, only some of the bits are needed. This involves rounding. Suppose number A has 20 digits, 1 sign, 5 integer digits, and 14 decimal digits, and number B has 18 digits, 1 sign, 2 integer digits, and 15 decimal digits. The result format is the same as A.

As shown in Figure 4, because only 5 bits of integer are retained, the first 3 bits are regarded as sign bits. If they are different, it means overflow; otherwise, there is no overflow. Then, the overflow or underflow is determined based on the real sign of the first two bits. If it is 0, it overflows, which is 20\'h7ffff. Otherwise, it underflows, which is 20\'h80001. In logic design, it can be realized by using a strobe. The Verilog HDL code is: assignceil=data in[37]:20\'h80001:20\'h7ffff; where data in[37] is the highest bit.

Rounding diagram

Since Verilog HDL is one of the most widely used hardware description languages, it can be used for logic design at various levels, simulation verification, timing analysis, and can be ported to different chips from different manufacturers. The code is highly readable, so this module is designed using Verilog HDL language.

If the number to be rounded does not overflow, then the rounding of the decimal part must also be considered. If the number to be rounded is a positive number, the rounded adjacent digit is 1, and 1 must be added when rounding; otherwise, no need to add. If the number to be rounded is a negative number, the rounded adjacent digit is 1 and there is another 1 after the rounded adjacent digit, then 1 must be added when rounding; otherwise, 1 is not added.

2 Implementation and simulation test of 32-bit floating-point multiplier

This module is simulated and implemented using Mentor Graphics' Model-Sim SE 6.0d simulation software. Figure 5 lists the FPGA simulation results of this design. In Figure 5, in1 is the multiplicand 20 b. in2 is the multiplier 18 b. reset is the reset signal, which is valid at low level. booth_multiplier_out is the result 38 b calculated by the Booth coded multiplier. derect_multiplier_out is the result obtained directly using the multiplication sign "×", which is also 18 b. The two results are consistent. round_out is the rounded result, 20 b. eq is a 1 b signal added during the test, which is 1 if booth_multiplier_out and derect_multiplier_out are equal, otherwise it is 0.

FPGA Simulation Results

Because the input and output are latched by registers for one clock clk during the test, the final output result is delayed by two clock clks. In Figure 5, the input multiplier and multiplicand of the first clock clk are 126,999 and 68,850 respectively; the output result is 8,743,881,150 of the third clock clk. Since 126,999×68,850=8,743,881,150, the result is correct. During the test, because the actual data volume is relatively large, in1 is from -219 to 219-1, and the ModelSim SE 6.0d simulation software needs to run for about 1 minute. If in1 is from -219 to 219-1, and in2 is from -217 to 217-1, it will take about T=218min=4 369 hours="182" days. Therefore, it is not possible to test all of them on a PC. Therefore, when writing the testbench, the random function is used to generate random numbers for testing. The multiplier is run for 12 hours using the ModelSim simulation software, and the eq signal is always 1, that is, the result calculated by the multiplier is consistent with the result of direct multiplication. It is believed that this method is completely feasible. [page]

3 Performance Comparison and Innovation

The module was synthesized using Synplify Pro 8.1 and placed and routed using Xilinx ISE 7.1i. The system resource usage is shown in Table 2 under Map report under Implement Design in Xilinx ISE.

System resources occupied

The static timing analysis report shows that the speed and delay are 62.805 MHz and 15.922 ns respectively.

The design uses a high compression rate 4-2 compression algorithm with a compression rate of 50%, while the general 3-2 compression rate is 33%. It also uses advanced integrated circuit manufacturing technology and uses SMIC's 0.18μm standard cell library. Therefore, while increasing the speed, it can reduce the number of devices. The multiplier can be completed within 1 clock. Unlike the pipeline structure, although it can increase the speed to 105.38 MHz, it requires 3 clocks and a large number of latches, which increases the power consumption while increasing the number of devices. In addition, it takes 24.30 ns to complete a multiplication operation. Due to the late start of domestic integrated circuit manufacturing, 80% of China's integrated circuit design companies are still using 0.35/μm and below processes. Domestic multipliers of the same type use Shanghua's 0.5 μm standard cell library, and the time to complete a multiplication operation is close to 30 ns, with 1,914 logic units. However, this design only takes 15.922 ns to complete a multiplication operation, and the device has only 494 slices, which significantly improves performance.

4 Conclusion

The design of a 20×18-bit signed fixed-point multiplier is given. The entire design is described in Verilog HDL language, and the device used is xc2vp70-6ff1517. The design uses radix-4 Booth encoding, 4-2 compression, and SMIC 0.18μm standard cell library, which reduces the area of the multiplier and reduces the delay, achieving a good compromise between chip performance and design complexity. The design is applied to the 20×18-bit signed fixed-point multiplier of the 3 780-point FFT unit in the China Terrestrial Digital Television Broadcasting (DTMB) ASIC, and works well at 60 MHz, meeting the predetermined performance requirements, and has certain practical value.

Keywords：Multiplier Reference address：FPGA Implementation of 20×18-bit Signed Fixed-point Multiplier

Previous article：Design of PROFIBUS-DP bus optical transceiver based on FPGA
Next article：Small Spaceborne Uncooled Infrared Imaging System Based on FPGA

Recommended ReadingLatest update time:2024-11-16 16:45

Electronic Technology Decrypted: Simplifying FPGA Power Supply Design

　　FPGA is a chip with multiple power requirements, mainly three types of power requirements: 　　1.Vccint core operating voltage 　　Generally, the voltage is very low, and the commonly used FPGAs are around 1.2V. To power various internal logics of the FPGA, the current ranges from several hundred milliamperes to several

[Power Management]

Electronic Technology Decrypted: Simplifying FPGA Power Supply Design

Application of Serial AD and FPGA in Micro Data Recorder

0 Introduction In the application field of modern electronic technology, A/D converter is the medium for converting analog signal into digital signal. In the data acquisition system, the high-precision A/D converter is generally controlled by a single-chip microcomputer or other microcontroller. The method of

[Embedded]

Application of Serial AD and FPGA in Micro Data Recorder

Data Acquisition of LSA Series Laser Particle Size Tester Based on FPGA

introduction With the development of modern science and technology, particle size and its distribution have become increasingly important in fields such as petroleum, metallurgy, pharmaceuticals, and building materials. Laser particle size analyzer is an instrument used to measure the size and distribution of tiny par

[Test Measurement]

Data Acquisition of LSA Series Laser Particle Size Tester Based on FPGA

Application of DSP and FPGA in large-scale laser CNC machining system

Laser cutting and engraving are widely used in the advertising industry and the model aircraft manufacturing industry for their high precision and good visual effects. In the development of large-size laser processing systems, processing speed and processing accuracy are the first issues to be solved. The general me

[Embedded]

Application of DSP and FPGA in large-scale laser CNC machining system

Design and implementation of program-controlled filter based on single chip microcomputer and FPGA

　　A filter is a device used to eliminate interference noise. It can be used to effectively filter out a specific frequency point or frequencies outside the frequency point. It occupies a very important position in the field of electronics and has been widely used in signal processing, anti-interference processing, pow

[Microcontroller]

Design and implementation of program-controlled filter based on single chip microcomputer and FPGA

Design of optometry control system based on FPGA and USB interface

　　1. Introduction 　　In recent years, myopia has seriously affected people's health. In order to accurately understand the degree of myopia of myopic patients and provide more suitable glasses for them, optometry instruments have become indispensable equipment in the eyewear retail industry. At present, the mainstream

[Power Management]

Design of optometry control system based on FPGA and USB interface

Application of high-speed A/D conversion chip ADC08D1000 based on FPGA

0 Introduction The ultra-high-speed ADC-ADC08D1000 of National Semiconductor is a high-performance analog/digital conversion chip. It has a dual-channel structure, and the maximum sampling rate of each channel can reach 1.6 GHz, and can achieve 8-bit resolution; when using the dual-channel "inter-plug" mode

[Embedded]

Design of Electronic Image Stabilization System Based on FPGA

Electronic camera systems have been widely used in military and civilian mapping systems, but the effect is affected by the change of posture or vibration of the carrier at different times. When the working environment is relatively harsh, especially in aviation or field operations, the vibration of the camera platf

[Embedded]

Design of Electronic Image Stabilization System Based on FPGA

Popular Resources
Popular amplifiers