Design of an improved Wallace tree multiplier-EEWORLD

Collect

Introduction

　　: In microprocessor chips, multipliers are the core of digital signal processing and are also the key components of data processing in microprocessors. The period in which the multiplier completes one operation basically determines the main frequency of the microprocessor. Multiplier speed and area optimization are very important to overall CPU performance. In order to speed up the execution speed of the multiplier and reduce the area of the multiplier, it is necessary to conduct in-depth research on the algorithm, structure and circuit implementation of the multiplier.

The general structure of the base-4 Booth algorithm and the multiplier.

　　The basic principle of the working of the multiplier is to first generate partial products, and then add these partial products to obtain the product. In the current multiplier design, the radix-4 Booth algorithm is a commonly used algorithm in the partial product generation process. For N-bit signed number multiplication A×B, the conventional multiplication operation will produce N partial products. If the multiplier B is base-4 Booth-encoded, 3 bits need to be considered each time: the adjacent high bit, the base bit and the adjacent low bit. The number of partial products generated after encoding can be reduced to [(N+1)/2] ([X ] value is an integer not greater than X), determine the operation amount 0, ±1A, ±2A. For the implementation of 2A, only A needs to be shifted left by one position. Therefore, for multiplication of symbolic numbers, the radix-4 Booth algorithm is convenient and fast. For unsigned numbers, only the high bits need to be extended by 0, and other processing methods are the same. Although the expansion may cause the number of partial products to be 1 more than that of signed number multiplication, this algorithm well ensures the consistency of the hardware and is conducive to implementation. For 32-bit multiplication, combined with the design of the instruction set, usually no more than 18 partial products need to be added.

　　For partial product addition, different adder array structures can be used. Different array structures will directly affect the time required to complete a multiplication. Therefore, the adder array structure is an important factor in determining the performance of the multiplier. Iterative Array (IA for short) and Wallace tree structure are the two most typical adder array structures. IA has a regular structure and is easy to implement layout, but it is the slowest and has a large area. In theory, the Wallace tree structure is the fastest adder array structure for multiplication operations, but the traditional Wallace tree structure has complex circuit interconnections and difficult layout implementation. . In order to solve this problem, people have introduced some tree structures with relatively simple connection relationships, such as ZM trees and OS trees. They all divide the IA tree into several segments, each segment is called a subtree. The internal connections of the subtrees use the IA structure, and the tree connections between the subtrees are used to reduce the connection complexity. However, this method reduces the The speed at which partial products are added.

　　While improving the tree structure, designers also tried to improve the basic adding unit in the adding array. In the earliest scheme proposed by Wallace, the CSA (carry-preserving adder) was used as the basic unit to construct an adder array. The basic method is: use the CSA component to compress the partial product step by step with a compression ratio of 3:2 until only two outputs are finally generated, and then use the carry transfer adder to pair the two pseudo sums generated with the local carry. Add up to get the true result. After that, Dadda proposed a new addition unit called "(j,k) counter", which has j inputs and k outputs, where j≦2k. After research and practice, people found that the 4-2 compressor (actually a 5-3 counter) has good balance and symmetry, and the multiplier composed of it as the basic adding unit has certain advantages in overall performance. Therefore, the 4-2 compressor has become the adding unit commonly used in multipliers today.

　　Figure 1 lists the structures of several adder arrays in the multiplier. They all use a 4-2 compressor as the basic adder unit to complete the sum of 18 partial products. Each rectangle in the figure represents a set of 4-2 compressors, and the line segments with arrows represent partial products and intermediate results.

(a) IA array (b) Wallace tree
(c) First-order OS tree (d) Tree structure in reference [5]

Figure 1 Addition array structure used for adding 18 partial products

　　As mentioned before, the IA array in Figure 1(a) has the most regular structure, but it is obvious that its delay levels are much more than other structures. (b) is a Wallace tree structure. Since a 4-2 compressor is used as the only adding unit, and 18 is not divisible by 4, in the process of summing 18 partial products, two of the partial products must be summed. Do additional processing. The method adopted by the Wallace tree is: first pass the 16 partial products through a three-stage 4-2 compressor to produce two results, and then perform one-stage 4-2 compression together with the remaining two partial products. The first-order OS tree structure in (c) also adopts a similar method, except that the order of processing is changed. Both structures destroy the symmetry of the tree and cause paths of unequal length, thus wasting hardware resources and increasing the complexity of layout and routing. (d) is an improved tree structure proposed in reference [5]. The summation process is: integrate the 18 parts into 3 groups, first sum the 6 partial products in each group, and then Produce two intermediate results, and then add these 6 intermediate results. Since the 6 partial products in each group are summed, two sets of 4-2 compressors with the same structure can be used, which greatly reduces the complexity of layout and routing. The disadvantage is that when using the 4-2 compressor to add six intermediate results, the problem of path imbalance cannot be avoided. Therefore, the delay of the critical path is still increased unnecessarily.

Circuit structure and delay analysis of CSA and 4-2 compressors.

　　Since CSA and 4-2 compressors are the main basic units used in the summing array, it is necessary to do some analysis on the circuit characteristics of CSA and 4-2 compressors. Let’s analyze and compare. As shown in Figure 2, the circuit logic of CSA is actually a one-bit full adder, and its critical path requires a delay of two levels of XOR gate logic. For the 4-2 compressor, it can be regarded as two CSAs connected according to the form of Figure 3.

Figure 2 CSA circuit structure

Figure 3 4-2 compressor circuit structure composed of two CSAs connected

　　A 4-2 compressor can be easily implemented using the connections shown in Figure 3. However, this unoptimized circuit structure is likely to cause unnecessary extension of the critical path. As mentioned above, the 4-2 compressor actually consists of 5 weight 1 inputs, producing 2 weight 2 outputs (Cout, C) and 1 weight 1 output (S). The reason why this article calls it a 4-2 compressor instead of a 5-3 counter is based on the fact that when this unit is arranged horizontally, the compression ratio that can be achieved by the number of addends is 4:2. Based on the truth table, a more ideal 4-2 circuit structure can be designed, as shown in Figure 4, in which an XOR gate circuit structure based on a 2-to-1 multiplexer is used to replace the traditional XOR gate.

Figure 4 Circuit structure of 4-2 compressor based on multiplexer

　　In addition, through the balanced path, this structure prevents the lateral carry chain from affecting the delay of the critical path. That is to say, the time required to generate the C and S signals does not depend on the Cin signal. The critical path of the circuit is the delay of three XOR gates. Under 90nm process conditions, the actual circuit delay simulation data obtained using Mentor's eldoD simulation tool is shown in Table 1. It can be seen that the maximum delay of the one-stage 4-2 compressor is about 1.5 times the maximum delay of the one-stage CSA, but completes the additive work done by the two-stage CSA.

Table 1 4-2 Compressor and CSA Delay Simulation Data

Signal Delay P1 P2 P3 P4 signal delay A B C
S (PS) 187.76 201.30 194.9992.77Sum (PS) 134.46 138.11 94.492
C (PS) 185.79 187.5 195.14Carry (PS ) 118.97 111.98 100.73

(a) 4-2 compressor delay simulation data (b) CSA delay simulation data

Improved Wallace tree multiplier structure and performance comparison.

For 32-bit multiplication, when multiplication of signed numbers, radix 4 Booth coding forms 16 coding terms, and thus produces 16 partial products; when multiplication of unsigned numbers, There is one more coding term and one more partial product. In addition, in the current design of the CPU instruction set, the multiply-add/subtract (C±A×B) instructions have been widely used. Therefore, in a multiplication operation, up to 18 partial products need to be added in the addition array. The number of partial products has a significant impact on the design of the array structure, which in turn affects the complexity of layout and routing and the number of delay stages of the array. This can be well proven in the above analysis of each array structure in Figure 1.

In order to solve the problems of poor tree structure symmetry, poor regularity, high layout and routing complexity, and unnecessary increase in critical path delay in the process of summing partial products of each structure in Figure 1, this article is based on the traditional Wallace The tree structure has been improved and a tree array structure as shown in Figure 5 is proposed.

Figure 5 Tree array structure combining CSA with 4-2 compressor

In this structure, CSA and 4-2 compressor are used together as the basic adder unit to compress 18 partial products. The specific process is: first use CSA to compress the 18 partial products for the first time to produce 12 intermediate results, then use the 4-2 compressor to perform the second compression, and then use CSA and 4-2 compressors respectively to compress them. The second intermediate result and the subsequent four intermediate results are compressed to obtain the final two pseudo-sums, which are sent to the carry propagation adder to obtain the final result. This structure uses CSA in the first and third compressions so that the initial 18 partial products and the 6 intermediate results produced by the second compression with a 4-2 compressor can be processed simultaneously, making each path A balance is achieved in terms of delay, which saves unnecessary waiting time compared to an array that only uses a 4-2 compressor as the basic adding unit. At the same time, replacing the two-stage 4-2 compressor with a two-stage CSA also significantly shortens the critical path delay, which is of high practical value for high-speed integrated circuit design.

In addition, it can be seen from Figure 5 that this structure has good symmetry and regularity, and has a small number of macro modules, which is conducive to layout and wiring. At the same time, for the multiplication instructions commonly used in current instruction set design, this structure also has a very high utilization rate of hardware. In summary, this structure maintains the advantages of the traditional Wallace tree structure in terms of fast summing speed, and better improves the shortcomings of the original array composed of a single adding unit.

In order to compare the area of this structure with each structure array shown in Figure 1, this article adopts a fully custom design method under the 90nm process and uses Cadence's layout tool Virtuoso to compare various situations. In addition, the delay of the critical path is measured through the 4-2 compressor stage, without considering the interconnection delay, and further comparison is made through the AT2 standard. The results are shown in Table 2. (It can be obtained from the data in Table 1 that the 1-level CSA delay ≈0.7-level 4-2 compressor delay.

Table 2 Comparison of various structures

Array structure area A (μm2) Delay T (4-2 levels) AT2 Normalize
IA array with Wallace tree 0.0362 8 2.3168 3.3
Wallace tree 0.0437 4 0.6992 1
first-order OS tree 0.0402 4 0.6432 0.92
References 0.0414 4 0.6624 0.95
[5] Structure
proposed in this article 0.0418 3.4 0.4832 0. 69
structure

Conclusion:

Using a circuit that combines CSA with a 4-2 compressor, the most efficient use of hardware is achieved in the process of summing partial products. At the same time, this structure not only takes advantage of the small CSA layout area, but also reflects the advantages of the 4-2 compressor with high compression ratio and fast speed. Therefore, compared with other structures, the improved structure proposed in this article has a larger area and speed. All achieved relatively ideal results. Although it has a certain complexity in layout and routing, it has achieved considerable improvements compared with the traditional Wallace tree. At present, the layout design work of the multiplier of this structure has been basically completed and is being used in the ongoing 64-bit high-performance embedded CPU design project, which is expected to be tape-out in March 2007.

参考文献
1Bwick G. Fast multiplication：algorithms and implementation[D]. Stanford University, 1994
2Poornaiah, D. Algorithm for designing efficient VLSI concurrent add-multiply and add-multiply-add cells for DSP applications[J]. Electronic Letters, 2000, 36(5)：399-400
3Jessani R M, Putrino M. Comparison of Single- and Dual-Pass Multiply-Add Fused Floating - Point Units[J]. IEEE Trans Comput, 1998, 47（9）：927-937
4Sousa L, Chaves R.. A universal architecture for designing ef

Keywords：digital Reference address：Design of an improved Wallace tree multiplier

Previous article：An expansion technology of frequency-shifting MODEM chip AM7911
Next article：Several issues in the selection of voltage controlled crystal oscillator

Popular Resources
Popular amplifiers