FPGA Implementation Method of FIR Filter-EEWORLD

Collect

A very important basic unit. In recent years, FPGA has developed rapidly due to its high speed, high integration and high reliability. With the demand for high precision and high processing speed in modern digital communication systems, more and more research has turned to the use of FPGA to implement FIR filters. For FIR filters, it is necessary to fully consider the reasonable optimization of their resources and operating speed. Various FIR filter structures have their own advantages and disadvantages. Only after understanding the advantages and
disadvantages of various structures can we better choose the appropriate structure to implement FIR filtering.

1 FIR digital filter
FIR digital filter consists of a finite number of sampling values. In the design, while meeting the amplitude characteristics, it can also ensure accurate and strict phase characteristics. Therefore, it is widely used in signal processing and other fields.
For FIR filters, its output y(n) is expressed as follows:

Where: N is the order (or number of taps) of the filter; x(i) represents the input sample at the i-th moment; h(i) is the i-th tap coefficient of the FIR filter.
Since the impulse response of the FIR filter is a finite sequence, its system function can be expressed as:

The basic structure of the FIR filter is shown in Figure 1. The FIR filter has poles only at the origin, so this makes the FIR filter globally stable. At the same time, the FIR filter satisfies the linear phase condition, and its impulse response sequence is real and satisfies the odd or even symmetry conditions, that is:

2 Implementation Methods
There are many different structures for implementing FIR digital filters using FPGA, but they are mainly divided into the following categories: serial structure, parallel structure, transposed structure, FFT algorithm-based structure, and distributed structure. Other types of FIR filter structures can be derived from the above structures.
2.1 Serial Structure
From expression (1), it can be seen that the FIR filter is actually a multiplication-accumulation operation. The order of the filter determines the number of multiplications and accumulations. Its serial structure is shown in Figure 2.

The serial structure FIR filter has a simple structure and occupies less hardware resources. It only needs to reuse one multiplier and one adder, so the cost is low. However, the FIR filter of this structure needs to go through multiple clock cycles before it can have an output. At the same time, the internal clock cycle is also affected by the operation speed of the multiplier. Therefore, the FIR filter of this structure has a slow processing speed and is only suitable for systems with low filter order and low processing speed requirements.
2.2 Parallel structure
The parallel structure FIR filter can be obtained by expanding the serial structure FIR filter. The parallel FIR filter structure is also called the direct FIR filter structure. This structure is directly based on the filter structure of Figure 1, and is implemented in parallel with multiple multipliers and adders. Usually, considering the symmetry of its filter coefficients, the input value is first added, then multiplied, and finally accumulated and output, so as to reduce the number of multipliers. Its structure is shown in Figure 3.

The parallel FIR filter can complete one filtering in one cycle and has a fast running speed. Although it uses the symmetry of the filter coefficient, it still takes up a large number of multipliers and adders, especially for filters with high filter orders, which occupy more resources. For example, for a 256-order filter, it requires 128 multipliers to implement. In order to increase the speed of the filter, a pipeline structure is often introduced, that is, a register is added after each addition or multiplication operation to store data, so that the filter can run at a higher frequency. 2.3 Transposed structure
According to the transposition theorem, if the direction of all branches in the network is reversed, and the input x(n) and output y(n) are interchanged, the system function H(z) remains unchanged. Through the transposition theorem, the parallel FIR filter can be transformed to obtain a transposed FIR filter, whose structure is shown in Figure 4.

The transposed FIR filter based on parallel structure realizes parallel input of data, and can complete one filtering in one cycle. Moreover, the structure of each level is the same, and data can be read out at each stage. The filter order can be expanded or reduced to realize filters of any order. However, since it is based on a parallel structure, it has some disadvantages of parallel structure, mainly for high-order filters, the resource occupation is huge and the design cost is high. Despite this, the transposed FIR filter is still a widely used filter.
2.4 Structure based on FFT
Using fast Fourier transform (FFT) to realize FIR filter is an important way to quickly realize filtering algorithm. As can be seen from formula (1), the output y(n) of FIR filter is the convolution of input x(n) and system impulse response sequence h(n). The convolution transform can be realized quickly by using FFT. As shown in Figure 5, the input signal x(n) is first transformed into its spectrum sampling value X(k) through FFT, and then multiplied with the frequency response sampling value H(k) of the FIR filter. H(k) can be stored in the memory in advance. Finally, the product X(k)H(k) is restored to a time domain sequence through inverse fast Fourier transform (IF-FT), and the output y(n) is obtained.

To implement FFT, the linear convolution of two finite length sequences can be replaced by their circular convolution without aliasing. The circular convolution length N ≥ N1 + N2-1 must be selected, that is, x(n) and h(n) are padded with zeros to a sequence of length N, that is:

in the FIR filter structure based on FFT, the number of multiplications required to calculate X(k), H(k) and the inverse Fourier transform y(n) is N/2log2N, and the calculation of X(k)H(k) requires N multiplications, so the total number of multiplications based on FFT is mf = 3/2Nlog2N + N. Since h(n) satisfies the condition of formula (3), the number of multiplications required for direct convolution is md = 1/2N1N2. Assuming N1=N2, the calculation amount of these two multiplications is compared as follows:

From Table 1, it can be seen that when N1<42, the calculation amount of the FFT method is less than that of the direct convolution. When N1=42, the calculation amount of the FFT method is equivalent to that of the direct convolution. When N1>42, the calculation amount of the FFT method is greater than that of the direct convolution. As N1 increases, the calculation speed of the FFT method becomes faster and faster, especially when N1=8 192, the calculation speed of the FFT method is nearly 100 times faster than that of the direct convolution.

2.5 Distributed structure
2.5.1 Principle of distributed algorithm
Distributed arithmetic (DA) was proposed by Croisier in 1973, but it was not widely used in FPGA to calculate the sum of multiplication and accumulation until the emergence of FPGA. For the signed number x(n), it can be expressed in the form of the complement of the following formula:

For h(i)xb(ni) in formula (7), it represents the product of the i-th bit of the input data x(ni) and the tap coefficient h(i). For the FIR filter, its coefficient h(i) is a constant, so a lookup table can be constructed in advance. The lookup table stores all the product values of h(i)xb(ni). The table is addressed by inputting (xb(N-1), xb(N-2), ..., xb(0)), and then the value is multiplied by 2b and shifted and accumulated to obtain the filter output y(n). The construction rules of the lookup table are shown in Table 2.

2.5.2 FIR filter structure based on distributed algorithm
There are three main types of FIR filter structures based on distributed algorithm.
(1) The first structure is a serial distributed structure. The principle of serial distributed FIR filter is to first use the lowest bit of all N input quantities to address the DA lookup table to obtain a partial product. The partial product is shifted right by one bit, which is equivalent to dividing by 2 and then temporarily stored in the register. At the same time, the second lowest bit of the N input quantities starts to address the DA lookup table to obtain another partial product. The partial product is added to the previous value stored in the register, and the added value is shifted right by one bit and placed in the register. Repeat the cycle accumulation until all bits are addressed
. Note that the partial product after the highest bit is addressed is subtracted, and the final value is the desired result.
When N is too large, that is, the filter order of the FIR filter is very high, using a lookup table to implement it will make the ROM storing the lookup table very large. For this purpose, a partial table structure can be used, that is, the lookup table is divided into multiple parts, and the same bit of the N input quantities corresponds to different partial table addresses. FIG6 shows a serial DA structure based on a 4-input partial table structure.

(2) The second structure is a parallel distributed structure. The parallel distributed structure is to look up the different bits of N input quantities at the same time, and the same bits are sent to the same ROM for addressing, and different bits have different ROMs. Its structure is shown in Figure 7.

The third structure is a serial-parallel combined distributed structure. It is a compromise solution that requires both a low speed and low resource usage. For the serial distributed algorithm, it is a one bit-at-a-time (1BAAT) lookup table, while the parallel distributed algorithm is B bits-at-a-time. Therefore, the serial-parallel combined distributed algorithm uses multiple bits at a time, such as 2BAAT and 3BA-AT. Figure 8 shows the structure diagram of the 4BAAT lookup table.
The median B in Figure 8 is a multiple of 4, and SRL is a shift register. In the SRL, the first column of the first row from the right is the 0th bit of the data, the second column is the 1st bit of the data, the third column is the 2nd bit of the data, and the fourth column is the 3rd bit of the data. Similarly, the first column from the right of the second row is the 4th bit of the data, the second column is the 5th bit of the data, the third column is the 6th bit of the data, and the fourth column is the 7th bit of the data. The subsequent rows are arranged in a similar digit order. In the first clock cycle, bits 0, 4, ..., B-4 of the data enter the lookup table ROM at the same time to find the required data. In the second clock cycle, bits 1, 5, ..., B-3 enter the ROM at the same time to find the required data. The data found is passed to the next level accumulator for accumulation, and the same operation is performed on the remaining data bits in turn. Since each block differs by 4 bits, that is, 16 times, in order to add the corresponding bits, it is multiplied by 16. The distributed algorithm structure of the FIR filter is faster than that implemented by a single multiplier, especially the higher the filter order, the more obvious its advantage. In the distributed structure, the serial structure queries 1 bit at a time, so for B bits of data, it takes B clock cycles to complete a filter without counting the time of shift register, etc.; while the parallel structure only needs 1 clock cycle to complete the filter, so the parallel structure is the structure with the best speed, but the parallel structure requires B DA lookup tables, which requires a large amount of ROM to store, increasing the consumption of hardware resources, especially the higher the order, the larger the hardware scale will be; the serial-parallel structure combines the advantages of the two structures to achieve coordination in speed and scale. In practical applications. The appropriate structure should be selected according to the system requirements.

3 Conclusion
This article qualitatively analyzes the FPGA implementation methods of various FIR filters. For low-order FIR filters, serial structure, parallel structure and transposed structure can be used to implement them. The FIR filters with parallel structure and transposed structure achieve speed advantages at the expense of resource loss. For high-order FIR filters, the serial structure, parallel structure and transposed structure based on the multiplier structure are difficult to meet the requirements of high-speed processing. The distributed algorithm converts multiplication into a table lookup and accumulation structure, which improves the speed of the FIR filter with distributed structure.
However, the three different forms of distributed structures should be reasonably selected based on comprehensive consideration of resources and speed. The FIR filter implemented with FFT also achieves speed improvement by reducing the amount of calculation, especially the higher the filter order, the more obvious the speed improvement.
The modern engineering technology field has higher and higher requirements for the bandwidth, high speed, and real-time signal processing of the filtering system. When using FPGA to implement FIR filtering, the FIR filter based on the multiplier structure cannot meet the above requirements. The distributed structure FIR filter cleverly uses the ROM lookup table to implement the multiplication and accumulation operation of fixed coefficients, avoiding the multiplication operation, and introduces the pipeline structure in each subsequent addition operation to improve the speed. Therefore, the use of distributed algorithms to implement FIR filters is a hot topic in current research. At the same time, no matter which distributed algorithm is used, ROM must be used as a lookup table. However, as the filter order increases, the number of ROMs will increase. How to reduce the number of ROMs while further improving the speed is a problem that needs to be solved in the future.

Reference address：FPGA Implementation Method of FIR Filter

Previous article：Design of 24×24-bit Low-power Multiplier Based on FPGA
Next article：Inverter welding power supply controller based on FPGA and NiosII