Implementing high-performance DSP functions using Spartan-3 FPGA

feifei

Implementing high-performance DSP functions using Spartan-3 FPGA [Copy link]

Spartan-3 FPGAs enable embedded DSP functions at a breakthrough price point. This article describes the features of Spartan-3 FPGAs that are optimized for DSP and analyzes their performance and cost advantages through implementation examples.

All low-cost FPGAs offer basic logic performance at an attractive price and can meet a wide range of general-purpose design needs. However, when considering embedding DSP functions in the FPGA fabric, high-end FPGAs must be selected to gain platform features such as embedded multipliers and distributed memory.

The introduction of Spartan-3 FPGAs has changed the landscape of embedded DSP applications. Although the Spartan-3 family of devices may be priced lower, they also have the platform features required for DSP design. These platform features enable signal processing functions to be implemented with higher area utilization, allowing designs to reach lower price points.

Spartan-3 devices are ideal for use as coprocessors or pre-/post-processors, offloading computationally intensive functions from the programmable DSP to enhance system performance.

Optimized for DSP

Xilinx's Spartan-3 devices use 90nm process technology and 300mm wafers, which greatly reduces the cost of FPGAs. At the same time, these devices also include key DSP resources such as embedded 18×18-bit multipliers, large block memories (18kb), distributed RAM, and shift registers. These advanced features mean that using Spartan-3 FPGAs, DSP algorithms can be implemented at a much lower price than other competing FPGAs.

Figure 1: The enhanced architecture allows
16 registers to be replaced with a single LUT.

In addition to increasing the basic performance of the system, these embedded features can also improve the utilization of the device. For example, if the Spartan-3 embedded multiplier is implemented in the logic structure, it will take up 300 to 400 logic elements (LEs). In addition, because the embedded multiplier is adjacent to the logic structure, it is very simple to expand its functionality (such as creating an adder or cascading multiple multipliers to support complex arithmetic functions).

To improve efficiency, many DSP functions are best implemented in pipelines in a time-multiplexed fashion. This creates higher bandwidth, faster systems, but comes at the cost of more temporary storage requirements. For example, a time-multiplexed filter requires the results of each multiply-accumulate unit to be stored in a shift register. This design can exhaust register or memory resources before the FPGA's logic resources are exhausted. The Spartan-3 FPGA family is unique in that they offer a mode where a lookup table (LUT) can perform a logic function or be configured as a 16-bit shift register.

As shown in Figure 1, this enhanced architecture allows a single LUT to replace 16 registers, maximizing area utilization when implementing time-multiplexed DSP functions.

Many DSP functions are also memory-intensive, requiring scratch memory to store coefficients, implement FIFOs, and obtain large buffers. Spartan-3 devices offer more memory bits than other low-cost FPGAs currently in use. For many DSP designs, the most important resource is the embedded memory within the FPGA, not the logic circuits or multipliers. Because of the lack of memory resources, designers using competing low-cost devices have to choose larger devices or use external memory to build systems that can be implemented with just a small Spartan-3 FPGA.

Common DSP function implementation

The following analysis shows how these characteristics affect device utilization by analyzing two implementation examples of finite impulse response (FIR) filters: one based on a multiply-accumulator (MAC) implementation and the other based on a multi-channel distributed algorithm (DA).

FIR filters are commonly used in base stations, digital video, wireless LAN, xDSL, and cable modems. The test bench is a 64-tap MAC FIR filter with 16-bit data and coefficients at 130MHz implemented in a Spartan-3 XC3S400 FPGA. The first implementation uses only one MAC, while the second implementation uses four MACs.

Going from a single MAC implementation to a quad MAC implementation significantly increases the performance of the FIR filter while only doubling the number of LUTs and still using only 4% of the total available logic resources. The quad MAC implementation uses four RAMs and four MACs to efficiently implement the FIR filter with minimal device logic resources.

Another interesting implementation is the implementation of a multi-channel FIR function, where you can see how the device utilization changes from a single channel FIR filter to an 8-channel FIR filter.

Implementing a single-channel distributed algorithm FIR filter uses 29% of the logic resources and 39% of the register resources of the XC3S1000 Spartan-3 device. When implementing the same 8-channel filter, different channels are usually time-multiplexed to save logic, but this will take up many registers or a large amount of on-chip memory to store intermediate results.

If a Spartan-3 FPGA is used, the intermediate results will be stored in a 16-bit shift register (SRL-16) configured by a LUT. In this way, the same 8-channel filter only uses 10% more available logic resources and 7% more available register resources, that is, building 8 channels only occupies 25% more device resources.

This significant resource savings is directly related to the use of SRL-16 in Spartan-3 devices, with an additional 1,343 LUTs being used in SRL-16 mode in the 8-channel implementation.

If this design is implemented in an FPGA that does not support SRL-16 performance, an additional 10,744 (1343×8) flip-flops will be required as storage units, which will necessitate the use of large-scale devices to provide a large number of registers and consume related combinational logic resources.