The ubiquitous adoption of video and still images, as well as the growing demand for configurable systems such as software radio, continue to drive the expansion of DSP applications. Many applications require cost-effective DSP processing. Although custom implementations of DSP functions are common, several functions such as FIR (finite impulse response) filters, IIR (infinite impulse response) filters, FFT (fast Fourier transform) and mixers are common in many applications. All of these functions require a combination of multiplication units along with addition, subtraction and accumulation. A FIR filter (Figure 1) stores a series of n data elements, each delayed by one additional cycle. Typically, these data elements are called branches. Each branch is multiplied by a coefficient, and the results are summed to produce the output. Some approaches perform all multiplications in parallel. A more general approach is to divide the multiplication into N stages, using accumulators to pass the results from one stage to the next. These implementations trade functional resources for speed, taking N computational stages and requiring n/N multipliers. There are many other common design optimization methods, depending on whether the coefficients are static or dynamic and the coefficient values.