Designing Software Radio and Modems Using FPGA-EEWORLD

Collect

This article takes the design of a 16-QAM RF transmit data pump as an example to introduce the techniques and device selection methods for designing digital filters using FPGAs , and explains the advantages of FPGAs over DSPs when performing distributed computing.

The basic structure of all digital logic

16-QAM Modulator

Coding and codeword mapping

Square Root Raised Cosine Filter

Design Tips

5 MHz carrier

Distributed Computing (DA) Technology

Implementation of filters

Design software with field programmable gate arrays (FPGAs) Radios and Modems Comparable to DSP chips. Although FPGAs can easily implement complex logic functions such as convolutional encoders, they have great shortcomings in implementing a large number of complex calculations. Even if the fastest FPGA is used to implement a matrix multiplier, its cost and performance cannot match that of a DSP chip worth only $5. DSP is still the chip of choice when designing with CAD tools, but with the application of distributed computing (DA) technology, FPGAs are once again popular with designers. One of the characteristics of FPGAs is its flexible structure. In fact, the functional modules of wireless and modulation and demodulation data channels can be easily mapped to independent and parallel hardware nodes. When using a digital signal processor that can only run in time-sharing mode, scheduling multiple time-critical tasks requires very complex programming, while using FP

GA avoids this problem.

We will introduce FPGA features while designing the 16-QAM RF transmit data pump, and describe in detail how to easily convert the data channel function module into the logic circuit of the Xilinx 4000 series FPGA, so as to accurately estimate the amount of required logic circuits. Although the design of 16-QAM data pumps that meet the same system requirements and use the same type of FPGA has been published in the open literature, the number of logic circuits reported seems to be much larger than what is actually needed. In order to rush to market, the product may not be designed with CAD tools. Relying entirely on CAD tools may not always lead to the best solution, but also requires a lot of hard work, experience and creative work.

Basic structure of all digital logic

Any digital logic can be constructed with enough general logic gates such as NAND gates and NOR gates. FPGAs have plenty of logic gates. The logic gates of the Xilinx 4000 series take the form of truth tables or the more general 16-word x 1-bit lookup tables (LUTs) that can implement any Boolean function of four input variables (the address lines of the lookup table). Since the function generated is usually equivalent to the combination of multiple NAND gates, the LUT is considered the basic logic unit. The Xilinx 4000 series configurable logic block (CLB) includes two 16-word LUTs that can be combined to generate any Boolean function of five input variables. In addition, the LUT can be set up as two 16 x 1 RAMs or one 32 x 1 RAM.

The CLBs are arranged in a two-dimensional square matrix, and the CLBs and the interconnections between them can be configured separately. The smallest XC4002 contains an 8 x 8 CLB matrix, and the largest XC4085XL contains a 48 x 48 CLB matrix. Each LUT is connected to a flip-flop up to 100 MHz.

16-QAM Modulator

16-QAM modulator Includes the key functional blocks of the RF transmit data pump (see Figure 1). The 20-Mbps serial data is divided into 4-bit symbol groups and sent in parallel to a differential encoder and symbol mapper at a rate of 5 megasymbols per second. The mapper produces 3-bit orthogonal component pairs. These component pairs are then pulse-shaped by a pair of square root raised cosine filters, interpolated to 20 megasymbols per second, and modulated by a 5MHz carrier. The outputs are summed and converted to digital. The key to the design is the use of a pair of interpolated pulse shaping filters.

16-QAM Modulator

To implement this design approach effectively, it is necessary to take the encoding and mapping functional blocks and a 5MHz modulator into account when determining the total number of logic gates.

Encoding and Symbol Mapping

In determining the amount of logic required for the encoder and signal mapper, we can look to the design of standard modems in the past. For example, the encoder in V.32 includes a differential encoder that provides 180-degree bi-phase protection and a convolutional encoder that can add redundancy to reduce the receiver's bit error rate (BER). Both the encoder and mapper are implemented as finite state machines, with all states implemented by five registers (2.5 CLBs) and the connection logic consisting of eight two-input XOR gates (4 CLBs) and three two-input AND gates (1.5 CLBs). In this 16-QAM transmitter, a serial-to-parallel conversion register (2 CLBs) captures four 20-Mbps serial bits to form a 4-bit symbol, so the encoder can handle data streams down to 5 megabits per second, which is easily handled by the CLBs. Data path control requires clocking registers along the data path, requiring fewer than 15 CLBs. Next, an encoded 5-bit output symbol corresponds to the address lines of the mapper, which is simply a pair of 3-bit output LUTs.

These outputs are mapped as quadrature components (I and Q) to symbol positions in a two-dimensional plane (constellation). Only 16 of the 64 intersection points (stars) represent valid symbol positions. The size of the mapper is 32 words x 3 bits x 2 or 6 CLBs. The total number of CLBs for these functional modules is 31.

Square Root Raised Cosine Filters

Square root raised cosine filters are a viable method for suppressing symbol interference within the limited bandwidth of a transmission channel. The spectrum is modulated by the transmitter and receiver units to form square root raised cosine filters. The filter shape and its coefficients are developed with the aid of QEDesign 1000 software. Figure 2 shows the response of a 32-tap finite impulse response (FIR) filter calculated at 12-bit fixed points. We will use a 12-bit filter model and determine its logic gate count (with 12-bit quantization, the QEDesign program only requires 28 symmetric coefficients, but this design will use a full 32-tap symmetric FIR filter).

Square Root Raised Cosine Filter

Design Tips

Square root raised cosine filters are used for spectrum shaping on both I and Q channels. When generating I and Q samples at 5 Mbps, the filter generates 20 Mbps of data for the modulator. Thus, the filter acts as a 1:4 interpolator. The corresponding computational effort (using symmetric coefficients) is 2 channels x 16-tap symmetric taps x 20 Mbps = 640 Mbps multiply-accumulate operations. This speed is significantly faster than most fixed-point DSP chips can run. FPGAs are now an attractive option, but it is also necessary to select a filter format that can be most efficiently mapped to a CLB-based design.

There are many configurations or forms of logic circuits that can implement FIR filters. The most important are the direct form (i.e., a commonly used software model), the transposed form with variables (which has been implemented by dedicated filter chips), and the polyphase filter (for multirate applications). However, none of these forms can use the method of symmetric coefficients to reduce the number of multiplication calculations. One trick for designing multirate filters is to plot the signal flow trajectory on the sample point-coefficient plane.

The vertical axis represents the sample points and the horizontal axis represents the coefficients. The data trajectory is drawn to show the response of the filter after flipping 90 degrees. Because the coefficients are symmetrical, only half of the filter coefficients need to be listed. The insertion coefficient is K, that is, K-1 zeros are filled between the input sample points, resulting in a V-shaped trajectory for the 32-tap FIR. Although the input data sample points are spaced 200 ns apart, the new trajectory points must be every 50ns.

Two computational models can be derived from this figure. The first is a variation of the transposed form, in which the products of the nonzero input sample values and all 32 coefficients are added in the partial sum register. After the 32 products are added and the full filter response is output, the multiply-accumulate circuit can be used to calculate a new trajectory. Here, 32 MAC operations are performed every 200ns. The second model is a delay-and-add, which is a direct form of the FIR filter. As can be seen in the filter trajectory, eight stored samples are required to calculate a filter response. By calculating five consecutive filter responses, we can observe the model given in Table 1.

Four consecutive 20MHz responses can be calculated from the same eight sample input groups. Only two sets of filter coefficients are used. The filter coefficients are in the opposite order of the third and fourth responses (yd and ye) of each sample data group. Can these response equations be mapped into an effective FPGA circuit? Of course they can! The key is to use distributed computing technology, which is not available in all current design tools. Before implementing the response equations, some simplifications can be made.

5 MHz Carrier

The simple equation for carrier modulation is: Y(k) = yI(k)cos(wC*t) + yQ(k)sin(wC*t), where wC is the carrier frequency = 2p(5 MHz), and I and Q represent the in-phase and quadrature components of the symbol.

This equation is executed every 50 ns. There are only four carrier values in one symbol period (200 ns). These values can be conveniently defined as: cos(wC*t) = 1, 0, -1, 0 and sin(wC*t) = 0, 1, 0, -1, 1.

The modulated output does not require any multiplication or addition, nor does it require the calculation of the I and Q filter responses every 50 ns. An I response is calculated for 50 ns, followed by a Q response in the next 50 ns, then an I response, then a Q response, and so on.

Distributed Computing (DA) TechnologyDA

is a computing technique specifically for sum-of-product equations where one of the multiplication factors is a constant. DA design can achieve gate-level high-efficiency, serial bit arithmetic and high-performance bit parallel operations. It is a classic serial/parallel synthesis scheme. DA technology can be applied to many important linear, time-invariant digital signal processing algorithms, such as filters (FIR and IIR), transforms (fast Fourier transform [FFT]), and matrix-vector products, such as 8 x 8 discrete cosine transform (DCT).

DA technology has been around for more than 20 years and has proven to be unsuitable for the fixed-point instruction set structure of programmable DSPs. However, DA is very suitable for FPGA implementation, especially LUT logic blocks such as Xilinx CLB. The design of DA FIR filters using Xilinx XC3000 series FPGAs was proposed as early as 1992.

There are no independent multipliers in the DA circuit. The multiplication is performed by the LUT. DA stores the sum of all partial product terms in an equation and performs operations based on all input variable bit lookup tables (here DALUT). The serial DA circuit has a separate DALUT that looks up the table starting from the least significant bit. The output sum of the partial products is stored in the accumulator. This approach reminds us of the shift-and-add subroutines in early computers. Successive DALUT outputs are accumulated into the binary shift-down accumulated sum of the partial products. This gives a true double-precision result.

Filter Implementation

The data path of the square root raised cosine filter is defined by standard functional blocks that can be converted to CLBs. The 3-bit I and Q signals output by the mapper are transferred to the parallel-to-serial shift register (PSR) every 200ns. The RAM shift register (SR) chain stores the seven previous symbols. The first three filter responses Yb, Yc, Yd are calculated together with the cyclic data in the shift register. The PSR also requires a feedback channel, but the RAM SR is cyclically affected by the block addressing when read only. There are six blocks here, the first three shifts are for Yb, the next three for Yc, and the last three for Yd. When calculating Ye, the data is shifted down the SR chain. This block addressing pattern is repeated as the data is transferred (written) by the previous stage. All twelve shifts and the corresponding PSR loading, RAMSR addressing and write control are derived from the 60MHz system clock.

Since the same coefficient group is used for two sampling cycles, one for I channel data calculation and the other for Q channel data calculation, a set of DALUTs and 2/1 multiplexers are used to direct the serial data stream to the corresponding address ports. These ports can represent the DALUT

The structure is shown in Figure 1. A logic high at the h3 port selects all addresses that contain the partial product and h3. Similarly, a logic high at the h7 port selects all addresses that contain h7, and a logic high at the h3 and h7 ports selects all addresses that contain h3 and h7. The remaining six coefficients follow this pattern. In fact, the eight coefficients will require 28 or 256 words to store. For the case of 12-bit coefficients, (256/32 words per CLB) x 12 = 96 CLBs will be required. Another trick is to use two DALUTs, each taking four coefficients and adding their outputs. This reduces the number of CLBs to (2 x 24)/32 x 12 + 13/2 (parallel adders) = 18.5 CLBs.

The same simplification can be applied to the second set of filter coefficients starting with h1. The parallel adders can be time-shared using a 2/1 multiplexer. The adder is expanded to 13 bits and fed into the aforementioned scalar accumulator that performs shift and add operations. When the sign bit of the input variable is transmitted to the DALUT, a subtraction operation is performed. This process can be accomplished by adding an EXOR gate at the DALUT output and carrying to the first stage of the accumulator in the standard way. For the negative responses Yd and Ye, the data can be sampled without the sign bit and all DALUT output data can be inverted to complement.

For I and Q data in fractional two's complement format, the filter coefficients are adjusted to prevent overflow in the final output. The ten most significant bits can be loaded into the D/A conversion driver register.

The total number of CLBs for the filter data channel is 71.5, and the FPGA output port has a trigger that can be used as a driver register for the D/A conversion. Including the encoder (31 CLBs) and timing and control functions (estimated to be less than 50 CLBs), the total number is about 159 CLBs, which fits neatly into the smaller (slightly larger than the smallest) chip in the Xilinx XC4000 series, the XC4005 (196 CLBs). If a more advanced FPGA device such as the Xilinx Virtex is used, the number of CLBs can be reduced and performance can be improved.

The entire design ensures performance at a 60MHz system clock. The data flow is uniform and unidirectional. Pipeline registers can be inserted (without increasing CLBs) to shorten the combinatorial path. The fourteen-stage carry chain through the scalar accumulator is the longest combinatorial path. However, sufficient speed margin is ensured by the built-in pre-carry circuit.

Keywords：FPGA Reference address：Designing Software Radio and Modems Using FPGA

Previous article：Carrier Modulation System Based on FPGA
Next article：Digital FPGA Design and Implementation of π/4-DQPSK Differential Demodulator

Recommended ReadingLatest update time:2024-11-16 16:51

Research on high performance monitoring and direction finding processing platform based on CPCI system

Abstract: A new high-speed parallel sampling technology architecture and a parallel processing embedded hardware architecture based on programmable chip technology and supporting flexible configuration are proposed. The platform integrates multi-channel high-speed acquisition, large-capacity data storage, high-perfo

[Embedded]

Research on high performance monitoring and direction finding processing platform based on CPCI system

Intel and WWT work together to reinvent FPGAs and bring flexible acceleration experiences to the world

Recently, WWT officially published a blog, introducing how WWT cooperated with Intel to use the Advanced Technology Center to demonstrate the new FPGA technology, making the entire FPGA development and introduction simpler, smarter and more powerful. The following is the article text: Customizable field-programmab

[Embedded]

Intel and WWT work together to reinvent FPGAs and bring flexible acceleration experiences to the world

Features and differences of ARM, DSP and FPGA

ARM (Advanced RISC Machines) is a well-known company in the microprocessor industry. It has designed a large number of high-performance, low-cost, low-energy RISC processors, related technologies and software. The ARM architecture is the first RISC microprocessor designed for the low-budget market. It is basically the

[Microcontroller]

What is the connection between microcontrollers, ARM, MCU, DSP, FPGA, and embedded systems?

A popular explanation of the complex relationship between MCU, ARM, MUC, DSP, FPGA and embedded systems! First of all, "embedded" is a concept. There is no accurate definition. Different books have their own definitions. But the main idea is the same. Compared with general systems such as PCs, embedded systems are s

[Microcontroller]

FPGA Design and Implementation of WCDMA Rate Adaptation Algorithm

With the explosive growth of the Internet and the increasing demand for various wireless services, traditional wireless communication networks have become increasingly unable to meet people's needs. Therefore, the third-generation mobile communication system (IMT-2000) with the purpose of large capacity, high data rate

[Embedded]

High-performance implementation of DES encryption algorithm based on FPGA

1 Introduction With the rapid development of communication systems and networks, the security and reliability of data communication, processing and storage are increasingly required. The development of secure encryption machines requires real-time encryption, key change, and the use of multiple algorithms.

[Embedded]

High-performance implementation of DES encryption algorithm based on FPGA

FPGA Implementation of FSK/PSK Modulation

Abstract: Based on DDS and VHDL hardware description technology, FSK and PSK digital modulation are realized by using large-scale programmable gate array FPGA. Modules such as m pseudo-random baseband code generator, jump detector and DDS signal generation are introduced. The system parameters are easy to modify,

[Embedded]

FPGA Implementation of FSK/PSK Modulation

Intel and Altera unveil edge and FPGA products built for AI at Embedded World

New edge-optimized processors and FPGAs drive AI everywhere in edge computing markets such as retail, industrial and healthcare Today, Intel and its subsidiary Altera announced new edge-optimized processors, FPGAs, and market-ready programmable solutions at the Embedded World, dedicated to

[Network Communication]

Intel and Altera unveil edge and FPGA products built for AI at Embedded World

Popular Resources
Popular amplifiers