Implementing a 4G Wireless Spherical Detector in an FPGA-EEWORLD

Collect

System Generator is the key to building a quasi-maximum likelihood detector (4x4, 64-QAM) for spatially multiplexed MIMO-OFDM systems.

WiMAX is to broadband Internet access what the cell phone is to voice communications. It can replace DSL and cable services, providing Internet access anytime, anywhere. All you need to do is turn on your computer, connect to the nearest WiMAX antenna, and you can surf the world's web.

One of the biggest challenges for broadband Internet access is mobility, which is what the latest WiMAX standard aims to address. IEEE 802.16e-2005 introduces the use of multiple antennas in transmission and reception, the MIMO concept, also known as multiple input multiple output, which is a key feature of mobile WiMAX.

Spatial Division Multiplexing (SDM) MIMO processing can significantly improve spectrum efficiency and thus greatly increase the capacity of wireless communication systems. Spatial Division Multiplexing MIMO communication systems have recently attracted widespread attention as a means to significantly improve wireless system capacity and connection reliability.

The best hard-decision detection method for MIMO wireless systems is the maximum likelihood (ML) detector. ML detection is very popular because of its excellent bit error rate (BER) performance. However, the complexity of a straightforward implementation increases exponentially with the number of antennas and modulation schemes, making ASICs or FPGAs limited to low-density modulation schemes using only a few antennas.

In MIMO detection, the best approach that can maintain BER performance comparable to the best ML detection while significantly reducing computational complexity is sphere detection. This approach can reduce the detection complexity of SDM and space division multiple access systems while maintaining BER performance comparable to the best ML detection. There are many ways to implement sphere detectors, and each method has many different algorithms, so designers can find the best balance between multiple performance indicators such as throughput of the wireless channel, BER, and implementation complexity.

While the algorithm (e.g. K-best or depth-first search) and hardware architecture clearly have a huge impact on the final BER performance of a MIMO detector, the channel matrix preprocessing that is typically performed before spherical detection also has a huge impact on the final BER performance of a MIMO detector. Channel matrix preprocessing can be complex or simple, such as prioritizing the processing of spatially multiplexed data streams based on variance computations on the channel matrix, or using very complex matrix factorization methods to determine a more ideal (in terms of BER) data stream processing priority.

Signum Concepts, a San Diego-based communications systems development company, has been working with Xilinx and Rice University to design a MIMO detector for spatial division multiplexing MIMO in 802.16e broadband wireless systems using FPGAs. The processor uses a channel matrix preprocessor to implement a continuous interference cancellation processing technique similar to that used in the Bell Labs Layered Space-Time (BLAST) architecture, ultimately achieving near maximum likelihood performance.

System Considerations

Ideally, the detection process requires the computation of ML solutions for all possible combinations of symbol vectors. The spherical detector aims to reduce the computational complexity by using simple arithmetic operations while maintaining the numerical integrity of the final result. The first step of our approach is to decompose the complex numerical channel matrix into an expression with only real numbers. This operation increases the matrix dimension but simplifies the computation of processing matrix elements. The second aspect of reducing computational complexity is to reduce the optional symbols analyzed and processed by the detection scheme. Among them, QR decomposition of the channel matrix is a crucial step. [page]

Figure 1 shows how the mathematical transformations are performed to arrive at the final expression for the computational part of the Euclidean distance metric. The Euclidean distance metric is the basis for the spherical detection process. R represents a triangular matrix for the iterative method of processing the optional symbols starting with the matrix element rM,M. Here, M represents the dimension of the channel matrix expressed in real numbers. The solution defines a traversal tree structure through M iterations, where each level i of the tree corresponds to the processed symbols of the i-th antenna.

Figure 1. Partial Euclidean distance metric equation for MIMO detection of spherical detectors

The order in which the sphere detector processes antennas has a significant impact on the BER performance. Therefore, our design uses a channel reordering technique similar to the V-BLAST technique before performing sphere detection.

There are several options for implementing tree traversal. In our implementation, we use breadth-first search because it uses the popular feed-forward structure and is hardware-friendly. At each level, we select only the K surviving nodes with the smallest distance to calculate the expansion.

The method calculates the row norm of the pseudo-inverse matrix of the channel matrix through multiple iterations, and then determines the optimal column detection order of the channel matrix. Depending on the number of iterations, the method can select the row with the largest or smallest norm. The row of the inverse matrix with the smallest Euclidean norm indicates the strongest antenna influence, while the row with the largest Euclidean norm indicates the weakest antenna influence. This novel method processes the weakest data stream first, and then iterates to process the data streams from high to low power.

FPGA Hardware Applications

To implement the system, we used Xilinx Virtex®-5 FPGA technology. The design flow used Xilinx System Generator for design capture, simulation, and verification. To support a variety of antennas/users and modulation orders, the detector was designed for the most demanding 4x4, 64-QAM case.

Our model assumes that the receiver has good knowledge of the channel matrix, which can be achieved by traditional channel estimation methods. After channel reordering and QR decomposition, we start using the sphere detector. In preparation for using soft-input, soft-output channel decoders (such as turbo decoders), we generate soft outputs by computing the log-likelihood ratios (LLRs) of the detected bits.

The main architectural elements of the system include data subcarrier processing and system submodule management functions to process the required number of subcarriers in real time while minimizing processing latency. A channel matrix estimate is performed for each data subcarrier, limiting the processing time available for each channel matrix. For the selected FPGA, the target clock frequency is 225MHz, the communication bandwidth is 5MHz (equivalent to 360 data subcarriers in a WiMAX system), and the number of processing clock cycles available for each channel matrix interval is 64.

We exploit the sophisticated pipelining and time division multiplexing (TDM) capabilities of hardware functional units to achieve the real-time requirements of WiMAX OFDM symbols. [page]

In addition to high data rates, controlling submodule latency is an important issue in guiding the architectural design process. We address the latency issue by introducing TDM of continuous channel matrices. This approach allows for longer processing time between elements of the same channel matrix while maintaining high data throughput. The number of channels that make up the TDM group varies from submodule to submodule. In the TDM scheme, 5 channels are used in the channel matrix inversion process, while 15 channels are time-division multiplexed in the real QR decomposition module. Figure 2 is a high-level flow chart of the system.

Figure 2. High-level flow chart of a MIMO 802.16e broadband wireless receiver.

Channel matrix preprocessing

The channel matrix preprocessor determines the optimal order in which to detect each layer of the spatially multiplexed composite signal. The preprocessor is responsible for computing the norms of the pseudo-inverse of the channel matrix and, based on these norms, selecting the next transmission stream to be processed. The rows with the smallest norms in the pseudo-inverse matrix correspond to the strongest transmission streams (with the smallest noise amplification after detection), while the rows with the largest norms correspond to the worst quality layers (with the largest noise amplification after detection). Our implementation detects the weakest layers first and then proceeds layer by layer in order from lowest noise amplification to highest noise amplification. For each step in the sorting process, the corresponding column in the channel matrix is then cleared and the simplified matrix enters the next level of the antenna sorting processing pipeline.

Among the preprocessing algorithms, the pseudo-inverse matrix is the most computationally demanding. The core of this process is the matrix inversion, which is usually achieved by QR decomposition (QRD) with Givens rotation. Commonly used angle estimation and plane rotation algorithms (such as CORDIC) will cause severe system latency, which is unacceptable for our system. Therefore, our goal is to find an alternative solution for vector rotation and phase estimation using the embedded DSP resources of FPGAs (such as the DSP48E in Virtex-5 devices).

The systolic array structure of the QRD consists of two types of processing elements – diagonal or boundary elements and off-diagonal or internal elements. The boundary elements perform vector functions that generate the rotation angles used by the internal elements of the array. To obtain the desired rotation angle, the value in the off-diagonal element is multiplied by the conjugate complex number in the diagonal element and then divided by the reciprocal of the complex number. The division is actually done by multiplication, that is, when the function is observed to be close to linear, multiplication is performed by the reciprocal calculated from the polynomial approximation of the defined interval. Figure 3 shows the signal flow diagram of this complex rotation in the diagonal systolic element using this approximation.

Figure 3. Diagonal pulsating unit structure

The data sent to the off-diagonal units is the result of dividing the in-phase and quadrature parts of the rotation vector by the corresponding approximation. We not only achieve high data throughput by using a pipelined architecture in the diagonal and off-diagonal units, but also control the latency caused by the approximation module and complex multipliers by time-division multiplexing the hardware across 5 channels.

For a 4x4 matrix, we used 1 diagonal unit and 7 off-diagonal units. The processing time to decompose a single matrix is 4x4=16 data cycles, and the design delivers data at a rate of one sample every three clock cycles, so the total time to decompose a single matrix is 3x4x4=48 clock cycles (less than the available 64 clock cycles). We used back substitution on the decomposed matrix and further reordered it in the same TDM manner. [page]

Spherical detector

The sphere detector uses PED units for norm calculation. We use three different types of PED units depending on the tree level. The root node PED module is responsible for calculating all possible PEDs. The secondary PED module calculates 8 possible PEDs for the 8 surviving paths calculated in the previous level. So we have 64 generated PEDs in the next level index of the tree. The third type of PED module is used in other tree levels and is responsible for calculating the nearest node PED of all PEDs calculated in the previous level.

The pipeline architecture of the sphere detector (SD) can process data in every clock cycle. As a result, only one PED module is required per tree level. Therefore, for a 4x4 64-QAM system, the total number of PED units is 8, which is equal to the number of tree levels.

SD can use two types of decoding techniques: hard decoding and soft decoding. Hard decoding can measure the order by the minimum distance matrix running through each level of the tree; soft decoding uses log-likelihood ratio to represent each bit of the output. Log-likelihood ratio is generally used as a priority input value to the channel decoder, such as turbo decoder.

FPGA resource usage

The implementation and simulation include the detection process shown in Figure 2, but do not include the soft output generation module. The target chip is Virtex-5 XC5VFX130T-2FF1738 FPGA. The designed clock frequency is 225MHz and the available data rate is 83.965Mb/s.

Table 1 shows the resource usage of each major functional unit in the design. The utilization (%) represents the percentage of FPGA area to the total area of the XC5VFX130T device.

Function	Number of slices	LUTs/FFs	DSP48	Block RAM
Channel preprocessing	9,999 (48%)	20,339/29,954 (24%)	159 (49%)	105 (17%)
RVD QRD	1,715 (8%)	4,418/5,556 (5%)	30 (9%)	27 (4%)
Spherical detector	2,445 (11%)	3,113/6,525 (3%)	48 (15%)	12 (2%)

Table 1. Resource usage by subsystem

Figure 4. 4x4 64-QAM floating point MATLAB simulation (hard decision), System Generator design (hard decision) BER curve compared with maximum likelihood curve [page]

System Generator and Model-Based Design

We implemented the complete hard decision chain using Xilinx System Generator for DSP design flow. The design verification used not only the simulation semantics of the MATLAB®/Simulink® environment but also the co-simulation capabilities of System Generator. The in-phase and quadrature parts of the channel matrix parameters were derived from normal distributions and delivered to the System Generator modeling environment by MATLAB. We also performed bit error rate calculations using this simulation framework. Figure 4 compares the BER curve of our fixed-point hard decision design, the BER curve of the floating-point hard decision design, and the best ML reference curve. We developed a hardware demonstration of this design using Ethernet-based hardware co-simulation on the Xilinx ML510 development platform. The channel matrix parameters were sent to the sphere detector using the Xilinx AWGN IP core. We calculated the BER by embedding the design into a self-synchronous BER tester. This instrument is able to send inputs to the detector and capture bit errors.

This article provides a brief introduction to sphere detectors for communication systems using spatially multiplexed MIMO. We explore the architecture of sphere detectors and channel matrix preprocessors in detail. There are many ways to implement preprocessing, and although our method is a bit more computationally complex, it yields BER performance close to maximum likelihood. Although our discussion is centered around WiMAX, designers can apply many of these methods to 3G LTE (Long Term Evolution) wireless systems.

The next step for our group is to improve the BER performance by using turbo convolutional codes and soft output generation modules to perform iterative soft detection.

Reference address：Implementing a 4G Wireless Spherical Detector in an FPGA

Previous article：Implementation of an RFID wireless communication system based on FPGA
Next article：Design and implementation of GPS baseband verification system based on FPGA prototype

Recommended ReadingLatest update time:2024-11-17 00:08

Design of Wideband Digital Channelized Receiver Based on FPGA

The modern electromagnetic signal environment is becoming more and more complex and dense, requiring electronic warfare receivers to have wide processing bandwidth, high sensitivity, large dynamic range, multi-signal parallel processing and the ability to process a large amount of information in real time. Digital chan

[Embedded]

Design of Wideband Digital Channelized Receiver Based on FPGA

Application of ARM7 combined with FPGA

　　Industrial control often requires multi-channel fault detection and multi-channel command control (this multi-task setting is very common). A single CPU chip is difficult to directly complete multi-channel inspection and control tasks due to its limited number of external control interfaces. Therefore, it is a very g

[Microcontroller]

A brief analysis of the development and relationship of the two major markets of DSP and FPGA

　　With the rapid development of many vertical sub-industries in the analog IC market, traditional DSP devices have encountered competition from various alternative signal processing platforms, of which FPGA is a typical example. With the advantages of high density, low power consumption and low cost, FPGA not only per

[Embedded]

Application of Logic Analyzer Test in FPGA-Based LCD Display Control

I. Introduction The logic analyzer is recognized as the most outstanding tool in the process of digital design verification and debugging. It can check whether the digital circuit is working properly and help users find and troubleshoot faults. The main features of the logic analyzer are that it can observe multiple s

[Test Measurement]

Application of Logic Analyzer Test in FPGA-Based LCD Display Control

基于FPGA的核物理实验定标器的设计与实现

简介：介绍使用现代EDA手段设计核物理实验常用仪器——定标器的原理和实现方法。新的定标器利用FPGA技术对系统中大量电路进行集成,结合AT89C51单片机进行控制和处理,并增加数据存储功能和RS232接口,实现与PC机通信,进行实验数据处理。本文给出详细新定标器设计原理图和FPGA具体设计方案。定标器在大学实验中有很广泛的应用,其中近代物理实验中的核物理实验里就有2个实验（G-M计数管和β吸收）要用到高压电源和定标器,而目前现有的设备一般使用的是分立元器件,已严重老化,高压极不稳定,维护也较为困难;另一方面在许多常用功能上明显欠缺,使得学生的实验课难以维持。为此我们提出了一种新的设计方案：采用EDA进行结构设计,充分发挥FPGA

[Microcontroller]

Implementation of LBS Controller Based on FPGA PEX8311

Abstract: By analyzing the control signals of the LBS controller, the timing of LBS bus read and write operations, and the LBS state machine, an efficient and reliable LBS controller is designed and implemented to realize the communication system between FPGA and PEX8311. The operation status is normal and stable in

[Industrial Control]

Implementation of LBS Controller Based on FPGA PEX8311

Dual-channel MIMO measurement based on WiMAX Wave2

　　The WiMAX Wave 2 specification now supports the use of multiple antennas to improve system performance in both the downlink and uplink. Systems in a Multiple Input Multiple Output (MIMO) configuration use spectrum more efficiently, resulting in higher data rates, compared to traditional Single Input Single Output (S

[Test Measurement]

Dual-channel MIMO measurement based on WiMAX Wave2

FPGA Design and Implementation of Frame Synchronization System

1 Introduction In digital communication, a certain number of code elements are usually used to form "words" or "sentences", that is, to form "frames" for transmission. Therefore, the frequency of the frame synchronization signal can be easily obtained by dividing the bit synchronization signal, but the begi

[Embedded]

FPGA Design and Implementation of Frame Synchronization System

Popular Resources
Popular amplifiers