FPGA/DSP/GPU, accelerating radar signal processing
[Copy link]
There are many ways to speed up processing in the underlying application environment. These methods are used in FPGA and digital signal processors (DSPs) and microprocessor designs.
These technologies can be reused in advanced application environments where many processors work together to complete radar or sensor data processing. These acceleration technologies include parallel computing and pipelining.
Pipelining and Parallel Computing
Pipelining has been used to improve the data throughput of algorithms since the early days of computer and radar data processing technology. This technique allows for longer latency (the time it takes to get the first result) to get better computational throughput.
Pipelining has the potential to increase speed using parallel computing, but it is not true parallel computing. For example, an integer multiplication can be divided into eight 5ns stages with a 40ns wait period, but each 5ns cycle produces a new pair of operands and a new result.
There are three simple premises for applying pipelining:
1. The operation to be performed can be divided into multiple small steps, and each step is executed faster than the whole.
2. The operation being performed is expected to be performed many times.
3. Redundant clock cycles during the waiting period will not cause any problems.
There are many ways to create modules for application processors today. Which one to choose will depend on the needs of the application scenario, the cost and benefits of the approach.
Full custom IC circuit design
offers the advantages of highest performance, lowest power, and smallest size. However, in the latest CMOS technology, the design cost of implementing large custom IC designs with the highest available performance is very high and is only used when annual demand reaches the millions or when no other digital logic circuits can provide sufficient speed or performance.
ULAs, ASICs, and Gate Arrays
A method of creating semi-custom integrated circuits was developed in the late 1970s that allowed designers to implement their processing circuits in integrated circuit form without incurring the full one-time cost of custom circuitry.
The initial versions of these semi-custom circuits were based on unconstrained logic arrays (ULAs). In these devices, transistors or gates are pre-wired and stored discretely on the chip. Designers create their own circuits and then wire together the required gates with metal. A dedicated metal layer connects the pre-wired transistors and gates to create the desired circuit.
Over time, the range of circuits in the matrix (pre-metallization) increased from simple gates to include more complex devices such as arithmetic units and small memories. Depending on the details, these are also called application-specific integrated circuits or gate arrays. The one-time engineering cost of these devices is still relatively high, but it is already one-tenth of that of a full custom design.
The next stage in the evolution of custom
processing circuits was the electrically programmable and erasable gate array. Instead of using expensive mask sets to implement circuit interconnects, these devices used programming stored in the device's memory to implement programmable internal connections. The reprogrammability of this device allowed the circuits to be reprogrammed in the field. For this reason, these devices were called FPGAs.
As CMOS circuits have become smaller, the cost of mask sets for custom or semi-custom integrated circuits has increased from $100,000 in the early 1990s to millions of dollars today. This has increased the use of FPGAs in devices that are not in high demand.
FPGAs offer many of the same advantages as custom IC design. By leveraging the hardware parallelism of the FPGA structure, designers can implement fast integer processing, which can cope with real-time, high-speed data stream applications. For example, if a digitized radar data stream needs to be processed as it is received, then FPGAs are the only low-latency solution other than ASICs and custom ICs.
FPGA advantages also include low non-recurring engineering costs, and FPGA vendors provide many useful IP or complex circuit modules, and the design can be modified at a low cost. FPGA can also include embedded processors in the design.
The disadvantage of FPGAs is that they are five times slower than microprocessors or DSPs of the same generation of technology (now 28 nanometer technology) for integer operations and ten times slower for single-precision floating-point operations. The number of transistors determines the programmable path resources, and some circuit units need to be repeated a lot whether they are needed or not.
Therefore, the efficiency of FPGAs may be lower than that of custom integrated circuits, as FPGAs have about ten times or more gate circuits, which can also be reflected in unit cost and power consumption.
DSP
DSPs provide an alternative between FPGA and microprocessor applications, providing multi-core integer and floating-point computing capabilities with clock rates comparable to those of microprocessors. DSPs usually provide SIMD operators and address generators that support multi-dimensional matrix addressing, gather/scatter operations, and separate storage of instructions and data.
These components, coupled with DMA technology, allow DSPs to achieve very high cycle operation counts. In order to achieve high performance, DSP devices require careful programming and optimization. Although there are dedicated optimization compilers available, it is still necessary to use device emulators to check and improve the correctness and performance of each step of operation.
The software value on a DSP device is usually higher than on a microprocessor, and the code is usually not portable. Another advantage of DSPs is that they have low power consumption solutions compared to multi-core microprocessors and FPGAs.
Embedded processors. These microprocessors are embedded into other integrated circuits to provide programmability. Both 32-bit and 64-bit versions are available, and support multi-core and SIMD acceleration. Embedded processors are most widely used in mobile phones, tablets, and most home entertainment devices.
In 2012 alone, over 8 billion ARM embedded processor cores were shipped. Embedded processors can be multi-core, with up to 8 cores at the time of this writing. These devices offer less than one-tenth the performance of high-end, multi-core, general-purpose microprocessors, but consume only 2% of the power. One of their most promising current applications is the use of embedded processors in FPGAs and DSP processors, both of which have convenient high-level programming language interfaces.
MicroprocessorsModern
microprocessor devices have multiple identical microprocessor cores that can have separate or shared caches. At the time of writing this article, there are 12-core devices, and the number of processor cores will increase in the future. To support the work of these cores, a large part of the chip area is occupied by caches, and as the number of cores increases, the percentage of chip area occupied by caches also increases.
Each processor core has multiple logic units, including a single instruction multiple data stream coprocessor that provides signal processing capabilities and flexible radar signal processing capabilities. Optimizing and vectorizing compilers enable these devices to fully perform. The best signal processing performance is usually obtained through highly manually optimized vector signal processing function libraries.
However, due to the cache management of the operating system and the multi-threaded nature of the microprocessor operating system, it is difficult to ensure the strong real-time requirements of the device. In addition, in order to maintain the best performance, the data must be kept in the cache, which makes signal processing particularly sensitive to the amount of data. Another problem with this generation of devices is the problem of heat dissipation, which also limits the performance of the processor or requires a liquid cooling system.
GPU
Graphics Processing Unit (GPU) is an application example of a single integrated circuit performing large-scale parallel processing. It can have hundreds of thousands of logical operation units running in parallel, which can be seen as a single instruction multiple data stream at the bottom layer and a multiple instruction multiple data stream device at the upper layer.
In fact, having software and hardware that support multithreaded computing is the main feature that distinguishes GPU programming from programming single-core or multi-core processors. Today, most multi-core processors support two hardware-supported threads running simultaneously on a single core (Hyper-Threading Technology), and at the time of writing, the most advanced GPUs support more than 16,000 threads.
The figure shows a block diagram of a GPU with eight components, each component has two stream processors, and each stream processor has eight units. Currently, GPUs still require an X86 series CPU as a host.
If the GPU can maintain high utilization of its stream processing units, its performance will exceed that of a multi-core processor by 10 to 100 times. To achieve this goal, the problem must be highly parallelized, and the GPU's general memory data access must allow large amounts of continuous data transfers, with minimal cross-link time with the main processor. The downside of the GPU is that it consumes a lot of power, sometimes up to 2 to 3 times that of a large microprocessor.
|