Article count:25311 Read by:103629709

Three Powers Compete for High-End FPGA (Part 3): Software Becomes the Main Battlefield

Latest update time：2019-11-13

Reads：

Source: The content is compiled by Semiconductor Industry Observer (ID: icbank) from " eejournal ", author: Kevin Morris, thank you.

In Part 1 of this series, “The Big Three: High-End FPGAs (Part 1)” , we looked at the new high-end FPGA families from Xilinx, Intel, and Achronix and discussed their underlying semiconductor processes, the type and amount of programmable logic LUT structures, the type and amount of DSP/arithmetic resources and their suitability for AI inference acceleration tasks, the claimed TOPS/FLOPS performance capabilities, and on-chip interconnects such as FPGA routing resources and network-on-chip (NOC). In Part 2, “The Big Three: High-End FPGAs (Part 2): Memory, I/O, and Customization” , we discussed memory architectures, in-package integration architectures, and high-speed serial IO capabilities. It is clear from these comparisons that these chips are the most complex and sophisticated chips ever developed, that the stakes are high in this battle, that each vendor brings some unique value, and that there are no clear winners or losers.

In this section, we’ll explore the most important factor—the design tools that allow engineering teams to harness the power of these devices. It turns out that the flexibility and power of programmable logic is both its greatest asset and its greatest limitation. FPGAs are not processors, even though one of their most attractive uses is to accelerate computing. Engineers with a pure software background often fail to appreciate the complexity behind using these devices and the long learning curve required to become proficient enough to use FPGAs. Unlike traditional von Neumann processors, FPGAs are not software programmable—at least not in the traditional sense.

Getting the most out of an FPGA requires designing digital logic, and despite decades of progress, we have not yet reached the point where a designer can optimize the use of an FPGA without a certain level of hardware expertise. However, there are a few caveats to this statement. First, after dealing with the massive (and increasing) complexity of FPGA design for many years, FPGA vendors have come up with some tricks to mitigate the problem. The simplest and most important is to use pre-designed single-function accelerators and applications. For certain common applications, FPGA companies have already done the design for you—designs that are logic that has been optimized and tuned by experts for optimal performance, power, and utilization. If you happen to be working on one of these applications, then you may never know you are using an FPGA. It just sits quietly in the background, making your system better. We will discuss this in more detail in a future article when we discuss the marketing and distribution of these devices.

The second way to reduce the required design expertise is to raise the level of abstraction of the design. By creating tools that allow design at a higher level, FPGA vendors reduce the need for detailed design at the register transfer level, greatly simplifying the process. We will discuss this in more detail later in this article.

But whatever the reality may be, today’s FPGA companies cater to three distinct audiences or areas of expertise. Digital hardware engineers with HDL skills, software engineers, and most recently, AI/machine learning engineers. Each of these user groups will benefit greatly from the application of FPGA technology, but providing each group with the tools they need to properly utilize FPGAs is a daunting task.

To unlock the vast landscape of FPGA tools and IP, we should start at the core, the part that uniquely defines an FPGA - the Lookup Table (LUT) structure. In modern times, FPGAs have evolved like a Swiss Army Knife or a smartphone. The "knife" part of a Swiss Army Knife has become a small part of its capabilities, but we still call them "knives." Another example is that the "phone" part of a smartphone is becoming smaller and smaller, but we still call them "phones." The LUT structure in an FPGA is only a small part of the value provided by these amazing devices, but the LUT structure is the only thing that gives FPGAs their power.

Xilinx points out that their "ACAP" devices do not require configuration of the LUT fabric, or even use of the LUT fabric at all, in order to boot and use their Versal devices. This is a large part of the basis for their claim that the Versal devices are not FPGAs, but a new category. However, all of these devices have enough in common in terms of functionality and tasks to be compared. And it is unlikely that Versal will find a home in many applications that do not use FPGA fabric at all.

It all starts with HDL design

The lowest level design tools in the FPGA world are place and route. This process takes a netlist of interconnect LUTs (and the configuration of those LUTs) and seeks to arrange them on the chip in a near-optimal way and make the required connections through the programmable interconnect fabric. Placement and routing is the heart and soul of FPGA implementations. As devices have gotten larger, two trends have impacted the place and route process. First, the portion of the chip devoted to interconnect (relative to logic) has had to increase in order for large designs to be successfully routed. Of course, FPGA companies don’t like to allocate a lot of silicon to routing because it means less logic on their devices. You never see vendors bragging about the huge routing resources they offer, but LUT count is usually the most important. Therefore, maximizing LUTs and minimizing routing is the basis of datasheet marketing.

To minimize the chip area consumed by routing resources, FPGA companies go through exhaustive discussions during the chip design process, placing and routing countless real-world designs over multiple iterations of chip development, so that most user designs can take advantage of the large number of logic resources on the chip without failing during the routing process. If they allocate too few routing resources, a large amount of logic on the chip will be unusable. This has happened in the past, as FPGA companies have released conventionally routed chips with only about 60% utilization, which puts the chip far below their advertised capabilities. Conversely, if too much silicon is allocated to routing, the chip will have less logic density than competitors in the same silicon area.

Obviously, if the performance of the place and route algorithm is better, fewer routing resources are required to complete most designs. Therefore, the performance of the place and route algorithm always affects the design of the device itself.

The second trend that has impacted place and route is that the dominant factor in delays in logic paths has shifted from “logic” delays to “interconnect” delays. This has profound implications for the design flow, as the logic synthesis process can no longer be analyzed for timing independently of the delays associated with routing/interconnect. This means that logic synthesis and place and route must be tightly coupled together—placement information determines routing delays, and those routing delays are fed back to drive different decisions in the construction of logic paths. In today’s world, logic synthesis and place and route are often tied together in a single (often iterative) step where various solutions for both logic and placement are evaluated and compared until a result is found that meets key goals such as timing, design area, and power.

All three companies provide powerful toolsets for synthesis and place-and-route, which are essential for traditional FPGA users (i.e., the "digital hardware engineers with HDL skills" mentioned above).

Xilinx created Vivado in 2012 by completely overhauling the then-aging ISE tool suite. Seven years later, Vivado has matured considerably into a relatively robust, reliable platform with an architecture that has generally done a good job keeping up with the rapid pace of FPGA business upgrades. Vivado's logic synthesis and place-and-route algorithms are state-of-the-art, and it handles compile times and memory footprint very well on today's large designs. Vivado is customizable through TCL, which provides a great deal of control and customization. Vivado also includes a simulator and an IP integration tool.

Xilinx has done extensive iterations between their tools and FPGA architectures to find the sweet spot in terms of routing and logic resources that allows their devices to achieve consistently high utilization with their tools and get solid results in timing closure. Of the three vendors, Xilinx is the most “old school” in terms of placement and timing closure, while both Intel and Achronix have taken some novel architectural steps in their device architectures to help achieve timing closure on today’s large, complex designs.

However, Xilinx is also leading the way in high-level synthesis (HLS) in the FPGA space, and Vivado HLS is (we believe by far) the industry's most used HLS tool, supporting a C/C++ flow for generating logic gates for hardware designers seeking productivity beyond register transfer level (RTL). Xilinx's high adoption of HLS tools also helps solve the timing closure problem, as the RTL automatically generated from the HLS tool often performs better in terms of timing convergence than hand-written RTL.

Intel's Quartus Prime Pro is the evolution of Altera's Quartus design tool suite, which has been its flagship for FPGA design for the past 20 years. As we mentioned before, Intel updated its chips a few generations ago with what they called the "HyperFlex" architecture - essentially covering the device with a large number of small registers to facilitate tools to retime critical logic paths on the fly. This makes timing closure easier for complex designs, but may come at the expense of some overall performance.

Recently, Intel added an optional strategy called Fractal Synthesis to Quartus for designs such as machine learning algorithms that are arithmetic-intensive or have small multipliers. Intel says Microsoft used fractal synthesis in their "Brainwave" project (which powers Bing search) to achieve 92% of the maximum performance of Stratix 10 devices. Another recent addition is Design Assistant design rule checking (DRC), which helps find problems in constraints and placement netlists to reduce the number of iterations required for timing closure.

Intel joined the HLS party much later than Xilinx, but they now include the Intel HLS compiler in their Quartus suite. This HLS compiler takes untimed C++ as input and generates RTL code that is optimized for Intel FPGAs. While Intel's HLS tool has far less usage in the field than Xilinx's Vivado HLS, we expect to see significant adoption as the HLS compiler provides support for the "FPGA" branch of Intel's One API software development platform. At this point, it seems likely that Intel's HLS tool will be more targeted toward software engineers than Xilinx's (which is clearly a powerful tool for hardware designers).

Achronix’s ACE tool suite relies on third-party tools for simulation and synthesis. An OEM version of Synopsys Synplify Pro is included in the ACE suite, which includes advanced floorplanning and critical path analysis capabilities to aid timing closure. However, with Speedster7t, Achronix has taken a unique approach to achieving timing closure with its network-on-chip (NoC). The NoC “enables designers to transfer data anywhere in the FPGA fabric at speeds of up to 2GHz without using logic resources. This also means that when placing and routing user designs, it will consume fewer precious FPGA routing resources than traditional FPGAs, which must use LUTs exclusively to route signals within the device.

Since acceleration workloads often involve a lot of "bus" routing of multi-bit paths, Achronix has also introduced byte (or nibble) based routing to further improve timing closure. These are additional routing resources that can be used when moving data words and no bit swapping is required.

For Achronix’s foray into HLS, Achronix has partnered with Mentor, whose Catapult-C is probably the most mature ASIC HLS tool on the market. Catapult-C is industrial-strength and affordable. Catapult-C will allow full C/C++ flow to Speedster7t FPGAs – with the usual caveat that HLS is still a tool designed for hardware engineers to improve productivity and quality of results, rather than a hardware design tool that software engineers can use directly. Catapult-C should shine in 5G applications and in optimizing accelerated workloads such as AI inference.

Entry point for software developers

Over the past 20 years (since Altera introduced their ill-fated "Excalibur" series), many high-end FPGAs have included traditional Von Neumann processors in addition to their LUT architecture. These processors range from "soft" microcontrollers implemented with LUTs to complex "hard" multi-core 64-bit processing subsystems and peripherals. These FPGAs can all be classified as "systems on chip" (SoC). With this trend, the design problem has become more complex because we now need tools to support software engineers in developing applications that run on the processor subsystems embedded in today's FPGAs.

For many years, this meant that FPGA SoC projects required both hardware and software experts on the team. But as FPGA companies work to get their chips into new markets, they are also working to reduce their reliance on HDL/hardware expertise. Today, they are working to provide software development tool suites that not only enable software engineers to develop code that runs on embedded processing subsystems, but also allow these software engineers to create accelerators that leverage the FPGA fabric to accelerate their applications.

As we wrote recently, Xilinx just announced their Vitis unified platform, which is designed to help software engineers (and AI developers using "Vitis AI" as an entry point) accelerate specific applications using Xilinx devices. Xilinx says Vitis "provides a unified programming model for accelerating host CPUs, embedded CPUs, and hybrid (host + embedded) applications. In addition to the core development tools, Vitis provides a rich set of hardware acceleration libraries that are pre-optimized for Xilinx device-based hardware platforms."

Vitis may represent a behavioral and philosophical shift for Xilinx (Xilinx has historically been known for cannibalizing its own ecosystem) and is "committed to open source and community engagement." Xilinx says all of their hardware acceleration libraries are published on GitHub, and their runtime, XRT, is also open source. Of course, all of this open source software is still software that uses Xilinx hardware as the ultimate carrier, and modifying it to apply to competitor devices may be a rather arduous task. Coupled with the fact that Vitis is available for free, this is undoubtedly a big step in the right direction for the company and opens up a convenient channel for non-traditional users (such as CUDA developers) to use FPGAs.

Intel’s One API is designed with similar goals to Xilinx’s Vitis, but with broader goals. Because Intel’s computing portfolio is so broad — heavy-iron Xeon processors, GPUs, FPGAs, dedicated AI engines like Nervana and Movidius, etc. — Intel is setting an ambitious goal of a single entry point for software development that covers what Intel calls its “SVMS architecture” (scalar, vector, matrix, spatial operations, deployed in CPUs, GPUs, NNPs, and FPGAs). Intel says One API “promises to eliminate the current barriers to entry for hardware accelerators on FPGAs by abstracting the DMA of data from the host to the FPGA and back — a process that is very manual, tedious, and error-prone in HDL-based design flows. One API also shares the FPGA’s backend infrastructure with Intel’s HLS Compiler and Intel OpenCL SDK for FPGAs, making it easy for developers currently using these tools to migrate to One API.”

Intel reminds us that “FPGA developers need to modify their code to get the best performance on FPGAs or use libraries that are pre-optimized for FPGA architectures with spatial computing features. Software developers must be trained on FPGAs to take full advantage of the performance benefits of FPGA-specific architectural acceleration and achieve portability between architectures.”

Thanks Intel-wan. We are all eager to get training on FPGA.

One API will use a new programming language called "Data Parallel C++" (DPC++) and API calls. One API will include API libraries for various workload domains and will also include enhanced analysis and debugging tools tailored for DPC++.

One feature that sets Intel’s products apart is their optimized low-latency and cache-coherent UPI (and future CXL) interface between XEON-scalable processors and FPGA accelerators. Intel says this capability “enables applications such as virtualization, artificial intelligence, and large in-memory databases to execute at a higher level than would otherwise be possible with traditional PCIe connections.”

Achronix is at a disadvantage compared to its larger and more diverse competitors in enabling software developers to directly take advantage of FPGA acceleration. The company is developing an ecosystem through various partnerships, but they are likely to be primarily targeted at design teams that have the hardware expertise required to use FPGAs, and applications for which the Achronix ecosystem already has pre-optimized solutions and reference designs.

As artificial intelligence takes over the world…

Recently, AI has emerged as a new “killer app” for FPGAs. As we discussed in our last post, FPGAs have an excellent ability to dynamically create customized processing engines to suit the needs of specific AI inference applications. But this has also created a new type of user - the AI/ML engineer. This third target audience brings their own set of tool flow requirements to leverage the power of FPGAs. These people need development tools that support TensorFlow, Caffe, and other front-end tools in their industry, as well as technical pathways to transform the results produced by these tools into properly optimized FPGA-based solutions.

As part of the release of the Vitis unified software platform, Xilinx also released Vitis AI, which, as one might guess, is a plugin aimed at AI and data scientists. It allows AI models to be compiled directly from standard frameworks such as TensorFlow. Vitis AI can instantiate what Xilinx calls domain-specific processing units (DPUs) in the FPGA fabric or the AI engine of a Versal device. A key (and compelling) advantage of this approach is that Vitis AI compiles AI models into opcodes for the DPU, so new models can be downloaded and run in minutes or even seconds without re-placement and routing. This makes iteration time and actual workload provisioning much simpler and faster than most approaches that configure the FPGA as part of the development flow. Vitis AI also leverages Xilinx's pruning technology to help optimize inference models.

Intel took a similar approach (actually, the order is reversed) with its One API for AI developers, with the release of the Open VINO toolkit. Intel’s version provides AI developers with what the company calls “a single toolkit that accelerates their deployment of solutions on multiple hardware platforms, including Intel FPGAs.” Intel says OpnVINO can take AI developers and data scientists from frameworks like TensorFlow and Caffe directly to hardware, without requiring any FPGA knowledge.

Intel's solutions support a variety of popular neural networks and also allow the creation of custom networks. They also include custom APIs for FPGA developers to change the engine for their applications. Intel said that by customizing the AI engine for specific data flows and modules of the application, orders of magnitude of performance improvements can be achieved.

Custom flows also include what Intel calls AI+, for applications that require additional capabilities in addition to AI, such as allowing application developers to leverage the flexibility of the FPGA fabric to enable () AI+ pre-processing, freeing the CPU from execution tasks and reducing overall system latency.

Achronix provides low-level machine learning library functionality for frameworks like TensorFlow, as well as support for high-level frameworks. They support both an “overlay architecture” and direct mapping of data graphs from high-level frameworks to FPGA logic. The company says an “overlay” is “an application processor specifically optimized for one or more ML networks, where the hardware is instantiated with reprogrammable logic and the specific network is implemented in microcode running on that hardware instance.” This means that the functionality of an overlay implementation can be modified by changing the software, or by changing both the hardware and software, while direct mapping of the data graph requires a complete or partial reconfiguration of the FPGA structure. Achronix’s approach essentially allows you to determine the scope of optimization required based on the time and expertise required.

As you can see, the range and scale of development tools and techniques used for these complex programmable logic devices is enormous, and we have only scratched the surface here. It will be interesting to see which capabilities stand out, as we believe that more hardware sockets will be won based on development flow rather than capabilities based on the hardware itself.

Next time, we’ll look at the vast ecosystem of IP and reference designs, as well as pre-integrated boards and modules that can drive these complex devices into high-value applications where there aren’t always enough engineering resources for full, optimized development.

*Click the end of the article to read the original text in English .

*Disclaimer: This article is originally written by the author. The content of the article is the author's personal opinion. Semiconductor Industry Observer reprints it only to convey a different point of view. It does not mean that Semiconductor Industry Observer agrees or supports this point of view. If you have any objections, please contact Semiconductor Industry Observer.

Today is the 2127th issue of content shared by "Semiconductor Industry Observer" for you, welcome to follow.

Latest articles about

■TSMC's 2nm is too powerful, UMC is too miserable

■The United States has heavily funded this semiconductor technology

■Tesla is also snapping up HBM 4

■Intel's next-generation AI chip is exposed for the first time

■Nvidia releases its largest chip yet

■The Danish robot giant invites you to "do something" together

■Self-developed DPU released: Microsoft chip, full of firepower

■In the post-Moore era, optical computing chips have become the key to breakthrough, and domestic manufacturers have great potential!

■Open source software in crisis

■The global semiconductor equipment giants are all in trouble