Column | FPGA vs. ASIC, who will lead the trend of mobile artificial intelligence?
Synced
Author: Li Yilei
PhD from UCLA
Artificial intelligence is in the ascendant, and countless startups and established companies are actively developing smart hardware with artificial intelligence applications as their selling point. At present, powerful cloud-based artificial intelligence services (such as Google's Alpha Go) have begun to emerge, and people also hope to bring artificial intelligence to mobile terminals, especially in combination with future Internet of Things applications.
The traditional method of implementing artificial intelligence on mobile terminals is to transmit all terminal data to the cloud through the network, and then send the results back to the mobile terminal after calculation in the cloud, such as Apple's Siri service.
However, this approach will encounter several problems. First, using the network to transmit data will cause delays. It is likely that the results of the data calculation will need to wait for several seconds or even tens of seconds before they can be transmitted back to the terminal (friends who have used the Prisma app to process photos should have a deep understanding of this). As a result, applications that need to get calculation results immediately cannot use this method. For example, the deep learning obstacle avoidance algorithm used on drones, if it is all executed in the cloud, the drone may have fallen before the calculation results are sent back.
Second, once the network is used to transmit data, there is a risk of data being hijacked. Therefore, applications that require low computing latency and are very sensitive to data security need to implement all artificial intelligence algorithms on the terminal, or at least complete some pre-processing operations on the terminal and then transmit a small amount of calculation results (rather than a large amount of raw data) to the cloud for final calculation, which requires the mobile terminal hardware to complete these operations quickly. On the other hand, the energy required for the mobile terminal hardware to complete these operations cannot be too much, otherwise the battery will run out of power in a short time (it is definitely not possible to equip the mobile phone with an Nvidia Pascal graphics card with a power consumption of 200W+!).
Currently, many companies are actively developing hardware that can realize mobile AI. There are two major schools of thought regarding the implementation of mobile AI hardware, namely FPGA and ASIC. Representative companies of the FPGA school include Xilinx’s Zynq platform, while representative companies of the ASIC school include Movidius. The two schools each have their strengths and weaknesses, so let me explain them in detail.
FPGA vs. ASIC
First, let's talk about the difference between FPGA and ASIC. FPGA stands for "Field Programmable Gate Array". Its basic principle is to integrate a large number of basic gate circuits of digital circuits and memories in the FPGA chip, and users can define the connections between these gate circuits and memories by burning the FPGA configuration file. This burning is not a one-time thing, that is, the user can configure the FPGA as a microcontroller MCU today, and edit the configuration file to configure the same FPGA as an audio codec tomorrow. ASIC is an application-specific integrated circuit. Once the design and manufacturing are completed, the circuit is fixed and cannot be changed.
FPGA (Xilinx Kintex 7 Ultrascle, top) and ASIC (Movidius Myriad 2, bottom) for deep learning accelerators
Comparing FPGAs and ASICs is like comparing Lego bricks and models. For example, if you find that Yoda in Star Wars is very popular recently and you want to make a Yoda toy to sell, what would you do?
There are two ways, one is to build with Lego bricks, and the other is to find a factory to make a mold and customize it. If you build with Lego bricks, you only need to buy a set of Lego bricks after designing the toy's appearance. If you find a factory to make a mold, you need to do a lot of things in addition to designing the toy's appearance, such as whether the toy's material will emit odor, whether the toy will melt at high temperatures, etc. Therefore, the preliminary work required to make toys with Lego bricks is much less than finding a factory to make a mold, and the time required from design completion to market launch is also much faster with Lego.
The same is true for FPGA and ASIC. When using FPGA, you can use the tools provided by the FPGA manufacturer to implement the hardware accelerator as long as you write the Verilog code. To design ASIC, you still need to do a lot of verification and physical design (ESD, Package, etc.), which takes more time. If you want to target special occasions (such as military and industrial applications with high reliability requirements), ASIC will take more time to be specially designed to meet the needs, but if you use FPGA, you can directly buy military-grade high-stability FPGAs without affecting the development time at all. However, although the design time is relatively short, the toys made of Lego bricks are much rougher (poorer performance) than those customized by the factory (see the figure below), after all, the factory molds are tailor-made.
In addition, if the shipment volume is large, the cost of mass-producing toys in factories will be much cheaper than making them with Lego bricks. The same is true for FPGAs and ASICs. At the same time, the speed of an ASIC accelerator implemented with the best process will be 5-10 times faster than the accelerator made with the same process FPGA, and once mass-produced, the cost of ASIC will be much lower than the FPGA solution (10 to 100 times cheaper).
FPGA vs ASIC: Building Blocks vs Figures
Of course, another major feature of FPGA is that it can be reconfigured at any time to achieve different functions in different situations. However, when the accelerator implemented by FPGA is sold to users as a commodity, it takes a lot of effort to let users reconfigure it themselves.
Back to the example of using Lego bricks to make toys, the toy manufacturer can claim that this Yoda master is made of bricks, so players can reassemble these bricks into other characters (such as Luke Skywalker). But ordinary players don't know how to disassemble and assemble bricks. What should we do? The solution is either to target the market as professional core players who are proficient in building bricks, or to add a switch on the back of the toy, so that ordinary players can automatically reassemble the bricks by pressing it. Obviously, the second solution requires a very high technical threshold.
For FPGA accelerators, if reconfiguration is to be used as a selling point, they should either be sold to corporate users who are capable of developing FPGAs themselves (for example, companies such as Baidu and Microsoft are indeed developing FPGA-based deep learning accelerators and configuring FPGAs as different accelerators in different application scenarios), or develop a convenient and easy-to-use compiler that can convert users' deep learning networks into FPGA configuration files (companies such as DeePhi are trying this).
At present, even if a high-end server is used to compile the FPGA, it will take several minutes. If the compilation is done on a mobile terminal with weaker computing power, it will take even longer. For mobile terminal users, how to convince them to try to reconfigure the FPGA and accept tens of minutes to compile the network and configure the FPGA is still a problem.
summary
I have summarized the comparison between FPGA and ASIC in the table below. FPGA is available quickly, but has lower performance. ASIC is available slowly, requires a lot of time to develop, and has a much higher one-time cost (photolithography mask production cost) than FPGA, but has much higher performance than FPGA and much lower average cost after mass production. FPGA can be fully reconfigured, but ASIC also has a certain degree of configurability, as long as the circuit is made adjustable in certain parameters during design.
In terms of target markets, FPGAs are too expensive, so they are suitable for places that are not very price-sensitive, such as enterprise applications, military and industrial electronics, etc. (reconfiguration may really be needed in these fields). ASICs are suitable for consumer electronics applications due to their low cost, and whether configurability is a false demand in consumer electronics remains to be discussed.
The market situation we see is also the same: most of the users who use FPGA for deep learning acceleration are enterprise users. Baidu, Microsoft, IBM and other companies have teams dedicated to FPGA acceleration for servers, and the target market of Teradeep, a startup company that makes FPGA solutions, is also the server. ASICs are mainly aimed at consumer electronics, such as Movidius. Since mobile terminals belong to the field of consumer electronics, the solutions used in the future should be mainly ASICs.
SoC+IP Model
At this point, many readers may have questions: the network structure of deep learning is changing with each passing day, but ASIC is slow to market and cannot be changed once it is produced (taped out). How can it keep up with the development speed of deep learning? To answer this question, I think we need to clarify a concept first, that is, what exactly does ASIC for deep learning acceleration do?
Some people think that neural network ASIC is to realize a neural network structure on the chip, so once the network structure is changed (for example, from 12 layers to 15 layers, or the weight parameters are changed), the ASIC cannot be used. In fact, this understanding is wrong: ASIC accelerator helps CPU quickly complete the calculations in deep learning (such as convolution). When the CPU is executing artificial intelligence algorithms, as long as it encounters such calculations, it will be handed over to the accelerator. Therefore, as long as the main calculations of the neural network remain unchanged, the ASIC accelerator can be used. The network structure will affect the performance of the ASIC accelerator. An ASIC accelerator may be optimized for GoogleNet, so it will execute GoogleNet very quickly; when you switch to VGG Net, this ASIC can still be used, but the execution efficiency will be discounted compared to executing GoogleNet, but in any case it will be much faster than the CPU.
As for the problem of ASIC's slow time to market, there is a way to solve it, which is to use the SoC+IP method. Since it takes too much time for a company to design ASIC, can it be outsourced or even crowdfunded? Absolutely! Many SoC chips are made in this way.
Here I would like to introduce the concept of SoC. SoC stands for "System-on-chip", which means a chip that integrates many different modules. Take the chip for multimedia applications as an example. In the early years, each module of multimedia applications (audio codec, MPEG playback codec, 3D acceleration, etc.) was an ASIC.
Later, the electronics industry found that it was too expensive to make ASIC for each module, and it was difficult to make the size of electronic products small. It was better to integrate all modules into the same chip. This chip integrates multiple modules, and a central control unit controls the operation of each module through a bus, which is SoC. For example, Qualcomm's Snapdragon is a typical SoC, which integrates GPU, video/audio codec, camera image signal processing unit (ISP), GPS, and wired/wireless connection unit, etc.
Each module on SoC can be called IP. These IPs can be designed by the company itself (such as the modem on Snapdragon is designed by Qualcomm itself), or purchased from other companies and integrated into its own chips. For example, the GPU used in Apple's A series processors uses Imagination's PowerVR IP. SoC+IP provides a flexible and fast model. It is conceivable that if Apple did not purchase IP but instead formed its own team to slowly develop GPUs, the launch of its A series processor chips would be delayed by at least one year.
Qualcomm's Snapdragon SoC integrates many IPs on the chip
For deep learning accelerators, making them into IP is also a model to accelerate time to market. When a deep learning accelerator becomes IP, it is no longer made into an ASIC itself, but becomes part of the SoC. When the SoC needs to perform deep learning-related operations, it is handed over to the accelerator to do it.
Moreover, IP can be more flexible in meeting customer needs for accelerators. For example, an accelerator IP design can achieve a computing speed of 100 GFlops and consume 150 mW of power. At this time, customer A says that we need an accelerator that can calculate faster (150 Gflops) and does not care about power consumption (300 mW is also OK) and chip area. Then the IP company can quickly fine-tune its design according to customer needs and deliver it within one or two months (because it does not need to actually produce the chip, only the design needs to be delivered).
However, if the accelerator has been made into an ASIC, then if you want to change the design, you must make a new chip. This process involves time-consuming physical design and verification, and the modified chip may not be available for sale until a year later. In the SoC+IP model, IP companies can focus on the front-end design of the accelerator and tailor it to customer needs, while large companies can do their own back-end and chip/package-level verification. It can be said that both large and small companies can play to their strengths and avoid their weaknesses, and each can take what they need, ultimately achieving rapid accelerator design iterations (such as once every six months or even a quarter) and keeping up with the pace of deep learning development.
From a performance perspective, if the deep learning accelerator is made into an IP, it can use high-bandwidth on-chip interconnection when communicating with the CPU on the same chip. However, if it is made into an ASIC, it must use off-chip interconnection with lower bandwidth and higher power consumption. Therefore, the deep learning accelerator as an IP becomes part of the SoC, which also improves the overall performance of the system.
At present, established companies that make deep learning accelerator IP include Ceva, Cadence, etc. Most of these companies' designs are based on existing DSP architectures and are relatively conservative. Of course, some startups have also seen the deep learning accelerator IP market and tried to use a new accelerator architecture design to meet application needs, such as Kneron.
I am personally optimistic about startups that make IP, because first of all, there is indeed a market demand for deep learning-related accelerator IPs. For example, in the HPU processor used by Microsoft for the AR device HoloLens, the main computing unit uses purchased accelerator IPs. Secondly, making IP does not constitute a competitive relationship with large chip companies (such as NVidia, Intel), so the pressure is relatively small. Finally, since making IP requires fewer resources and the product is launched faster, the pressure on capital to maintain operations is relatively small, and the risk is also smaller than directly making chips. It can be said to be a relatively safe solution.
The deep accelerator IP market has both established vendors that use traditional architectures (Ceva, Cadence) and startups that use innovative architectures (Kneron).
Conclusion
FPGA and ASIC have their own strengths in implementing deep learning accelerators. The configurability of FPGA is more suitable for enterprise and military applications, while the high performance and low cost of ASIC are suitable for consumer electronics (including mobile terminals). In order to achieve rapid iteration, ASIC can adopt the SoC+IP model, which also allows small and medium-sized companies that do not have the resources to mass-produce chips to focus on the architecture and front-end design of deep learning accelerator IP and gain a place in the artificial intelligence market.
©This article is original by Machine Heart.
Please contact this official account for authorization if you want to reprint it
.
✄------------------------------------------------
Join Almost Human (full-time reporter/intern): hr@almosthuman.cn
Submit an article or seek coverage: editor@almosthuman.cn
Advertising & Business Cooperation: bd@almosthuman.cn