Semiconductor manufacturers are developing AI inference chips that have great potential-EEWORLD

Collect

In recent years, there have been waves of technology booms, including the Internet of Things and wearable electronics in 2013 and 2014, artificial intelligence in 2016, and 5G at the end of 2018. Artificial intelligence was hotly discussed in the 1950s and 1980s, but it fell into a slump due to a number of technical limitations and excessive expectations. In 2016, it became popular again with the increasing amount of cloud data and the demand for audio and video recognition (Figure 1).

Figure 1 The third wave of AI

The application of artificial intelligence is divided into two stages: the learning and training stage and the reasoning stage. This is similar to application programs. The program development stage is the learning and training stage, and the program is officially launched and operated as the reasoning stage. Development is the ship being built or repaired in the dock, and execution is the ship sailing and operating at sea (Figure 2).

Figure 2 The difference between AI training and reasoning

The requirements for calculations in the training and reasoning stages are different. The training stage requires a large number of complex calculations, and in order to allow the artificial intelligence model to obtain better parameter adjustment data, the calculation accuracy and sophistication are relatively high. On the contrary, the reasoning stage is the opposite. The model has been trained and no longer requires a huge amount of calculations. In order to obtain the reasoning results as quickly as possible, lower precision calculations are allowed.

For example, a cat face recognition application requires thousands of photos with cat faces to be provided for training during the training phase, and various delicate recognition features must be captured from them. However, the actual inference operation set up at the front end to identify whether the person coming is a cat only recognizes a single face, which has a small amount of calculation and may have simplified features. The result (is it a cat or not) can be obtained with simple and fast calculations.

Demand for dedicated chips for inference emerges

In recent years, chips other than CPUs have been widely used to accelerate AI training and reasoning operations, such as GPGPU, FPGA, ASIC, etc., especially GPGPU, because GPGPU has a more complete high-level software ecosystem and can support a variety of AI frameworks. In contrast, FPGA requires people familiar with low-level hardware circuits to develop, and ASIC is usually optimized only for limited software or frameworks (Table 1). Although FPGA and ASIC are more difficult and limited, there are still technology giants willing to invest. For example, Microsoft advocates using FPGA to perform AI operations, and Google develops ASIC for TensorFlow AI framework, namely Cloud TPU chip.

In the past, the development (training) and execution (inference) of artificial intelligence models mostly used the same chip, which was used to perform training operations and inference operations. However, in the past one or two years, as the training results have gradually increased and mature artificial intelligence models have become more popular, the shortcomings of using the same chip for inference operations have gradually emerged. In the case of GPGPU, the chip has a large number of parallel computing units designed for game graphics, professional graphics or high-performance computing, and can calculate 32- and 64-bit floating-point numbers. This is also applicable to the training stage of artificial intelligence models, but in the inference stage, only 16-bit floating point, 16-bit integer, 8-bit integer and other operations may be needed to obtain the inference results, and even 4-bit integers are sufficient. In this way, the high-precision and large number of parallel computing units in the past are overkill, and both circuits and power consumption are wasted, so a dedicated processing chip for artificial intelligence inference is needed.

Semiconductor manufacturers are developing inference chips

The demand for inference chips began to emerge two years after the re-initiative of artificial intelligence, but there were already several products before that. For example, Project Tango, which was unveiled by Google in 2014, used Movidius' Myriad chip (Figure 3).

Figure 3 Intel Movidius Myriad X chip

Movidius later launched the Myriad 2 chip in 2016. Also in 2016, Intel acquired Movidius and acquired the Myriad 1/2 series chips, and then launched the Myriad X chip. In addition to the Tango project, Google also uses Intel/Movidius chips in other hardware, such as the Google Clips artificial intelligence camera in 2017 and the Google AIY Vision artificial intelligence visual application development kit in 2018.

However, what really attracted the industry's attention was still in 2018, including NVIDIA's launch of the T4 chip (strictly speaking, it is an acceleration interface card with a chip) (Figure 4), Google's launch of the Edge TPU chip (Figure 5), and Amazon Web Services' announcement in November 2018 that it would launch the Inferentia chip in 2019, all of which are inference chips.

Figure 4 NVIDIA displays the T4 interface card

Figure 5 Google Edge TPU is smaller than a penny copper plate

In addition, Facebook has also realized that various types of inference chips will be released in the next few years. In order to avoid the difficulties of software support caused by the diversity of hardware, it proposed the Glow compiler concept, hoping that all artificial intelligence chip manufacturers can unanimously support this compilation standard. Currently, Intel, Cadence, Marvell, Qualcomm, and Esperanto Technologies (a new artificial intelligence chip entrepreneur) have all expressed their support.

At the same time, Facebook also admitted that it is developing its own AI chip and will cooperate with Intel technology. Currently, Facebook's senior technical executives have said that its chip is different from Google TPU, but they cannot disclose more relevant technical details. In addition to acquiring Movidius in 2016, Intel also acquired another AI technology company Nervana System in the same year. Intel will also develop inference chips based on Nervana's technology.

Inference chips are not only attractive to large companies, but also to new entrepreneurs. Habana Labs provided engineering samples of its inference chip HL-1000 to specific customers in September 2018, and will subsequently produce PCIe interface inference acceleration cards based on this chip, code-named Goya. Habana Labs claims that HL-1000 is currently the fastest inference chip in the industry (Figure 6).

Figure 6 In addition to launching the HL-1000 inference chip Goya, Habana Labs also launched the training chip Gaudi

Cloud data center/fast response inference chips can be divided into two orientations

From the above, we can understand that many companies have invested in the development of inference chips. However, strictly speaking, inference chips can be divided into two orientations: one is to pursue better cloud data center efficiency, and the other is faster and more immediate response. The former is to place the inference chip in the cloud data center and perform inference operations in a full-time and specialized manner. Compared with dual-purpose chips for training and inference, it saves more data center space, electricity and cost, such as NVIDIA T4.

The latter is to set up inference chips on-site, such as in IoT gateways, access control cameras, and car computers, to perform real-time image object recognition, such as Intel Movidius Myriad series, Google Edge TPU, etc.

The inference chips installed in the computer room can obtain endless power from the power socket, so they still use tens of watts of electricity. For example, the TDP (Thermal Design Power) of NVIDIA T4 is 70 watts. In contrast, the inference chips installed on site must adapt to various possible environments, such as operating only on battery power, so they save power as much as possible. For example, the TDP of Google Edge TPU is only 1.8 watts. The only exception to the on-site type is currently observed to be car use. Since cars have batteries available, the power abundance is between the battery and the power socket, so the chip power consumption performance can be higher.

To quickly respond to inference, the chip accuracy must be adjusted

As mentioned earlier, inference chips usually use lower precision for calculations in order to be able to solve problems quickly and in real time. The 64-bit double-precision floating point numbers (DP) that were too high-performance in the past, or the 32-bit single-precision floating point numbers (SP) used for gaming and professional graphics may not be applicable, and the precision may be reduced to (including) 16 bits or less.

For example, Intel Movidius Myriad X natively supports 16-bit floating point numbers (newly called half-precision, HP) and 8-bit integers; Google Edge TPU only supports 8-bit and 16-bit integers, but not floating point numbers; NVIDIA T4 supports 16- and 32-bit floating point numbers, as well as 8-bit and 4-bit integers.

Furthermore, inference chips may use more than two precision operations at the same time. For example, NVIDIA T4 can perform 16-bit floating-point and 32-bit floating-point operations at the same time, or AWS Inferentia, which has not yet been launched, claims to be able to perform 8-bit integer and 16-bit floating-point operations at the same time (Figure 7). The practice of using more than two precisions at the same time has a new term, called mixed precision (MP) operations.

Figure 7 AWS announced that it will launch its own inference chip Inferentia in 2019, which will be able to calculate both integer and point formats. Image source: AWS

The integer and floating point formats with different bit expression lengths are generally written as INT4 (Integer), INT8, FP16, FP32 (Float Point), etc. There are also formats that emphasize the ability to operate on formats that express pure positive integers without positive or negative expressions. For example, Habana Labs' HL-1000 emphasizes support for INT8/16/32 as well as UINT8/16/32 formats, where U stands for Unsigned.

[1] [2]

Keywords：AI Reference address：Semiconductor manufacturers are developing AI inference chips that have great potential

Previous article：MathWorks addresses pain points in the AI industry: tool selection is crucial
Next article：See how DPA completes the workshop automation measurement of railway bogies

Popular Resources
Popular amplifiers