A review of the world's top ten AI training chips: Huawei Ascend 910 is the only Chinese chip selected
Edited by Qian Ming
Quantum Bit Report | Public Account QbitAI
Which AI chip is the best? Now, there is a direct comparison and reference.
James W. Hanlon, a senior chip engineer in the UK, has listed the top ten AI training chips.
It also provides a horizontal comparison of various indicators, which is also the latest discussion and summary of AI training chips.
Among them, Huawei Ascend 910 is the only chip selected by a Chinese chip manufacturer, and its performance is also shown in this comparison.
△ * represents speculation, † represents single chip data.
Cerebras Wafer-Scale Engine
This chip was officially launched in August this year and is known as the "largest AI chip in history" and is named "Cerebras Wafer Scale Engine" (WSE for short).
Its biggest feature is that it integrates logical operations, communications and memory into a single silicon chip. It is a chip specifically designed for deep learning.
Set 4 world records in one fell swoop:
-
The computing chip with the largest number of transistors: a total of 1.2 trillion transistors. Although Samsung has made a chip with 2 trillion transistors, it is eUFS used for storage.
-
The chip has the largest area: the size is about 20 cm × 23 cm, with a total area of 46225 square millimeters.
-
The largest on-chip cache: includes 18GB of on-chip SRAM memory.
-
The most computing cores: 410,592 processing cores
Such impressive data is directly due to its integration of 84 high-speed interconnected chips. The peak performance of a single chip on FP32 is 40 Tera FLOPs, and the chip power reaches 15 kilowatts, which is comparable to an AI cluster.
The on-chip cache also reaches 18GB, which is 3,000 times the GPU cache; it can provide 9PB of memory bandwidth per second, which is 10,000 times faster than the GPU.
Wafer-scale integration is not a new idea, but issues related to yield, power transmission, and thermal expansion make it difficult to commercialize. Cerebras has provided corresponding solutions in these areas:
-
In order to solve the problem of low yield caused by defects, Cerebras took 1~1.5% redundancy into consideration when designing the chip, added additional cores, and shielded it when a problem occurred in a core, so impurities would not cause the entire chip to be scrapped.
-
Cerebras has partnered with TSMC to invent new technologies to handle the etching and communications problems of chips with trillions of transistors.
-
A "cold plate" is installed above the chip, using multiple vertically mounted water pipes to directly cool the chip.
Cerebras was founded in 2016 by Sean Lie (chief hardware architect), Andrew Feldman (CEO), and others. The latter founded the microserver company SeaMicro and sold it to AMD for $334 million.
The company has 194 employees in California, including 173 engineers, and has raised $112 million in funding to date from venture capital firms such as Benchmark.
Further reading:
The largest AI chip in history was born:
462 square centimeters, 400,000 cores, 1.2 trillion transistors, setting 4 world records
Google TPU (v1, v2, v3)
Google's TPU series chips were officially released in 2016. The first-generation chip, TPU v1, was only used for reasoning and only supported integer operations.
By sending instructions across PCIe-3 to perform matrix multiplication and apply activation functions, it provides acceleration for the host CPU, saving a lot of design and verification time. Its main data are:
-
Chip area 331 square millimeters, 28nm process
-
Frequency is 700 MHz, power consumption is 28-40W
-
On-chip storage is 28 MB SRAM: 24 MB for activation and 4 MB for accumulator
-
Chip area ratio: 35% is used for memory, 24% is used for matrix multiplication unit, and the remaining 41% is used for logic.
-
256x256x8b Systolic Matrix Multiply Unit (64K MACs/cycle)
-
Int8 and INT16 algorithms (peak speeds of 92 and 23 TOPs/s respectively)
IO data:
-
8 GB DDR3-2133 DRAM accessible via two interfaces at 34 GB/s
-
PCIe-3x16 (14 GBps)
In May 2017, Google TPU v2 was released, which improved the floating-point computing capability of TPU v1 and enhanced its memory capacity, bandwidth, and HBM integrated memory, which can be used not only for reasoning but also for training. The data of a single chip is as follows:
-
20nm process, power consumption is 200-250W (estimated)
-
45 TFLOPs on BFloat16, also supports FP32
-
Dual core with scalar and matrix units
-
After integrating four chips, the peak performance is 180 TFLOPs
Single core data:
-
128x128x32b Systolic Matrix Unit (MXU)
-
8GB dedicated HBM, access bandwidth 300 GBps
-
Maximum throughput on BFloat16 is 22.5 TFLOPs
IO data:
-
16Gb HBM integrated memory, 600 GBps bandwidth (estimated)
-
PCIe-3 x8 (8 GBps)
One year after the release of Google TPU v2, Google released a new version of the chip - TPU v3.
There are few details about TPU v3, but it is likely just an incremental revision of TPU v2, doubling the performance and adding HBM2 memory to double the capacity and bandwidth. The data for a single chip is as follows:
-
16nm or 12nm process, power consumption is estimated to be 200W
-
BFloat16 has 105 TFLOPs performance, which is probably 2x to 4x that of MXUs
-
Each MXU has access to 8GB of dedicated memory
-
After integrating 4 chips, the peak performance is 420 TFLOPs
IO data:
-
32GB of integrated HBM2 memory with a bandwidth of 1200GBps (estimated)
-
PCIe-3 x8 (8 GBps) (estimated)
Further reading:
Want to learn more about TPU 3.0?
Jeff Dean recommends watching this video
Graphcore IPU
Founded in 2016, Graphcore is not only favored by capital and industry giants, but also recognized by industry leaders.
In December 2018, it announced the completion of a $200 million Series D financing round, with a valuation of $1.7 billion. Investors include industry giants such as BMW and Microsoft, as well as well-known venture capital firms such as Sofina and Atomico.
AI giant Hinton and DeepMind founder Hassabis both expressed direct praise.
The Graphcore IPU is the company's star product, with an architecture that's highly parallel with a large number of simple processors with small memories, connected together via a high-bandwidth "switch" interconnect.
Its architecture operates under a Bulk Synchronous Parallelism (BSP) model, where program execution proceeds as a series of computation and exchange phases. Synchronization is used to ensure that all processes are ready to start exchanging.
The BSP model is a powerful programming abstraction that eliminates concurrency risks, and the execution of BSP allows the calculation and exchange stages to fully utilize the chip's energy, thereby better controlling power consumption. A larger IPU chip system can be built by linking 10 inter-IPU links. Its core data is as follows:
-
16nm process, 23.6 billion transistors, chip area is about 800 square millimeters, power consumption is 150W, PCIe card is 300W
-
1216 processors, FP16 algorithm peaks at 125 TFLOPs with FP32 accumulation
-
300 MB of on-chip memory distributed across the processor cores, providing 45 TBps of total access bandwidth
-
All model state is stored on-chip, with no direct connection to DRAM
IO data:
-
2x PCIe-4 host transfer links
-
10 times the IPU link between cards
-
Total transfer bandwidth of 384GBps
Single core data:
-
Mixed-precision floating-point random algorithms
-
Runs up to six threads
Further reading:
After two years of establishment, the AI chip company valued at $1.7 billion received investment from BMW and Microsoft
Habana Labs Gaudi
Habana Labs, also founded in 2016, is an Israeli AI chip company.
In November 2018, it completed a US$75 million Series B fundraising, with a total fundraising of approximately US$120 million.
The Gaudi chip was unveiled in June this year, directly competing with Nvidia's V100.
Its overall design is also similar to that of a GPU, especially with more SIMD parallelism and HBM2 memory.
The chip integrates 10 100G Ethernet links and supports remote direct memory access (RDMA). Compared with Nvidia's NVLink or OpenCAPI, this data transmission function allows large-scale systems to be built using commercial network equipment. Its core data is as follows:
-
TSMC 16 nm process (CoWoS process), chip size is about 500 square millimeters
-
Heterogeneous architecture: GEMM operation engine, 8 tensor processing cores (TPCs)
-
SRAM memory sharing
-
PCIe card power consumption is 200W, mezzanine card is 300W
-
On-chip memory unknown
TPC core data:
-
VLIW SIMD parallelism and a local SRAM memory
-
Supports mixed precision operations: FP32, BF16, and integer format operations (INT32, INT16, INT8, UINT32, UINT8)
-
Random number generation, transcendental functions: Sigmoid, Tanh, GeLU
IO data:
-
4x provides 32 GB of HBM2-2000 DRAM stacks, totaling 1 TBps
-
On-chip 10x 100GbE interfaces supporting RDMA over Converged Ethernet (RoCE v2)
-
PCIe-4 x16 host interface
Huawei Ascend 910
Huawei Ascend 910, which is also directly targeted at Nvidia V100, was officially put into commercial use in August this year and is known as the most powerful AI training chip in the industry. It focuses on deep learning training scenarios and its main customers are AI data scientists and engineers.
Its core data are:
-
7nm+EUV process, 456 square millimeters
-
Integrates four 96mm2 HBM2 stacks and Nimbus IO processor chips
-
32 DaVinci cores
-
FP16 performance peak 256TFLOPs (32x4096x2), twice that of INT8
-
32 MB on-chip SRAM (L2 cache)
-
Power consumption 350W
Interconnect and IO data:
-
The cores are interconnected in a 6 x 4 2D mesh packet-switched network, providing 128 GBps bidirectional bandwidth per core
-
4 TBps L2 cache access
-
1.2 TBps HBM2 access bandwidth
-
3x30GBps chip internal IOs
-
2 x 25 GBps RoCE network interfaces
Single DaVinci kernel data:
-
3D 16x16x16 matrix multiply unit, providing 4,096 FP16 MACs and 8,192 INT8 MACs
-
2,048-bit SIMD vector operations for FP32 (x64), FP16 (x128), and INT8 (x256)
-
Support scalar operations
Further reading:
Huawei's most powerful AI chip is commercially available:
2 times more powerful than Nvidia V100!
Open source AI framework, benchmarking TensorFlow and PyTorch
Intel NNP-T
This is Intel's second foray into AI training chips after Xeon Phi. It took four years, involved the acquisition of four startups, and cost more than $500 million. It was released in August this year.
The "T" in the neural network training processor NNP-T stands for Train, which means that this chip is used for AI reasoning. The processor code name is Spring Crest.
The NNP-T will be manufactured by Intel's competitor TSMC using the 16nm FF+ process.
The NNP-T has 27 billion 16nm transistors, a silicon wafer area of 680 square millimeters, a 60mmx60mm 2.5D package, and contains a grid of 24 tensor processors.
The core frequency can reach up to 1.1GHz, with 60MB on-chip memory, 4 8GB HBM2-2000 memories, it uses a x16 PCIe 4 interface, and has a TDP of 150~250W.
Each tensor processing unit has a microcontroller that directs the operations of a math coprocessor and can be extended with custom microcontroller instructions.
NNP-T supports three major mainstream machine learning frameworks: TensorFlow, PyTorch, and PaddlePaddle, as well as C++ deep learning software library and compiler nGraph.
In terms of computing power, the chip can reach up to 119 trillion operations per second (119TOPS), but Intel did not disclose whether it is based on INT8 or INT4.
In comparison, Nvidia Tesla T4 has a computing power of 130TOPS on INT8 and 260TOPS on INT4.
Further reading:
Intel's first AI chip is finally released:
it can be used for both training and reasoning. It took 4 years and $500 million to buy 4 companies
Nvidia Volta architecture chip
Nvidia Volta, announced in May 2017, introduced Tensor Cores, HBM2, and NVLink 2.0 from the Pascal architecture.
NVIDIA V100 chip is the first GPU chip based on this architecture, and its core data is:
-
TSMC 12nm FFN process, 21.1 billion transistors, area 815 square millimeters
-
Power consumption is 300W, 6MB L2 cache
-
84 SMs, each containing: 64 FP32 CUDA cores, 32 FP64 CUDA cores, and 8 Tensor Cores (5376 FP32 cores, 2688 FP64 cores, 672 TCs).
-
A single Tensor Core performs 64 FMA operations per clock (128 FLOPS total), and each SM has 8 such cores, for 1024 FLOPS per clock per SM.
-
In comparison, even with pure FP16 operations, a standard CUDA core in an SM can only produce 256 FLOPS per clock.
-
Per SM, 128 KB L1 data cache/shared memory and four 16K 32-bit registers.
IO data:
-
32 GB HBM2 DRAM, 900 GBps bandwidth
-
300 GBps NVLink 2.0
Nvidia Turing architecture chip
The Turing architecture is an upgrade to the Volta architecture, released in September 2018, but with fewer CUDA and Tensor cores.
As a result, it is smaller in size and uses less power. In addition to machine learning tasks, it is also designed to perform real-time ray tracing. Its core data are:
-
TSMC 12nm FFN process, area of 754 square millimeters, 18.6 billion transistors, power consumption 260W
-
72 SMs, each containing: 64 FP32 cores, 64 INT32 cores, 8 Tensor Cores (4608 FP32 cores, 4608 INT32 cores, and 576 TCs)
-
Peak performance with boost clocks: 16.3 TFLOPs on FP32, 130.5 TFLOPs on FP16, 261 TFLOPs on INT8, 522 TFLOPs on INT4
-
On-chip memory is 24.5 MB, between 6 MB of L2 cache and 256 KB of SM register file
-
Base clock is 1455 MHz
IO data:
-
12x32-bit GDDR6 memory, providing 672 GBps aggregate bandwidth
-
2x NVLink x8 links, each providing up to 26 GBps bidirectional speed
Reference:
https://www.jameswhanlon.com/new-chips-for-machine-intelligence.html
The author is a contracted author of NetEase News and NetEase "Each has its own attitude"
-over-
The conference opens! Foreseeing the new future of intelligent technology
QuantumBit MEET 2020 Intelligent Future Conference has kicked off, and will work with outstanding AI companies and outstanding scientific researchers to present a high-quality industry event! For details, please click on the picture:
The 2019 China Artificial Intelligence Annual Selection has begun. Three major awards will be selected: Leading Enterprises, Business Breakthrough Figures, and Most Innovative Products. The list will be announced at the MEET 2020 conference. Excellent AI companies are welcome to scan the QR code to sign up!
Quantum Bit QbitAI · Toutiao signed author
Tracking new trends in AI technology and products
If you like it, click "Like"!