Habana Labs has unveiled the Gaudi HL-2000, a custom AI processor that the company claims can outperform Nvidia’s best and brightest GPUs at training neural networks. Along with the release of the new chip, the Tel Aviv-based startup is launching a range of Gaudi-based PCIe cards, as well as an eight-processor server that can be used as the basis for building very large training clusters.
Gaudi represents Habana's second foray into the AI market. The company began shipping its Goya inference cards to customers in the fourth quarter of 2018. As we reported at the time, the HL-1000-powered Goya delivered more than 4x the throughput, 2x the energy efficiency, and half the latency when performing inference on ResNet-50 compared to Nvidia's V100 GPU. Habana has already amassed nearly 20 Goya customers who are currently evaluating the technology, according to Habana Chief Business Officer Eitan Medina.
The new HL-2000 was announced on Monday as a counterpart to the HL-1000. Again, using ResNet-50, Gaudi demonstrated that it could achieve 1,650 images per second with a batch size of 64. (For the V100, the best training result we could find was 1,360 images per second with an unspecified batch size.) “The fundamental properties that allow us to achieve this performance with small batch sizes have to do with the core architecture — it was designed from the ground up, rather than relying on older architectures like GPUs or classic CPUs,” Medina told The Next Platform.
Habana didn't provide much information about the details of the chip's internals, only claiming that it is based on the second-generation Tensor Processing Cores (TPC), the first generation of which went into their inference chips. Medina told us that the Gaudi processor supports typical floating-point formats used for training, such as FP32 and bfloat16, as well as some integer formats. On-package memory takes the form of 32GB of HBM2, mirroring what's available on GPU accelerators such as Nvidia's V100 and AMD's Radeon Instinct MI60.
Habana did not reveal any raw performance figures for the new processors. "If I told you how many multipliers I put on the chip and how often they run, but the architecture didn't allow you to use them, then all I would do is mislead you," Medina explained. According to him, their chips can achieve higher utilization than GPUs due to their clean-sheet design.
Perhaps Gaudi’s biggest potential advantage will be its ability to deliver performance at scale, which has been a challenge for building larger, more complex neural networks. For most training setups, performance levels off once you get beyond eight or 16 accelerators—that is, once you leave the server chassis. That’s not the case with Gaudi’s technology, Medina said. He noted that the same ResNet-50 training scaled to hundreds of HL-2000 processors with near-linear performance gains. Compared to the V100, the Habana technology is able to deliver a 3.8x throughput advantage at the 650-processor level.
Habana achieves this by inserting a lot of network bandwidth into their Gaudi chips, in the form of RDMA over Converted Ethernet (ROCE). The reasoning behind using Ethernet (rather than something more exotic like NVLink or OpenCAPI) is that it enables customers to easily drop Habana hardware into existing data centers, as well as build AI clusters using standard Ethernet switches from a variety of network providers.
In the case of the HL-2000 processor, 10 100GbE interfaces are integrated on the chip, some of which can be used to connect to other HL-2000 processors within the node, and the rest can be used for intra-processor communication across nodes. The latter feature eliminates the need for a NIC.
You can see this working in Habana's own HLS-1 system, a 3U DGX-like box with eight HL-2000 processors. Internally, seven of each chip's 100GbE links are used to connect the HL-2000 processor to one another in a non-blocking, all-to-all fashion, while the remaining three links are provided to servers to build larger clusters - so there are 24 100GbE external ports. Connecting to host servers or flash storage does not take up Ethernet bandwidth. For this purpose, Habana provides four PCIe Gen4 x16 interfaces.
Contrast this with a typical GPU-accelerated server, which is often limited by a single network interface. The best of the best in this regard is Nvidia’s latest 16-GPU DGX-2 system, which comes with up to eight 100G ports, but this is still a fraction of what the 24-port HLS-1 offers.
A rack of the Habana Gaudi system can be built by interleaving six HLS-1 servers with six CPU host servers (HLS-1 has no host processor), plus an Ethernet switch on the top of the rack. Such racks can be linked together to build arbitrarily large clusters. While the lack of an onboard host processor may be a turnoff for some, it does allow customers to choose the model and brand of CPU and gives them the ability to fine-tune the ratio of CPU cores to AI accelerators.
Customers who want to build their own Gaudi-based systems can use Habana's HL-200 PCIe card, which provides eight 100GbE ports, or the HL-205 mezzanine card, which has 20 56Gbps SerDes interfaces, enough to support 10 100GbE or 20 50GbE ports in the form of ROCE. The HL-200 consumes 200 watts of power, while the HL-205 consumes 300 watts.
The mezzanine card is the basis of Habana's HLS-1 server. But it's also possible to use it to build larger systems. For example, if you drop down to 50GbE for all-to-al connectivity in the chassis, you can use 16 HL-205 cards to build a 16-processor chassis and still leave 32 100GbE ports for expansion. If you want to build a smaller server, you can daisy-chain up to 8 HL-200 cards in a single chassis.
Incidentally, the mezzanine card supports the OCP Accelerator Module (OAM) specification, an open hardware compute accelerator module format developed by Facebook, Microsoft, and Baidu. This tells us a lot about where Habana is targeting this particular product.
Unlike what Nvidia did with NVLink, Habana does not support a cache-coherent global memory space across multiple processors. Gaudi designers believe that cache coherence is a performance killer that does not scale effectively beyond a small number of accelerators. From their perspective, achieving scalability for training neural networks is fundamentally a networking problem, and using RDMA can produce larger models very efficiently.
Habana's competition may also be moving toward this way of thinking. As Medina points out, at the recent GTC conference, Nvidia CEO Jensen Huang touted RoCE as a way to greatly improve the scalability of deep learning workloads. This means that the company has some very specific ideas about leveraging Mellanox's Ethernet technology once the GPU maker's acquisition is completed later this year.
In terms of software, Gaudi is equipped with Habana's AI software stack, called SynapseAI. It consists of a graph compiler, runtime, debugger, deep learning library, and driver. At this point, Habana supports TensorFlow to build models, but Medina said that over time, they will add support for PyTorch and other machine learning frameworks.
There may still be a long way to go from evaluation systems to production deployments, but if Habana technology delivers as promised, the AI market will happily shift in pursuit of better performance. Still, Nvidia has proven itself to be a fast-moving target when it comes to AI hardware, both for startups and established chipmakers like Intel and AMD. One thing is certain: The demand for bigger and better AI is creating a highly competitive market where nimble execution by engineering teams is almost as important as architectural design.
Habana will make the Gaudi platform available to selected customers in the second half of 2019. Pricing has not yet been revealed, although Medina tells us that the Gaudi will be “competitive” with similar products on the market.
Previous article:Lattice's new sensAI version enables low-power, intelligent IoT devices at the edge of the network
Next article:Xilinx Releases New Heterogeneous Computing Devices—Versal™ AI Core and Versal Prime
- Popular Resources
- Popular amplifiers
- e-Network Community and NXP launch Smart Space Building Automation Challenge
- The Internet of Things helps electric vehicle charging facilities move into the future
- Nordic Semiconductor Launches nRF54L15, nRF54L10 and nRF54L05 Next Generation Wireless SoCs
- Face detection based on camera capture video in OPENCV - Mir NXP i.MX93 development board
- The UK tests drones equipped with nervous systems: no need to frequently land for inspection
- The power of ultra-wideband: reshaping the automotive, mobile and industrial IoT experience
- STMicroelectronics launches highly adaptable and easy-to-connect dual-radio IoT module for metering and asset tracking applications
- This year, the number of IoT connections in my country is expected to exceed 3 billion
- Infineon Technologies SECORA™ Pay Bio Enhances Convenience and Trust in Contactless Biometric Payments
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Modbus RTU master-slave protocol made by dsp
- After 8 days, I finally added a mechanical hard drive to my laptop
- Evaluation Weekly Report 20220124: Yatli AT32F425 and Qinheng CH582 apply for online launch ~ National Technology and other evaluation reports updated
- Shouldn't we buy chips on Taobao?
- [ESP32-S2-Kaluga-1 Review] 3. Getting started with Ubuntu and updating under Ubuntu
- GD32L233C-START Development Board Study Notes (Part 3)
- EEWORLD University ---- "Embedded Artificial Intelligence" - Chinese Chatbot Development
- Can Software Radio Become Open Radio?
- Sincerely recruiting embedded software and hardware engineers
- What is the function of the forward connection of the op amp output to the voltage regulator diode here?