Habana Labs' most powerful AI processor competes with Nvidia-EEWORLD

Collect

Habana Labs has unveiled the Gaudi HL-2000, a custom AI processor that the company claims can outperform Nvidia’s best and brightest GPUs at training neural networks. Along with the release of the new chip, the Tel Aviv-based startup is launching a range of Gaudi-based PCIe cards, as well as an eight-processor server that can be used as the basis for building very large training clusters.

Gaudi represents Habana's second foray into the AI market. The company began shipping its Goya inference cards to customers in the fourth quarter of 2018. As we reported at the time, the HL-1000-powered Goya delivered more than 4x the throughput, 2x the energy efficiency, and half the latency when performing inference on ResNet-50 compared to Nvidia's V100 GPU. Habana has already amassed nearly 20 Goya customers who are currently evaluating the technology, according to Habana Chief Business Officer Eitan Medina.

The new HL-2000 was announced on Monday as a counterpart to the HL-1000. Again, using ResNet-50, Gaudi demonstrated that it could achieve 1,650 images per second with a batch size of 64. (For the V100, the best training result we could find was 1,360 images per second with an unspecified batch size.) “The fundamental properties that allow us to achieve this performance with small batch sizes have to do with the core architecture — it was designed from the ground up, rather than relying on older architectures like GPUs or classic CPUs,” Medina told The Next Platform.

Habana didn't provide much information about the details of the chip's internals, only claiming that it is based on the second-generation Tensor Processing Cores (TPC), the first generation of which went into their inference chips. Medina told us that the Gaudi processor supports typical floating-point formats used for training, such as FP32 and bfloat16, as well as some integer formats. On-package memory takes the form of 32GB of HBM2, mirroring what's available on GPU accelerators such as Nvidia's V100 and AMD's Radeon Instinct MI60.

Habana did not reveal any raw performance figures for the new processors. "If I told you how many multipliers I put on the chip and how often they run, but the architecture didn't allow you to use them, then all I would do is mislead you," Medina explained. According to him, their chips can achieve higher utilization than GPUs due to their clean-sheet design.

Perhaps Gaudi’s biggest potential advantage will be its ability to deliver performance at scale, which has been a challenge for building larger, more complex neural networks. For most training setups, performance levels off once you get beyond eight or 16 accelerators—that is, once you leave the server chassis. That’s not the case with Gaudi’s technology, Medina said. He noted that the same ResNet-50 training scaled to hundreds of HL-2000 processors with near-linear performance gains. Compared to the V100, the Habana technology is able to deliver a 3.8x throughput advantage at the 650-processor level.

Habana achieves this by inserting a lot of network bandwidth into their Gaudi chips, in the form of RDMA over Converted Ethernet (ROCE). The reasoning behind using Ethernet (rather than something more exotic like NVLink or OpenCAPI) is that it enables customers to easily drop Habana hardware into existing data centers, as well as build AI clusters using standard Ethernet switches from a variety of network providers.

In the case of the HL-2000 processor, 10 100GbE interfaces are integrated on the chip, some of which can be used to connect to other HL-2000 processors within the node, and the rest can be used for intra-processor communication across nodes. The latter feature eliminates the need for a NIC.

You can see this working in Habana's own HLS-1 system, a 3U DGX-like box with eight HL-2000 processors. Internally, seven of each chip's 100GbE links are used to connect the HL-2000 processor to one another in a non-blocking, all-to-all fashion, while the remaining three links are provided to servers to build larger clusters - so there are 24 100GbE external ports. Connecting to host servers or flash storage does not take up Ethernet bandwidth. For this purpose, Habana provides four PCIe Gen4 x16 interfaces.

Contrast this with a typical GPU-accelerated server, which is often limited by a single network interface. The best of the best in this regard is Nvidia’s latest 16-GPU DGX-2 system, which comes with up to eight 100G ports, but this is still a fraction of what the 24-port HLS-1 offers.

A rack of the Habana Gaudi system can be built by interleaving six HLS-1 servers with six CPU host servers (HLS-1 has no host processor), plus an Ethernet switch on the top of the rack. Such racks can be linked together to build arbitrarily large clusters. While the lack of an onboard host processor may be a turnoff for some, it does allow customers to choose the model and brand of CPU and gives them the ability to fine-tune the ratio of CPU cores to AI accelerators.

Customers who want to build their own Gaudi-based systems can use Habana's HL-200 PCIe card, which provides eight 100GbE ports, or the HL-205 mezzanine card, which has 20 56Gbps SerDes interfaces, enough to support 10 100GbE or 20 50GbE ports in the form of ROCE. The HL-200 consumes 200 watts of power, while the HL-205 consumes 300 watts.

The mezzanine card is the basis of Habana's HLS-1 server. But it's also possible to use it to build larger systems. For example, if you drop down to 50GbE for all-to-al connectivity in the chassis, you can use 16 HL-205 cards to build a 16-processor chassis and still leave 32 100GbE ports for expansion. If you want to build a smaller server, you can daisy-chain up to 8 HL-200 cards in a single chassis.

Incidentally, the mezzanine card supports the OCP Accelerator Module (OAM) specification, an open hardware compute accelerator module format developed by Facebook, Microsoft, and Baidu. This tells us a lot about where Habana is targeting this particular product.

Unlike what Nvidia did with NVLink, Habana does not support a cache-coherent global memory space across multiple processors. Gaudi designers believe that cache coherence is a performance killer that does not scale effectively beyond a small number of accelerators. From their perspective, achieving scalability for training neural networks is fundamentally a networking problem, and using RDMA can produce larger models very efficiently.

Habana's competition may also be moving toward this way of thinking. As Medina points out, at the recent GTC conference, Nvidia CEO Jensen Huang touted RoCE as a way to greatly improve the scalability of deep learning workloads. This means that the company has some very specific ideas about leveraging Mellanox's Ethernet technology once the GPU maker's acquisition is completed later this year.

In terms of software, Gaudi is equipped with Habana's AI software stack, called SynapseAI. It consists of a graph compiler, runtime, debugger, deep learning library, and driver. At this point, Habana supports TensorFlow to build models, but Medina said that over time, they will add support for PyTorch and other machine learning frameworks.

There may still be a long way to go from evaluation systems to production deployments, but if Habana technology delivers as promised, the AI market will happily shift in pursuit of better performance. Still, Nvidia has proven itself to be a fast-moving target when it comes to AI hardware, both for startups and established chipmakers like Intel and AMD. One thing is certain: The demand for bigger and better AI is creating a highly competitive market where nimble execution by engineering teams is almost as important as architectural design.

Habana will make the Gaudi platform available to selected customers in the second half of 2019. Pricing has not yet been revealed, although Medina tells us that the Gaudi will be “competitive” with similar products on the market.

Keywords：Habana Reference address：Habana Labs' most powerful AI processor competes with Nvidia

Previous article：Lattice's new sensAI version enables low-power, intelligent IoT devices at the edge of the network
Next article：Xilinx Releases New Heterogeneous Computing Devices—Versal™ AI Core and Versal Prime