New chips from NVIDIA and AMD break through PCIe limitations
????If you want to see each other often, please mark it as a star???? and add it to your collection~
Source : Content by Semiconductor Industry Observation (ID: i c b ank) Source hpcwire, thank you.
Students who have studied microprocessors may still remember that the original 8086/8088 processor did not have a floating point unit. Motherboards usually have an extra slot for an optional 8087 math coprocessor. Math coprocessors made their way into the CPU itself, and today, CPUs don't have optional math coprocessors.
However, there are many options for SIMD processors (such as GPUs). It is well known that GPUs can accelerate mathematical processing (such as matrix operations) much faster than a host CPU.
With the introduction of the Nvidia GH-200 processor and AMD MI300A APU, the market is witnessing the "8087 moment" - when the CPU absorbs external performance hardware. Both Nvidia and AMD have incorporated GPUs into their processors, and the result is a huge jump in HPC performance and a sign of things to come.
Goodbye PCI
Both AMD and Nvidia GPUs rely on the PCI bus to communicate with the CPU. The CPU and GPU have two different memory domains, and data must move from the CPU domain to the GPU domain (and back) through the PCI interface.
The maximum bandwidth for a GPU using all 16 lanes of the Gen 5 PCIe bus is approximately 63GB/s. This bottleneck will limit memory movement between the CPU and GPU.
Nvidia GH200 connects Grace CPU and Hooper GPU via 900 GB/s bidirectional NVLink-C2C. The result is about 14 times faster. Additionally, GH200 brings the benefit of a single shared CPU-GPU memory domain. There is no need to move data between the CPU and GPU over the PCI bus. As shown in Figure 1, the CPU and GPU have a consistent view of all memory. The CPU memory is up to 480GB LPDDR5X (with ECC), and the GPU has 96GB HBM3 or 144GB HBM3e. Total coherent (single domain) memory ranges from 576GB to 624GB.
The current AMD Instinct MI300A APU uses a single memory domain with 128 GB of HBM3 memory that is shared consistently between the CPU and GPU using Infinity Fabric, encapsulating a peak throughput of 5.3 TB/s. Although the MI300A currently does not support additional DDR memory expansion like the GH200, CXL is a word worth remembering in the future.
For the GH200 and MI300A, the key prominent phrase is "presenting a single memory domain". In a traditional CPU-PCIe-GPU combination, the amount of GPU memory is usually smaller than the CPU memory, and data must be shuffled through the PCIe interface. These two new designs eliminate this bottleneck. A single large memory domain has always been attractive for HPC, and the growth of GenAI has accelerated this need (ie, the ability to load large models in memory and run them using the GPU). For traditional GPUs, the amount of GPU memory limits the model size, requiring a distributed GPU approach. (Note: GH200 can create massive unified memory through external NVLink connections; for example, Nvidia-AWS NLV32 can provide up to 20 TB of unified memory.)
Not far from your desktop
One of the clear trends in technology is the shift from expensive new technology markets to low-cost commodity markets. High-performance computing is no exception. With market demand, everything from multi-core to advanced memory has moved from high-end to "mobile phones". Moving to a single memory domain is one of these changes.
Recently, prominent tester Michael Larabel at Linux benchmark site Phoronix ran HPC benchmarks on a GH200 workstation. The system is provided by GPTshop.ai in Germany.
It is understood that the system tower case is equipped with a GH200 Grace Hopper Superchip, equipped with 576G memory, dual 2000+ W power supplies, QCT motherboard, and multiple configuration options, including SSD and NVIDIA Bluefield/Connect-X adapter. An interesting and useful feature is that the TDP can be programmed from 450W to 1000W (CPU + GPU + memory), which should be useful in non-data center environments. In addition, the default air cooling noise is said to be 25 decibels. Liquid cooling is also an option.
However, desktop super workstations don't come cheap. The currently available model, the GH200 576GB, starts at €47,500 (according to Phoronix, this price equates to $41,000 as there is no 19% VAT when shipping outside the EU)
That price may seem high, but consider that the current market price for an Nvidia H100 PCIe GPU with 80 GB of HBM2e memory is between $30,000 and $35,000. This does not include the host system that powers and runs the GPU. Additionally, users are limited to 80GB of GPU memory, which is separated from the main memory domain via the PCIe bus.
GPTshop workstation provides 576GB of single domain memory. HPC and GenAI users will find this half terabyte of CPU-GPU memory attractive.
Preliminary baseline
With GPTshop, Phoronix is able to run multiple benchmarks remotely. Benchmarks should be viewed as preliminary rather than final measures of performance. In particular, the benchmark is only for the CPU and does not use the Hopper A100 GPU. Therefore, the benchmark plot is incomplete. Phoronix plans to test GPU-based applications in the future.
According to Phoronix, Ubuntu 23.10, along with Linux 6.5, uses GCC-13 as the standard compiler. Use a similar environment to test comparable processors, including Intel Xeon Scalable, AMD EPYC, and Ampere Altra Max processors. The complete list can be found on the Phoronix website.
Additionally, no power consumption data is available for the benchmark runs. According to Phoronix, the NVIDIA GH200 currently does not appear to expose any RAPL/PowerCap/HWMON interfaces under Linux, which are only used to read the power/energy usage of the GH200. The BMC on the system does expose the overall system power consumption through the web interface, and the power data is not exposed through IPMI.
Despite these limitations, some important benchmarks were run on the GH200 for the first time outside of Nvidia.
Good Ole HPCG
The first test reported by Phoronix is the standard HPCG memory bandwidth benchmark, shown in Figure 2.
As can be seen, the performance of the GH200 Arm reaches a respectable 42 GFLOPS, which is slightly higher than the Xeon Platinum 8380 2P (40 GFLOPS) and slightly lower than the EPYC 9654 Genoa 2P (44 GFLOPS). Also worth noting is the 72-core Arm Grace CPU, which has almost twice the performance of the Ampere Altra Max 128-core Arm processor.
The GH200 performed well in other benchmarks. The most impressive results are shown in Figure 3. The NWChem (C240-Bucky Ball) using the 72-core Arm GH200 ran in 1404 seconds, just behind the leader, the 128-core Epyc 9554 (2p), which ran in 1323 seconds.
what's about to happen
Nvidia GH200 and AMD MI300A introduced new processor architectures. Like the 8087 math coprocessor, high-end CPUs are beginning to incorporate GPUs (or SIMD processing units). However, the idea is not entirely new. Since 2011, AMD has integrated mid-range GPUs into its desktop/laptop APU processors. While these high-end processors may be considered "specialized" and therefore expensive, the overwhelming interest in GenAI may push these designs toward commodity price points over time. This story will continue to evolve as more benchmarks become available.
Additionally, the introduction of personal high-performance workstations with enough memory to run some of the largest LL.M.s right at your desk is a major milestone. Not to mention the ability to run many large-memory GPU-optimized HPC applications. Data centers and cloud will continue to be the mainstay today, but it must be said that "there is a reset button."
Original link
https://www.hpcwire.com/2024/02/22/a-big-memory-nvidia-gh200-next-to-your-desk-closer-than-you-think/
END
*Disclaimer: This article is original by the author. The content of the article is the personal opinion of the author. The reprinting by Semiconductor Industry Watch is only to convey a different point of view. It does not mean that Semiconductor Industry Watch agrees or supports the view. If you have any objections, please contact Semiconductor Industry Watch.
Today is the 3685th issue of "Semiconductor Industry Observation" shared with you. Welcome to pay attention.
Recommended reading
★ EUV lithography machine blockbuster report, released in the United States
★ Silicon carbide "surges": catching up, involution, substitution
★ The chip giants all want to “kill” engineers!
★ Apple, playing with advanced packaging
★ Continental Group, developing 7nm chips
★
Latest interview with Zhang Zhongmou: China will find a way to fight back
"Semiconductor's First Vertical Media"
Real-time professional original depth
Public account ID: icbank
If you like our content, click "Watching" to share it with your friends.