Vicor powers the most powerful AI server
What is the hottest thing in the technology world right now? It goes without saying that it is artificial intelligence. What is the hottest thing in the technology world right now? It is artificial intelligence. In this era, if you say that your company's products have nothing to do with artificial intelligence, you will probably be laughed at... Chips are the carrier of artificial intelligence and the basis for software operation. Professor Wei Shaojun said that most of the current AI chip startups will become martyrs, which is an admirable and moving great event in the development of AI, but it has not affected the large amount of talents and huge amounts of funds that continue to pour into this industry, rapidly promoting the development and adjustment of this industry.
Artificial intelligence is not a new thing. The Dartmouth Conference in the summer of 1956 marked the birth of artificial intelligence. Since then, it has experienced many ups and downs in its development. IBM's "Deep Blue" supercomputer defeated the human world chess champion. Until the AlphaGo vs. Lee Sedol man-machine century match in 2016, artificial intelligence was once again brought back to the spotlight.
If we look at the previous artificial intelligence, it was mostly implemented through programming, and the amount of calculation was not huge. In contemporary artificial intelligence, a set of learning algorithms are run in the machine to allow the machine to generate its own behavior strategy. The network-based deep learning method has a much larger amount of calculation than the previous program execution, so the hardware computing speed requirement is greatly increased. In recent years, image recognition and image processing have been widely used in artificial intelligence inference recognition applications. Graphics processing is more suitable for parallel processing, but each processing is not very complicated, and it would be very wasteful to process it through the CPU. The GPU integrates hundreds of thousands of simple computing units to do simple and repetitive calculations, which is very efficient. The most vivid explanation I have seen is: If there is a job that requires hundreds of millions of additions, subtractions, multiplications and divisions within 100, the most efficient way is to hire hundreds of elementary school students to calculate together at the same time. Even if the professor is awesome, he can't beat the computing speed of these hundreds of elementary school students. The main job of the professor is to reasonably assign tasks to elementary school students, or mainly deal with difficult tasks such as calculating calculus. The collaborative work of hundreds of elementary school students is the GPU, while the calculation of one or several professors is the CPU.
According to previous statistics, in the graphics computing market, since the CPU is the core of the server, Intel still firmly occupies the first place, accounting for 71%, NVIDIA accounts for 16%, and AMD accounts for 13%. If we only calculate the discrete GPU board, NVIDIA accounts for 71% and AMD accounts for 29%. NVIDIA's GPU board products are widely used in artificial intelligence training and inference in data centers. The most powerful artificial intelligence hardware company today is NVIDIA! In the artificial intelligence industry, NVIDIA is the target of Vicor's service.
In the past two years, Nvidia's stock has made a lot of money, but my Nvidia friend told me that he lives in unhappiness because the stock did not sell at a high point...
Although the work of each computing unit of the GPU is simple and repetitive, the chip consumes a lot of power due to the large number of computing units. The following is the power consumption and manufacturing information of different Tesla GPUs. From the table, we can see that the power of the GPU has increased from 235W to 300W. According to the latest information, the rated power of the new generation of GPUs has reached 400W. Such a large power is just what Vicor wants. We can inject a steady stream of power into artificial intelligence...
The power supply schemes for GPU and CPU are very similar. The current mainstream power supply architecture is 12V input, which is implemented by multi-phase interleaved buck step-down. If the load power is large, more phases are used to cope with high-power loads by increasing the number of phases. Generally speaking, the current per phase is mostly 25~30A. This architecture is widely used in CPU power supply. It has a relatively high cost-effectiveness for designs within 200W. The number of phases is not only determined by the load power, but also depends on the dynamic requirements of the load to a certain extent. The post-stage of the Buck step-down method uses inductors and capacitors for filtering. The energy storage characteristics of the inductor prevent the rapid response of the current. Relying on multi-phase interleaving technology can also improve the ability to cope with rapid changes in loads.
The GPU has a huge amount of computing power, and the sudden peak power is nearly twice the rated power. The figure below shows the appearance of the Tesla V100 board with a 300W power consumption GV100. In order to cope with nearly twice the peak power, the board adopts a 16-phase power supply method. Fortunately, the GPU is just a coprocessor and does not need to have many IO ports like the CPU to be connected to the outside world, but even so, 16 phases are already crowded, and the space around the GPU is full of devices. Nvidia is worthy of being a professional image processing company. After changing different angles and rendering pictures, the crowded board can present another kind of violent mechanical beauty... I dare not imagine what the 400W GPU power supply solution will look like, at least 20 phases, so many phases of power supply control and layout is a great challenge.
For the ultra-multiphase buck solution, in addition to the board area mentioned above, there is another problem that is easy to overlook in terms of electrical characteristics. Due to the large number of multiphases, the distance from the inductor to the actual load end cannot be very close, which will inevitably increase the impedance on the path from the power supply output end to the load end where the current is extracted, and will also increase the inductive reactance on this path.
The increase in impedance will directly increase the power loss of the power supply, reduce the overall power efficiency, and increase the heat generation of the board. The increase in inductive reactance will directly lead to a decrease in dynamic response capability and affect the performance of the processor core. In addition, when the power of a single processor exceeds 200W, the corresponding 12V input terminal will also be close to or exceed 20A, or even 30A. The PCB traces of the 12V input terminal are relatively not very thick, and the I2R loss is also a factor in the power reduction of the system. If a 12KW system is also powered by 12V, the 12V bus will need to provide up to 1000A of current. Current management is a difficult problem, and limiting the bus current is necessary. In the above situations, as the power of the chip continues to increase, its defects are more prominent. In order to meet this challenge, new ideas need to be found. Vicor experts have always insisted that the 48V FPA architecture can easily cope with this challenge and help processor performance reach a new level.
At the GTC conference a few days ago (March 30), with the attention of global artificial intelligence practitioners, Nvidia released its latest GPU server. Here are some key words of the server: DGX-2, 2PFLOPS, 10 times faster than DGX-1, built-in 16 Tesla V100 GPU boards SXM3, total power 10KW, total weight 350 pounds, and a staggering price of $399,000 (but according to Nvidia's logic, since one server can replace a 360KW supercomputer system composed of 600 dual-CPU servers, the more you buy, the more you save)
Undoubtedly, the new generation of GPUs consume more power, exceeding 400W. Here is a close-up of the SXM3 board in the picture above. If the rated power is 400W, the peak power is nearly twice that, which is extremely difficult to achieve using multiple phases. For this reason, the SXM board uses the latest Vicor FPA technology to power the GPU.
Compared with the previous multi-phase shocking inductor arrangement, this board has three more golden modules, and the board surface is much simpler. In the lower right corner of the board, the chip with the screen print of Picor is Vicor's 48V to 12V ZVS buck SIP chip, which is compatible with 12V IO voltage. These modules are Vicor's latest product current multiplier module and current multiplier module driver (MCM/MCD) developed specifically for the artificial intelligence market. The application of MCM/MCD on the board is shown in the figure below.
The MCD/MCM combination is similar to the previous PRM/VTM combination and belongs to the FPA architecture. Unlike PRM, MCD integrates all PRM functions and the H-bridge circuit of the original VTM. The remaining part of the VTM is packaged as MCM, which can make the MCM very thin. The previous MCD converts the 48V input of the system into a 30-40V AC differential pair power signal, which is then input into the MCM and converted into the low-voltage and high-current required by the processor through the MCM's transformer and synchronous rectification. The change in processor voltage or the voltage required by the processor is fed back to the MCD, changing the amplitude of the differential pair and thus changing the processor voltage output by the MCM.
The power density of this solution is more than twice that of a multi-phase Buck, freeing up more PCB board space to optimize system functionality. Since the MCM is very thin, only 2.7mm thick, it can be placed very close to the load end, minimizing the impedance and inductance on the high-current power transmission path. In addition, since the FPA architecture has a natural fast response characteristic, close placement is sufficient to meet any demanding processor transient change requirements.
Vicor's products provide strong power for the DGX-2, which can achieve a computing speed of 2PFLOPS with a power of 10KW.
Here, we would like to remind industry professionals who are engaged in artificial intelligence and experts who use GPU for deep learning that the input terminal of the SXM3 board is already 48V, not 12V! If your system wants to use the SXM3 board, you need to provide a 48V power interface.
Important things should be said three times:
If you want to use the most powerful GPU, you must provide 48V input on the board!
If you want to use the most powerful GPU, you must provide 48V input on the board!
If you want to use the most powerful GPU, you must provide 48V input on the board!
The most powerful GPU on the planet now has increased power due to the surge in computing performance. In the released machines, it has been proven that 48V input is an effective way to solve the total power of the system. Compared with the 12V system, the current is only 1/4 of the same power, and the power loss is only 1/16. When the processor power is too large, the use of multi-phase is no longer a feasible way. Vicor's mature FPA architecture can take on this task. With the rapid start of the market and the current competition of the Eight Immortals Crossing the Sea, excellent performance is a condition for them not to become martyrs. Some of these companies use high-end FPGA processing, and some self-developed ASIC. Before the emergence of new architectures and new algorithms, the expectation of hardware performance improvement will definitely lead to a rapid increase in the power of processors and systems. Vicor's 48V FPA technology, MCM/MCD are technologies prepared for the future of these companies.
Long press your fingerprint, identify the QR code, and follow us with one click
Featured Posts