IoD leads the new era of cloud computing, in-depth analysis of DPU technology white paper (attached with download)

Latest update time：2024-08-10

Reads：

????If you hope to meet more often, please mark the star ?????? and add it to your collection~

In July 2024, Zhongke Yusu and the industry's authoritative organizations jointly released the latest DPU technology white paper. The white paper proposed and elaborated the new concept of "IaaS on DPU (IoD)", which is to build a high-performance computing power base by offloading cloud computing infrastructure service (IaaS) layer components to DPU.

The white paper points out that at present, DPU technology is no longer "exclusive" to giants such as AWS and Alibaba Cloud. With industry leaders such as Red Hat and VMware launching DPU-based solutions, the application of DPU has become a major trend in the field of infrastructure and cloud computing. By integrating DPU into their products, these companies provide more flexible and efficient computing platforms for enterprises of different sizes. To some extent, IoD technology has become the core technology and best practice of the next generation of high-performance computing power base.

On June 19, Zhongke Yusu released the third-generation DPU chip K2-Pro. As the first company in China to achieve mass production of full-function DPU chips, Zhongke Yusu has profound experience and authoritative status in DPU research and development and the commercialization and deployment of DPU chips. This white paper provides a comprehensive and in-depth analysis of the architecture, advantages and integration solutions of IoD technology with traditional cloud computing systems, and demonstrates the performance advantages and construction path of building cloud computing infrastructure services (IaaS) based on DPU, providing valuable technical references for the industry. [White paper download link attached at the end of the article]

The AI industry has given rise to the demand for high-performance cloud computing

2024 is the first year of AGI. With the rapid development of large models and generative AI, the parameter scale and data set size of the models continue to increase. In the six years from 2017 to 2023, the number of parameters of large AI models increased from 65 million in Transformer to 1.8 trillion in GPT-4, and the model size increased by more than 20,000 times.

The industry's demand for intelligent computing power has also grown dramatically. According to the "Computing Power and Artificial Intelligence" report, the computing power demand of early AI models doubled every 21.3 months. Since the deep learning era in 2010 (the era of small models), this demand interval has shortened to doubling every 5.7 months. By 2023, the computing demand for large models will double every 1-2 months. The growth rate of Moore's Law lags significantly behind the exponential growth rate of society's demand for Al computing power, that is, the "Al super demand curve" is far ahead of the Al computing power supply of traditional architectures, resulting in short-term market phenomena such as Al chip production capacity bottlenecks and price increases.

Such a huge demand for intelligent computing power brings huge technical and cost challenges to the performance, stability, cost and security of the underlying intelligent computing infrastructure. In particular, the performance of intelligent computing cloud infrastructure in terms of computing power, network, storage, scheduling, etc. has a key impact on the AI training process and is a key factor in determining the quality of AI large model training (efficiency, stability, energy consumption, cost, trust, etc.).

Currently, almost all major computing infrastructure in the world is managed and scheduled through cloud computing technology, especially in the large model industry. It can be said that cloud computing technology has become the "operating system" of the digital world. Cloud computing not only provides the necessary infrastructure to support AI training, but also directly promotes the improvement of AI model training quality and efficiency through its flexible, efficient and scalable characteristics, and promotes the rapid development and widespread application of AI technology.

Using DPU to build a computing power foundation

Among the components of cloud computing business, the hardware part constitutes the physical foundation of cloud computing, including servers, storage devices, network equipment (switches, routers, etc.) and possible specialized hardware (such as GPU servers, FPGA accelerators, etc.).

The functions of traditional IaaS platform components are all carried by CPU computing power. However, with the improvement of cloud computing performance requirements and the development of the demand for extreme utilization of CPU computing power, the concept of building an IaaS platform based on DPU has been proposed and demonstrated.

In this process, Amazon Web Services (AWS) is particularly representative. According to the analysis of disclosed materials, since the release of the Nitro (DPU) device in 2013, AWS's cloud computing service system has gradually transformed to be built based on DPU and run in Nitro devices. The CPU computing power on the server is fully pooled and sold to customers with nearly 100% of the original computing power performance. Based on this, AWS has built a complete set of high-performance and high-stability cloud service systems and has become the largest cloud service provider in the world. Alibaba Cloud in China also adopts a similar system. Its cloud service system works closely with its self-developed DPU devices, helping Alibaba Cloud achieve great success.

Therefore, IaaS on DPU, abbreviated as IoD, is not a new concept, but a technical direction that has been fully demonstrated by leading companies in the industry, and its commercial value has also been tested by the market.

However, the DPU and cloud platforms of companies such as AWS are highly customized and difficult to promote in the industry. With the entry of leading chip companies such as NVIDIA, INTEL, AMD, and some excellent DPU startups such as China's Zhongke Yushu into the DPU track, DPU technology has gradually matured. Whether from the perspective of functional completeness, system stability or cost-effectiveness, DPU has the conditions for landing in large-scale production environments.

How to explore a path to combine a general cloud computing system with standard DPU products has become the focus of attention in the industry. The establishment of standardization organizations such as OPI and ODPU is a key event to promote the development of DPU. Cloud vendors and DPU suppliers have participated in the discussion of DPU API specifications. DPU API specifications can decouple cloud platforms from DPU devices, standardize IoD technology, and fully promote it to the cloud computing industry.

IoD leads a new paradigm of high-performance cloud computing

In the current development trend of high-performance computing, network performance bottleneck has emerged as one of the important obstacles restricting the progress of cloud computing. Under the challenges of processing large-scale data and meeting real-time computing needs, solving this problem is becoming increasingly difficult.

As a key optimization technology, network offloading significantly reduces the CPU burden and improves data processing speed and network throughput by offloading network-intensive tasks such as packet processing, encryption and decryption from the CPU to dedicated hardware such as the DPU. It also reduces latency and enhances security through hardware acceleration, thereby effectively solving the network performance bottleneck in high-performance cloud computing and helping it achieve more efficient, secure, and cost-effective network transmission and processing capabilities.

IoD technology is the main implementation solution for cloud computing offloading technology. The core idea of IoD technology is to rely on the heterogeneous computing capabilities of DPU to sink the infrastructure components of the cloud computing platform to DPU as much as possible to achieve the purpose of saving CPU overhead and improving IaaS service performance. At the same time, after the infrastructure components are sunk to DPU, they can provide a consistent network, storage and security base for various businesses running on the server side, and can better converge the business scheduling of virtual machines, containers and bare metal to a unified platform.

IoD Network Offload Acceleration Principle

IoD System Model

The white paper discusses the IoD high-performance cloud computing application paradigm, which mainly includes:

"Inclusive" public cloud: Public cloud services are the most typical cloud computing application scenario. Overall, some public cloud vendors will adopt self-developed DPUs to obtain higher business customization after selecting the technical route, but the huge investment in chip research and development also brings huge uncertainty. Most other cloud service vendors will choose to introduce equipment from hardware suppliers to build their own technical systems. At this time, the standardization, customization and service support capabilities of DPU equipment will become crucial factors.

"Safe and powerful" private cloud: IoD technology has obvious advantages for private cloud construction, including operation and maintenance isolation, high security, performance improvement, energy saving and emission reduction. However, the current private cloud transformation is facing many problems, involving adaptation transformation, business migration and other aspects.

"Small and exquisite" edge cloud: IoD technology is also of great significance to the development of edge cloud, such as achieving space saving, customization with the help of DPU's highly programmable characteristics, and DPU's network and storage offloading capabilities are of great help to improve edge cloud performance. We are still in the early stages of large-scale deployment of edge cloud services, and this is the best time to introduce DPU applications into the edge cloud technology system.

The "emerging" Zhisuan Cloud: The infrastructure layer of Zhisuan Cloud mostly adopts a CPU+DPU+GPU 3U integrated heterogeneous computing architecture, in which the network layer hardware adopts the DPU series products. By offloading the computing, storage, network, security, and management of Zhisuan to the DPU hardware layer for processing, it can achieve ultimate performance in an ultra-high bandwidth, ultra-low latency network environment. At the same time, DPU provides security isolation protection for multi-tenant Zhisuan Cloud services, and well supports the reasoning and training services of AI in GDR and GDS scenarios, ensuring the safe, stable, and reliable operation of all services and data on the Zhisuan Cloud platform.

"Lightning-fast" low-latency cloud: The heterogeneous computing power management capabilities of the IoD technology system incorporate low-latency transmission capabilities into cloud platform management and scheduling, which can better support the business needs of low-latency cloud scenarios.

In general, on the path of building high-performance cloud computing infrastructure, IoD technology starts from the perspective of cloud computing architecture and combines the actual capabilities of DPU to try to offload as many cloud computing capabilities as possible, such as network, storage, security, management, and operation and maintenance, to DPU. While ensuring the smooth evolution of the existing technology system as much as possible, it can also bring huge performance improvements to cloud computing.

IoD Technology Panorama

Summarize

High-performance cloud base is an important direction for the development of cloud computing. Through IoD technology, a high-performance computing base with DPU as the core structure and integrated hardware and software can be provided for the cloud computing system, providing unified management, high scalability, high performance, and low-cost IaaS services. At the hardware level, it provides better solutions for the heterogeneous computing power management of "3U in one" and "one cloud, multiple cores". By unloading network, storage, security, management and other loads, the hardware resources of the server are released, performance acceleration is achieved, and the operating efficiency of the infrastructure is improved. In addition, through the unified base technology of IoD, unified scheduling and operation and maintenance management capabilities of containers, virtual machines, and bare metal services can be provided for the cloud computing system, improving the efficiency of operation and maintenance management.

However, its realization depends on the joint efforts of all parties in the industry. In its white paper, Zhongke Yusu called for: competent authorities and industry organizations should formulate policies to encourage innovation; cloud service providers and hardware manufacturers should work closely together to jointly develop high-performance cloud computing services and solutions; user enterprises should actively respond to cloud policy documents, and understand and evaluate the potential application value of high-performance cloud base solutions in their businesses.

Reference Links

IaaS on DPU (IoD): Next-generation high-performance computing base technology white paper, July 2024, Zhongke Yushu et al.

To download the "IaaS on DPU (IoD): Next-Generation High-Performance Computing Base Technology White Paper", please click " Read Original Text" at the end of the article .

Click here???? to follow us and lock in more original content

END

*Disclaimer: This article is originally written by the author. The content of the article is the author's personal opinion. Semiconductor Industry Observer reprints it only to convey a different point of view. It does not mean that Semiconductor Industry Observer agrees or supports this point of view. If you have any objections, please contact Semiconductor Industry Observer.

Today is the 3853rd content shared by "Semiconductor Industry Observer" for you, welcome to follow.