Goose Factory releases large model computing cluster! The computing power is increased by 3 times, and a trillion-scale large model can be trained in 4 days.

Latest update time：2023-04-18 16:00

Reads：

Mingmin comes from Ao Fei Temple
Qubit | Public account QbitAI

Unexpectedly, just when everyone was scrambling to release large models, Goose Factory found another way to focus on computing power.

Just now, Tencent Cloud’s latest generation HCC (High-Performance Computing Cluster) high-performance computing cluster has arrived!

For large model training, the latest generation of Tencent Cloud Xingxinghai self-developed servers are used, equipped with NVIDIA H800 Tensor Core GPU, which can provide the industry's highest 3.2T ultra-high interconnection bandwidth .

Actual measurement results show that the computing performance of Tencent Cloud's new generation cluster is up to three times higher than that of the previous generation .

Take the effect of training your own large model as an example - Hunyuan NLP large model training with trillions of parameters. With the same data set, the training time is shortened from 50 days to 11 days. If based on a new generation cluster, the training time will be further shortened to 4 days .

Under this wave of AIGC boom, the demand for computing power in the industry has increased unprecedentedly, and various hardware and software related to intelligent computing power have become more popular.

What new developments will this sudden release of Goose Factory bring?

Brings 3.2T ultra-high communication bandwidth

It is understood that Tencent Cloud's new generation cluster can provide high-performance, high-bandwidth, and low-latency intelligent computing capability support for large model training by collaboratively optimizing single-machine computing power, network architecture, and storage performance.

At the computing level, the stand-alone performance of the server is the basis of the cluster computing power.

In the case of non-sparse specifications, the new generation cluster single GPU card supports output of up to 495 TFlops (TF32), 989 TFlops (FP16/BF16), and 1979 TFlops (FP8) computing power.

For large model training scenarios, Tencent Cloud Xinghai server adopts a 6U ultra-high-density design, which is 30% higher than the industry's supported shelf density. It uses the parallel computing concept and integrates the design of CPU and GPU nodes to increase the computing power of a single point. Performance is improved to its maximum.

At the network level, there are massive data interaction requirements between computing nodes. As the cluster scale expands, communication performance will directly affect training efficiency.

Tencent's self-developed Xingmai network brings the industry's highest 3.2T ultra-high communication bandwidth to the new generation of clusters.

Unified AllReduce communication bandwidth inside and outside the node achieves maximum synergy between network and computing power.

Actual measurement results show that equipped with the same GPU, the latest 3.2T Xingmai network can increase the overall computing power of the cluster by 20% compared to the 1.6T network.

Based on the non-blocking network architecture of multi-track aggregation, active congestion control and customized acceleration communication library, Tencent Cloud can provide industry-leading cluster construction capabilities and support a network scale of up to 100,000 cards for a single cluster.

In extremely large cluster scenarios, it can still maintain excellent communication overhead ratio and throughput performance, meeting the horizontal expansion of large model training and inference business.

At the same time, Tencent's self-developed high-performance collective communication library TCCL is based on in-depth optimization of the Xingmai network hardware platform and incorporates custom-designed solutions in global path planning, topology-aware affinity scheduling, and real-time alarm/self-healing of network faults.

Compared with the industry's open source collective communication library, it optimizes 40% load performance for large model training and eliminates training interruption problems caused by multiple network reasons.

At the storage level, in a training scenario, thousands of computing nodes will read a batch of data sets at the same time, and the loading time of the data set needs to be shortened as much as possible. The new generation cluster introduces Tencent Cloud's latest self-developed storage architecture to support storage needs in different scenarios.

The COS+GooseFS object storage solution provides multi-layer cache acceleration and greatly improves end-to-end data reading performance; it stores public data sets, training data, and model results in the object storage COS to achieve unified storage and efficient transfer of data.

At the same time, GooseFS caches hot data into GPU memory and local disks on demand, leveraging data locality to provide high-performance access.

CFS Turbo high-performance parallel file storage solution adopts multi-level cache acceleration and is based on a fully distributed architecture to provide the ultimate performance of 100GB/s bandwidth and 10 million IOPS. And through persistent client caching technology, the local NVMe SSD and Turbo file system of the bare metal server form a unified namespace to achieve microsecond-level latency and solve the requirements of large data volume, high bandwidth, and low latency in large model scenarios.

At the same time, through intelligent tiering technology, cold and hot data are automatically tiered, saving 80% of storage costs and providing the ultimate cost-effectiveness.

On top of the underlying architecture, for large model training scenarios, the new generation cluster integrates Tencent Cloud's self-developed TACO Train training acceleration engine, which performs a large number of system-level optimizations on network protocols, communication strategies, AI frameworks, and model compilation, significantly saving training and tuning. and computing power costs.

AngelPTM, the self-developed training framework of Tencent Taiji Machine Learning Platform, has also been provided as an external service through Tencent Cloud, which can help enterprises accelerate the implementation of large models.

At present, Tencent's Hunyuan AI large model has covered basic models such as natural language processing, computer vision, and multi-modality, as well as models in many industries and fields.

On Tencent Cloud, based on large model capabilities and toolboxes, enterprises can conduct fine-tuned training based on industrial scenario data, improve production efficiency, and quickly create and deploy AI applications.

Previously, Tencent’s many self-developed chips have been put into mass production.

Among them, the Zixiao chip used for AI inference and the Canghai chip used for video transcoding have been delivered for use within Tencent, and their performance indicators and overall cost-effectiveness are significantly better than those in the industry.

Zixiao uses a self-developed storage and computing architecture to increase on-chip memory capacity and use more advanced memory technology to eliminate the problem of insufficient memory access capabilities that restricts chip performance. It also integrates Tencent’s self-developed acceleration module to reduce handshake waiting time with the CPU.

Currently, Zixiao has been deployed in Tencent's head business, providing up to 3 times the computing acceleration performance and over 45% overall cost savings.

It is reported that Tencent Cloud’s distributed cloud-native scheduling has a total scale of more than 150 million cores and provides 16 EFLOPS (160 billion floating-point operations per second) of intelligent computing power. In the future, the new generation of clusters will not only serve large model training, but will also be fully used in scenarios such as autonomous driving, scientific computing, and natural language processing.

(Qubit recently launched the "China AIGC Computing Power Industry Panoramic Report" solicitation activity. Interested children are welcome to click the registration form below━(*｀∀´*)ノ任!)

-over-

Collection of "China AIGC Computing Power Industry Panoramic Report" launched

The demand for AIGC computing power has exploded. Who will stand out in this transformation of the computing power industry?

Qubit's "China AIGC Computing Power Industry Panoramic Report" and "AIGC Computing Power Players Most Noteworthy" have officially launched external solicitations. We look forward to more outstanding institutional products, cases and technologies being seen by the public.

click here

Latest articlesabout

■AI venom is all over Douyin and Xiaohongshu! Xianyu generates it for 10 yuan per time, but the official website is actually free

■The space-based intelligent version of ImageNet is here! Produced by Fei-Fei Li and Jia-Jun Wu’s team

■Multimodal models can be connected to the Internet without fine-tuning. A plug-and-play new framework is more effective than closed-source commercial solutions.

■Last week! 2024 Artificial Intelligence Annual Selection, the industry pioneers in the AI era are waiting for you

■The world's first legal o1 big model is released, slow thinking legal experts under the System2 paradigm | HKUST & Peking University

■Tsinghua University and Xiamen University proposed the "infinite length context" technology, which can find a needle in a million haystacks and make Llama\Qwen\MiniCPM score high

■Domestic AI can now shoot micro-movies! 4K, 60fps high-definition picture quality, with built-in sound effects

■Ant Group’s front-end technology team shares: What opportunities and changes will front-end development usher in under the wave of AI?

■AI protein published in Nature again after winning the Nobel Prize, with first-principles-level accuracy, a 4-year effort by Microsoft Research Asia

■A pop-up window confused Claude, and he suddenly couldn't use the computer | Stanford & HKU new research

最新精华更多

Goose Factory releases large model computing cluster! The computing power is increased by 3 times, and a trillion-scale large model can be trained in 4 days.

Mingmin comes from Ao Fei Temple Qubit | Public account QbitAI

Brings 3.2T ultra-high communication bandwidth

Latest articlesabout

Mingmin comes from Ao Fei Temple
Qubit | Public account QbitAI