The dilemma of the intelligent computing center: having a card does not mean having computing power

Latest update time：2024-09-11

Reads：

“ Always on a shopping spree? It’s time for someone to step up and think about how to improve the efficiency of using credit cards. ”

Author | Hu Min

Editor | Zhou Lei

"If you could go back to 2018, what would you do?"

"Stock up on a large batch of Nvidia cards first."

Although this conversation is a joke on the Internet, it also reflects the madness of manufacturers hoarding and looking for cards around the world. As we all know, GPUs have been in a state of tight supply in the past two years, but we have encountered such a real case: a traditional IDC manufacturer has a lot of GPU cards in hand, but they are idle.

This reflects the current situation: although intelligent computing is popular, there are still many situations in the market where card usage is inefficient. In other words, having a card does not necessarily mean having computing power.

Smart computing is hot, but low card efficiency is a concern

Further questioning, why is using the card inefficient?

Because intelligent computing is not just a card issue, but a system that coordinates software and hardware, involving multiple core capabilities such as computing, storage, and networking. When the software capabilities are insufficient, the card itself will also be limited. "Said Sha Kaibo, vice president of Tencent Cloud and senior cloud computing technical expert.

It's like buying a top-of-the-line sports car, but you don't know much about racing technology and don't have a professional racing team to debug the vehicle and plan the best driving route, so you can only drive it on ordinary roads.

This is something that all IDC manufacturers do not want to see, because this is not only a problem of idle resources, but also a problem of transformation opportunity. By making good use of these cards, IDC manufacturers can also transform into AIDC (Intelligent Computing Center) manufacturers and develop new businesses.

Of course, it is not only IDC manufacturers that face the problem of inefficient card usage. For many large model companies, they are also in urgent need of improving computing efficiency. Especially this year, the number of model training parameters is getting larger and larger. Last year, everyone may still be using billions or tens of billions of parameters, but this year the scale has "rolled" to hundreds of billions. For example, Tencent's Hunyuan large model has expanded to a trillion parameter scale.

Such a huge number of parameters also makes the scale of the underlying computing power clusters larger and larger. Some industry practitioners said that starting this year, 10,000 cards will be the minimum standard for all intelligent computing clusters, and only intelligent computing clusters with more than 10,000 cards are valuable.

The continuous expansion of cluster size undoubtedly poses higher challenges to the processing efficiency of the underlying AI infrastructure, such as how to achieve ultra-large-scale networking, effective cluster computing efficiency, high training stability and availability, rapid fault location and diagnostic tools, etc. This is just like you want to improve the collaboration efficiency of 1,000 people and 10,000 people respectively. The difficulty of the two is incomparable.

A more direct problem brought about by the low computational efficiency of large models is that the cost of model training will further increase. At the same time, the training cost is a sensitive point for domestic large model companies at present.

On the one hand, it is well known that big model training costs a lot of money. On the other hand, the capital market's investment in domestic big model companies is also becoming more rational this year. Last year, more than 200 big model companies received investment, while in the first half of this year, only some leading big model companies such as Dark Side of the Moon and Zhipu received financing.

Spending money cannot be endless, and improving card usage efficiency is imminent. As one of the leading cloud vendors, Tencent Cloud has quietly made its move.

On September 5, Tencent Cloud launched its AI Infra brand - Tencent Cloud Intelligent Computing at the Tencent Global Digital Ecosystem Conference, integrating the capabilities of its individual products such as high-performance computing HCC, high-performance network IHN, high-performance cloud storage, acceleration framework, containers, vector databases, and intelligent computing suites to help the industry break through technical bottlenecks and accelerate the release of AI productivity.

Smart computing kicks off to solve customer problems

In fact, before the launch of the Tencent Cloud Intelligent Computing brand, Tencent Cloud has already been iterating and exporting intelligent computing technologies and products to improve intelligent computing performance and reduce usage costs.

In April last year, Tencent Cloud officially released the new generation of HCC high-performance computing cluster for large-model training; in June last year, Tencent Cloud fully disclosed its self-developed Xingmai high-performance computing network to the public for the first time. Later, Tencent Cloud released the AIGC cloud storage solution; it launched a proprietary cloud intelligent computing suite based on mature practices of public cloud, supporting enterprises to build high-performance proprietary intelligent computing clouds based on their own hardware.

According to Sha Kaibo, the reason for establishing the Tencent Cloud Intelligent Computing brand at this point is that, firstly, the rise of AI big models has put forward more advanced requirements on the entire cloud infrastructure on the demand side; secondly, driven by the AI big models, Tencent Cloud has also evolved many capabilities of the cloud infrastructure. The establishment of the Tencent Cloud Intelligent Computing brand is to let more customers understand the capabilities of Tencent Cloud Intelligent Computing, and at the same time export these capabilities to better support customers' business development.

Customers often encounter the following problems during large model training:

The first is how to improve training efficiency and reduce failure rate?

The low training efficiency may be due to several factors. The first is the long training startup time. Due to various factors such as software and hardware, many training startup times in the industry currently last as long as a month.

The second is that failures often occur during training. The failure rate of large model training is a problem that cannot be underestimated. According to statistics, the failure rate of GPU is more than 120 times that of CPU. Not long ago, Meta released its latest Llama 3 405B large language model and announced a research result. The 405B model was trained for 54 days through a server cluster composed of 16,384 NVIDIA H100 80G GPUs. During these 54 days, the cluster suffered 419 unexpected component failures, an average of one failure every 3 hours.

By integrating software and hardware technical capabilities, Tencent Cloud's intelligent computing cluster can be completed in just one day from machine deployment to the start of training. In terms of the number of failures, the number of failures per day in Tencent Cloud's cluster has been refreshed to 0.16, which is 1/3 of the industry level.

The reason for this achievement is related to its network, storage products, acceleration framework, vector database and intelligent computing suite. According to Sha Kaibo, Tencent Cloud's self-developed Xingmai network automatically senses traffic and topology for scheduling, improves network throughput, locates and processes problem links when failures occur, and reduces training interruptions. In a Wanka cluster, network failures can be discovered in 1 minute, located in 3 minutes, and resolved in 5 minutes. The communication time in a Qianka cluster is shortened to 6%, which is half of the industry. Tencent Cloud's high-performance parallel file storage CFS Turbo supports Qianka concurrent reading and writing.

In addition, in terms of acceleration framework, Tencent Cloud's TACO also speeds up cloud computing efficiency. According to relevant personnel from Tencent Cloud, under the same hardware environment, a system that could only process 100 tokens per second can be increased to 200 or even 300 tokens per second after using TACO, and the increase in the number of tokens processed does not bring too much delay.

The second is how to make training more compatible and deployment more flexible?

In the past, model training tasks were often performed by a single manufacturer's chip for a single task. In the current context of tight chip supply, major chip manufacturers continue to increase their investment and layout in GPUs. In more and more intelligent computing centers, heterogeneous networking of cards of different models and manufacturers is becoming more and more common. At the same time, many industries currently have extremely high requirements for data security and compliance, and many training and reasoning can only be performed in local data centers.

In order to solve the training problem of multiple types of cards, Tencent Cloud currently adopts the "one cloud, multiple cores" architecture, which can adapt, manage, and schedule a variety of CPU and GPU chips, effectively reducing supply chain risks while meeting the needs of different businesses for different computing power.

In response to deployment issues, Tencent Cloud launched a proprietary cloud computing suite to support enterprises in building high-performance proprietary intelligent computing clouds based on their own hardware, meeting the needs of enterprises to train large models in a private computing environment. This suite has the same configuration as the public cloud, and Xingmai Network, AIGC cloud storage, and Taco are all included in this packaged solution.

Today, according to Sha Kaibo, Tencent Cloud Intelligent Computing has served more than 90% of large-model enterprises, and these companies have also achieved a reduction in large-model training costs. After a large-model customer adopted Tencent Cloud's complete computing power solution, the cost was reduced by 20 million in one year.

In addition to large model customers, some customers doing AI applications are also using this solution. In the second half of last year, a community e-commerce company replaced overseas chips with domestic chips provided by Tencent Cloud on the public cloud when doing AI applications. While keeping the main business indicators unchanged, the company completed chip replacement within 21 days, model adaptation in about two weeks, and physical framework transformation in about a week. In addition, the IDC company mentioned at the beginning also sold almost all of its GPU resources within half a year in cooperation with Tencent Cloud.

Intelligent computing is developing to find the direction of cloud market growth

From the customer's perspective, they hope that cloud vendors can provide more high-performance AI infrastructure to improve the quality and efficiency of their business. From the cloud vendor's perspective, accelerating the improvement of intelligent computing capabilities is also to seize new cloud growth points.

Previously, cloud vendors have locked in several incremental directions - going overseas, sinking markets, and digital transformation of traditional industries. Going overseas currently faces great uncertainty. Going to developed countries, such as Europe and the United States, faces trust issues, long-term usage scale cannot be increased, and the cost of operating data centers is high. Overseas companies are basically still bleeding.

The main problem in the lower-tier markets is lack of money. Customers in these markets do not have a strong demand for cloud computing, and customers who pay 1,000 yuan per year are everywhere. The biggest difficulty in the digital transformation of traditional industries is how to gain insight into industry needs and understand the know-how of each industry. After all, the core purpose of digital transformation for traditional enterprises is to see the improvement of business quality and efficiency.

Expansion in these incremental directions is like opening up a new path in a dense forest; each step is full of challenges and uncertainties.

At the same time, all cloud practitioners have long been mired in the quagmire of existing market competition, confused by the lack of performance growth, and desperately trying to win customers from competitors, which has prompted cloud vendors to eagerly look for more incremental markets. The emergence of big models has brought new hope for incremental growth to cloud vendors.

In the financial reports of many cloud vendors this year, AI's contribution to cloud business revenue is very eye-catching. In the second quarter of this year, AI drove Alibaba Cloud back to growth, with quarterly revenue increasing by 6% to 26.549 billion yuan, of which AI-related product revenue achieved triple-digit growth and public cloud business achieved double-digit growth; Tencent also stated that benefiting from factors including the growth of cloud service business revenue, enterprise service business revenue achieved double-digit growth.

Many people predict that AI big models will become the biggest driving force for future cloud market growth and the only chance for public cloud services to return to a high-growth era. Based on this prediction, Tencent Cloud established its Zhisuan brand, and other cloud vendors are also making intensive arrangements.

Whether large models can truly drive cloud usage remains controversial. Although the scale of the domestic AI public cloud service market has grown, the market share allocated to cloud vendors seems to be unable to satisfy their huge appetite amid fierce competition.

According to the data from the "AI Cloud 2023" report just released by IDC, the scale of the domestic AI public cloud service market in 2023 will be 12.6 billion, an increase of 58.2% over the previous year. The growth rate is gratifying, but a careful calculation shows that the 12.6 billion scale is divided among various cloud vendors, and the remaining sums are only a few hundred million to several billion. This figure does have a driving force on the revenue of cloud vendors, but it does not seem to be that large.

Time will tell how much growth the big model can bring to the cloud. But at present, judging from the intensive layout of various cloud vendors, a fierce competition of intelligent computing strength has quietly begun. Where the future cloud market will go remains to be seen.

Recent Hot Articles

In the first half of the year, Tencent Cloud channel revenue growth rate outperformed the market. What is the reason?

Exclusive: Tencent Cloud Industry Ecosystem Cooperation Director Rotates, Yang Chen Takes Over

Dialogue with Tencent's Wu Yunsheng: Focusing on cloud intelligence, Tencent's ToB evolution is a "convergence"

Latest articles about

■Database "Suicide Squad"

■Exclusive: Yin Shiming takes over as President of Google Cloud China

■After more than 150 days in space, the US astronaut has become thin and has a cone-shaped face. NASA insists that she is safe and healthy; it is reported that the general manager of marketing of NetEase Games has resigned but has not lost contact; Yuanhang Automobile has reduced salaries and laid off employees, and delayed salary payments

■Exclusive: Google Cloud China's top executive Li Kongyuan may leave, former Microsoft executive Shen Bin is expected to take over

■Tiktok's daily transaction volume is growing very slowly, far behind Temu; Amazon employees exposed that they work overtime without compensation; Trump's tariff proposal may cause a surge in the prices of imported goods in the United States

■OpenAI's 7-year security veteran and Chinese executive officially announced his resignation and may return to China; Yan Shuicheng resigned as the president of Kunlun Wanwei Research Institute; ByteDance's self-developed video generation model is open for use丨AI Intelligence Bureau

■Seven Swordsmen

■A 39-year-old man died suddenly while working after working 41 hours of overtime in 8 days. The company involved: It is a labor dispatch company; NetEase Games executives were taken away for investigation due to corruption; ByteDance does not encourage employees to call each other "brother" or "sister"

■The competition pressure on Douyin products is getting bigger and bigger, and the original hot-selling routines are no longer effective; scalpers are frantically making money across borders, and Pop Mart has become the code for wealth; Chinese has become the highest-paid foreign language in Mexico丨Overseas Morning News

■ByteDance has launched internal testing of Doubao, officially entering the field of AI video generation; Trump's return may be beneficial to the development of AI; Taobao upgrades its AI product "Business Manager" to help Double Eleven丨AI Intelligence Bureau