Analysis丨Perhaps all large models in the future will be calculated using A cards?

Latest update time：2024-10-18

Reads：

Focus: artificial intelligence, chips and other industries

Welcome all guests to follow and forward

Preface :

Currently, the computing center field is facing an important trend. As some large model manufacturers gradually reduce their dependence on pre-training, the demand for reasoning capabilities and model fine-tuning has increased significantly.

This change has had a profound impact on downstream large-scale model and application developers, prompting upstream chip manufacturers to quickly adjust their strategic direction.

However, differentiated competition relying solely on a single chip is no longer able to meet market demand.

Author | Fang Wensan

Image source | Internet

AMD's new products directly compete with Nvidia's Blackwell series chips

At the Advancing AI 2024 event held in San Francisco, AMD CEO Lisa Su launched a new generation of Ryzen CPUs, Instinct AI computing cards, EPYC AI chips and other products.

These products cover the entire AI computing space and are clearly designed to compete with Nvidia’s Blackwell series of chips.

According to international media reports, in the past few years, Nvidia has occupied a significant leadership position in the data center GPU market, almost forming a monopoly, while AMD has long been ranked second in market share.

Earlier this year, NVIDIA released its pinnacle performance product, the B200, which has achieved nearly 30 times the computing power of the previous generation H200 chip.

According to the plan, B200 will be put into mass production and put on the market in the fourth quarter of this year, which is expected to further widen the gap with its competitors.

However, with the launch of the MI325X, AMD aims to compete effectively with Nvidia's Blackwell chip.

AMD integrates eight MI325X units in its Instinct MI325X GPU, which can support up to 2TB of HBM3E memory;

Its theoretical peak performance at FP8 precision can reach 20.8 PFLOPs, and at FP16 precision it can reach 10.4 PFLOPs.

In terms of system architecture, AMD uses Infinity Fabric interconnect technology to achieve a bandwidth of up to 896 GB/s and a total memory bandwidth of 48 TB/s.

In addition, the power consumption of each GPU has also been increased from the original 750W to 1000W.

Compared with NVIDIA H200 HGX, Instinct MI325X has 1.8 times and 1.3 times advantages in memory capacity, memory bandwidth and computing power respectively.

In terms of inference performance, it has also achieved an improvement of about 1.4 times compared to H200 HGX.

In mainstream models such as Meta Llama-2, the single-GPU training efficiency of MI325X has surpassed H200, but in a cluster environment, the two remain at a comparable level.

In addition, AMD also highlighted its next-generation MI300 series GPU accelerator, the MI355X.

This model is based on the next-generation CDNA 4 architecture and uses a 3nm process technology. Its memory capacity has been upgraded to 288GB HBM3E.

MI355X can support both FP4 and FP6 data types, thereby further improving AI training and reasoning performance while maintaining computing accuracy.

AMD said that the next-generation CDNA 4 architecture is expected to have a 35-fold performance improvement and a 7-fold increase in AI computing power compared to the CDNA 3 architecture, while memory capacity and bandwidth will also increase by about 50%.

Therefore, next year's Instinct MI355X AI GPU is expected to achieve a huge leap in performance.

According to the data currently released by AMD, the computing power of the Instinct MI355X AI GPU in the FP16 data format can reach 2.3PF, which is 1.8 times higher than that of the MI325X and is on par with the computing power of NVIDIA B200.

In FP6 and FP4 formats, its computing power can reach 9.2PF, which is nearly double the computing power of B200 in FP6 format and equivalent to the computing power of B200 in FP4 format.

Therefore, MI355X is regarded as AMD's true GPU chip that competes with B200.

Blackwell GPU production issues now resolved

It has been reported that Blackwell architecture products encountered some problems during the production process, specifically manifested in low yield rate, which in turn affected shipment volume.

NVIDIA Blackwell GPU is one of the first products to adopt TSMC's CoWoS-L packaging technology. This technology builds high-density interconnections through LSI bridges and can be seamlessly compatible with various high-performance chips.

However, due to the differences in the coefficient of thermal expansion (CTE) between the GPU chip, RDL intermediate layer, LSI bridge and substrate, this has brought some challenges to production.

Morgan Stanley pointed out that the yield decline problem of Nvidia Blackwell GPU in the production process mainly occurred in the post-packaging stage, which further aggravated the already tight supply situation of CoWoS packaging and HBM3e memory.

However, the agency believes that Nvidia has resolved these issues internally.

Morgan Stanley predicts that Nvidia will ship up to 450,000 Blackwell GPUs in the fourth quarter of this year, and is expected to achieve revenue of US$5 billion to US$10 billion.

At the same time, the agency also stated that since the current Blackwell GPU orders are full until the second half of next year, the unfilled orders will instead drive the growth of demand for Hopper GPUs.

Nvidia's Blackwell GPU production issues were seen as a potential opportunity for AMD, however, this opportunity did not materialize as expected.

AMD's UDNA competes with Nvidia's CUDA for ecological niche

At the ADVANCING AI 2024 conference, AMD further revealed its AI chip development blueprint, announcing that the MI350 series based on the CDNA 4 architecture will be released next year, while the MI400 series will adopt the more advanced CDNA architecture.

However, AMD still faces many challenges and needs to make long-term efforts to shake Nvidia's leading position in the CUDA ecosystem.

As a long-term plan of AMD, the potential of UDNA still needs time to be verified.

The main challenge AMD faces in expanding its market share is that Nvidia has built a solid barrier in the field of AI software development through its CUDA platform, attracting a large number of developers and deeply binding them to Nvidia's ecosystem.

The CUDA ecosystem, as NVIDIA's unique parallel computing platform and programming model, has now been established as a benchmark standard in the field of AI and high-performance computing tasks.

The challenge facing AMD is not only to focus on improving hardware performance, but also to build a software ecosystem that can attract the attention of developers and data scientists.

To this end, AMD has increased its investment in its ROCm (Radeon Open Compute) software stack and announced at a recent event that it has successfully doubled the inference and training performance of the AMD Instinct MI300X accelerator in widely used AI models.

Additionally, the company noted that more than one million models are now able to run seamlessly on the AMD Instinct platform, which is three times the number when MI300X was first launched.

Exit the high-end graphics card market and focus on the AI market

For the gaming community, the exclusive optimization strategies of mainstream manufacturers for NVIDIA products have shown an irresistible appeal.

In view of this, AMD's current move to temporarily withdraw from the high-end graphics card market is seen as a wiser business decision.

It is worth noting that this is not the first time AMD has given in to Nvidia in the high-end market.

Looking back to the period between 2017 and 2019, AMD was actually absent from the high-end graphics card field. At that time, Vega 64 and Radeon VII faded out of the market one after another, and only relied on the RX580 and RX590 products to maintain competitiveness in the mid- and low-end markets.

AMD's current strategic focus is on the mid- and low-end markets, aiming to expand its user base and guide developers to optimize their products through a large user base.

In addition, AMD's strategic withdrawal can be seen as a prelude to its "all-in" in the field of AI.

Given AMD Chairman and CEO Dr. Lisa Su's firm judgment that the AI super cycle has just begun, the company regards AI as a key direction for future development and is willing to invest heavily in it.

Faced with the rapidly growing market for AI chips, AMD is undoubtedly unwilling to miss the opportunity.

Although its CPU business continues to prosper due to the success of the ZEN architecture, the market performance of discrete graphics cards, especially high-end gaming graphics cards, is less than satisfactory.

Therefore, leveraging strengths and avoiding weaknesses while focusing on the development of AI products has become AMD's most reasonable choice at present.

Although Nvidia still maintains a significant advantage in AI training performance, the real profit point in the AI field lies in inference workloads.

In short, training is the process of [educating] an AI model through a dataset, while inference is the process of the trained model making predictions on completely new data.

But it is worth noting that AMD's pricing strategy for its accelerated server products is quite aggressive, which is in stark contrast to Nvidia, whose net profit margin exceeds 50%.

This strategy is exactly the same as the method AMD used to weaken Intel's position in the CPU market. Given the price-performance advantage, this strategy may prompt some customers to switch from Nvidia to AMD.

AMD's current priority is to win more market orders

A research report by TechInsights pointed out that Nvidia's data center GPU shipments in 2023 reached approximately 3.76 million units, accounting for 97.7% of the market share, which is roughly the same as in 2022.

In comparison, AMD and Intel followed with shipments of 500,000 and 400,000 units respectively, accounting for 1.3% and 1% of the market share respectively.

For AMD, its current main goal is not to surpass Nvidia, but to actively win more market orders.

As one of Nvidia's few competitors in the market, AMD has always been seen as one of the diversified options by the technology giant.

Currently, the proportion of AI chips in AMD's business continues to rise.

According to AMD's second quarter financial report, AMD Instinct MI300X GPU contributed more than $1 billion in operating income to AMD in the quarter, and full-year sales are expected to exceed the $4.5 billion mark, accounting for approximately 15% of the company's overall sales.

It is worth noting that many companies, including Microsoft, OpenAI, Meta, Cohere, Stability AI, Lepton AI and World Labs, have chosen AMD Instinct MI300X GPU in their generative AI solutions.

In addition to gaining support from technology giants, the cost-effectiveness of AMD products is also an important factor in its market share.

Nvidia's products have a high premium due to its technological leadership and high market share.

In comparison, AMD's MI300 series has surpassed Nvidia's H100 in performance and is more affordable.

This cost-effectiveness advantage gives AMD great development potential in the AI chip market.

Based on the above analysis, Su Zifeng raised her forecast for AMD's artificial intelligence chip revenue in 2024, which is expected to reach more than US$4.5 billion, higher than its April estimate of US$4 billion.

This further demonstrates AMD's strong growth momentum and broad development prospects in the AI chip market.

Similar to Nvidia, AMD also plans to launch new AI chip products every year.

Ending:

For AMD, although the recovery of Nvidia's production capacity may bring competitive pressure, fortunately, AMD has gradually found a competitive rhythm that suits it.

Looking ahead to 2025, the two chip giants will face off again in the GPU field. This will be a crucial year to test the comprehensive strength of both sides.

With the continuous advancement of technology and the widespread application of GPU in more fields, the two companies will continue to optimize GPU architecture, improve computing efficiency, and jointly promote the arrival of the era of intelligent interconnection.

Some reference materials: Chaowaiyin: "AMD releases the most powerful AI chip, benchmarking Nvidia Blackwell, to be launched in 2025", Zhiding Toutiao: "AMD is focusing on the GPU computing field, and now the pressure is on Nvidia", Machine Heart: "AMD's GPU running AI model is finally Yes? PK Nvidia H100 is not to be afraid of", Electronics Fan Network: "AMD's most powerful AI chip, performance is better than Nvidia H200, but the market still doesn't buy it, is the ecosystem the biggest shortcoming?", US Stock Research Society: "AMD is about to seize Nvidia's AI leadership", Report Finance: "OORT founder: AMD's challenge to Nvidia in the AI chip market", NO Easy Touch: "AMD's most powerful AI chip debuts, performance may completely surpass Nvidia", Sanyi Life: "Giving up the high-end graphics card market, AMD may have chosen [all-in] AI", Cyber Car: "Challenging Nvidia, AMD is becoming an AI chip company", Dashengtang Electronics: "Nvidia and AMD: The history and future of GPU competition"

The articles and pictures published on this public account are from the Internet and are for communication purposes only. If there is any infringement, please contact us and reply. We will deal with it within 24 hours after receiving the information.

END

Recommended reading:

For business cooperation, please add WeChat:

18948782064

Please be sure to specify:

「Name + Company + Cooperation Requirements」

Latest articles about

■Report丨Huawei Intelligent World 2030 Report

■Core News丨Baidu will release a pair of smart glasses with built-in AI assistant

■Trend: Satellite communications are penetrating downwards, and wireless communications will become a useful supplement to mobile phones

■Xpeng Motors releases AI humanoid robot Iron

■@Everyone scan the QR code to register for the High-Tech Fair! Click the link to get a 50-yuan ticket for free~

■Analysis丨Huawei is heading in the opposite direction, is Nokia quietly turning?

■Chip News丨Yuanrong Qixing received $100 million in financing from OEMs to plan global mass production and Robotaxi operations

■Industry丨SK Hynix's annual operating profit may surpass Samsung as AI semiconductor boom heats up

■Chip News丨42.5 million euros! India's Tessolve acquires German chip design company DCT

■Exhibition丨Get free tickets! "Black technology" from over 100 countries and regions are here →