Has the TOPS competition for AI chips gone astray?

Latest update time：2024-06-13

Reads：

????If you hope to meet more often, please mark the star ?????? and add it to your collection~

Source: The content is compiled from theregister by Semiconductor Industry Observer (ID: i c ban k ), thank you.

For chipmakers, AI PCs have become a TOPS race — with Intel, AMD and Qualcomm all trying to outdo each other.

As we learned last week, AMD's next-generation Ryzen 300 series chips will have 50 NPU TOPS, while Intel's Lunar Lake parts will offer 48. Meanwhile, Qualcomm and Apple have previously announced that their NPUs will hit 45 and 38 TOPS, respectively.

Historically, this marketing has worked very well - the bigger the number, the easier it is for us customers to understand. But, just like clock speeds and cores, it’s never as simple as the marketers make it out to be. That’s certainly the case with TOPS.

One of the biggest problems is that TOPS (that is, how many terabyte-sized operations a chip can process per second) is missing a key piece of information: precision. This means that 50 TOPS with 16 bits of precision is not the same as 50 TOPS with 8 or 4 bits of precision.

Typically, when we talk about TOPS, it’s assumed to be INT8, meaning 8 bits of precision. However, as 6-bit and 4-bit lower-precision data types become more common, it can no longer be taken for granted. To their credit, Intel and AMD have done a better job clarifying precision, but it remains a potential point of confusion for consumers trying to make an informed decision.

Even assuming that the claimed performance is measured with the same accuracy, TOPS is only one of many factors that affect AI performance. Just because two chips can produce similar performance in TOPS or TFLOPS doesn’t mean they can actually exploit them.

Take Nvidia's A100 and L40S, for example, which are rated at 624 and 733 INT8 TOPS densities, respectively. Obviously, the L40S will perform slightly better when running (inference) AI applications, right? It's not that simple. Technically, the L40S is faster, but its memory is much slower: 864GB/sec, while the 40GB A100 has a bandwidth of 1.55TB/sec.

Memory bandwidth is as important to AI PCs as it is to powerful data center chips, and its impact on performance may be more noticeable than you think.

From the perspective of large language models, inference performance can be divided into two stages: the first and second token delays.

For a chatbot, first token latency refers to how long it takes to think about your question before it can start answering. This step is usually compute-bound — meaning more TOPS is definitely better.

Meanwhile, the second-level latency refers to the time it takes for each word of the chatbot’s response to appear on the screen. This step is severely limited by memory bandwidth.

This phase will be more noticeable to the end user — you’ll feel the difference between a chatbot that can generate 5 words per second and one that can generate 20 words per second.

That’s why Apple’s M-series chips prove to be great little machines for running local LLM. Their memory is packaged together with the SoC, which allows for lower latency and higher bandwidth. Even an older chip like the M1 Max is capable of running LLM because it has 400GB/sec of memory bandwidth to work with.

Now, we’re starting to see more chipmakers, like Intel, package memory with compute capabilities. Intel’s upcoming Lunar Lake processors will come with up to 32GB of LPDDR5x memory running at 8500MT/sec and supporting four 16-bit channels.

This should greatly improve performance when running LLM on your device - but may not be welcomed by right-to-repair advocates.

We can help reduce memory pressure by developing models that can run at lower precision - for example, quantizing them to 4-bit weights. This also has the benefit of reducing the amount of RAM required to fit the model into memory.

However, we either need smaller and more flexible models, or more memory to hold them. Somehow, in 2024, we are still shipping PCs with 8GB of RAM - memory can get pretty tight if you want to run more than the smallest models on a PC. In general, a 4-bit quantized model takes about 512MB per billion parameters - for a model like LLama3-8B, that’s about 4GB of RAM.

We can use smaller models, like Google's Gemma-2B, but more likely we will have multiple models running on our system at the same time. So what you can do with an AI PC depends not only on TOPS and memory bandwidth, but also on how much memory you have.

You can pause a model to disk when it's been inactive for a certain amount of time, but doing so will incur a performance penalty when resuming - since the model is reloaded into memory - so you'll also need a very fast SSD.

In the increasingly mobile world of computing, power is a major factor, but one that is not always clearly addressed.

Take two chips that produce about 50 TOPS. If one chip consumes 10 watts and the other requires 5 watts, you will notice a difference in battery consumption, even though in theory they should perform about the same. Similarly, if a chip produces 25 TOPS but only requires 3 watts, it will consume less power even if it takes twice as long as a chip that produces 50 TOPS at 10 watts.

In short, many factors are just as important, if not more important, than how many TOPS your chip can output.

This isn’t to say TOPS aren’t important. They are. There’s a reason why Nvidia, AMD, and Intel are pushing the limits of their chips with each generation. More TOPS means you can solve bigger problems, or solve the same problems faster.

But as with most systems, a careful balance of memory, compute, I/O, and power consumption is critical to achieving the performance characteristics required of an AI PC. Unfortunately, communicating any of this is much harder than pointing to a big TOPS number—so it seems we’re doomed to a repeat of the GHz wars.

Reference Links

https://www.theregister.com/2024/06/11/ai_pc_tops/?td=rt-3a

Click here???? to follow us and lock in more original content

END

*Disclaimer: This article is originally written by the author. The content of the article is the author's personal opinion. Semiconductor Industry Observer reprints it only to convey a different point of view. It does not mean that Semiconductor Industry Observer agrees or supports this point of view. If you have any objections, please contact Semiconductor Industry Observer.

Today is the 3795th issue of content shared by "Semiconductor Industry Observer" for you, welcome to follow.