Qualcomm SoC is here, which will crush SA8295P, and HiPhi will be the first to launch it-EEWORLD

Collect

Is the ceiling of cockpit SoC SA8295P? Of course not. AMD's series of embedded processors can crush SA8295P. Qualcomm's own cockpit SoC such as SA8255P can also surpass SA8295P in the field of AI . The main reason is that SA8295P is a product in early 2021. Its design scope was determined in 2020. However, it was unexpected that the cockpit field was swept up by Chinese car companies, so the computing power of products with lower positioning than SA8295P continued to increase .

On September 19, 2023, the Jiyue 01 will be the first to launch the Qualcomm Snapdragon 8295 smart cabin chip . The Snapdragon 8295 is the most powerful car machine chip, using a 5nm process technology and 8 times the computing power of the 8155. In the AnTuTu car machine performance list, its running score is nearly 700,000, almost twice that of the Snapdragon 8155. On the same afternoon, Gaohe officially released its self-developed high-computing power smart cockpit platform on the Wing Day. The platform will be the first to use the Qualcomm QCS8550 chip, realizing the industry's first launch. According to official data comparison, it is better than SA8295 in all aspects. If nothing unexpected happens, BYD's next generation will also use QCS8550.

The biggest performance difference between the two is AI computing power .

Image source: Qualcomm

This picture may be confusing, and needs to be explained clearly. The 96TOPS is the computing power under INT4 precision, and SA8295P does not support INT4 precision. However, even with INT8 precision, QCS8550 has 48TOPS, which is still far ahead.

The CPU also crushes SA8295P, up to 300kDMIPS, and the GPU is Adreno 740, with a computing power of 3.6TFLOPS, which is also higher than SA8295P. In terms of manufacturing process, QCS8550 is 4 nanometers, while SA8295P is still 5 nanometers.

Who is QCS8550?

Image source: Qualcomm

The above picture shows Qualcomm's positioning of QCS8550/QCM8550. Obviously, it is not a car-grade chip , but it doesn't matter. The AMD graphics chip in Tesla Model S is not even industrial-grade, and no one has criticized it. This is at least industrial-grade, not consumer-grade. The AMD Ryzen V1000 series products used in the current Model 3/Y are industrial-grade products, not car-grade, and no one dares to blame Tesla. Another thing is that the top domestic new energy manufacturers have always used Qualcomm's non-car-grade modules for cockpits, and at least 30% of them use non-car-grade modules for cockpits.

Qualcomm QCS8550/QCM8550 parameters

Image source: Qualcomm

QCM means modem. A glance at the CPU configuration shows that this is a modified version of Snapdragon 8gen2 in the mobile phone field. In fact, the model number alone shows that the model of 8Gen2 is SM8550.

Comparison between Snapdragon 8Gen3 and 8Gen2

After comparing the above table, it is not difficult to find that QCS8550 is 8Gen2, and the two are exactly the same.

It is not difficult to achieve powerful AI computing power, but it is difficult to achieve high AI computing power at low cost, and Qualcomm is best at low-cost AI computing power. For chips, the hardware cost is basically equivalent to the die size. The die size of Qualcomm SoC is generally very small, generally less than 120 square millimeters, while Nvidia Orin and Huawei MDC 610 are more than 400 square millimeters. In the cockpit SoC, Qualcomm's AI computing power is unusually strong. Can this 48TOPS really run large models ? Of course, it cannot run large models such as ChatGPT 3, even with a single H100. At least 8 H100s and two $6,000 CPU chips are required to run ChatGPT3 smoothly.

Qualcomm's AI computing power is so strong mainly due to its unique DSP architecture and VLIW instruction set , which originated from ATI. As early as 2004, Qualcomm and ATI reached a cooperation plan and decided to integrate ATI's 3D graphics technology into Qualcomm's next-generation mobile processors, focusing on ATI Imageon. Later, ATI was acquired by AMD, and ATI Imageon was renamed AMD Imageon. In 2009, Qualcomm acquired AMD's mobile device assets for US$65 million and obtained AMD's vector graphics and 3D graphics technology-related intellectual property rights, and no longer had to pay AMD technology licensing fees. Later, Qualcomm independently developed a new GPU brand system-Adreno. Adreno GPU has continued to bear fruit since then, and after years of evolution, it has occupied a dominant position in the mobile GPU market.

In fact, ATI's technology not only supported the future Adreno, but ATI also developed VLIW technology. Take ATI Radeon HD 5800 as an example. The GPU is composed of 20 SIMD computing engines, and each SIMD computing engine is composed of 16 thread processor units (Thread Processor - TP). And each TP is a 5-way VLIW Processor. Although VLIW later withdrew from the GPU field, it shines in the DSP field and is even more powerful in the AI era, helping Qualcomm become the mobile overlord.

VLIW is a very long instruction set.

Comparison of several instruction sets

VLIW is similar to a collection of multiple RISC instructions. The idea of VLIW is to simplify the hardware as much as possible. The hardware is only responsible for fetching and executing instructions, and does not care about the rest. The difficulties are pushed to the compiler , and the compiler is responsible for instruction scheduling. First of all, we have to know what a compiler is. For example, programs like C language, C++, and Java, when we write down the code line by line, it needs to be "translated" by the compiler to become an executable program before it can be executed, and the transformation from code to program can be realized. Computers (mainly CPUs) only recognize the two numbers 0 or 1. All the codes written need to be compiled by the compiler, that is, translated into a large number of 01 codes (in fact, there is another step in the middle to generate assembly code), which is the "mother tongue" of the CPU, and the CPU will skillfully help us execute it at a high speed.

VLIW packages multiple independent instructions into an instruction set and gives it to the compiler. The compiler determines the operating cycle of the instructions based on the different forms of the instructions, and arranges the instructions with relatively consistent operating cycles together for emission and execution. The biggest advantage of VLIW is that it realizes parallel computing . For example, if the data bus length of VLIW is 1024 bits, then 256 4-bit data can be taken at a time, and the data can be taken for parallel computing (the premise is that you have 256 sets of ALU plus registers and other hardware systems). One instruction can complete 256 cycle operations, just like 256 cores. The disadvantage is obvious. If one of the 256 calculations is stuck, then the other 255 must stop and wait for the calculation to complete. This is lockstep. Everyone's pace must be exactly the same, but the traditional superscalar CPU will not, it can be executed out of order. Another disadvantage is that even if there are only 10 instructions, the other 246 must also idle, which means high power consumption. This is very similar to the recent SIMD variable vector length, but SIMD only takes 256 4-bit data at a time, while VLIW completely relies on software to achieve parallel computing. In 1994, Intel and HP signed an agreement to jointly develop processors for high-performance computing (HPC), which later became Itanium. They proposed EPIC (Explicitly parallel instruction computing) based on VLIW instructions . However , this was too challenging for open software systems and disappeared after 2000, but VLIW+DSP gradually rose.

VLIW processor schematic

The biggest difference between DSP and traditional CPU or GPU is that it adopts Harvard architecture, which divides the memory space into two, storing programs and data respectively. They have two sets of buses connected to the processor core, allowing them to be accessed simultaneously, and each memory is independently addressed and accessed independently. This arrangement doubles the data throughput of the processor, and more importantly, provides data and instructions to the processor core at the same time. DSP chips widely use 2-6-level pipelines to reduce instruction execution time, thereby enhancing the processing power of the processor. This allows instruction execution to be completely overlapped, and different instructions are active in each instruction cycle . It is more like a pulsating processor, where data is imported once, the circulation cycle is very long, and the efficiency is extremely high. The strongest point of DSP is that it can achieve zero-overhead loops, and AI engines are usually zero-overhead loop structures, without any branch control overhead for comparison and branching.

However, DSP is essentially similar to CPU design and is not suitable for parallel computing. It is most suitable for image compression algorithms or fast Fourier transform (FFT) algorithms, which are serial data stream calculations. VLIW is a naturally parallel instruction set. The combination of the two is very suitable for AI computing, which is parallel matrix computing and also in data stream form.

Qualcomm's AI performance is closely related to the compiler, but as we all know, the compiler is static and cannot be adjusted dynamically, so some models may perform poorly on Qualcomm chips. Many people working on cockpits have never used Qualcomm's DSP computing power, and few people use DSP in the field of intelligent driving because it is too difficult to use. On Qualcomm's only general-purpose AI calculator, AI100, Qualcomm did not use its best DSP architecture, but the traditional MAC array architecture, mainly to expand the application as much as possible.