After DSA, where will the chip architecture go?

Latest update time：2022-09-05

Reads：

Editor's Note

Software hot spots are emerging one after another and are iterating rapidly; CPU performance bottlenecks and Moore's Law have expired; Turing Award winners John Hennessy and David Patterson proposed the "Golden Age of Computer Architecture" in 2017, and the solution they gave was domain-specific architecture DSA .

Now that the "Golden Age" has passed halfway, we will analyze, summarize and look forward to the architecture development during this period:

Is DSA successful? Is DSA a development trend in architecture?
After the golden age, what are the new development trends of architecture?

1 The “golden age of architecture” is not of sufficient quality

Software hot spots emerge in endlessly, from the previous cloud computing, big data, and AI to the later blockchain, 5G/6G, autonomous driving, metaverse, etc., almost every two years or so, a new technology application will appear. Moreover, existing technology applications are still evolving and iterating rapidly. For example, in the six years from 2012 to 2018, people's demand for computing power increased by more than 300,000 times; that is to say, AI computing power doubled approximately every 3.5 months.

In 2017, Turing Award winners John Hennessy and David Patterson proposed the "golden age of computer architecture" and pointed out that due to the current performance bottleneck of general computing, it is necessary to develop targeted optimized architectures for specific application scenarios. They gave The solution is DSA (Domain Specific Architecture).

Through CPU + (integrated or independent) GPU/FPGA/DSA processor, software services are accelerated through heterogeneous computing, such as network, storage, virtualization, security, database, video images, deep learning and other scenarios. But currently, the biggest problem is that the acceleration in these fields has changed from CPU computing as a whole to "isolated islands" of heterogeneous computing.

Let’s analyze the current cases of DSA (successful or unsuccessful) in several major fields:

AI DSA. Google's TPU is the first true DSA chip. The architecture of many AI chips is also the product of similar ideas, and they all hope to achieve a certain degree of programmability based on ASIC-level performance. Unfortunately, AI scenarios still belong to the application level: application algorithms iterate rapidly and come in many varieties; this results in the implementation of AI DSA not being very successful. The essential reason is that the flexibility of the DSA architecture cannot meet the flexibility requirements of the application layer algorithm.
Integrate storage and calculation. The integration of storage and computing is a very broad concept, including near-memory computing, in-memory computing, etc. Even in-memory computing can be divided into in-memory storage computing and memory execution computing. Whether it is focusing on AI or other fields, strictly speaking, the integration of storage and calculation is a micro-architecture and implementation technology; at the system architecture level, the integration of storage and calculation belongs to the category of DSA. The integration of storage and computing must also face the core problem of DSA: the matching problem of chip architecture flexibility and domain algorithm flexibility.
Network DSA. Intel has achieved software programmability based on ASIC-level performance through the network DSA of the PISA architecture, which can realize the programming of most network protocols and functions. Network DSA is a relatively successful case under the DSA concept. The essential reason is that network tasks belong to the infrastructure layer and their flexibility requirements are relatively low. Therefore, it can be concluded that the flexibility of the DSA architecture can meet the flexibility requirements of infrastructure layer tasks.

In addition, we also need to consider the implementation of the final scene. From the perspective of final implementation scenarios and solutions:

Customers cannot only need one DSA. What customers need is a comprehensive solution. DSA only implements a small part of customer scenarios.
Moreover, in cloud and edge data centers, when CSP invests hundreds of millions of dollars and puts tens of thousands of physical servers on the shelves, it does not know which user a specific server will eventually be sold to, and whether this user will be on the server. What application is running on it? Moreover, in the future, after this user's server resources are recycled and then sold to the next user, it is not known what the next user will use them for.

Therefore, the final implementation scenario must require a comprehensive and relatively universal solution; moreover, it is also necessary to improve performance and reduce costs as much as possible on a universal basis.

Various independent DSAs make the system more and more fragmented, which is contrary to the development trend of macro comprehensive macro scenarios such as Internet cloud and edge computing, cloud network edge-end integration, etc.

2 The value of DSA

2.1 Compared with ASIC, DSA has certain flexibility advantages

When the system complexity is low, the ultimate performance and efficiency of ASIC are very suitable. However, as the complexity of the system increases, the disadvantages of ASIC are gradually exposed:

ASIC iteration does not match system iteration. Complex systems change much more than simple systems, and ASICs are tightly coupled designs that cannot change with rapid changes in system functions and business logic. As system complexity further increases, the conflict between ASIC and system flexibility will become more intense.
ASIC performance efficiency is not necessarily the most efficient. ASIC functions are determined, and in order to cover more scenarios, a superset of functions is bound to be needed. However, as system functions increase and the proportion of functions available to each user decreases, the function superset drags down the performance and resource efficiency of ASICs.
ASIC hardware development is difficult. Hardware development is much more difficult than software. Due to the tight coupling of the system, ASIC needs to refine the business logic into the hardware circuit design, which in turn significantly increases the difficulty of hardware development.
ASICs are difficult to scale. Because of the difficulty of ASIC hardware development mentioned above, this in turn makes it difficult for ASIC design to achieve very large scale, which limits the size of the system it can support.
With a completely hardware ASIC, users do not have much say in functions and business logic. The user is just a "User", not a "Developer", and ASIC will restrict the user's business innovation.

DSA can achieve programmability of functions and business logic at the ASIC level to a certain extent, achieving both performance and partial flexibility, and significantly improving the above-mentioned problems and challenges of ASIC. Therefore, DSA has gradually become the mainstream solution for acceleration in the current field.

For example, the development of network DSA. Before SDN, the mainstream implementation of network chips was ASIC; the first step in the development of SDN was to realize the programmability of the control plane through Openflow; the second step was to realize the programmability of the data plane through P4, and have a DSA that supports P4 programmability : Barefoot's PISA architecture and Tofino chip.

2.2 DSA is more suitable for basic implementation layer acceleration

The system is composed of various components that are divided into layers. These components are generally divided into three categories:

Infrastructure layer. The underlying business logic is relatively the most stable.
application layer. Due to the uncertainty of which specific applications the hardware platform will run, as well as the rapid upgrade and iteration of applications, the application layer changes the fastest.
The application layer can accelerate the part. Some performance-sensitive algorithms can be extracted from the application layer. Their flexibility is between the infrastructure layer and the application, and the changes are relatively moderate.

As shown in the table above, each processor engine type has its advantages and disadvantages:

CPU has the best flexibility. Currently, data centers have high requirements for flexibility, so servers are mainly based on CPU processors. But the performance of the CPU is relatively lowest.
GPGPU has improved performance compared to CPU, and although its flexibility has been somewhat reduced, it can still meet the requirements of many scenarios. For example, AI algorithms and application performance are sensitive and can be updated quickly, so they are more suitable for GPGPU architecture.
ASICs have no place in complex systems such as cloud and edge due to their minimal flexibility.
DSA is a callback to the ASIC, adding some flexibility. Practice has proven that DSA is not suitable for application-level scenarios that change quickly and require high flexibility, such as AI acceleration scenarios; DSA is suitable for scenarios that change slowly and require low flexibility, such as network acceleration scenarios.

2.3 DSA is the main support for computing power

On the one hand, there is an obvious rule in complex systems, the "28/20 rule", which means that about 80% of the system's calculations belong to the infrastructure layer and are suitable for DSA acceleration; on the other hand, DSA can cover enough A large area, and its certain degree of programmability can meet the flexibility requirements of domain-specific tasks in the infrastructure layer of the system.

The processor (engine) of the DSA architecture can achieve the most extreme performance and the most extreme cost performance while meeting the system flexibility requirements.

Let's analyze it from the perspective of the entire system. We can generally think of the work to be done by the system as a tower defense game:

According to the "80/20 rule" of the system, in a system, DSA completes 80% of the calculations, GPU completes 16% of the calculations, and CPU completes 4% of the calculations.
Software and hardware integration technology can make the hardware more powerful and have some features and capabilities that are usually found in software, thereby further increasing the proportion of system hardware acceleration. In this way, DSA can complete 90% of the calculations, leaving only the remaining 10% for the CPU and GPU.

In short, the DSA architecture processor must be the backbone of the computing power in the entire system.

3 Disadvantages of DSA

3.1 DSA is not suitable for application layer work

The most typical case of application layer DSA is none other than AI chips.

The extreme demand for computing power in the AI field, and the release of the world's first DSA architecture processor TPU by industry giant Google, instantly triggered an industry-wide trend in AI chips.

But about five years have passed, and even though Google has the world's top technology across the entire chain from chips, frameworks, ecosystems to services, and has almost the world's largest industry appeal, the entire TPU system is still difficult to be considered a success. Google's AI framework is currently gradually migrating to GPU. It can be said that the current trend of AI platform is: callback from DSA to GPU.

Other DSA architecture AI chips have fewer implementation scenarios and fewer numbers.

The essential reason is also very simple: the flexibility provided by the AI DSA chip is very different from the flexibility requirements required by the upper-layer AI algorithm. Let’s make this statement more general: DSA chips provide relatively low flexibility, while application layer algorithms have relatively high requirements for flexibility. There is a huge difference between the two.

The system is a hierarchical and block-based system. Generally speaking: the lower the layer, the more certain it is, and the more suitable it is for DSA architecture processing; the higher the layer, the more uncertain it is, and the more suitable it is for general-purpose processors such as CPUs; and performance-sensitive application layer algorithms are more suitable. On GPUs and processors with the same instruction complexity level architecture (such as Graphcore IPU).

If AI chips want to be implemented in the future, the chips and application algorithms need to meet each other halfway:

The chip adds more flexibility (typical case: Tenstorrent Wormhole, which improves flexibility while providing ultimate system scalability);
Over time, the flexibility of AI algorithms decreases and gradually settles into infrastructure layer tasks.

3.2 DSA makes the system more and more fragmented, running counter to the general trend of cloud-network-edge-device integration.

Our IT equipment and systems usually go through the following four stages of evolution, gradually moving toward cloud-network-edge-device integration:

The first stage is the island. All devices are independent and are not connected to the Internet, so there is no such thing as a cloud network edge.
The second stage is interconnection. That is our Internet, which connects devices together and allows communication between devices.
The third stage is collaboration. The usual C/S architecture is a typical collaboration. At this time, with the division of cloud and edge, whoever is better at doing things can be assigned the tasks, and everyone collaborates to do the things of the entire system well.
The fourth stage is integration. Collaboration represents staticity to a certain extent. With the development of time, the initial task division may not necessarily adapt to the development of the system; while fusion represents dynamics and more adaptability. Services are split into microservices, and clients are split into thin clients and microservices. These microservices can run fully adaptively on the cloud, network, edge or terminal, or may be available at various locations.

In the era of cloud-network-edge-device integration, the computing, network, storage and other resources of all servers and devices are all connected into a single huge resource pool. Various resources can be shared very conveniently, and software can be very convenient. It can be deployed in any location and can be easily and adaptively migrated.

However, DSA’s approach runs counter to all of this:

First of all, a domain DSA chip means an island in architecture.
Secondly, even in the same field, there are different architecture implementations by different companies. The computing power of these processors with different architectures is difficult to integrate into a large macro computing power pool.

As an important concept and solution in the "golden age of architecture", DSA means "a hundred flowers bloom and a hundred schools of thought contend". It also means that there are more and more computing architectures, becoming more and more fragmented, and becoming more and more difficult to integrate into a whole. Come more and more a plate of loose sand. From the perspective of technological development, it is necessary to go through the stage of DSA. However, as various technologies mature, they will inevitably gradually converge: universal, integrated, and comprehensive processors can cover more users, more scenarios, and more iterations through fewer architectures.

3.3 DSA does not solve the fundamental contradiction of the architecture

If performance disadvantages are not considered, CPU is the best processor. It realizes the decoupling of software and hardware through the most basic instructions of addition, subtraction, multiplication and division. Hardware engineers do not need to care about specific scenarios and only care about the microarchitecture and implementation of the processor; software engineers do not care about hardware details and only care about their own software programming. Based on these basic instruction sets, a huge software ecosystem has been built.

GPGPU is still successful. Advantages of GPGPU compared to DSA:

GPGPU can cover many areas, while DSA can only cover one. Today, when the one-time cost of chips is getting higher and higher, covering more areas means more product sales and lower costs (the one-time amortized cost accounts for a high proportion of the cost of large chips).
NVIDIA GPGPU and CUDA build a complete ecosystem, while the DSA architecture tends to be fragmented, making it very difficult to build an ecosystem.
Hot applications mean the greatest value. The balance of performance and flexibility of GPGPU makes it well able to meet the performance and flexibility requirements of many current hot-spot applications; while DSA is difficult to meet the flexibility required by hot-spot applications (because hot-spot applications must iterate quickly).

What is the most fundamental contradiction in architecture? Simply put, it is the contradiction between performance and flexibility, and how to balance or even take into account both. Usually, there is an inverse relationship between the two. If one increases, the other will decrease. The essence of architectural optimization and innovation is to improve another element as much as possible while keeping one element unchanged.

DSA has improved some flexibility based on ASIC, which is its value. But at the same time, some problems have also been introduced. In the entire system, if only a DSA accelerator card in a specific field is needed, and no other accelerator cards such as GPU are needed, then the entire system is a typical CPU+DSA heterogeneous computing system, and the problem is not outstanding.

Actual cloud and edge scenarios are comprehensive scenarios that are superimposed on many scenarios. They require not only AI acceleration, but also virtualization acceleration, network acceleration, storage acceleration, security acceleration, database acceleration, video graphics processing acceleration, etc. If every field acceleration requires a physical accelerator card:

First of all, the physical space and interfaces of the server cannot meet the use of so many cards.
In addition, the interaction between these cards is an additional cost. Today, when CPU performance is stretched, the CPU simply cannot bear the data interaction processing between DSAs.

From a local perspective of DSA, DSA brings significant performance improvements and also ensures a certain degree of flexibility. However, from the perspective of the entire system, the additional data interactions between DSAs (similar to the rapid increase in intranet east-west traffic after the system is deconstructed and micro-serviced) need to be processed by the CPU, which makes the CPU performance bottleneck problem become More severe.

4 Where does architecture go after the golden age?

4.1 From separation to integration

DPU has now become an important processor type in the industry. The biggest difference between DPU and smart network card is that smart network card is an accelerator card in a single field, while DPU is an integrated platform for multi-field acceleration. The emergence of DPU illustrates that the computer architecture is changing from the separation of DSA to the integration:

In the first stage, the CPU is a single general computing platform;
The second stage, from integration to division, is the heterogeneous computing platform of CPU+GPU/DSA;
The third stage, from the starting point of division to integration, is a heterogeneous computing platform centered on DPU;
The fourth stage, from division to integration, integrates and reconstructs many heterogeneous systems into a more efficient hyper-heterogeneous converged computing platform.

4.2 From mono-isomerism to hyper-isomerism

Without considering the physical collaboration of multiple chips, we only consider the computing architecture. The computing architecture has evolved from serial to parallel, from homogeneous to heterogeneous, and will continue to evolve to hyper-heterogeneous:

The first stage is serial calculation. Single-core CPUs and ASICs all belong to serial computing.
The second stage is homogeneous parallel computing. CPU multi-core parallelism belongs to homogeneous parallel computing.
The third stage is heterogeneous parallel computing. CPU+GPU, CPU+DSA and SOC are heterogeneous parallel computing.
In the future, it will move towards the fourth stage, the hyper-heterogeneous parallel stage. Integrate numerous CPU+GPU+DSAs "organically" to form hyper-heterogeneous.

4.3 From software and hardware collaboration to software and hardware integration

The system is becoming larger and larger, and the system can be decomposed into many subsystems. The scale of the subsystem has reached the scale of a traditional single system. So the system becomes a macrosystem, and the subsystem becomes a system.

Software and hardware collaboration is the study of how the software and hardware of a single system are divided, decoupled, and re-collaborated. According to the characteristics of the system, hardware processor types such as CPU, GPU, and DSA are selected. If it is hardware that needs to be developed such as DSA, the software needs to be carefully considered. Hardware work division and interaction interface.

Software and hardware integration studies the "collaboration" of software and hardware of macro systems. Each system is a different level of software and hardware collaboration. These systems also need to determine task division, data interaction and work collaboration, so as to present the whole macro system level. Out: "There is hardware in software, there is software in hardware, and software and hardware are deeply integrated."

4.4 From open to more open

The x86 architecture is mainly used in the PC and data center fields; the ARM architecture is mainly used in the mobile field and is currently expanding into the PC and server fields; the University of California, Berkeley, developed RISC-V, which has become an industry-standard open ISA. Ideally, if the open ecosystem of RISC-V is formed, there will be no cross-platform losses and risks, and everyone can focus on the innovation of CPU microarchitecture and upper-layer software.

The advantages of RISC-V are reflected in:

free. The instruction set architecture is available for free, no authorization is required, and there are no commercial constraints.
Openness. Any manufacturer can design its own RISC-v CPU and build an open ecosystem for mutual prosperity and symbiosis.
standardization. The most critical value. The openness of RISC-v makes its ecosystem more powerful; if RISC-v becomes a mainstream architecture, there will be no cross-platform costs.
Simple and efficient. Without historical baggage, ISA is more efficient.

If it is a "choice" to choose open RISC-v for CPU (because there are also x86 and ARM to choose from), then in the future era of hyper-heterogeneous computing, openness is "the only option that has to be chosen."

When a platform needs to support more and more processor types and architectures, it will inevitably mean that it becomes more and more difficult to run programs on it, processor resource utilization becomes lower and lower, and it becomes more and more difficult to build an ecosystem. In this case, we need:

On the one hand, it is necessary to gradually shift from hardware-defined software to software-defined hardware. Also, the software needs to natively support hardware acceleration.
In addition, it is necessary to build an efficient, standard, and open ecosystem (reduce the number of various types of architectures as much as possible, and let the number of architectures gradually converge).
Finally, there is also a need for open application development frameworks for processors (engines) of different types of architectures, as well as across different architectures of the same type of processors.

(Full text ends)

*Disclaimer: This article is original by the author. The content of the article is the personal opinion of the author. The reprinting by Semiconductor Industry Watch is only to convey a different point of view. It does not mean that Semiconductor Industry Watch agrees or supports the view. If you have any objections, please contact Semiconductor Industry Watch.

Today is the 3151st content shared by "Semiconductor Industry Observation" with you. Welcome to pay attention.

Latest articles about

■SiC giant, rebirth: how to predict the future?

■Apple chips may hit Qualcomm hard

■Chip cost per car: soaring to $1,000

■TSMC 2nm, important information

■Huang Renxun's latest views

■The risks of this type of chips that are promising have increased significantly!

■NPU, how to see it?

■Storage giants are abandoning DDR 4

■Intel, why?

■Nvidia will definitely be disrupted