A brief list of terms related to AI processors-EEWORLD

Collect

In this article, we will introduce the most common choices for core processor architectures used in AI systems from three perspectives: scalar, vector, and spatial. For each case, we will make some generalizations about their performance characteristics and the types of optimization algorithms. In later articles, we will discuss in more depth how they are implemented and their performance on different types of AI workloads.

Flynn classification

Any discussion of processor architectures would be incomplete without the rather popular "Flynn taxonomy", as the nomenclature is so common. Its original intent was to describe how a Harvard architecture computer ingests instruction and data flow, and is as likely to make the most sense in this context. Nonetheless, modern processors are often closer to one characteristic than the other, so we often refer to them this way, but we should note that it would be a gross oversimplification to assume that any modern processor fits neatly into one of these types. Presented here is a slightly more modern taxonomy that is more open ended.

SISD: Single Instruction Single Data

The simplest form of CPU fits into this category. Each cycle of the CPU takes instructions and data elements and processes them in order to modify a global state. This concept is fundamental to computer science, so most programming languages compile to a set of instructions that target this architecture. Most modern CPUs also emulate SISD operations, although very different concepts may be used in software and hardware.

SIMD: Single Instruction Multiple Data

The simplest SIMD architecture is a vector processor, similar to the SISD architecture with wider data types, so each instruction operates on multiple consecutive data elements. Slightly more complex is thread parallelism, where a single instruction operates on multiple thread states, which is a more common programming model.

MISD: Multiple Instruction Single Data

There is no universal consensus on what a fault handler is, so I will not limit myself here. Consider an architecture that is able to execute multiple arbitrary instructions in sequence in a single cycle on a single data input. This basically requires multiplexing from output to input without storing intermediate results. Later, we will see the advantages of this advanced architecture.

MIMD: Multiple Instructions Multiple Data

Again, without qualification, I would say that a Very Long Instruction Word (VLIW) processor fits this category best. The purpose of such a processor is to expose a programming model that more precisely fits the processor's available resources. VLIW instructions are able to send data to all execution units simultaneously, which has a large performance advantage through instruction-level parallelism (ILP), but the compiler must be architecturally aware and perform all scheduling optimizations. In general, this has proven to be challenging.

Scalar (CPUs): Mixed performance

The modern CPU is a very complex system designed to do a wide variety of tasks very well. It has elements that cover every category of Flynn's taxonomy. You can certainly program it as a SISD machine and it will give you output as if the program was calculated in the order you gave it. However, each CISC instruction is usually translated into a chain of multiple RISC instructions to be executed on a single data element (MISD). It will also look at all the instructions and data you give it and arrange them in parallel to execute the data on many different execution units (MIMD). There are also many operations, such as in the AVX instruction set, that perform the same calculation on many parallel aligned data elements (SIMD). Furthermore, since multiple cores and multiple threads run in parallel to use resources simultaneously on a single core, almost any type of parallelism in Flynn's taxonomy can be achieved.

Code Optimizer

If the CPU were to run in simple SISD mode, grabbing each instruction and data element from memory one at a time, it would be very slow, no matter how high the frequency. In modern processors, only a relatively small portion of the die area is dedicated to actually performing arithmetic and logic. The rest is dedicated to predicting what the program will do next and arranging instructions and data for efficient execution without violating any causal constraints. Perhaps the most relevant comparison between the performance of the CPU and other architectures is the handling of conditional branches. Instead of waiting for a branch to resolve, it predicts which direction to take and then fully restores the processor state if it goes wrong. There are hundreds of these tricks etched into the silicon, and they are tested on a wide variety of workloads, providing a huge advantage when executing highly complex arbitrary code.

Moore's Law Philosophy

In my first job I was assigned to integrate a very expensive ASIC that was deemed necessary to decode satellite imagery in real time. I noticed that the design was a few years old and I did some calculations that told me that I could have nearly the same computing power on an Intel processor. I wrote the algorithm in C and demonstrated the system on a Pentium III CPU before ASICs were available. At that time, 'Dennard Scaling' was so fast that for a short period of time, the performance gains on general purpose processors outstripped the need for specialized processors. Probably the biggest advantage of choosing a general purpose processor is that it is easy to program, which makes it a preferred platform for algorithm development and system integration. It is possible to optimize the algorithm to a more specialized processor, but CPUs are already very good at doing this for you. In my particular case, the first version of the satellite used Reed-Solomon codes, but future designs were considering Turbo codes. Downlink sites that used ASICs had to replace the entire system, our sites would use simple software updates and regular CPU upgrades. So you can spend your time optimizing your code or you can spend your time innovating your application. The corollary of Moore's Law is that it will soon be fast enough.

Vector (GPU and TPU): Simple and Parallel

In many ways, a vector processor is the simplest modern architecture: a very limited computational unit that is duplicated many times across the chip to perform the same operation on large amounts of data. These were the first to popularize graphics, hence the term GPU. In general, GPUs do not have the predictive gymnastics capabilities that CPUs do to optimize complex arbitrary code, and specifically have a limited instruction set that is limited to supporting only certain types of computation. Most advances in GPU performance have been made possible by fundamental technology scaling of density, area, frequency, and memory bandwidth.

GPGPU

There has been a recent trend to extend the GPU instruction set to support general purpose computing. These GP instructions must be adapted to run on SIMD architectures, which exposes some advantages and disadvantages, depending on the algorithm. Many algorithms that are programmed to run as a repetitive loop on the CPU are really just performing the same operation on each adjacent data element of an array in each loop. With some programmer effort, they can be easily parallelized, sometimes massively, on the GPU.

It is worth noting that if any condition is applied to any element, then all branches must be run on all elements. For complex code this can mean an exponential increase in computation time relative to the CPU. GPUs have very wide memory buses which give excellent performance for streaming data, but if the memory accesses are not aligned with the vector processor elements then each data element requires a separate request from the memory bus, whereas CPUs have very sophisticated predictive caching mechanisms that greatly compensate for this.

The memory itself is very fast but also very small, and relies on data access transfers over the PCIe bus. In general, GPGPU algorithm development is much more difficult than CPU. However, this challenge is addressed in part by discovering and optimizing efficient parallel algorithms that achieve the same results through unified execution branches and aligned memory accesses. Typically, these algorithms are less efficient in terms of raw operations, but execute faster in a parallel architecture.

AI Operation

Many popular algorithms in AI are based on linear algebra, and the massive scaling of parameter matrices has enabled great advances in the field. The parallelism of GPUs allows massive speedups of the most basic linear algebra, so it suits AI researchers as long as they stay within the confines of dense linear algebra on matrices that are large enough to occupy most of the processing elements and small enough to fit in the GPU's memory. However, the speedups are so great that to this day, much progress has been made in deep learning within these constraints.

The two main drivers of modern development in GPUs are the Tensor Processing Units (TPUs), which perform full matrix operations in a single cycle, and multi-GPU interconnects for processing larger networks. We are experiencing a larger divergence between hardware architectures for dedicated graphics and hardware designed for AI.

Today, we encounter a much larger divide between hardware architectures for dedicated graphics and hardware designed for AI. The simplest divergence is in precision, with AI developing techniques based on low-precision floating-point and integer operations. Slightly more obtuse are the shortcuts that graphics processors use to render convincing renditions of complex scenes in real time, often using very specialized compute units. So the similarities between the architectures end at the highest optimization levels for both.

Systolic Arrays

An ASIC or FPGA can be designed for any type of computing architecture, but here we focus on a specific type of architecture that is somewhat different from other choices and is relevant to artificial intelligence. In a clocked architecture such as a CPU or GPU, each clock cycle loads a data element from a register, moves the data to the processing element, waits for the operation to complete, and then stores the result back to the register for the next operation. In spatial dataflow, the operations are physically connected on the processor so that the next operation is performed as soon as the result is calculated, and the result is not stored in a register. When moderately complex units that contain their own state in registers local to the processing element are linked together in this way, we call it "Systolic Arrays".

[1] [2]

Reference address：A brief list of terms related to AI processors

Previous article：Japan's sanctions on South Korea may lead to reduced iPhone production
Next article：Baidu PaddlePaddle will join hands with Huawei Kirin chips to jointly expand the AI market

Popular Resources
Popular amplifiers