SoC chip selection for intelligent driving domain controller-EEWORLD

Collect

The application of ACPU in intelligent driving systems is not limited to the deployment of software modules. With the increase of NN (neural network) computing power, ACPU needs to process more sensor data, higher resolution camera images, and more complex scenes and functions. In order to meet these needs, the computing power of ACPU is also constantly improving. Now, ACPU can support more preprocessing of high-resolution sensor data, pre- and post-processing of deep learning models, more complex perception fusion functions, and tasks such as trajectory prediction and behavior planning. The realization of these functions is inseparable from the powerful computing power and efficient processing speed of ACPU. In addition, ACPU is also equipped with development tools such as libc and STL libraries with functional safety certification, which greatly facilitates the development of upper-level software. These tools not only improve development efficiency, but also ensure the security and reliability of software.

In summary, the selection of ACPU needs to focus on computing power, while also paying attention to the functional safety level of the device and operating system. In addition, the computing power of the ACPU should match the computing power of the NN to achieve optimal system performance.

2. Parallel Computing

2.1. DSP DSP chip, also known as digital signal processor, is a microprocessor with a special structure. Compared with general-purpose CPU, it is more suitable for high-computational-intensive processing.

Inside the DSP chip, a Harvard structure with separated programs and data is usually used, pipeline operations are widely used, and it has a dedicated hardware multiplier and provides special DSP instructions that can be used to quickly implement various digital signal processing algorithms.

DSP chips generally have the following main features:

Separate program and data spaces, allowing simultaneous access to instructions and data;
The chip has fast RAM, which is usually connected through a separate data bus;
There is a dedicated hardware multiplier that can complete one multiplication and one addition in one instruction cycle;
Hardware support for loops and jumps with low or no overhead;
Multiple hardware address generators can be operated in a single clock cycle;
With fast interrupt processing and hardware I/O support;
Support pipeline operation, so that the operations of fetching, decoding and executing different instructions can be performed in parallel;

Compared with general-purpose microprocessors, other general functions of DSP chips are relatively weak. The DSP structure diagram is as follows.

It is connected to external data storage through independent instruction bus and data bus, and the periphery is usually equipped with L1 and L2 cache to improve data access efficiency.

The internal structure is mainly divided into the program control unit (PCU), address generation unit (AGU) and data calculation unit (DALU), plus some address registers and data registers. Each processing unit is an independent hardware module, and the modules are processed in parallel through the instruction pipeline to improve the processing power of DSP.

In the DSP evaluation process, computing speed is one of the most important performance indicators of DSP chips, usually considering the following aspects:

Data bit width and length;
The number of multiplications and accumulations in a single cycle;
Number of registers;
The number of instructions that can be processed simultaneously in a single cycle;
Richness of inline instructions;
Peripheral SRAM size;

With the application of DSP in the fields of image, audio and machine learning, chip manufacturers have also adapted and supported new scenarios for DSP. For example, TI's C71 DSP, in addition to supporting common scalar and vector operations, also adds a matrix multiplication accelerator (MMA), which further enhances the dedicated capabilities of DSP and makes it easier for developers to deploy NN models.

Well-known DSP chip manufacturers in the industry include Texas Instruments and Analog Devices. There are also many domestic DSP chips entering the automotive market, including Jinxin Electronics and Zhongke Haoxin. Among them, Jinxin Electronics has launched the 32-bit floating-point DSP chip AVP32F335 series products, and Zhongke Haoxin is about to launch 32-bit floating-point RISC-V DSP chip products such as HXS320F280039C and HXS320F28379D.

2.2. GPUCPU has many functional modules and is suitable for complex computing scenarios. Most transistors are used in control circuits and storage, and a small number are used to complete computing work. The control of GPU is relatively simple and does not require a large cache. Most transistors are used for computing, so the computing speed of GPU is greatly increased, and it has powerful floating-point computing capabilities.

Schematic diagram of CPU and GPU architecture comparison. Current multi-core CPUs are generally composed of 4 or 6 cores, which simulate 8 or 12 processing processes for calculation. Ordinary GPUs contain hundreds of cores, and high-end GPUs have tens of thousands of cores, which have a natural advantage in handling a large number of repetitive processes. More importantly, it can be used for large-scale parallel data processing.

In terms of application, GPU is suitable for computing scenarios where the previous and next computing steps are independent of each other. Many problems involving a large amount of computing basically have this feature, such as graphics computing, mining, and password cracking. These calculations can be decomposed into multiple identical small tasks, each of which is processed by a single core in the GPU. The GPU increases the number of small tasks processed simultaneously through multi-core concurrency, thereby increasing the computing speed. The CPU is more suitable for computing scenarios where the previous and next computing steps are closely related and have a high logic dependency.

Compared with CPU, GPU has several characteristics:

The computing resources are very abundant;
The control components occupy a very small area;
Large memory bandwidth;
The memory latency is high. Compared with the CPU which uses multi-level cache to alleviate latency, the GPU uses multi-threading to process.
GPU processing requires data to be highly aligned;
Register resources are extremely rich;

The biggest difference between CPU and GPU is bandwidth. CPU is like Ferrari, running very fast, but it is not as good as heavy trucks in hauling goods. GPU is like heavy trucks, running slowly, but carrying more goods at a time. Some goods can be packed and transported, such as these goods all come from one place, the same size, and need to be transported to one place, which is a computationally intensive task. Some goods cannot be packed, such as these goods need to go to different places, the size is different, and they cannot be packed in multiple packages, but can only be transported multiple times, which is a control-intensive task. CPU spends a lot of effort on cache, branch prediction, and out-of-order execution, and uses a large number of registers to implement these functions, ensuring high speed. The frequency is generally much higher than GPU, and the speed is very fast each time, but a large number of registers take up a lot of space. Considering the cost and the basic law of semiconductors (the area of a single die does not exceed 800 square millimeters, otherwise the yield will drop rapidly), the number of CPU cores is very limited, and the cargo that can be carried each time is very small. GPU, on the contrary, does not consider branch prediction and out-of-order execution, uses the fastest register instead of cache, has a simple structure, a small number of transistors, and can easily reach thousands of cores. It can carry a lot of cargo each time, but the speed is not fast. Therefore, relatively speaking, GPU is more suitable for processing computing tasks with fewer branches, large data volumes, simple calculations and repetitive computing tasks.

2.3. Deep learning capability In a broad sense, any chip that can run artificial intelligence algorithms is called a deep learning chip. However, a deep learning chip in the usual sense refers to a chip that has been specially designed to accelerate deep learning algorithms.

Generally speaking, deep learning chips generally use OPS (Operations Per Second) as the unit to evaluate the theoretical peak computing power of deep learning. The physical computing unit of OPS is the multiply accumulate operation (MAC), which is a special operation in the microprocessor. 1 * MAC = 2 * OPS. The hardware circuit unit that implements this operation is called a "multiplication accumulator". This operation is to add the multiplication product b*c and the value of accumulator a, and then store it in accumulator a: a ← a + b*c

The theoretical value of deep learning computing power depends on the computing accuracy, the number of MACs, and the operating frequency. For accelerators that share the core of fixed-point and floating-point computing units, it can be roughly simplified to the number of MACs under INT8 precision is equal to half under FP16 precision, and FP32 is half again, and so on. For example, assuming that there are 512 MAC computing units in the chip and the operating frequency is 1GHz, the computing power of INT8 is 512 * 2 * 1GHz = 1TOPS (Tera Operations Per Second), the computing power of FP16 is 0.5TOPS, and the computing power of FP32 is 0.25TOPS.