Choosing the best multi-core architecture for compute-intensive applications-EEWORLD

Collect

From tiny and highly integrated systems on a chip to large data centers, the multi-core revolution has spread like wildfire. So, when you design your own system, how can you make the most of multi-core technology? In addition, it should be noted that it is not easy to fully utilize every computing power in a multi-core system.

Today's multi-core processors are more than just putting multiple processors on the same chip. Leading processor providers have embedded many useful special features in their products. For example, hashing, caching, inter-processor communication, interrupt management, and memory management. If these features can be used properly, the AMP architecture will run efficiently, which requires special optimization in software.

We know that multi-core processing architectures can basically be divided into two types: symmetric multi-processing (SMP) and asymmetric multi-processing (AMP). The characteristic of the SMP architecture is that it treats each processor core equally, and does not specifically specify which core or cores to perform a specific task. The operating system is completely responsible for evenly distributing and coordinating the work between the cores. The characteristic of the AMP architecture is the opposite of SMP. It does not treat each processor core equally, but assigns specific tasks to specific cores to run. The advantage of this is that it reduces the data switching related to repetitive work, thereby achieving higher operating efficiency.

For example, you can take a typical multi-core processor, such as the Freescale T4240, which has 12 multi-threaded cores, each core can be shared by 2 threads. The 12 cores are divided into 3 groups of 4 cores each, sharing 2MB of cache. I believe you have already felt that this system is quite complicated. So, do you want all the cores to run a single OS domain and let it schedule all the threads, or divide the total computing power into multiple independent OS domains, each with different tasks? Which solution is better? In fact, this must be determined by the type of application. Is this application safe enough for parallel processing? Is it a data-intensive application? Whether it can take advantage of the shared Level 2 cache is likely to be a factor you should consider when making a judgment.

A set of standard CPUs with built-in GPUs, such as Intel Core i7, is also a common hardware solution. Such systems can implement 8 hyperthreads in 4 cores and use GPUs to perform complex general-purpose calculations. For typical compute-intensive applications, although developing such a CPU-GPU hybrid heterogeneous architecture will increase the complexity of the system, the resulting performance improvement is still very attractive, which makes us not tired of trying it.

Once we understand how to decompose an application, we have the basis for choosing the method and language to develop the application. If a multi-operating system architecture is used, whether it is SMP or AMP, it is usually necessary to use shared memory to pass data between different OS Domains. Although this is not the only way, it is a common way - pass a command with some data to a certain OS Domain, and then an interrupt program will make the corresponding processing. But what API can be used?

There are several options here. The Multicore Association has introduced the MCAPI (Multicore Communication API) standard, as shown in Figure 1. This is designed for multi-OS environments and can be built on related technical specifications and MRAPI (Multicore Resource API). MRAPI, as a resource, provides shared memory between multiple OS Domains.

Figure 1: Basic multi-core software configuration

For this architecture, other options are similar ones with dedicated APIs. Whatever you choose, you want it to be easy to configure and maintain, so that it is the best solution for long-term development. One of the important factors is the resource consumption of the selected interface itself. Many cores in the system usually share memory, and their data transmission speed is much higher than Ethernet. If one of the reasons for splitting your application into multiple OS domains is to prevent cache thashing (multiple threads read and write the same cache line during execution, entering a competitive state), then reducing the resource consumption of the interface is particularly necessary.

There are also many options for programming SMP architectures. In this case, the same OS Domain contains multiple CPUs of the same architecture. One option is to use the threading model available within the operating system. In a standard threaded OS environment, there are usually multiple languages to choose from, such as OpenMP, OpenCL, and Cilk/Cilk++. Each programming environment has a different syntax, some are simpler, but the level of control provided varies. Compared to the typical C language syntax, some require extensibility changes. Some do not support all architectures, so you need to carefully check whether the selected language, compiler, and operating system can match and support each other well.

If you have the interest and ability to push your programming skills to the limit in order to fully mobilize every "gate" in the system, you can consider using GPGPU (General Purpose GPU programming). Then you need to pay attention to these factors: language, driver, and bandwidth. GPUs are specially designed to operate graphics at the pixel level, calculate data vectors, and process complex 3D views at high frame rates. Therefore, they have the ability to quickly perform complex calculations on small data sets.

Drivers are not trivial for GPGPU and must be well supported by the operating system. Many GPU vendors do not provide source code because it is part of their intellectual property. At the same time, they usually only provide drivers for popular operating systems. They may not support some operating systems.

Next you need to consider the choice of GPGPU language. OpenCL is from the Khronos standard. CUDA is dedicated to Nvidia GPUs. They all use similar methods to achieve parallel programming, but the performance benchmark indicators are different, and the performance in different hardware environments is somewhat different. Since OpenCL is an open standard, it can be used on most platforms, it comes with a compiler, and it can be applied to mixed CPU and GPU systems without code modification. This is obviously an advantage worth noting.

Finally, the amount of data that the remote GPU needs to process and what type of bus it needs to pass through will also affect your decision. The more data-intensive the application, the closer the GPU should be to the CPU. If the two must pass through the PCIe bus, then the bandwidth must be shared with peripherals, which is likely to cause a significant performance impact. If the GPU is closer to the CPU, the impact caused by this will be relatively reduced.

Especially for consumer electronic products, such as wearable devices, mobile handheld devices, digital imaging devices, home gateways, and broadband access devices, an important challenge is to process more and more images, sounds, and even human physiological characteristics data in a small size and low power consumption operating environment. In order to develop excellent multi-core systems in a short time for such operating environments, the choice of development platform is particularly critical.

Wind River recently launched industry profiles for various industries for the latest version of the VxWorks 7 real-time operating system. These profiles expand a series of very valuable functions for VxWorks 7 to help customers meet the evolving market and technical requirements, thereby seizing the new market development opportunities brought by the Internet of Things, including the consumer electronics field, specifically for small-volume networked devices such as wearable devices, mobile handheld devices, digital imaging devices, home gateways, and broadband access devices, providing a fast startup, small size, and low-power operating environment, and also placing special emphasis on the support for GPUs and 2D/3D graphical user interfaces, so that the advantages of multi-core processors can be maximized.

In short, there is no magic wand that can turn everything into gold. You must study each architectural choice in depth, including hardware, software, language, and compiler, to accurately evaluate the impact of each part on the overall performance and to best optimize for a specific algorithm. Once and for all, this does not exist in high-performance computing systems, at least not yet!

Figure 2: MCAPI is a messaging application interface with protocol and semantic specifications that define the behavior that any application implementation must follow.

Reference address：Choosing the best multi-core architecture for compute-intensive applications

Previous article：Multi-mode design for wireless charging that takes into account both efficiency and convenience
Next article：A Practical Approach to Assembling MEMS Inertial Sensors

Popular Resources
Popular amplifiers