Leverage efficient programming techniques to take advantage of multi-core architectures-EEWORLD

Collect

In the entire embedded field, "more cores" has become a design trend. Some hardware architectures can provide dozens of cores, and some architectures even have thousands of cores. However, multi-core design still has many challenges in software, and it is not easy to port applications between different architectures.

At the low end of the embedded space, single core solutions still exist. It is still possible to move up the system's capabilities and performance curve by using faster or wider bandwidth processors. At the high end, multi-core is the inevitable direction of development. This is why double-precision floating point algorithms often appear and thrive in supercomputers. In fact, desktop and rack-mount systems (such as Nvidia's products) are democratizing this processing power.

Another issue that is often mentioned when discussing software and multi-core architectures is virtualization. Not all multi-core platforms support virtualization, but virtualization does provide better opportunities. Although virtualization makes hardware design more challenging, it generally simplifies software and application management.

SMP Server

Xeon Nehalem-EX is the top 8-core symmetric multiprocessing (SMP) platform offered by Intel. Multi-chip solutions such as 8-chip, 64-core systems usually use high-speed QuickPath point-to-point interconnect technology to link processors and peripheral controllers together (Figure 1). Engineers who have used AMD Opteron processors with HyperTransport links are very familiar with this architecture. In both cases, the simplest configuration is a single processor linked to a single peripheral controller via a single high-speed link.

除了提供分布式内存子系统外，Intel和AMD还实现了连贯缓冲非统一内存寻址(ccNUMA)技术。每个处理器芯片都有自己的内存控制器以及一级、二级和三级缓存。任何芯片都可以使用高速链路访问其它任何芯片中的内存。当然，离请求者越远的数据访问时间越长。这些高速链路也被用于消费设备，但只有到I/O中心的单条链路是必需的。换句话说，在共享内存访问时服务器将在处理器芯片间产生显著的流量。芯片至芯片流量和缓存管理是高效操作的关键。

HT Assist is an important feature of AMD's latest Istanbul Opteron processor, which optimizes the memory request and response process to minimize the number of related transactions, thereby releasing a large amount of bandwidth for processing other services (Figure 2). HT Assist actually tracks the movement of data between cores and caches, allowing requests to be serviced by the nearest core that has the required data.

The worst case is that the chip that has the off-chip memory space must access the data from the off-chip memory; the best case is that the data is found in the cache of the chip running the thread that needs it; in the middle, the core gets the data from the cache of the adjacent chip. The use of virtualization and caching technology will make the system more complex and make data latency more difficult to determine. This may be a problem in deterministic embedded applications, but it is not a problem in most server applications because speed is more important than fine determinism.

Programmers are using these platforms today because they greatly simplify programming tasks. Likewise, applications can use more and more cores, provided that the application can efficiently utilize sufficient threads. Using multi-core systems efficiently is not as easy as it may seem. Cache sizes and locality of reference within the application's working data set affect how well a particular algorithm will perform.

AMP Application Processor

Symmetric processing (SMP) architectures are very useful for many embedded applications, but asymmetric multiprocessing (AMP) also has its place. AMP configurations can be seen in many places, from TI's OMAP (Open Multimedia Application Platform) to Freescale's P4080 QorIQ (Figure 3).

[page]

TI's OMAP 44xx platform integrates ARM Cortex-A9, PowerVR SGX 540 GPU, C64x DSP and image signal processor. Each core has a dedicated function, and the communication between processors is not symmetrical. OMAP only works in AMP mode, while the core of P4080 is an SMP system, but it can also divide the core into AMP mode. The 8-core chip can run like 8 independent cores, and can also be used together in many configurations (such as a pair of dual-core SMP subsystems, or four single-core subsystems).

The main difference between OMAP and P4080 at a high level is that OMAP functionality is fixed and the cores are optimized for their respective tasks. This will make programming much easier as applications can be partitioned to specific cores based on matching functionality.

The performance level of each subsystem is limited by the architecture, but the P4080 can adjust the partitioning scheme, although the partitioning is usually done at system startup. System designers can adjust the allocation of cores in the P4080, provided that there are enough cores. There are also QorIQ platforms on the market with fewer cores, so more economical chips can be selected.

IBM's Cell processor fills the gap in the middle. It uses a 64-bit Power core and 8 Synergy Processing Elements (SPEs). All SPEs are identical (each has 256KB of memory) and they work in isolation, which is different from the shared memory SMP system discussed above. There is no cache in the SPE and no support for virtual memory.

This approach has both advantages and disadvantages for hardware and software design. The advantage is that it simplifies the hardware implementation, but it complicates the software from many perspectives. For example, memory management is controlled by the application, as is communication between cores. Data must be moved into the local memory of the SPE before it can be operated on. It takes time to fully develop architectures such as Cell because they are different from more traditional SMP or AMP platforms. The software improvements made over the years on Cell-based platforms such as Sony's PlayStation 3 highlight the changes in programming techniques and experience.

Specialized processors such as GPUs

Changing programming techniques is key to success with graphics processing units (GPUs). GPUs from companies such as ATI and Nvidia have hundreds of cores in a single chip, and these GPUs can be combined into multi-chip solutions to provide developers with thousands of cores. For example, four Nvidia Tesla T10s integrated into a 1U chassis can provide 960 cores (Figure 4).

Programming a Tesla or any other compatible Nvidia GPU chip is challenging, but architectures like Nvidia's CUDA or runtimes based on CUDA can make life easier. Part of the challenge comes from the single instruction, multiple thread (SIMT) architecture of Nvidia GPUs. Like many high-performance systems, these GPUs like to process arrays of data. This is a good choice for many applications, but not all, which is one of the reasons why GPUs are often paired with multi-core CPUs.

Another parallel programming framework, CUDA and OpenCL (Open Computing Language), completely match the GPU approach (using separate memory from the main processor). This means that data must be moved from one place to another before it can be operated on. The C programming language has some extensions, but also has limitations. For example, it is freely recursive and does not support function pointers. Some of these limitations are derived from the SIMT approach.

Many applications use CUDA, but the performance gain over traditional SMP platforms varies widely, from 2x to 100x. The reason for this variation is that threads are most efficient when running in groups of 32. Branching does not affect performance, provided that the 32-thread group is within the same branch.

Specialized processors such as GPUs offer both graphics and multi-core processing. Another approach is to use many traditional cores, such as Intel's Larrabee (Figure 5). Larrabee uses x86-compatible cores that are optimized for vector processing.

[page]

In some ways, Larrabee is similar to IBM's Cell processor. The Larrabee core has only 32KB of L1 cache and 256KB of L2 cache to access. If the data is not in the cache, it must be requested from the memory controller or another cache in the system, and then the data is placed in the core's cache and then processed by the application.

The ring bus is used for communication between the core and the controller. IBM's Cell unit interconnect bus (EIB) is also a ring bus that connects the SPE and the memory controller and peripheral interface. From a programming perspective, there is a big difference between Larabee's cache and Cell's SRAM. Indeed, to programmers, Larrabee looks more like a set of coherently cached x86 processors. Due to its GPU positioning, programmers can take full advantage of its support for DirectX and OpenGL.

Multi-core networking

Multi-core chips are also common components in network infrastructure. Handling 10Gps networks is a big challenge for multi-core chips. Analyzing and processing data from line-speed network connections requires a lot of processing resources.

Netronome's NFP-3200 network traffic processor contains 40 1.4GHz cores, each of which can run 8 threads, for a total of 320 hardware-based threads per chip. This is the same order of magnitude as a GPU, but these processors are mainly used for packet processing.

Like IBM's Cell, the NFP-3200 has a main CPU-type controller, and it's an ARM11 core. The 40 cores of the NFP-3200, also called microengines, are compatible with Intel's IXP28xx architecture, which is mainly used for network processing. This compatibility is important because a lot of code is developed for this architecture. Older chips have fewer cores, so in a sense the NFP-3200 provides the same solution.

Of course, simply adding more cores to the problem is only one way to go. Netronome has made a number of improvements, such as enhanced micromodules that support TCP task offload. The interconnect speed is also higher, running at up to 44Gbps between cores.

The Netronome chip has a number of specialized processors, including encryption systems for handling various security protocols. Netronome's PCI Express interface supports I/O virtualization functions commonly used by x86 processors. It can be moved next to the NFP-3200 instead of being separated by another network link.

Compared with other multicore chips, programming the NFP-3200 is generally not a big problem because there is a lot of ready-made code for the IXP28xx series. In addition, Netronome provides libraries, which makes the creation of network processing applications more like stacking modules.

Cavium's Octeon II is a more traditional SMP multicore design with two to six 64-bit MIPS64 cores connected by a crossbar switch. Like the Netronome chip, the Octeon II is designed for networking and storage appliances.

The Octeon II also has a RAID 5/6 accelerator and a regular expression hyper-finite factorial (HFA) for packet inspection. Programming the Octeon II is similar to programming most SMP systems. The Octeon II can run operating systems such as Linux.

Other multi-core architectures

Adopting a more aggressive multi-core architecture will increase programming transactions, but it can open up opportunities for developers to take advantage of new architectures. IntellaSys's SeaFORTH 40C18 is one such example (Figure 6). Its native programming language is VentrueForth, and the instruction length is actually 5 bits, and 4 instructions can be compressed into a single 18-bit word (one instruction is only 3 bits long). The 40C18 has 40 cores, they have the same processing unit, and all have 64 words of RAM and 64 words of ROM.

There is obviously a big difference in programming the 40C18 compared to chips with more memory, such as Intel's Larrabee or IBM's Cell. The 40C18 core consumes less than 9mW, while the other two chips cannot function properly without a large heat sink or fan. The 40C18 is designed for embedded and even mobile applications.

Programming the 40C18 will be a different experience for most developers, and not just because Forth is the programming language. The small memory capacity of each core and the matrix interconnect change the programming approach. Cores often run small functions that pass data to one or more adjacent cores, so cooperative programming will be the trend.

Even external memory accesses require three cores to work together, which works well when there are many cores that can work together. The 40C18 also has a unique ability to send small programs consisting of 4 instructions as a single word to an adjacent core for execution, thus having enough space to perform block transfers.

XMOS's XS1-G4 is an interesting hybrid based on 32-bit integer Xcores. Each Xcore can handle a number of different threads, while there is a hardware-based event system to help XMOS's soft peripherals. Like the 40C18, the XS1-G4 can wait on I/O ports. The difference is that the XS1-G4 handles multiple threads, while the IntellaSys chip handles a single thread.

Developers can use XC, an extended version of the C language, to get the most out of the XMOS hardware. The C language extension provides a quick path to hardware support, including Xlinks. Xlinks connect the four cores in the chip and provide four off-chip links, so multiple chips can be connected. A switch is used inside the chip for the Xlink connection, but the hardware and software provide a unified interface for inter-processor communication.

Each core has 64KB of memory, which is more than the 40C18, but less than some of the higher-performance chips mentioned in this article. Again, this is more than enough for most application code, and allows programming using a more traditional threading approach. Most programming for XMOS chips will probably be done in traditional C or C++, rather than XC, which is more geared toward communications and peripheral handling.

The XS1-G4 won't challenge double-precision floating-point GPUs or other high-end systems, but its integer and fixed-point DSP support makes it suitable for many other audio and video processing functions. The linked XMOS chips are already used internally to drive several large-screen LCDs.

Multicore architectures will continue to grow at a rapid pace. Programming these cores efficiently and choosing the right products is not easy, but it will become more and more common, even for embedded developers. Legacy applications will continue to be ported to architectures that match existing hosts. When applications are redesigned or created from scratch, perhaps a better solution will emerge.

Reference address：Leverage efficient programming techniques to take advantage of multi-core architectures

Previous article：Embedded CNC system hardware and software architecture
Next article：How to Design Low-Power Embedded Systems

Popular Resources
Popular amplifiers