Article count:25311 Read by:103629709

CXL opens a new era of high-performance computing

Latest update time：2023-10-17

Reads：

As the demand for data processing capabilities in scientific research and industrial production continues to rise, high-performance computing (HPC) has become an important force driving progress in these fields. In this context, the advancement of computer technology, especially interconnection technology, is particularly critical. Compute Express Link (CXL), as a new generation of high-speed interconnect technology, is showing broad application potential in the HPC field due to its advantages in bandwidth, latency and scalability, and may completely change the relationship between data centers and accelerators. communication methods, thereby promoting revolutionary progress in the HPC field.

However, in order to fully realize the value of CXL, software must keep up with technological developments. Therefore, the key is to develop a software framework for the hardware ecosystem required by SoC (System on Chip) designers. This means we need to build comprehensive software solutions that support CXL to simplify application development, ensure full utilization of hardware, and ultimately achieve system-level optimization.

Three CXL devices unlock new potential for HPC

CXL is mainly composed of three protocols: CXL.io, CXL.cache and CXL.mem. Specifically, CXL.io is seen as a low-latency alternative to PCIe communications, and they can replace each other in many situations. CXL.io brings an enhanced and non-uniform load/store interface to I/O devices. CXL.cache provides related devices with the ability to establish consistent caches in system memory and create caches for low-latency I/O transactions through request and response methods. CXL.mem has the opposite function of CXL.cache. It allows the host processor to directly access devices connected through CXL, especially those storage devices with low-latency load/store instruction sets.

These three protocols can be combined in different ways to support three different types of devices, which for convenience are referred to as Type 1, Type 2 and Type 3 respectively.

Type 1 devices incorporate the CXL.io and CXL.cache protocols to allow devices such as smart NICs or accelerators without internal memory to directly control and access regions of system memory.

Type 2 devices have additional or built-in storage on top of Type 1 and utilize all protocols to allow the system or device to allocate areas in another memory with consistency supported by the hardware between the two. These two different directions (system-initiated or host-initiated) are called host-biased or device-biased modes respectively.

Type 3 devices are storage devices supported by the CXL.io and CXL.mem protocols that enable byte-level memory addressing on DRAM, NVRAM, and other types of persistent and volatile storage devices. This architecture enables the host to access other systems' additional memory and dedicated memory expanders as if it were local memory.

The following image shows these three device types.

Figure 1: CXL device type, excerpted from Compute Express Link Specification r.3.0, v1.0

Compared to traditional PCIe transactions, CXL.mem and CXL.cache have lower latency. For example, PCIe 5.0 common latency is about 100 nanoseconds, while CXL 2.0 only has 20-40 nanoseconds. The latency reduction brought by CXL.mem enables memory expansion by leveraging type 3 devices. This means that application threads can access memory outside the system while avoiding the risk of job failure due to insufficient memory, a feature known as "memory pooling." Enhanced with CXL 2.0, this paradigm can be extended with support for switch-attached memory. This not only allows the system to add additional memory, but also provides unused local memory to other systems, thereby increasing utilization and reducing initial system cost.

Memory sharing introduced in the CXL 3.0 specification allows multiple hosts to access a given CXL attached memory allocation. It also defines the concept of fabric-attached memory expanders—devices that can contain various types of memory for pooling and sharing, and can implement local memory tiering to represent the performance characteristics of a host-optimized pool. This creates an interesting alternative to the SHMEM protocol defined by Cray Research, enabling extremely low-latency access to a shared memory pool by multiple hosts. Not only does this provide better performance than SHMEM library routines due to the native bus interconnect medium, but it also provides a potentially much simpler programming model for parallel computing on this shared memory pool.

Another inherent value of CXL latency reduction is that it also has the potential to facilitate device-to-device memory transactions, such as using multiple GPUs in one or more systems, without the need to spend proprietary auxiliary buses or software layers to interconnect these equipment. This is especially evident in small AI training scenarios, where immediate performance impact can be easily demonstrated. This method of integrating remote hardware directly into a shared memory system through structural connections may open a new chapter in AI training, especially in the data center, because CXL introduces symmetric peer-to-peer communication capabilities, thereby reducing CPU requirements. continued dependence.

Overall, the CXL structure provides the opportunity for server decomposition, helping to overcome issues that limit specific application workflows due to resources not being in the native system architecture. For example, when storage can be centralized into fabric-connected expanders, the need for independent (and siled) storage per system is reduced. The additional bandwidth provided by the dedicated memory bus and CXL helps resolve core memory bandwidth bottlenecks, allowing individual servers to be designed and configured with more emphasis on performance than capacity. Key to realizing this vision is the development of low-latency CXL switching capabilities and flexible memory tiering systems that need to be implemented in supporting software and expander hardware.

Synopsys’ new solution to ensure CXL integrity and data encryption

Given the introduction of external switching in CXL 2.0 and enhanced fabric in CXL 3.0, enhanced bus security becomes critical as data travels on cables external to the server. Therefore, to protect data from unauthorized access or tampering, PCIe and CXL controllers can use Integrity and Data Encryption (IDE) security IP modules to secure data even in the event it is accessed by outsiders. security and privacy.

Faced with growing security needs, Synopsys has launched an innovative solution that combines high-security CXL controllers with standards-compliant, customizable IDE security modules. The purpose of this technology is to ensure that data is protected from tampering and physical attacks during transmission within the SoC. More specifically, the scheme can provide confidentiality, integrity and replay protection for FLIT in the case of CXL.cache/.mem protocol, and confidentiality for transaction layer packets (TLP) in the case of CXL.io , integrity and replay protection. It is worth mentioning that this system not only matches the data interface bus width and channel configuration of the controller, but is also carefully optimized in terms of area, performance and latency, even in CXL.cache/.mem sliding mode. Almost zero-delay data transmission can be achieved.

CXL’s future outlook

After years of efforts to standardize consistency protocols such as OpenCAPI and GenZ, the entire industry has begun to focus on CXL. The CXL controller will use a stream interface protocol called Credit Extensible Streams (CXS), which brings symmetric consistency to multi-processor architectures by encapsulating newer versions of CCIX. This approach initially encountered higher latency in its native form due to the increased cost of consistency, especially when handling small write operations. CXS.B (the CXL-hosted version of CXS) neatly solves this challenge by providing dedicated streaming channel pairs for symmetric communication between CPUs.

The development and application of CXL technology is profoundly changing the field of high-performance computing, showing obvious progress and potential in reducing latency, realizing memory resource sharing and server functional decoupling. As the software framework continues to improve, this technology is expected to usher in a new era of computing performance and efficiency to meet the needs of a growing hardware ecosystem. As a leading provider of PCIe and CXL physical layers, controllers, and IDE and verification IP, Synopsys leverages integration and verification experience in more than 1,800 design projects to significantly reduce risk and assist SoC engineers in speeding product to market. process.

*Disclaimer: This article is original by the author. The content of the article is the personal opinion of the author. The reprinting by Semiconductor Industry Watch is only to convey a different point of view. It does not mean that Semiconductor Industry Watch agrees or supports the view. If you have any objections, please contact Semiconductor Industry Watch.

Today is the 3557th issue of "Semiconductor Industry Observation" shared with you. Welcome to pay attention.

Latest articles about

■TSMC's 2nm is too powerful, UMC is too miserable

■The United States has heavily funded this semiconductor technology

■Tesla is also snapping up HBM 4

■Intel's next-generation AI chip is exposed for the first time

■Nvidia releases its largest chip yet

■The Danish robot giant invites you to "do something" together

■Self-developed DPU released: Microsoft chip, full of firepower

■In the post-Moore era, optical computing chips have become the key to breakthrough, and domestic manufacturers have great potential!

■Open source software in crisis

■The global semiconductor equipment giants are all in trouble