One article to solve the problem | Zero copy technology

Latest update time：2021-09-09 18:22

Reads：

Hello everyone, my name is Peter. Memory copying is a relatively time-consuming operation, and zero copy is a commonly used optimization method. The article I will share today is about the zero copy technology of the Linux system .

DMA and zero-copy technology

Note : In addition to Direct I/O, all disk-related file read and write operations use page cache technology.

1. Four copies of data and four context switches

When facing client requests, many applications can be equivalent to making the following system calls:

File.read(file, buf, len);
Socket.send(socket, buf, len);

For example, the message middleware Kafka is an application scenario for this. It reads a batch of messages from the disk and writes them intact to the network interface controller (NIC) for sending.

Without any optimization techniques, the operating system will perform 4 data copies and 4 context switches, as shown in the following figure:

If there is no optimization, the performance of reading disk data and then transmitting it through the network card is relatively poor:

4 copies:

The CPU is responsible for moving data from the disk to the Page Cache in the kernel space;
The CPU is responsible for moving data from the kernel space socket buffer to the network;
The CPU is responsible for moving data from the Page Cache in kernel space to the buffer in user space;
The CPU is responsible for moving data from the user space buffer to the kernel space socket buffer;

4 context switches:

When the read system is called: the user state switches to the kernel state;
The read system call is completed: the kernel state switches back to the user state;
When the write system is called: the user state switches to the kernel state;
The write system call is completed: the kernel state switches back to the user state;

We can't help but complain:

It is acceptable for the CPU to be responsible for copying data in memory because the efficiency is acceptable. However, if the CPU is responsible for copying data between memory, disk, and network, this will be unacceptable because the speed of disk and network card is much lower than that of memory, and memory is much lower than that of CPU.
4 copies are too many, and 4 context switches are too frequent;

2. Data is copied four times with DMA

DMA technology is easy to understand. In essence, DMA technology is to put an independent chip on the motherboard. When transferring data between memory and I/O devices, we no longer control the data transfer through the CPU, but directly through the DMA controller (DMA Controller, referred to as DMAC). We can think of this chip as a co-processor.

The most valuable aspect of DMAC is when the data to be transmitted is very large and the speed is very fast, or when the data to be transmitted is very small and the speed is very slow.

For example, when we use a gigabit network card or hard disk to transfer a large amount of data, if we use the CPU to handle it, it will definitely be too busy, so we can choose DMAC. When the data transmission is very slow, DMAC can wait for the data to arrive before sending a signal to the CPU for processing, instead of letting the CPU wait there.

Note the word "cooperate" here. DMAC is "assisting" the CPU to complete the corresponding data transmission. In the process of DMAC controlling data transmission, we still need the CPU to control it, but the specific data copy is no longer completed by the CPU.

Originally, data copying (flow) between all components of a computer must pass through the CPU, as shown in the following figure:

Now, DMA replaces the CPU to be responsible for data transfer between memory and disk and between memory and network card. The CPU is the controller of DMA, as shown in the following figure:

However, DMA has its limitations. DMA can only be used to copy data when exchanging data between devices. However, the data copy within the device still needs to be performed by the CPU. For example, the CPU is responsible for copying data between kernel space data and user space data (copying within memory), as shown in the following figure:

The read buffer in the above figure is the page cache, and the socket buffer is the socket buffer.

3. Zero-copy technology

3.1 What is zero copy technology?

Zero copy technology is an idea[3] that refers to the fact that when a computer performs an operation, the CPU does not need to copy data from one memory location to another specific area first.

It can be seen that the characteristic of zero copy is that the CPU is not responsible for writing the data in the memory to other components. The CPU only plays a management role. But please note that zero copy does not mean no copying, but the CPU is no longer responsible for the whole process of data transfer when copying. If the data itself is not in the memory, it must be copied to the memory in some way (the CPU does not need to participate in this process), because only in the memory can the data be transferred and directly read and calculated by the CPU.

There are many specific ways to implement zero-copy technology, such as:

sendfile
mmap
splice
Direct I/O

Different zero-copy technologies are suitable for different application scenarios. The following analyzes sendfile, mmap, and Direct I/O in turn.

However, for the purpose of summary, we first make a forward-looking summary of the following technologies here.

DMA technology review: DMA is responsible for copying data between memory and other components. The CPU is only responsible for management, not for the entire data copy process.
Use zero copy of page cache:

sendfile: replaces the read/write system call once, and implements zero copy by using DMA technology and passing file descriptors
mmap: It only replaces the read system call, maps the kernel space address to the user space address, and the write operation directly acts on the kernel space. Through DMA technology and address mapping technology, there is no need to copy data between the user space and the kernel space, achieving zero copy

Direct I/O without page cache: Read and write operations are performed directly on the disk without using the page cache mechanism. It is usually used in conjunction with the user cache in user space. Data is directly interacted with the disk/network card through DMA technology, achieving zero copy

3.2 sendfile

The application scenario of snedfile is: after the user reads some file data from the disk, it can be transmitted through the network without any calculation or processing. The typical application of this scenario is the message queue.

Under traditional I/O, as shown in Section 1, a data transfer in the above application scenario requires four CPU-managed copies and four context switches, as described in Section 1 of this article.

sendfile mainly uses two technologies:

DMA technology;
Pass file descriptors instead of copying data;

The following explains the functions of these two technologies one by one.

1. Using DMA technology

sendfile relies on DMA technology to reduce four CPU-managed copies and four context switches to two, as shown in the following figure:

Use DMA technology to reduce 2 CPU-involved copies

DMA is responsible for copying data from the disk to the Page cache (read buffer) in the kernel space and copying data from the socket buffer in the kernel space to the network card.

2. Pass file descriptors instead of data copies

Passing a file description can be used instead of copying data for two reasons:

The page cache and socket buffer are both in kernel space;
There is no write operation before or after the data transfer process;

Use file descriptor transfer instead of kernel data copy

Note : Only when the network card supports SG-DMA (The Scatter-Gather Direct Memory Access) technology can a CPU copy in kernel space be avoided by passing file descriptors. This means that this optimization depends on whether the physical network card of the Linux system supports it (Linux introduced the DMA scatter/gather function in kernel version 2.4, just make sure that the Linux version is higher than 2.4).

3. One system call instead of two system calls

Because sendfile only corresponds to one system call, while traditional file operations require two system calls: read and write.

Because of this, sendfile can reduce the context switching between user mode and kernel mode from 4 to 2.

The sendfile system call requires only two context switches.

On the other hand, we need to pay attention to the limitations of the sendfile system call. If the application needs to write data read from the disk, such as decryption or encryption, then the sendfile system call cannot be used at all. This is because the user thread cannot get the transmitted data through the sendfile system call at all.

3.3 mmap

The mmap technology is discussed separately in [4], please go there to read it.

3.4 Direct I/O

Direct I/O is direct I/O. The word "direct" in its name is used to distinguish cache I/O that uses the page cache mechanism.

Cache file I/O: When user space wants to read and write a file, it does not directly interact with the disk, but a layer of cache is sandwiched in the middle, namely the page cache;
Direct file I/O: files read by user space interact directly with the disk, without an intermediate page cache layer;

"Direct" has another layer of semantics here: in all other technologies, data needs to be stored at least in kernel space, but in Direct I/O technology, data is stored directly in user space, bypassing the kernel.

The Direct I/O mode is shown in the following figure:

Direct I/O diagram

At this time, the user space directly copies data with the disk and the network card through DMA.

Direct I/O reading and writing are very unique :

Write operation: Since it does not use page cache, it writes files. If the return is successful, the data is actually written to the disk (without considering the disk's own cache);
Read operation: Since it does not use the page cache, each read operation is actually read from the disk and not from the file system cache.

In fact, even Direct I/O may still require the use of the operating system's fsync system call. Why?

This is because although the file data itself does not use any cache, the file metadata still needs to be cached, including the inode cache and dentry cache in the VFS.

In some operating systems, performing a write system call in Direct I/O mode ensures that file data is written to disk, but file metadata may not be written to disk. If you are using such an operating system, you need to perform an fsync system call to ensure that file metadata is also written to disk. Otherwise, file anomalies and metadata corruption may occur. The O_DIRECT and O_DIRECT_NO_FSYNC configurations of MySQL are a specific example [9].

Advantages and disadvantages of Direct I/O:

(1) Advantages

The direct I/O technology in Linux omits the use of the operating system kernel buffer in the cached I/O technology. Data is directly transferred between the application address space and the disk, so that self-caching applications can omit the complex system-level cache structure and perform data read and write management defined by the program itself, thereby reducing the impact of system-level management on application data access .
Like other zero-copy technologies, it avoids copying data from kernel space to user space. If the amount of data to be transferred is large, direct I/O is used for data transmission without the participation of the operating system kernel address space to copy data operations, which will greatly improve performance.

(2) Disadvantages

Since data transfer between devices is done through DMA, the user space data buffer memory page must be page pinned to prevent its physical page frame address from being swapped to disk or moved to a new address, which causes DMA to copy data and fail to find the memory page at the specified address, thus causing a page fault. The overhead of page locking is not less than that of CPU copying. Therefore, in order to avoid frequent page locking system calls, the application must allocate and register a persistent memory pool for data buffering.
If the accessed data is not in the application cache, the data will be loaded directly from the disk each time, which will be very slow.
Introducing direct I/O at the application layer requires the application layer to manage it itself, which brings additional system complexity;

Who uses Direct I/O?

An article from IBM[5] states that self-caching applications may choose to use Direct I/O.

Self-caching applications

For some applications, it will have its own data caching mechanism, for example, it will cache data in the application address space. Such applications do not need to use the cache memory in the operating system kernel at all. Such applications are called self-caching applications.

For example, the application maintains a cache space internally. When there is a read operation, the cache data of the application layer is read first. If not, the data is read directly through disk I/O through Direct I/O. The cache is still in the application, but the application thinks that implementing a cache by itself is more efficient than the cache of the operating system.

Database management systems are an example of this type of application . Self-caching applications tend to use logical representations of data rather than physical representations; when system memory is low, self-caching applications will cause the logical cache of this data to be swapped out rather than the actual data on disk. Self-caching applications have a good understanding of the semantics of the data they are operating on, so they can use more efficient cache replacement algorithms. Self-caching applications may share a block of memory between multiple hosts, so self-caching applications need to provide a mechanism that can effectively invalidate cached data in the user address space to ensure the consistency of cached data in the application address space.

On the other hand, the current asynchronous IO library on Linux relies on files being opened in O_DIRECT mode, and they are usually used together.

How to use Direct I/O?

User applications need to implement a cache in user space, and read/write operations should be provided through this cache as much as possible. If there are performance considerations, try to avoid frequent read/write operations based on Direct I/O.

4. Typical Cases

4.1 Kakfa

As a message queue, Kafka involves two main operations in disk I/O:

Provider sends messages to Kakfa, which is responsible for persisting the messages to disk in the form of logs.
Consumer pulls messages from Kafka, which reads a batch of log messages from the disk and then sends them through the network card;

The Kakfa server uses the mmap mechanism [6] to receive messages from the Provider and persist them. It can provide efficient persistence based on sequential disk I/O. The Java class used is java.nio.MappedByteBuffer.

The Kakfa server uses the sendfile mechanism [7] when sending messages to the Consumer. This mechanism has two main advantages:

sendfile avoids the CPU-full-process data movement from kernel space to user space;
sendfile is implemented based on Page Cache. Therefore, if multiple Consumers consume messages from a topic at the same time, since the messages are always cached in the page cache, only one disk I/O is required to serve multiple Consumers.

Using mmap to persist the received data and using sendfile to read the data from the persistent medium and then send it out is a common combination. But please note that you cannot use sendfile to persist the data and use mmap to copy the data without the CPU participating in the whole process of data transfer.

4.2 MySQL

The specific implementation of MySQL is much more complicated than Kakfa, because the database that supports SQL queries is much more complicated than the message queue.

For more information on how to use MySQL’s zero-copy technology, please refer to my other article [8].

5. Conclusion

The introduction of DMA technology means that when copying data between memory and other components, such as disks and network cards, the CPU only needs to send a control signal, and the data copying process is completed by DMA.

There are many implementation strategies for Linux's zero-copy technology, but they can be divided into the following types according to the strategies:

Reduce or even avoid data copying between user space and kernel space : In some scenarios, the user process does not need to access and process the data during data transmission, so the data transmission between the buffer of Linux and the user process can be completely avoided, so that the data copy is completely carried out in the kernel, and even more clever ways can be used to avoid data copying in the kernel. This type of implementation is generally completed by adding new system calls, such as mmap(), sendfile() and splice() in Linux. Page Cache
Direct I/O bypassing the kernel : allows user-mode processes to bypass the kernel and directly transfer data to the hardware. The kernel is only responsible for some management and auxiliary work during the transfer process. This method is actually similar to the first method, which also attempts to avoid data transfer between user space and kernel space. The only difference is that the first method completes the data transfer process in kernel mode, while this method directly bypasses the kernel and communicates with the hardware. The effect is similar, but the principle is completely different.
Optimization of transmission between kernel buffer and user buffer : This method focuses on optimizing the CPU copy between the user process buffer and the operating system's page cache. This method continues the traditional communication method of the past, but is more flexible.

【Reposted from https://spongecaptain.cool/SimpleClearFileIO/ 】

5T technical resources are available for free! Including but not limited to: C/C++, Arm, Linux, Android, artificial intelligence, microcontrollers, Raspberry Pi, etc. Reply " peter " in the above [ Everyone is a Geek ] public account to get it for free! !