Article count:1086 Read by:1552441

Account Entry

The relationship and evolution history of Linux kernel Page Cache and Buffer Cache

Latest update time:2021-10-08 10:21
    Reads:

When we persist data and write file contents to disk, we often use fsync operation, which will write the dirty page data (actual file contents and metadata information) associated with the file back to disk. The dirty page mentioned here is the page cache.


The buffer cache is a layer of cache built by the kernel to speed up access to the underlying storage medium. It caches some disk data. When there is a disk read request, it will first check whether there is corresponding data in the block cache. If so, it will directly return the corresponding data, thereby reducing disk access.


The two layers of cache each have their own cache targets. I am curious about the relationship between the two. This article mainly refers to several kernel materials, and the corresponding kernel source code versions mainly include: linux-0.11, linux-2.2.16, linux-2.4.0, linux-2.4.19, linux-2.6.18.

The roles of the two types of cache

Page Cache

Page Cache caches file contents in units of Page. File data cached in Page Cache can be read faster by users. At the same time, for buffered write operations, data can be returned immediately after being written to Page Cache without waiting for the data to be actually persisted to the disk, thereby improving the overall performance of upper-layer applications in reading and writing files.

Buffer Cache

The smallest data unit of the disk is sector. Each time the disk is read or written, the disk is operated in sectors. The sector size depends on the specific disk type. Some are 512Bytes and some are 4K Bytes. Whether the user wants to read 1 byte or 10 bytes, when accessing the disk, it must be read in sectors. If the disk is read naked, it means that the efficiency of data reading will be very low. Similarly, if the user wants to write (update) 1 byte of data to a certain location on the disk, he must also refresh a sector. In other words, before writing this byte, we need to read all the disk sector data where the 1 byte is located, modify the corresponding 1 byte data in the memory, and then write the entire modified sector data to the disk in one go. In order to reduce this kind of inefficient access and improve the disk access performance as much as possible, the kernel will build a layer of cache on the disk sector. It caches part of the sector data in the memory in units of integer multiples of the sector (block). When there is a data read request, it can directly read the corresponding data from the memory. When data is written, it can directly update the specified part of the data in the memory, and then write the updated data back to the corresponding disk sector asynchronously. This layer of cache is the block cache Buffer Cache.

The logical relationship between the two types of cache

From the kernel source code of linux-2.6.18, Page Cache and Buffer Cache are two manifestations of the same thing: for a Page, upward, it is a Page Cache of a certain File, and downward, it is also a group of Buffer Caches on a Device.


The file address space is divided into 4K (page size) units, and each 4K may correspond to a page (the possible meaning here is that only the cached part will correspond to the page, and the non-cached part will not correspond), and this 4K page is a Page Cache of this file. For a file on the disk, eventually, this 4K page cache needs to be mapped to a group of buffer caches corresponding to disk blocks. Assuming that the block is 1K, then each page cache will correspond to a group (4) of buffer caches, and each buffer cache has a corresponding buffer cache and device block mapping descriptor: buffer_head, which records the specific location of the block corresponding to this buffer cache on the disk.


The above figure only shows the relationship between Page Cache and Buffer Cache (buffer_head), and the corresponding blocks. From the perspective of File, if you want to write data to disk, the first step is to find which page of the page cache corresponds to the specific location of the file? Then you can write the data. To find the corresponding page, you need to rely on the i_mapping field in the inode structure:


This field is an address_space structure, and in fact, address_space is a radix tree. Simply put, radix tree is a multi-level index structure. If the size of a file is divided into pages, assuming that a file has N pages, and N is a 32-bit int, then this 32-bit N can be divided into several levels: level-0: [0 - 7bit], level-1: [8 - 15bit], level-2: [16 - 23bit], level-3: [24 - 31bit]. When searching for the existence of a page corresponding to a certain position in a file, take the position N where the page is located and search on the corresponding radix-tree. When searching, first use the level-0 part of N to search on the level-0 hierarchical index on the radix tree. If it does not exist, it is directly notified that it does not exist. If it exists, then further use the level-1 part of N to search on the corresponding level-1 under this level-0, and search one level at a time. Thus, we can see that, at most, by searching on 4 levels of index, we can find the page information corresponding to N. For a detailed description of radix-tree and address_space, please refer to the descriptions in [12] and [2]. Here, we borrow a picture from [12] and [2] to better illustrate the appearance of the radix-tree (address_space) structure:

The basic radix-tree mapping structure:


On the corresponding inode, the mapping relationship between the i_mapping field (address_space) and the page is:


The evolution of two types of cache

Although, in the current Linux Kernel code, Page Cache and Buffer Cache are actually unified, whether it is the file Page Cache or the Block Buffer Cache, they are finally unified on the Page. However, when reading the older code, we can see that the implementation of these two caches was originally completely separate. What is the reason that these two types of caches finally "come together"? The answers of everyone in [10] made me suddenly enlightened. I will try to sort out the origin of this evolution.

Phase 1: Buffer Cache Only

In the Linux-0.11 version code, we can see that the buffer cache is a completely independent implementation, and it is not even based on pages as memory units, but appears in the form of raw pointers. Each block sector corresponds to an independent buffer cache unit in the kernel, and this buffer cache unit is described by the buffer head:


Among them, when buffer_head is initialized, its internal b_data points to the original memory address:


Among them, b_data points to the specific buffer cache content, while b_dev and b_blocknr represent the device corresponding to this cache and the block number information on the device.

The kernel returns a buffer cache unit (buffer header) corresponding to a specified dev, blocknr sector to the caller through the getblk function. The upper layer reads and writes this buffer_header, which will eventually be mapped to the reading and writing of the corresponding (dev, blocknr) sector.


If a corresponding buffer cache unit (dev, blocknr) has been allocated in the kernel, it will be directly returned to the user through get_hash_table. If not, the corresponding buffer_header will be created first, added to the hash_table (inser_into_queues), and finally returned to the user.

The upper layer's reading and writing of files will be converted into reading and writing of the corresponding buffer_header:


When file_read is executed, the actual position of dev and blocknr is calculated by f_pos, and the corresponding buffer_head is obtained by bread. After that, the data in the buffer cache unit is backfilled to the target buf by put_fs_byte (data reading).

Similarly, when writing data to a file, the corresponding dev and blocknr position information is first calculated through f_pos, and then the corresponding buffer_head is obtained through bread, and the data is written to the buffer cache unit corresponding to the buffer_header.


From the implementation of file_read and file_write above, we can see that bread returns the target buffer_head, allowing the upper layer to only operate the buffer cache unit and no longer care about the bottom layer of the block.


Inside bread, the getblk function mentioned above is used to return the corresponding buffer_head and then perform data reading.

Phase 2: Page Cache and Buffer Cache coexist

In Linux-2.2, the cache for disk file access is still the Buffer Cache. Its access mode is similar to the access logic of Linux-0.11 above. However, at this time, the Buffer Cache has allocated memory based on pages, and the buffer_head has some information about the page:


At the same time, from the initialization of the buffer cache and the action of creating a new buffer cache unit when the buffer cache is insufficient, we can also see that the buffer cache is now completely based on page allocation.


When the buffer cache is insufficient, add buffer cache via grow_buffers:


And complete the initialization construction of buffer_head through create_buffers:


Taking the Linux-2.2.16 version code as an example, when executing disk file writing, the buffer_head information of the corresponding position will be obtained through xxx_getblk, and the corresponding data will be written into the buffer. After that, a step of update_vm_cache will be executed . As for why this step is executed, we will look at it later.


For the corresponding file reading, it is the same, first find the corresponding buffer_head through xxx_getblk, and then complete the corresponding data reading. (Through the while loop, take out the buffer_head of all target blocks at once, and then read all the data at once)


Finally, xxx_getblk uses the getblk interface to locate the specified buffer_head:


From the above description, we can see that the buffer cache at this time allocates memory based on pages, but it is completely independent of the Page Cache and has nothing to do with it.

In Linux-2.2 version, what is Page Cache used for?

(1). mmap for files:

From [10]:

page cache was used to cache pages of files mapped with mmap MAP_FILE among other things.

From [11]:

read() and write() are implemented using the buffer cache. The read() system call reads file data into a buffer cache buffer and then copies it to the application.The mmap() system call, however, has to use the page cache to store its data since the buffer cache memory is not managed by the VM system and thus not cannot be mapped into an application address space. Therefore the file data in the buffer cache is copied into page cache pages, which are then used to satisfy page faults on the application mappings.

For network-based filesytems:

From [1]:

Disk-based filesystems do not directly use the page cache for writing to a regular file. This is a heritage from older versions of Linux, in which the only disk cache was the buffer cache. However, network-based filesystems always use the page cache for writing to a regular file.

At this point, the relationship between Page Cache and Buffer Cache is shown in the following figure:


The Page Cache is only responsible for processing the mmap part, while the Buffer Cache is actually responsible for all IO access to the disk. From the above figure, we can also see one of the problems: write bypasses the Page Cache, which leads to a synchronization problem. When write occurs, the valid data is in the Buffer Cache, not in the Page Cache. This may cause inconsistencies in the file data accessed by mmap. To solve this problem, all writes based on the disk file system need to call the update_vm_cache() function, which will modify the Page Cache corresponding to the write-related Buffer Cache. From the code, we can see that in the above sysv_file_write, after calling copy_from_user, update_vm_cache will be called.

Similarly, it is this design of separation of Page Cache and Buffer Cache that results in the fact that for disk-based files, the same data may have one copy in the Page Cache and one copy in the Buffer Cache at the same time.

Phase 3: Integration of Page Cache and Buffer Cache

Due to the drawbacks of the above-mentioned Page Cache and Buffer Cache separation design, the implementation of Page Cache and Buffer Cache was integrated in Linux-2.4 version. The integrated Buffer Cache no longer exists in an independent form. The content of Buffer Cache exists directly in Page Cache. At the same time, the descriptor unit of Buffer Cache is retained: buffer_head


In the page structure, whether the buffers field is empty is used to determine whether the Page is associated with a set of Buffer Caches (in the subsequent evolution process, this judgment is changed to be determined by the private field).


Correspondingly, buffer_head adds the field b_page, which directly points to the corresponding page.


At this point, the relationship between the two has been integrated as shown in the following figure:


The PageCache (page) of a file can quickly determine the buffer_head information corresponding to the page through the buffers field, and then clarify the device, block and other information corresponding to the page.

Logically, when a write request for a file enters the kernel, generic_file_write is executed. At this layer, a new page is allocated as the page cache for the corresponding write through the inode's address_space structure mapping (here we assume that it is a new write and the data volume is only one page): __grab_cache_page. After the memory space page is allocated, prepare_write is used to complete the construction of the corresponding buffer_head.


prepare_write actually executes: __block_prepare_write, in which it allocates the corresponding buffer_head for the page (create_empty_buffers), calculates the specific location of the actual write on the device: blocknr, and then initializes the buffer_head (get_block)


Inside create_empty_buffers, through a series of operations such as create_buffers and set_bh_page, page and buffer_head are organized into a relationship of mutual association through buffers, b_page, etc. as shown in the previous figure.


Allocate a set of concatenated buffer_heads through create_buffers


Use set_bh_page to associate each buffer_head with the corresponding page and the specific location of the data


It is the above series of actions that bind Page Cache and Buffer Cache (buffer_head) to each other. On the top, when reading and writing files, they are processed in units of pages. On the bottom, when data is refreshed to the device, it can be processed in units of buffer_head (block).

In the subsequent linux-2.5 version, the bio structure was introduced to replace the buffer_head-based block device IO operations.

[Note] : The fusion of Page Cache and Buffer Cache here refers to the fusion of Page Cache and Buffer Cache at the file level. For cross-layer: Page Cache at the file level and raw device Buffer Cache, although they are unified into Page-based implementation, the Page Cache of the file and the Buffer Cache accessed by the block corresponding to the file at the raw device level are two completely independent pages. In this case, the data on a physical disk block still corresponds to two pages in the Linux kernel, one is the Page Cache (Page Cache) of the file accessed through the file layer, and the other is the Page Cache (Buffer Cache) accessed through the raw device layer.

Original: http://lday.me/2019/09/09/0023_linux_page_cache_and_buffer_cache/
Author: lday

References

[1]. Understanding the Linux Kernel
[2]. Professional Linux Kernel Architecture
[3]. The Art of Linux Kernel Design
[4]. Linux Kernel Development
[5]. A Heavily Commented Linux Kernel Source Code
[6]. Linux kernel source code scenario analysis
[7]. A unique approach to kernel design: A guide to the Linux kernel source code
[8]. Introduction to the file cache management mechanism of the Linux kernel
[9]. Linux kernel file cache mechanism
[10]. What is the major difference between the buffer cache and the page cache
[11]. UBC: An Efficient Unified I/O and Memory Caching Subsystem for NetBSD
[12]. Trees I: Radix trees

 
EEWorld WeChat Subscription

 
EEWorld WeChat Service Number

 
AutoDevelopers

About Us About Us Service Contact us Device Index Site Map Latest Updates Mobile Version

Site Related: TI Training

Room 1530, Zhongguancun MOOC Times Building,Block B, 18 Zhongguancun Street, Haidian District,Beijing, China Tel:(010)82350740 Postcode:100190

EEWORLD all rights reserved 京B2-20211791 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号 Copyright © 2005-2021 EEWORLD.com.cn, Inc. All rights reserved