CPU and Cache of ARM architecture in Linux system-EEWORLD

Collect

【Summary】

【Reason for writing】

【Question Structure】

【Analysis 1】Overall Process

[Analysis 2] get_free_pages and mmap

【Analysis 3】CPU and TLB

[Analysis 4] CPU and L1 cache

[Analysis 5] CPU and L2cache

【Summarize】

Note: Please use Google Chrome to read (IE browser layout is confusing)

【Summary】

Whether it is ARM, PowerPC, MIPS, or X86, improving the access speed of memory is an important means for the CPU to improve its own performance, and cache comes from this; whether it is Linux or Windows operating system, making full use of CPU characteristics and improving system performance is a good choice, just like the switching of page tables in process context switching is the perfect use of MMU by Linux. Performance is an eternal theme of software, and understanding MMU is crucial for software developers. So how do memory and cache work? What are the first-level cache and second-level cache of the CPU? What are the advantages and disadvantages of fully associative cache and set associative cache, and how to use them? This article will use a common cache problem to analyze the Linux system's support for the ARM architecture MMU mechanism.

【Reason for writing】

1 In order to facilitate my own review of knowledge points in the future, I also hope to provide some reference for friends who want to understand cache.

2 Cache problems are quite common, and rational use of cache can improve software performance. It is important to understand the relationship between CPU and cache.

【Question Structure】

First, we construct a common cache problem:

Step 1: Apply for a section of memory in kernel mode:

virt_svc=__get_free_pages(GFP_KERNEL | GFP_DMA,order);

illustrate:

1) You can also use kmalloc/alloc_pages and other interfaces to apply;

2) The virt_svc applied for at this time is the linear address of the kernel state.

Step 2: In kernel mode, convert the virt_svc obtained in the first step into a physical address: phy_addr = virt_to_phys (virt_svc);

Step 3: The user-state process maps the physical address obtained in the second step to the user-state address space of the process through mmap.

fd= open("/dev/mem", O_RDWR|O_DSYNC,0);

virt_user = mmap(NULL,len,PROT_READ|PROT_WRITE,MAP_SHARED,fd,phy_addr);

mmap maps the physical memory phy_addr applied for in the second step to the user-mode process address space virt_user, and the length of the mapping interval is len

Notice:

1) The virt_user address is uncached.

Reason: When open("/dev/mem", O_RDWR|O_DSYNC,0), the O_DSYNC flag is specified, and the character driver mem will set the C position of the page table entry to 0 when creating a page table for virt_user. You can refer to the kernel code: drivers/char/mem.c, which will not be repeated here.

Step 4: Assign the value of 0x12121212 to virt_svc obtained in the first step; then assign the value of user_virt obtained in the third step to

0x34343434; Tracking the memory of user_virt, it is likely to be modified.

The above constructs a very typical cache problem. The cause of the problem is also simple. Here we just want to introduce the topic of this article through this problem. Next, we will also gradually introduce the CPU and cache in the arm architecture in the form of problem analysis.

【Analysis 1】Overall Process

To know how the data in the memory is changed, we must first know how the CPU accesses the data in the memory. This article discusses the armv6 architecture. The following figure well explains the process of the CPU accessing memory, and the experiments are also based on this figure:

illustrate:

1. When the CPU issues a virtual address request, the first step is address translation, which is actually what we often call TLB (translation lookaside buffer). TLB is actually a cache, but it is different from l1 and l2 cache. It is an area used by MMU to cache page table entries. It is assumed that readers have some understanding of the paging mechanism, so I will not go into details here. For more information about the paging mechanism, please refer to the blog post: How does the Linux kernel implement the paging mechanism?

In fact, if you don't understand this process, it will not hinder the understanding of this article. You can simply think that the CPU obtains the page table entry through this process, knows the physical memory base address and C/B bit corresponding to the VA, which will be introduced later. The C/B bit is very important and directly related to the access method of l1 and l2 cache.

2. Whether the CPU accesses the L1 and L2 caches is determined by the C/B bit mentioned above. This is critical and will be discussed in the subsequent analysis. Let's assume that the CPU needs to access the cache. How it accesses the cache will be explained in detail later. In short, the CPU obtains data from the cache.

3. If the page table entry configured in the C/B bit is non-cacheable, or there is no hit during cache access, the CPU directly issues a physical memory access request.

The following is an introduction to the cache-related software implementation in the Linux kernel in the form of problem analysis.

[Analysis 2] get_free_pages and mmap

1 __get_free_pages(GFP_KERNEL | GFP_DMA,order)

1. First of all, it is important to clarify where the memory used in the software comes from. This is crucial to analyzing the problem. _get_free_pages is a common way to request memory in Linux systems. Do you know where the memory it requests comes from and what characteristics it has?

Observe these two flags GFP_KERNEL | GFP_DMA. In Linux, it usually includes zone_type such as ZONE_DMA/ZONE_NORMAL/ZONE_HIGHMEM:

1) ZONE_DMA: Mainly used for compatibility with early devices that can only do DMA mapping in 0-16M. It is not used on many platforms.

2) ZONE_HIGHMEM is generally used on 32-bit systems when more than 896MB of memory is used, and the kernel linear address range is not enough to map all addresses. The S2 platform does not use it either.

3) ZONE_NORMAL: Normal low-end address, used by many platforms. This function actually requests a page frame from ZONE_NORMAL. ZONE_NORMAL corresponds to the low-end memory area in the Linux kernel.

2. During the Linux system startup process, the method of configuring low-end memory page table entries can refer to the blog post How the Linux kernel implements the paging mechanism.

In short, the Linux kernel defines the initial page table entry attributes in mem_types:

prot_pte = L_PTE_PRESENT | L_PTE_YOUNG | L_PTE_DIRTY。

In fact, during the kernel startup process, build_mem_type_table() will also be adjusted according to different arm versions. In s2, the low-end memory cache is configured in write-back mode.

#define L_PTE_MT_WRITEALLOC (_AT(pteval_t, 0x07) << 2)；

The following conclusions are drawn from the above two points:

1) The memory requested by __get_free_pages has a write-back attribute in the corresponding page table entry in the kernel-mode linear address space.

Experiment: In kernel mode, try to change the page table entry attribute of the memory requested by __get_free_pages to uncached (implemented in ioremap_page_range):

Experimental conclusion: Theoretically, this experiment can solve the problem, but the experimental result still fails. Why? Because the Linux kernel does not allow the low-end memory that has established the page table to be re-mapped.

However, there are other ways to conduct this experiment. You can refer to the classic Linux approach and map the low-end memory to the high-end memory area, namely the vmalloc area, and modify the page table entry attributes in the process. For specific implementation, refer to the kernel interface: __dma_alloc_remap()–>ioremap_page_range() Because the problem was finally located, the experiment was not continued.

2) ioremap function

#define ioremap(cookie,size) __arm_ioremap((cookie), (size), MT_DEVICE)

#define ioremap_nocache(cookie,size)__arm_ioremap((cookie), (size), MT_DEVICE)

#define ioremap_cached(cookie,size) __arm_ioremap((cookie), (size), MT_DEVICE_CACHED)

Observe the above ioremap function mtype, which is MT_DEVICE, which is the device physical memory range dedicated to the driver, and also the range of the CPU address line, and does not correspond to the real DRAM memory. The low-end memory discussed here is mtype=MT_MEMORY.

2 mmap system call.

A user process uses mmap to map the memory requested by __get_free_pages to the user space for use. The core problem mentioned in this article is that the data on the memory is tampered with after the user maps. For the Linux standard /dev/mem, mmap->mmap_mem:

Among them, when opening the device, open("/dev/mem", O_RDWR|O_DSYNC,0); the O_DSYNC flag will specify vm_page_prot as uncached/unbufferd.

Although we did not use the standard /dev/mem in the implementation process, but implemented a driver ourselves, the design concept has not changed. The page table is created in user space for the physical address through remap_pfn_range.

Experiment 2: Perform cache invalidation and cache closing experiments before and after mmap to verify whether remap_pfn_range creates a nocache page table entry. See the code screenshot below for details.

Conclusion: There is no problem with the mmap process.

After the above experiments, there are two doubts

1) There is a problem with the user's remap process.

2) There is a problem with the experimental method, that is, there is a problem with the cache operation. So let's continue to explore the relationship between the CPU and L1-cache and L2-cache. Before the analysis, let's briefly introduce the TLB process.

【Analysis 3】CPU and TLB

1 Although TLB is not the direct cause of the problem discussed, it is an important part of the MMU process. When I did several cache-related experiments and still could not locate the problem, I suspected the TLB mechanism and did a simple experiment to eliminate the doubts. Therefore, I briefly introduce the experimental process for record.

Analysis process: After several cache-related experiments (see below for details), I began to suspect that it was not a cache problem. Is it because:

[1] [2] [3]

Keywords：CPU Cache Reference address：CPU and Cache of ARM architecture in Linux system

Previous article：ARM-I/Dcache, MMU relationship
Next article：ARM Basic Learning-Cache and Write Buffer

Recommended ReadingLatest update time:2024-11-23 16:32

Real-time photoelectric image recognition system based on dual CPU

introduction 　　Optoelectronic hybrid pattern recognition has become an important way to realize the practical and real-time pattern recognition with its advantages of high-speed parallel processing and no crosstalk. It has been widely studied and applied in the fields of target recognition, fingerprint recognition, o

[Microcontroller]

Real-time photoelectric image recognition system based on dual CPU

Not just CPU, details on mobile GPU (Part 2)

ARM Mali - "the son" 　　Development History: 　　As the core of the entire ARM ecosystem, ARM plays a decisive role in the development of mobile SOC CPUs. However, ARM is not so important in the development of mobile GPUs. In the early days, ARM did not even have a GPU part. It was not until 2006, after ARM acquired

[Analog Electronics]

Not just CPU, details on mobile GPU (Part 2)

Milchip D9 is a powerful domestically produced CPU that can run Android, Linux, and RTOS

Mil can run Android, Linux, RTOS domestic core board development board Do you still remember the days of chip shortages and price increases? In recent years, due to trade wars and technological suppression, localization of chips has become a trend. Today I recommend a development board that

[Embedded]

Milchip D9 is a powerful domestically produced CPU that can run Android, Linux, and RTOS

PIC16C71 single-chip microcomputer key to wake up the CPU source program

; p=pic16c71,xt=40000hz LIST P=16c71 ; Z EQU 2 RBPU EQU 7 TEMP EQU 10H OPTIONREG EQU 1H F EQU 1 PORT_B EQU 06H ; INCLUDE LIST ; ORG 0 ; reset address GOTO START ; ORG 4 ; interrupt vector GOTO SERVICEINTERRUPT ; START CALL INITPORT_B ; initialize port B LOOP SLEEP ; save pow

[Microcontroller]

Production of ARM Linux Root Filesystem

Introduction: Introduces the composition of the root file system: directory, shell, library, and script. Table of contents The root file system must contain these required directories: /dev, /bin, /usr, /sbin, /lib, /etc, /proc, /sys /dev is the mount point of devfs (device file system) or udev. If there is no /de

[Microcontroller]

AVR Notes 2: Define F_CPU

1.warning: #warning "F_CPU not defined for " 2.warning: "F_CPU" redefined 3.c:/winavr-20100110/lib/gcc/../../avr/include/util/delay.h:86:1: warning: this is the location of the previous definition The solution to the above three errors is to #define F_CPU 1000000 Put it before the #include util/delay.h

[Microcontroller]

What impact does CPU branch prediction have on your code?

The English name of branch prediction is "Branch Prediction" You can search this keyword on Google and you can see a lot about branch prediction, but understanding how branch prediction works is the key to the problem. The impact of branch prediction on programs Let’s take a look at the following two pieces of c

[Microcontroller]

Dual CPU digital signal processor with ARM core

Abstract: This article mainly introduces the structure, function and characteristics of TMS320VC5470, the latest fixed-point digital signal processor launched by American TI Company. The TMS320C54x digital signal processor and ARM7TDMI RISC MCU integrated into the device and their connections are introduced respe

[Embedded]

Popular Resources
Popular amplifiers

Latest Microcontroller Articles

Naxin Micro and Xinxian jointly launched the NS800RT series of real-time control MCUs
On November 20, Naxin Micro announced that it would launch the NS800RT series of real-time control MCUs in cooperation with ChipSine. This series of MCUs has more efficient and powerful real-time control capabilities and rich ...
How to learn embedded systems based on ARM platform
1. The concept of embedded system focuses on understanding the concept of "embedded" from three aspects: 1. From the hardware perspective, the CPU-based peripheral devices are integrated into the CPU chip, such as the early X86-based ...
Summary of jffs2_scan_eraseblock issues
Summarize the problems encountered before: 1 Similar: mtd->read(0x44 bytes from 0x68cf44) returned ECC errorjffs2_get_inode_nodes(): CRC failed ...
Application of SPCOMM Control in Serial Communication of Delphi7.0
Abstract: Using Delphi to develop industrial control system software has become the choice of more and more developers, and serial port communication is one of the problems that must be solved in this process. ...
Using TComm component to realize serial communication in Delphi environment
Abstract: Using Delphi to develop industrial control system software has become the choice of more and more developers, and serial port communication is one of the problems that must be solved in this process. ...
Bar chart code for embedded development practices
Embedded Development Learning (10)
Embedded Development Learning (8)
Embedded Development Learning (6)

He Limin Column Microcontroller and Embedded Systems Bible

Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.

MoreSelected Circuit Diagrams

Change More Related Popular Components

MorePopular Articles

MoreDaily News

Guess you like