【Summary】
【Reason for writing】
【Question Structure】
【Analysis 1】Overall Process
[Analysis 2] get_free_pages and mmap
【Analysis 3】CPU and TLB
[Analysis 4] CPU and L1 cache
[Analysis 5] CPU and L2cache
【Summarize】
Note: Please use Google Chrome to read (IE browser layout is confusing)
【Summary】
Whether it is ARM, PowerPC, MIPS, or X86, improving the access speed of memory is an important means for the CPU to improve its own performance, and cache comes from this; whether it is Linux or Windows operating system, making full use of CPU characteristics and improving system performance is a good choice, just like the switching of page tables in process context switching is the perfect use of MMU by Linux. Performance is an eternal theme of software, and understanding MMU is crucial for software developers. So how do memory and cache work? What are the first-level cache and second-level cache of the CPU? What are the advantages and disadvantages of fully associative cache and set associative cache, and how to use them? This article will use a common cache problem to analyze the Linux system's support for the ARM architecture MMU mechanism.
【Reason for writing】
1 In order to facilitate my own review of knowledge points in the future, I also hope to provide some reference for friends who want to understand cache.
2 Cache problems are quite common, and rational use of cache can improve software performance. It is important to understand the relationship between CPU and cache.
【Question Structure】
First, we construct a common cache problem:
Step 1: Apply for a section of memory in kernel mode:
virt_svc=__get_free_pages(GFP_KERNEL | GFP_DMA,order);
illustrate:
1) You can also use kmalloc/alloc_pages and other interfaces to apply;
2) The virt_svc applied for at this time is the linear address of the kernel state.
Step 2: In kernel mode, convert the virt_svc obtained in the first step into a physical address: phy_addr = virt_to_phys (virt_svc);
Step 3: The user-state process maps the physical address obtained in the second step to the user-state address space of the process through mmap.
fd= open("/dev/mem", O_RDWR|O_DSYNC,0);
virt_user = mmap(NULL,len,PROT_READ|PROT_WRITE,MAP_SHARED,fd,phy_addr);
mmap maps the physical memory phy_addr applied for in the second step to the user-mode process address space virt_user, and the length of the mapping interval is len
Notice:
1) The virt_user address is uncached.
Reason: When open("/dev/mem", O_RDWR|O_DSYNC,0), the O_DSYNC flag is specified, and the character driver mem will set the C position of the page table entry to 0 when creating a page table for virt_user. You can refer to the kernel code: drivers/char/mem.c, which will not be repeated here.
Step 4: Assign the value of 0x12121212 to virt_svc obtained in the first step; then assign the value of user_virt obtained in the third step to
0x34343434; Tracking the memory of user_virt, it is likely to be modified.
The above constructs a very typical cache problem. The cause of the problem is also simple. Here we just want to introduce the topic of this article through this problem. Next, we will also gradually introduce the CPU and cache in the arm architecture in the form of problem analysis.
【Analysis 1】Overall Process
To know how the data in the memory is changed, we must first know how the CPU accesses the data in the memory. This article discusses the armv6 architecture. The following figure well explains the process of the CPU accessing memory, and the experiments are also based on this figure:
illustrate:
1. When the CPU issues a virtual address request, the first step is address translation, which is actually what we often call TLB (translation lookaside buffer). TLB is actually a cache, but it is different from l1 and l2 cache. It is an area used by MMU to cache page table entries. It is assumed that readers have some understanding of the paging mechanism, so I will not go into details here. For more information about the paging mechanism, please refer to the blog post: How does the Linux kernel implement the paging mechanism?
In fact, if you don't understand this process, it will not hinder the understanding of this article. You can simply think that the CPU obtains the page table entry through this process, knows the physical memory base address and C/B bit corresponding to the VA, which will be introduced later. The C/B bit is very important and directly related to the access method of l1 and l2 cache.
2. Whether the CPU accesses the L1 and L2 caches is determined by the C/B bit mentioned above. This is critical and will be discussed in the subsequent analysis. Let's assume that the CPU needs to access the cache. How it accesses the cache will be explained in detail later. In short, the CPU obtains data from the cache.
3. If the page table entry configured in the C/B bit is non-cacheable, or there is no hit during cache access, the CPU directly issues a physical memory access request.
The following is an introduction to the cache-related software implementation in the Linux kernel in the form of problem analysis.
[Analysis 2] get_free_pages and mmap
1 __get_free_pages(GFP_KERNEL | GFP_DMA,order)
1. First of all, it is important to clarify where the memory used in the software comes from. This is crucial to analyzing the problem. _get_free_pages is a common way to request memory in Linux systems. Do you know where the memory it requests comes from and what characteristics it has?
Observe these two flags GFP_KERNEL | GFP_DMA. In Linux, it usually includes zone_type such as ZONE_DMA/ZONE_NORMAL/ZONE_HIGHMEM:
1) ZONE_DMA: Mainly used for compatibility with early devices that can only do DMA mapping in 0-16M. It is not used on many platforms.
2) ZONE_HIGHMEM is generally used on 32-bit systems when more than 896MB of memory is used, and the kernel linear address range is not enough to map all addresses. The S2 platform does not use it either.
3) ZONE_NORMAL: Normal low-end address, used by many platforms. This function actually requests a page frame from ZONE_NORMAL. ZONE_NORMAL corresponds to the low-end memory area in the Linux kernel.
2. During the Linux system startup process, the method of configuring low-end memory page table entries can refer to the blog post How the Linux kernel implements the paging mechanism.
In short, the Linux kernel defines the initial page table entry attributes in mem_types:
prot_pte = L_PTE_PRESENT | L_PTE_YOUNG | L_PTE_DIRTY。
In fact, during the kernel startup process, build_mem_type_table() will also be adjusted according to different arm versions. In s2, the low-end memory cache is configured in write-back mode.
#define L_PTE_MT_WRITEALLOC (_AT(pteval_t, 0x07) << 2);
The following conclusions are drawn from the above two points:
1) The memory requested by __get_free_pages has a write-back attribute in the corresponding page table entry in the kernel-mode linear address space.
Experiment: In kernel mode, try to change the page table entry attribute of the memory requested by __get_free_pages to uncached (implemented in ioremap_page_range):
Experimental conclusion: Theoretically, this experiment can solve the problem, but the experimental result still fails. Why? Because the Linux kernel does not allow the low-end memory that has established the page table to be re-mapped.
However, there are other ways to conduct this experiment. You can refer to the classic Linux approach and map the low-end memory to the high-end memory area, namely the vmalloc area, and modify the page table entry attributes in the process. For specific implementation, refer to the kernel interface: __dma_alloc_remap()–>ioremap_page_range() Because the problem was finally located, the experiment was not continued.
2) ioremap function
#define ioremap(cookie,size) __arm_ioremap((cookie), (size), MT_DEVICE)
#define ioremap_nocache(cookie,size)__arm_ioremap((cookie), (size), MT_DEVICE)
#define ioremap_cached(cookie,size) __arm_ioremap((cookie), (size), MT_DEVICE_CACHED)
Observe the above ioremap function mtype, which is MT_DEVICE, which is the device physical memory range dedicated to the driver, and also the range of the CPU address line, and does not correspond to the real DRAM memory. The low-end memory discussed here is mtype=MT_MEMORY.
2 mmap system call.
A user process uses mmap to map the memory requested by __get_free_pages to the user space for use. The core problem mentioned in this article is that the data on the memory is tampered with after the user maps. For the Linux standard /dev/mem, mmap->mmap_mem:
Among them, when opening the device, open("/dev/mem", O_RDWR|O_DSYNC,0); the O_DSYNC flag will specify vm_page_prot as uncached/unbufferd.
Although we did not use the standard /dev/mem in the implementation process, but implemented a driver ourselves, the design concept has not changed. The page table is created in user space for the physical address through remap_pfn_range.
Experiment 2: Perform cache invalidation and cache closing experiments before and after mmap to verify whether remap_pfn_range creates a nocache page table entry. See the code screenshot below for details.
Conclusion: There is no problem with the mmap process.
After the above experiments, there are two doubts
1) There is a problem with the user's remap process.
2) There is a problem with the experimental method, that is, there is a problem with the cache operation. So let's continue to explore the relationship between the CPU and L1-cache and L2-cache. Before the analysis, let's briefly introduce the TLB process.
【Analysis 3】CPU and TLB
1 Although TLB is not the direct cause of the problem discussed, it is an important part of the MMU process. When I did several cache-related experiments and still could not locate the problem, I suspected the TLB mechanism and did a simple experiment to eliminate the doubts. Therefore, I briefly introduce the experimental process for record.
Analysis process: After several cache-related experiments (see below for details), I began to suspect that it was not a cache problem. Is it because:
Previous article:ARM-I/Dcache, MMU relationship
Next article:ARM Basic Learning-Cache and Write Buffer
Recommended ReadingLatest update time:2024-11-23 16:32
- Popular Resources
- Popular amplifiers
- Siemens PLC Programming Technology and Application Cases (Edited by Liu Zhenquan, Wang Hanzhi, Yang Kun, etc.)
- Siemens PLC from Beginner to Mastery with Color Illustrations (Yang Rui)
- Experience and skills in using Siemens S7-200PLC (Shang Baoxing)
- Siemens S7-1200-PLC Programming and Application Tutorial (3rd Edition) (Edited by Shi Shouyong)
- Naxin Micro and Xinxian jointly launched the NS800RT series of real-time control MCUs
- How to learn embedded systems based on ARM platform
- Summary of jffs2_scan_eraseblock issues
- Application of SPCOMM Control in Serial Communication of Delphi7.0
- Using TComm component to realize serial communication in Delphi environment
- Bar chart code for embedded development practices
- Embedded Development Learning (10)
- Embedded Development Learning (8)
- Embedded Development Learning (6)
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Intel promotes AI with multi-dimensional efforts in technology, application, and ecology
- ChinaJoy Qualcomm Snapdragon Theme Pavilion takes you to experience the new changes in digital entertainment in the 5G era
- Infineon's latest generation IGBT technology platform enables precise control of speed and position
- Two test methods for LED lighting life
- Don't Let Lightning Induced Surges Scare You
- Application of brushless motor controller ML4425/4426
- Easy identification of LED power supply quality
- World's first integrated photovoltaic solar system completed in Israel
- Sliding window mean filter for avr microcontroller AD conversion
- What does call mean in the detailed explanation of ABB robot programming instructions?
- STMicroelectronics discloses its 2027-2028 financial model and path to achieve its 2030 goals
- 2024 China Automotive Charging and Battery Swapping Ecosystem Conference held in Taiyuan
- State-owned enterprises team up to invest in solid-state battery giant
- The evolution of electronic and electrical architecture is accelerating
- The first! National Automotive Chip Quality Inspection Center established
- BYD releases self-developed automotive chip using 4nm process, with a running score of up to 1.15 million
- GEODNET launches GEO-PULSE, a car GPS navigation device
- Should Chinese car companies develop their own high-computing chips?
- Infineon and Siemens combine embedded automotive software platform with microcontrollers to provide the necessary functions for next-generation SDVs
- Continental launches invisible biometric sensor display to monitor passengers' vital signs
- What do 8bit, bit7 and bit6-0 mean?
- Toshiba Photo Relay TLP3547 Review
- Active RFID Design Based on MSP430 and CC1100
- How to design an HDMI digital interface converter based on a single chip
- TDC-GP1 High-precision Time Interval Measurement Chip and Its Application
- Running the C interpreter on ESP32
- 【GD32E503 Review】+ W5500 Network Module Transplantation
- Design of reactive power compensation control system for power grid
- [Repost] How to use resistors to adjust power supply output and protect the power supply from failure
- I have a question about 7060 chip burning!