ARM920T's CP15 coprocessor-EEWORLD

Collect

Cache

ARM920T has 16K data cache and 16K instruction cache. These two caches are basically the same. The data cache has some additional mechanisms to write back to the memory. Later, we will take the data cache as an example to introduce the basic principles of cache. We already know that the storage unit in the cache is the cache line. A cache line of ARM920T is 32 bytes, so the 16K cache consists of 512 cache lines. To understand the basic principles of cache, we start with the question of how to design cache.

The simplest idea for designing a cache is to divide VA into 32-byte units. Starting from any VA aligned to a 32-byte address boundary, 32 consecutive bytes (such as 0x00-0x1f, 0x20-0x3f, 0x40-0x5f, etc.) can be cached in any of the 512 cache lines. So how do we know which VA the 32 bytes in a cache line come from? This requires that VA is also saved in the cache. Since the starting address of these 32 bytes is aligned to a 32-byte address boundary and the last 5 bits are all 0, only VA[31:5] needs to be saved. This is called VA Tag[4]. Tag is part of VA and is the identifier of the data in the cache line, indicating which VA the 32 bytes of data come from. A cache designed in this way is called a fully associative cache, as shown in the following figure:

Figure 17. Fully Associative Cache

Given a VA, how do we find the corresponding data in the cache? First, we compare and find which row in the cache has a tag equal to VA[31:5]. After finding the corresponding cache line, we then use VA[4:0] to determine which byte of the 32 bytes in the cache line we want to access. Since there are 512 cache lines, if the VA is not cached in the cache, we need to compare 512 times to find out. This is the worst case, but also the most common case. Next, we need to improve the cache design to solve this problem.

The characteristic of a fully associative cache is that any VA can be cached in any cache line. When a VA is searched, since it may be cached in any of the 512 cache lines, all of them have to be searched. If a certain VA is limited to being cached in only one cache line, the search process will be much faster: check the cache line that should cache this VA to see if the tags are consistent. If they are consistent, it is a cache hit. If they are inconsistent, it is a cache miss. You can directly access the physical memory without having to look for other cache lines. This design is called a direct mapped cache, as shown in the following figure:

Figure 18. Direct-mapped Cache

Addresses 0~31 should be cached in the 1st Cache Line, addresses 32~63 should be cached in the 2nd Cache Line, and so on. Addresses 16352~16383 should be cached in the 512th Cache Line. The next address should be 16384 (16K). We go back to the beginning. Addresses 16K~16K+31 should be cached in the 1st Cache Line, addresses 16K+32~16K+63 should be cached in the 2nd Cache Line, and so on. The address that returns to the beginning again should be 32K, 32K~32K+31 should be cached in the 1st Cache Line, 32K+32~32K+63 should be cached in the 2nd Cache Line, and so on. Readers should be able to summarize the rules: given a VA, the remainder after dividing it by 16K determines which cache line it should be cached in, so the quotient after dividing by 16K should be the VA Tag, which is used to distinguish whether the data cached in the cache line is at 0, 16K, or 32K address. So how are the quotient and remainder after dividing by 16K represented? VA[31:14] is the quotient after dividing by 16K, and VA[13:0] is the remainder, so the tag in the figure above is marked with VA[31:14]. The remainder VA[13:0] is a byte offset in the 16K cache, and the cache is organized as a 32-byte cache line, so the high-order VA[13:5] in the remainder determines the cache line, and the low-order VA[4:0] in the remainder determines the byte offset within the cache line. After checking, VA[13:5] is 9 bits in total, and the number of cache lines that can be represented as the cache line number is exactly 512.

Although direct-mapped cache has a fast search speed, it also has disadvantages. For example, addresses 0~31, 16K~16K+31, and 32K~32K+31 should all be cached in the first cache line. If our program accesses address 30 for the first time, the data of addresses 0~31 will be loaded from the memory to the first cache line so that the next access can be faster. However, our program accesses address 32770 for the second time. The data of addresses 32K~32K+31 will be loaded from the memory to the first cache line, replacing the data of addresses 0~31 originally stored in the cache line so that the next access can be faster. However, our program accesses address 16392 for the third time... In this way, the cache will not play any accelerating role and is useless. This problem is called cache thrashing. Fully associative cache will not have this problem because any VA can be cached in any cache line. VAs accessed several times can be cached in different cache lines without conflicts.
Fully associative cache and direct-mapped cache each have their own advantages and disadvantages. Fully associative cache searches are slow, but there is no jitter problem, while direct-mapped cache is just the opposite. In order to get better performance, the actual CPU cache design is a compromise between the two, dividing all cache lines into several groups, each group has n cache lines, called n-way set associative cache. ARM920T uses 64-way set associative cache, as shown in the following figure:

Figure 19. 64-way set-associative cache

With the foundation of the previous two cache concepts, this cache should be easy to understand. 512 cache lines are divided into 8 groups, each with 64 cache lines. Addresses 0-31, 256-587, 512-543, etc. can be cached in any of the 64 cache lines in the first group. Addresses 32-63, 288-319, 544-575, etc. can be cached in any of the 64 cache lines in the second group, and so on. Why is the set-associative cache a compromise between the fully associative and direct-mapped caches? If the groups are large and all cache lines are placed in one group, it becomes a fully associative cache; if the groups are small and each group has only one cache line, it becomes a direct-mapped cache. As an exercise, please calculate why the VA Tag is VA[31:8] and why the group number is represented by VA[7:5].

So, why does the performance of set-associative caches better than direct-mapped caches? On the one hand, set-associative caches disperse conflicts on one cache line to 64 cache lines, which has a 64-fold positive effect. On the other hand, more VAs should be cached in the same group: for direct-mapped caches, there are 4G/512 VAs that conflict with each other in the same group (that is, the same cache line); for set-associative caches, there are 4G/8 VAs that conflict with each other in the same group (64 cache lines). From this quantitative relationship, set-associative caches have a 64-fold negative effect. Wouldn't these two effects completely offset each other? I don't intend to prove it strictly mathematically, as this is not the focus of this section. Readers can understand it through an example of common sense: There are two buildings with the same number of floors. One of the buildings has one elevator with three households on each floor, while the other building has two elevators with six households on each floor. The average number of people in each household is the same. In which building do you think it takes less time to wait for the elevator?

Next, let's explain the issue of cache writing back to memory. There are two modes for cache writing back to memory:

Write Back: When the data in the Cache Line is modified by the CPU core, it is not immediately written back to the memory. The data in the Cache Line and the memory will be temporarily inconsistent. There is a Dirty bit in the Cache Line to mark this situation. When a Cache Line is to be replaced by data from other VAs, if it is not Dirty, it will be replaced directly. If it is Dirty, it will be written back to the memory before replacement.

Write Through: Whenever the CPU core modifies the data in the cache line, it is immediately written back to the memory. The data in the cache line and the memory are always consistent. If multiple CPUs or devices access the memory at the same time, such as using dual-port RAM, it is very important that the data in the cache is consistent with the memory. In this case, the relevant memory pages are usually configured in Write Through mode.

By reading and writing the relevant registers of CP15, the following operations can be performed on the Cache:

Clean: Write the data in the Cache Line back to the memory and clear the Dirty bit. It is used at certain synchronization points in the program to ensure that the data in the Cache Line and the memory are consistent.

Invalidate: There is an Invalid bit in the Cache Line to indicate invalidity. If this bit is set to 1, the data will be read from the memory again when it is accessed next time even if the VA Tag matches. For example, when switching processes, it is necessary to declare that the data cached in the cache of the previous process is invalid.

Lock: Lock the data at a certain address in the cache to ensure that it is not replaced. In a real-time system, this can ensure that the data at a certain address can be accessed within a certain time.

VA is used when searching for data to be accessed from the cache, but PA is used when writing data back to the memory. It would be inefficient if the page table needs to be checked again when writing data back to the memory. Therefore, PA[31:5] (PA Tag) is actually stored in each cache line. The complete cache structure is shown in the following figure:

Figure 20. PA Tag

Finally, let's solve a problem we left behind: What do the C and B bits in the page descriptor mean?

Table 2. Meaning of the C and B bits in the page descriptor

When the C bit is 1, it means that Cache is allowed. In this case, the B bit is used to indicate Write Through or Write Back. Some pages do not allow Cache, and the C bit is set to 0. In this case, the B bit can be used to select whether to allow the use of Write Buffer. Write Buffer is also a simple cache. When the CPU core executes a write instruction, it can pass the data to the Write Buffer, and then the Write Buffer is responsible for writing it back to the memory. At this time, the CPU can execute subsequent instructions without having to wait for the slower operation of writing back to the memory to complete. Think about it, since there is a Write Buffer, why is there no Read Buffer?

ARM920T's CP15 coprocessor

The MMU and Cache of ARM920T are integrated in the CP15 coprocessor. The MMU and Cache are closely related. This section first introduces how the MMU, Cache and CPU core work together in general, and the following two sections explain the details of MMU and Cache respectively. Samsung's S3C2410 is a very common chip using ARM920T. When it comes to specific chips, we take S3C2410 as an example.

The following is a list of CP15 coprocessor registers (from the [S3C2410 User Manual]). Like the CPU core's r0 to r15 registers, the coprocessor registers are also numbered from 0 to 15. In the instruction, 4 bits are used to represent the register number. Some coprocessor registers have shadow registers. In this case, using different options to read or write the same numbered register actually accesses different registers. When a particular register is used later, its function will be explained in detail.

Table 1. CP15 coprocessor register list

The operation of the CP15 coprocessor uses two coprocessor instructions, mcr and mrc. The notation of these two instructions is from back to front: mcr is to transfer the data in r (CPU core register) to c (coprocessor register), and mrc is to transfer the data in c (coprocessor register) to r (CPU core register). All operations on the CP15 coprocessor are completed by exchanging data between the CPU core register and the CP15 register. The following figure is the instruction format of the coprocessor (excerpted from [S3C2410 User Manual]).

Figure 8. Coprocessor instruction format

Like other ARM instructions, Cond is the condition code, bit 20 is the L bit, indicating whether the instruction is read or write. If L=1, it means Load, which is read from the outside to the CPU core, that is, the mrc instruction. If L=0, it means Store, that is, the mcr instruction. [11:8] These four bits are the coprocessor number. The CP15 number is 15, so there are 4 1s. CRn is the CP15 register number, and Rd is the CPU core register number, each occupying 4 bits. For the CP15 coprocessor, opcode1 is specified to be 0, and opcode2 and CRm are options for the instruction, and the specific meaning depends on different registers.

Although the register numbers and related instructions of the coprocessor are introduced here, readers only need to understand that the coprocessor is operated in this way. Our focus is on explaining the basic concepts of MMU and Cache. For specific instructions on how to write various operations, please refer to the [S3C2410 User Manual].

How does MMU map VA to PA? From Figure 4 "Process address space is independent", it seems that there is a VA to PA table, and PA can be found by looking up a VA. In fact, it is not that simple. Usually there is a multi-level table lookup process. For ARM architecture, it is a two-level table lookup, and for some 64-bit architectures, more levels are required. See the figure below.

Figure 9. Translation Table Walk

First, the 32-bit VA[3] is divided into three segments. The first two segments [31:20] and [19:12] are used as indexes for two table lookups, and the third segment [11:0] is used as the offset within the page. The steps for table lookup are as follows:

1 The TTB register of the CP15 coprocessor (see Table 1 "CP15 coprocessor register list" in which register is this?) stores the base address of the first-level page table (Translation Table). This base address refers to PA, which means that the page table is directly stored in the physical memory at this address.

2 Use the content in TTB as the base address and VA[31:20] as the index to find an item in the table (think about how many items there are in this table?). This table entry stores the base address of the second-level page table (Coarse Page Table), which is also a physical address. That is to say, the second-level page table is also directly stored in the physical memory at this address.

3 Use VA[19:12] as the index to find an entry in the second-level page table (think about how many entries there are in this table?). This table entry stores the base address of the physical page. We said earlier that virtual memory management is based on pages, and a virtual memory page is mapped to a physical memory page frame. This can be confirmed here because the table lookup is based on pages.

4 After knowing the base address of the physical page, add the offset VA[11:0] to retrieve the data at the corresponding address (think about how many bytes a page is?).

This process is called Translation Table Walk. The word Walk is very vivid. From TTB to the first-level page table, then to the second-level page table, and then to the physical page, one addressing is actually three accesses to physical memory. Note that this "walk" process is completely done by hardware. Every time the CPU addresses, the MMU automatically completes the above four steps. There is no need to write instructions to instruct the MMU to do it. The premise is that the operating system must maintain the correctness of the page table entries, fill in the corresponding page table entries every time memory is allocated, clear the corresponding page table entries every time memory is released, and allocate or release the entire page table when necessary.

With the above basic concepts, let's look at the hardware operation sequence when the CPU accesses memory (excerpted from the [ARM Reference Manual]).

ARM920T's CP15 coprocessor - william_djj@126 - My blog

Figure 10. Hardware operation sequence when the CPU accesses memory

Let's take the CPU reading memory as an example to explain the steps in the figure. Each step has a corresponding label in the figure.

1 The CPU core (the "ARM" box in the figure) sends a VA request to read data, and the TLB (Translation Lookaside Buffer) receives the address. TLB is a high-speed cache (also a cache) in the MMU, which caches the page table entries corresponding to the recently searched VA. If the TLB caches the page table entry of the current VA, there is no need to do a Translation Table Walk. Otherwise, the page table entry is read from the physical memory and saved in the TLB. The TLB cache can reduce the number of accesses to the physical memory.

2 The page table entry not only stores the base address of the physical page, but also stores the permission bit and the flag whether the cache is allowed. The MMU first checks the permission bit. If there is no access right, an exception is triggered to the CPU core. Then it checks whether the cache is allowed. If the cache is allowed, the cache and CPU core interoperability are enabled. The "C, B bits" in the figure can be understood as the selection line. The role of these two bits will be explained in detail later.

3 If Cache is not allowed, PA is directly issued to read data from physical memory to the CPU core.

4 If Cache is allowed, VA is used as the index to search the Cache to see if the data to be read is cached. If the data is already cached in the Cache (called Cache Hit), it is directly returned to the CPU core. If the data is not cached in the Cache (called Cache Miss), PA is issued to read the data from the physical memory and cache it in the Cache, and then return it to the CPU core. However, the Cache does not only take the data required by the CPU core, but also takes all the adjacent data to cache, which is called a Cache Line. The Cache Line of ARM920T is 32 bytes. For example, if the CPU core wants to read the 4-byte data at address 0x134-0x137, the Cache will take all the 32 bytes at address 0x120-0x13f (aligned to the 32-byte address boundary) to cache.

Keywords：ARM920T CP15 Reference address：ARM920T's CP15 coprocessor

Previous article：Compile the linux2.4.18 kernel for s3c2410
Next article：S3C2440 power management related issues and their solutions

Popular Resources
Popular amplifiers