Collect

This blog post mainly introduces the basic ideas of virtualization and how to perform virtualization on the ARM platform, the hardware features provided by ARM, etc.

Introduction to Virtualization Technology

Virtualization Technology

Virtualization is a concept. From the perspective of this concept, any object used to simulate another object can be called virtualization. Some restaurants even use tofu to make meat taste good. I think this can also be called virtualization. But here we are mainly discussing virtualization in the computer field. We define virtualization as "virtualization is to simulate a single physical device into multiple isolated virtual devices while ensuring the efficiency of these virtual devices." The definition of this concept also includes the requirements for virtualization, that is, isolation and efficiency. The hypervisor we often talk about, which is also called VMM (virtual machine monitor) in some books, is a software that runs directly on physical hardware. Its function is to manage physical hardware so that these physical resources (CPU, memory, peripherals, etc.) can be shared between different virtual machines. Since the hypervisor directly deals with physical peripherals, it must run in privileged mode. In the past, when there was no virtualization extension, guest OS and guest application could only run in de-privileged mode, as shown in the figure below.

Popek and Goldberg have a classic paper on virtualization, which divides instructions that need to be executed in privileged mode into two categories:

sensitive instructions: These instructions attempt to change the configuration of system resources, or their execution results depend on the state of the system.
Privileged instructions: These instructions will trap (generate an exception and fall into the interrupt vector table) in non-privileged mode, but can be executed normally in privileged mode.

Popek and Goldberg proposed the requirement for building a hypervisor: sensitive instructions are a subset of privileged instructions. This standard is now called classically virtualized. Although virtualization can be done without meeting this requirement (binary translation technology, which will be introduced later), it will be much easier to implement if this requirement is met. The following introduces existing virtualization technologies:

Pure virtualization: Full virtualization requires that the hardware architecture is virtualizable (in line with the classic virtualization model). When the trap enters the hypervisor, the hypervisor simulates the execution of sensitive instructions. This technology is also called trap-and-emulate. When a guest OS wants to access privileged resources (physical peripherals), a trap is generated to wake up the hypervisor. The hypervisor simulates the access and then returns to the next instruction of the guest OS to continue execution. As shown in the figure below, the red arrow indicates a trap. It can be seen that each privileged instruction requires many instructions to simulate, so this trap-and-emulate overhead is very large and has a great impact on system performance.
Binary rewriting: Binary rewriting is the method used when the hardware architecture cannot be virtualized (does not conform to the classic virtualization model). It can be divided into static and dynamic. Static binary rewriting is to replace all sensitive instructions with a trap instruction (system call instruction) by scanning the ELF file, or use some non-sensitive instructions to simulate the execution of this sensitive instruction. Dynamic binary rewriting handles sensitive instructions similarly to static ones, except that it analyzes instructions one by one at runtime. In fact, this method is worse, because whether it is a sensitive instruction or not, it needs to be analyzed one by one to determine, which is very time-consuming. The performance of static methods at runtime is better than that of dynamic methods, but some inexplicable errors often occur, because the runtime state is very complex, and static modifications are difficult to predict all situations. The above process is shown in the figure below.
Para-virtualization: This virtualization method is translated as semi-virtualization in many books. In fact, this translation is inaccurate. Para-virtualization is a long-standing technology that only virtualizes some peripherals to meet the execution environment of certain specialized software, but cannot run all software that may run on a physical machine. If readers have any questions about this, please refer to "System Virtualization: Principles and Implementation", written by Intel Open Source Technology Center and Fudan University Parallel Processing Institute. Section 1.3 of the book discusses this. In fact, it is difficult to explain this part clearly, and you also need to understand various virtualization vulnerabilities. Simply put, para-virtualization modifies the source code (API level) of the guest OS so that the guest OS avoids these instructions that are difficult to virtualize (virtualization vulnerabilities). The operating system usually uses all the functions provided by the processor, such as privilege levels, address spaces, and control registers. The first problem that para-virtualization needs to solve is how to fall into the VMM. The typical approach is to modify the relevant code of the guest OS so that the OS actively gives up the privilege level and runs at the next level of privilege. In this way, when the guest OS tries to execute privileged instructions, a protection exception is triggered, thereby providing an interception point for the VMM to simulate (it can also use the hypercall method, which is introduced below). Since the kernel code already needs to be modified, class virtualization can further optimize I/O. In other words, class virtualization is not to simulate devices in the real world, because too much register simulation will reduce performance. On the contrary, class virtualization can customize highly optimized I/O protocols, which are completely based on things and can reach speeds close to that of physical machines.
In fact, the class virtualization used by OKL4 is to modify the API provided by the hypervisor to the guest OS (different from the underlying hardware), and at the same time modify the source code of the guest OS, replacing those sensitive instructions with hypercalls (calls into hypervisor). The figure below shows that for pure virtualization, the APIs of the hardware and hypervisor are the same, but they are different for para-virtualization.

Comparison of virtualization technologies

Pure virtualization and binary rewriting

Both pure virtualization and binary rewriting do not modify the machine API, so any guest OS can run directly in a virtualized environment. However, since all privileged instructions will cause traps, the execution overhead of privileged instructions in a virtual environment is much higher than in a native environment. In the past, when neither x86 nor ARM met the requirements of classical virtualization, VMWare used binary rewriting to implement virtualization on the x86 architecture. After optimization, the performance overhead was less than 10%, but this technology was very complex. Due to the complexity of the implementation, the code running in privileged mode will increase, which will increase the probability of bugs in the attack surface and hypervisor, thus reducing the security and isolation of the entire system.

Para-virtualization

Although para-virtualization is a new term, it was proposed in the Denali virtual machine monitor in 2002. However, this design concept appeared as early as the IBM CMS system in 1970, when the DIAG instruction was used to call into the hypervisor, and many research institutions are still using this concept, such as Mach, Xen and L4.
Para-virtualization can provide better performance than pure virtualization because it directly uses various APIs instead of implementing simulation through the process of trap->decode->hardware emulation. Of course, I have also mentioned its disadvantages in previous blogs, that is, the source code must be modified to allow the guest OS to use the new API, which is not only a heavy task, but also for some non-open source operating systems, we must adopt other methods unless the manufacturers of these non-open source operating systems are willing to cooperate with us.

Virtual memory in virtualization environment

Why do we need to discuss the memory management part separately? Because this part is very complicated. In fact, the content we discussed before was mainly about the operation of the CPU, such as the switching between various instructions and various modes. Regarding memory, we first discuss the virtual memory management before the introduction of the guest os, and then discuss the changes after the introduction of the guest os.
Virtual memory management involves a lot of content. Here we will not discuss various memory allocation algorithms, how to reduce the page fault rate, etc., but only analyze how to convert virtual addresses to physical addresses. We know that the ARM architecture uses MMU+TLB to complete the conversion from VA (virtual address) to PA (physical address), and the access to the page table is actually completed automatically by the hardware (if there is no page fault). However, after adding virtualization, this conversion becomes complicated. The guest page table does not complete the conversion from va to pa, but is only responsible for the conversion from guest va to guest pa, and the hypervisor completes the conversion from guest pa to the actual physical address. This conversion process is shown in the figure below.

This diagram is very clear, but it is very difficult to implement, because there is only one page table base address register, so the hardware cannot identify whether it is a conversion from guest va to guest pa or va to pa. In the absence of hardware support, it can only be implemented through shadow page tables. The principle of shadow page tables is to convert two-step conversion (guest va->guest pa->pa) into one step, and the synchronization in the middle is done with hash. When the shadow page table is constructed, each access to the guest page table requires a trap, and the hypervisor converts the guest pa into an actual physical address. If readers want to understand this content, I suggest that you take a deep look at KVM's previous implementation of shadow page tables (due to x86 hardware support, KVM has now abandoned shadow page tables). We can't explore shadow page tables in depth here, but after understanding what it is, we can analyze its performance below. First of all, its performance must be very poor, because each access to the guest page table requires a trap, and each modification of the guest page table needs to be synchronized to the shadow page table. Although the hash method can speed up, the performance gap is relatively large compared to the native environment (NOVA has done an experiment, and the performance loss of accessing the page table alone is about 23%), and it is very complicated to implement. Intel and ARM both provide hardware support for this part, and the hardware completes the two-level page table conversion mentioned here. In fact, according to the principle of locality when the program is running, if each access can hit the TLB, this two-level page table conversion is not much different from the first-level page table conversion, but when the TLB misses, the performance of accessing the two-stage page table is still quite different, even though this part is done by hardware. For example, in the case of Linux-64, KVM is a 4-level page table conversion, and it needs to access the page table 5 times from va to pa. After introducing two stages, it needs 5*5=25 page table accesses. Readers can think about why it is a multiplication relationship here.

[1] [2] [3]

Reference address：Introduction to virtualization on ARM platform

Previous article：ARMv7 user-level instruction exception processing flow
Next article：arm series knowledge framework

Popular Resources
Popular amplifiers