Article count:1087 Read by:1556316

Featured Content
Account Entry

Linux process management CFS load balancing

Latest update time:2022-01-25
    Reads:

What is load balancing?

The previous scheduling learning is the default scheduling strategy on a single CPU. We know that in order to reduce "interference" between CPUs, each CPU has a task queue. During the running process, some CPUs may be "very busy" and some CPUs may be "very idle", so load balancing is needed.

The process of transferring tasks from a CPU with a heavier load to a CPU with a relatively lighter load is called load balancing.

Before understanding load balancing, it is necessary to understand the topological relationship between CPUs on the SoC.

We know that the internal structure of a multi-core SoC is very complex. The kernel uses the CPU topology to describe the architecture of a SoC. The kernel uses the scheduling domain to describe the hierarchical relationship between CPUs. For low-level scheduling domains, the load balancing processing overhead between CPUs is relatively small, while for higher-level scheduling domains, the load balancing overhead is greater.

For example, in a 4-core SOC, two cores form a cluster and share L2 cache. Then each cluster can be considered as an MC scheduling domain, each MC scheduling domain has two scheduling groups , and each scheduling group has only one CPU. The entire SOC can be considered as a higher-level DIE scheduling domain, which has two scheduling groups. Cluster0 belongs to one scheduling group, and cluster1 belongs to another scheduling group. Cross-cluster load balancing requires clearing the L2 cache, which is very expensive. Therefore, the overhead of load balancing in the SOC-level DIE scheduling domain will be higher.

The scheduling domain and scheduling group corresponding to the CPU can be viewed in the device model file /proc/sys/kernel/sched_domain.

  • The main members of the scheduling domain sched_domain are as follows:
member describe
Parent and Child The sched domain forms a hierarchical structure, and the parent and child establish a parent-child relationship of different hierarchical structures. For the base domain, its child is equal to NULL; for the top domain, its parent is equal to NULL.
groups There are several scheduling groups in a scheduling domain. These scheduling groups form a circular linked list, and the groups member is the head of the linked list.
min_interval and max_interval Balancing also requires overhead. You cannot check the balance status of the scheduling domain all the time. These two parameters define the range of the time interval for checking the balance status of the sched domain.
balance_interval Defines the time interval for balancing the sched domain
busy_factor Normally, balance_interval defines the balancing interval. If the CPU is busy, the balancing interval should be longer, that is, the interval is defined as busy_factor x balance_interval
imbalance_pct When the imbalance state in the scheduling domain reaches a certain level, load balancing will be performed. The imbalance_pct defines the imbalance water mark.
level The level of the sched domain in the entire scheduling domain hierarchy
span_weight The number of CPUs in this sched domain
span The span of the scheduling domain
  • The main members of the scheduling group sched_group are as follows:
member describe
next All sched groups in the sched domain will form a circular linked list, and next points to the next node in the groups linked list.
group_weight How many CPUs are in this scheduling group?
sgc The computing power information of this scheduling group
cpumask Which CPUs does this scheduling group contain?

CPU topology example

To reduce lock contention, each CPU has its own MC domain, DIE domain (sched domain is divided into two levels, the base domain is called MC domain (multi core domain), and the top domain is called DIE domain) and sched group, forming a hierarchical structure between sched domains and a circular linked list structure of sched groups. You can view the CPU topology information through /sys/devices/system/cpu/cpuX/topology.

In the above structure, the sched domain is divided into two levels, the base domain is called MC domain, and the top domain is called DIE domain. The top DIE domain covers all CPUs in the system, the MC domain of the small core cluster includes all CPUs in the small core cluster, and the MC domain of the large core cluster includes all CPUs in the large core cluster.

Through DTS and CPU topo subsystems, a sched domain hierarchy can be constructed for specific balancing algorithms. The process is: kernel_init() -> kernel_init_freeable() -> smp_prepare_cpus() -> init_cpu_topology() -> parse_dt_topology()

Load balancing software architecture

As can be seen from the figure, the left side is mainly divided into CPU load tracking and task load tracking.

  • CPU load tracking: Considers the load of each CPU and aggregates all loads on the cluster to facilitate calculation of load imbalances between clusters.
  • Task load tracking: Determine whether the task is suitable for the current CPU computing power. If it is determined that balance is needed, then how many tasks need to be migrated between CPUs to achieve balance.

The right side shows the sched domain hierarchy structure built by DTS and CPU topo subsystems. The process is: kernel_init() -> kernel_init_freeable() -> smp_prepare_cpus() -> init_cpu_topology() -> parse_dt_topology()

With the infrastructure on both sides, when will load balancing be triggered? This is mainly related to scheduling events. When scheduling events such as task wakeup, task creation, and tick arrival occur, the imbalance of the current system can be checked and tasks can be migrated as appropriate to keep the system load in a balanced state.

When to do load balancing?

There are two types of load balancers for CFS tasks: one is the periodic balancer for busy CPUs, which is used to balance CFS tasks on busy CPUs; the other is the idle balancer for idle CPUs, which is used to balance tasks on busy CPUs to idle CPUs.

  1. Periodic load balancing (or tick load balancing) means periodically checking the load balancing status of the system in a tick, finding the heaviest loaded domain, group, and CPU in the system, and moving the runnable tasks on it to this CPU to keep the system load balanced.
  1. Nohz load balance means that other CPUs have entered idle state, and this CPU has too much work to do, so it needs to wake up other idle CPUs through IPI to perform load balancing. Nohz idle load balance is also driven by the tick on the busy CPU. If you need to kick the idle load balancer, an IPI interrupt will be sent to the selected idle CPU through GIC, so that it can perform load balancing on behalf of all idle CPUs in the system.
  1. The new idle load balance is relatively easy to understand. When there is no task executing on the CPU and it is about to enter the idle state, it checks whether other CPUs need help to pull tasks from the busy CPU to keep the load of the entire system in a balanced state.

The basic process of load balancing

When load balancing is performed on a CPU, it always starts from the base domain and checks the load balance between the sched groups to which it belongs. If there is an imbalance, it will be migrated between the clusters to which the CPU belongs in order to maintain the task load balance of each CPU core in the cluster.

load_balance is the core function for handling load balancing. Its processing unit is a scheduling domain, that is, sched domain, which includes the processing of scheduling groups.

  1. Find the busiest sched group in the domain
  2. Select the busiest CPU runqueue in the busiest group, and this CPU becomes the source of the task migration.
  3. Select the task to be migrated from the queue (the judgment is mainly based on the size of the task load, and the task with the heaviest load is given priority)
  4. Migrate to the CPU runqueue as dst


5T technical resources are available for free! Including but not limited to: C/C++, Arm, Linux, Android, artificial intelligence, microcontrollers, Raspberry Pi, etc. Reply " peter " in the official account to get them for free! !


Remember to click Share , Like and Watching , give me some power

 
EEWorld WeChat Subscription

 
EEWorld WeChat Service Number

 
AutoDevelopers

About Us Customer Service Contact Information Datasheet Sitemap LatestNews

Room 1530, Zhongguancun MOOC Times Building,Block B, 18 Zhongguancun Street, Haidian District,Beijing, China Tel:(010)82350740 Postcode:100190

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号