Understand DPDK in one article

Latest update time：2023-01-08

Reads：

1. The situation and trends of network IO

From the use of our users, we can feel that the network speed has been improving, and the development of network technology has also evolved from 1GE/10GE/25GE/40GE/100GE. From this, we can conclude that the network IO capability of a single machine must keep up with the development of the times.

Traditional telecommunications field

The IP layer and below, such as routers, switches, firewalls, base stations and other equipment, all use hardware solutions. Based on dedicated network processors (NP), some are based on FPGA, and some are based on ASIC. However, the disadvantages of being based on hardware are very obvious. Bugs are difficult to repair and debug and maintain. Moreover, network technology has been developing, such as the innovation of mobile technologies such as 2G/3G/4G/5G. These business logics are too painful to implement based on hardware. Iterate quickly. The challenge faced by traditional fields is that there is an urgent need for a high-performance network IO development framework with a software architecture.

The evolution of cloud

The emergence of private clouds has become a trend through network function virtualization (NFV) sharing hardware. NFV is defined as realizing various traditional or new network functions through standard servers and standard switches. There is an urgent need for a high-performance network IO development framework based on common systems and standard servers.

Soaring stand-alone performance

The development of network cards from 1G to 100G, the development of CPUs from single-core to multi-core to multi-CPU, and the single-machine capabilities of servers have reached new heights through horizontal expansion. However, software development cannot keep up with the pace, and the processing capabilities of single machines are not as good as the hardware. How to develop high-throughput services that keep pace with the times and the concurrency capabilities of millions and tens of millions of single machines. Even if there are businesses that do not require high QPS and are mainly CPU-intensive, now big data analysis, artificial intelligence and other applications all need to transfer large amounts of data between distributed servers to complete operations. This should be the most concerned and relevant point for our Internet backend development.

2. Linux + x86 network IO bottleneck

A few years ago, I wrote an article titled "Network Card Working Principle and Tuning under High Concurrency", which described the Linux message receiving and sending process. According to experience, running an application on C1 (8 cores) consumes 1% of the soft interrupt CPU for processing every 1W packets, which means that the upper limit of a single machine is 1 million PPS (Packet Per Second). From the performance of TGW (Netfilter version) of 1 million PPS, AliLVS can only reach 1.5 million PPS after optimization, and the configuration of the server they use is still relatively good. Suppose, we want to run a full 10GE network card, each packet is 64 bytes, which requires 20 million PPS (Note: The upper limit of the speed of Ethernet 10G network card is 14.88 million PPS, because the minimum frame size is 84B "Bandwidth, Packets Per Second, and Other Network Performance Metrics"), 100G is 200 million PPS, that is, the processing time of each packet cannot exceed 50 nanoseconds. A cache miss, whether it is a miss in TLB, data cache, or instruction cache, takes about 65 nanoseconds to read back to the memory, and about 40 nanoseconds for cross-node communication in a NUMA system. Therefore, even without adding business logic, it is still difficult to simply send and receive packets. We need to control the cache hit rate, understand the computer architecture, and prevent cross-node communication.

From this data, I hope to get a first-hand sense of how big the challenge is here, and how ideal and realistic we need to balance it. These are the problems

Traditional methods of sending and receiving messages must use hard interrupts for communication, and each hard interrupt consumes approximately 100 microseconds, not counting the Cache Miss caused by terminating the context.
Data must be copied from the kernel mode to the user mode, which causes a lot of CPU consumption and global lock competition.
There is system call overhead for sending and receiving packets.
The kernel works on multiple cores and is globally consistent. Even if Lock Free is used, the performance loss caused by the lock bus and memory barriers cannot be avoided.
The path from the network card to the business process is too long, and some are actually unnecessary, such as the netfilter framework. These all bring a certain amount of consumption and are prone to Cache Miss.

3. Basic principles of DPDK

From the previous analysis, we can know that the IO implementation method, the bottleneck of the kernel, and the uncontrollable factors of data flow through the kernel are all implemented in the kernel. The kernel is the cause of the bottleneck. To solve the problem, you need to bypass the kernel. Therefore, the mainstream solutions are to bypass the network card IO and directly send and receive packets in user mode, bypassing the kernel, to solve the kernel bottleneck.

The Linux community also provides a bypass mechanism called Netmap. According to official data, a 10G network card can reach 14 million PPS, but Netmap is not widely used. There are several reasons for this:

Netmap requires driver support, which means that the network card manufacturer needs to approve this solution.
Netmap still relies on the interrupt notification mechanism, which does not completely solve the bottleneck.
Netmap is more like a few system calls, which enables user mode to directly send and receive packets. The function is too primitive, there is no network development framework to rely on, and the community is incomplete.

So, let’s take a look at DPDK, which has been developed for more than ten years. From Intel’s leading development to the participation of major manufacturers such as Huawei, Cisco, and AWS, the core players are all in this circle, with a complete community, and the ecosystem forms a closed loop. In the early days, it was mainly applications below layer 3 in the traditional telecommunications field. Huawei, China Telecom, and China Mobile were all early adopters. Switches, routers, and gateways were the main application scenarios. However, with the needs of upper-layer services and the improvement of DPDK, higher-level applications are gradually emerging.

DPDK bypass principle:

The left side is the original way of data from network card -> driver -> protocol stack -> Socket interface -> business

The right side is the DPDK method, based on UIO (Userspace I/O) bypass data. Data from the network card->DPDK polling mode->DPDK basic library->Business

The advantage of user mode is that it is easy to use, develop and maintain, and has good flexibility. Moreover, Crash does not affect the operation of the kernel and is highly robust.

CPU architectures supported by DPDK: x86, ARM, PowerPC (PPC)

List of network cards supported by DPDK: https://core.dpdk.org/supported/, we mainly use Intel 82599 (optical port), Intel x540 (electrical port)

4. UIO, the cornerstone of DPDK

In order to allow the driver to run in user mode, Linux provides a UIO mechanism. UIO can be used to sense interrupts through read and communicate with the network card through mmap.

UIO principle:

There are several steps to develop a user-mode driver:

Develop UIO modules that run in the kernel, because hard interrupts can only be handled in the kernel
Read interrupt via /dev/uioX
Sharing memory with peripherals via mmap

5. DPDK Core Optimization: PMD

DPDK's UIO driver blocks interrupts issued by the hardware, and then uses active polling in the user mode. This mode is called PMD (Poll Mode Driver).

UIO bypasses the kernel and actively polls to remove hard interrupts, so that DPDK can send and receive packets in user mode. It brings the benefits of Zero Copy and no system calls, and synchronous processing reduces Cache Miss caused by context switching.

Core running in PMD will be in user mode CPU 100% state

When the network is idle, the CPU idles for a long time, which will cause energy consumption problems. Therefore, DPDK launches Interrupt DPDK mode.

Interrupt DPDK:

Its principle is very similar to NAPI, that is, it goes to sleep when there are no packets to process, and is changed to interrupt notification. And it can share the same CPU Core with other processes, but the DPDK process will have a higher scheduling priority.

6. High-performance code implementation of DPDK

Use HugePage to reduce TLB Miss

By default, Linux uses 4KB as a page. The smaller the page, the larger the memory, the greater the overhead of the page table, and the greater the memory usage of the page table. The CPU's TLB (Translation Lookaside Buffer) is expensive, so it can generally only store hundreds to thousands of page table entries. If the process wants to use 64G of memory, then 64G/4KB = 16000000 (sixteen million) pages, and each page occupies 16000000 * 4B = 62MB in the page table entry. If you use HugePage to use 2MB as a page, you only need 64G/2MB=2000, which is not in the same level.

DPDK uses HugePage, which supports page sizes of 2MB and 1GB under x86-64, geometrically reducing the size of page table entries, thereby reducing TLB-Miss. It also provides basic libraries such as memory pool (Mempool), MBuf, lock-free ring (Ring), and Bitmap. According to our practice, frequent memory allocation and release in the data plane must use a memory pool, and rte_malloc cannot be used directly. DPDK's memory allocation implementation is very crude, not as good as ptmalloc.

SNA (Shared-nothing Architecture)

Decentralize the software architecture and try to avoid global sharing, which will bring about global competition and lose the ability to expand horizontally. Under the NUMA system, memory is not used remotely across Nodes.

SIMD (Single Instruction Multiple Data)

From the earliest mmx/sse to the latest avx2, SIMD capabilities have been increasing. DPDK uses batch processing of multiple packets at the same time, and then uses vector programming to process all packets in one cycle. For example, memcpy uses SIMD to improve speed.

SIMD is more common in the game backend, but if other businesses have similar batch processing scenarios and want to improve performance, you can also see if it can be met.

Not using slow API

Slow APIs need to be redefined here, such as gettimeofday. Although there is no need to fall into the kernel state through vDSO under 64-bit, it is just a pure memory access, which can reach tens of millions per second. However, don't forget that under 10GE, our processing power per second reaches tens of millions. So even gettimeofday is a slow API. DPDK provides Cycles interfaces, such as rte_get_tsc_cycles interface, implemented based on HPET or TSC.

Using the RDTSC instruction under x86-64 to read directly from the register requires inputting 2 parameters. A common implementation is:

static inline uint64_t
rte_rdtsc(void)
{
      uint32_t lo, hi;

      __asm__ __volatile__ (
                 "rdtsc" : "=a"(lo), "=d"(hi)
                 );

      return ((unsigned long long)lo) | (((unsigned long long)hi) << 32);
}

This is correct, but not perfect. It involves two bit operations to get the result. Let's see how DPDK implements it:

static inline uint64_t
rte_rdtsc(void)
{
 union {
  uint64_t tsc_64;
  struct {
   uint32_t lo_32;
   uint32_t hi_32;
  };
 } tsc;

 asm volatile("rdtsc" :
       "=a" (tsc.lo_32),
       "=d" (tsc.hi_32));
 return tsc.tsc_64;
}

Clever use of C's union shared memory, direct assignment, reducing unnecessary operations. However, there are some problems that need to be faced and solved when using tsc.

CPU affinity solves the problem of inaccurate multi-core beating
Memory barrier solves the problem of inaccurate out-of-order execution
Disable downclocking and Intel Turbo Boost, fix the CPU frequency, and solve the inaccuracy problem caused by frequency changes.

5. Compilation and execution optimization

branch prediction

Modern CPUs improve parallel processing capabilities through pipelines and superscalars. In order to further develop parallel capabilities, branch prediction is performed to improve the parallel capabilities of the CPU. When encountering a branch, determine which branch may be entered, process the code of the branch in advance, and pre-read instructions, encode and read registers. If the prediction fails, all preprocessing is discarded. When we develop business, sometimes we know very well whether this branch is true or false, so we can generate more compact code through manual intervention to prompt the CPU branch prediction success rate.

#pragma once

#if !__GLIBC_PREREQ(2, 3)
#    if !define __builtin_expect
#        define __builtin_expect(x, expected_value) (x)
#    endif
#endif

#if !defined(likely)
#define likely(x) (__builtin_expect(!!(x), 1))
#endif

#if !defined(unlikely)
#define unlikely(x) (__builtin_expect(!!(x), 0))
#endif

CPU Cache prefetch

The cost of Cache Miss is very high. It takes 65 nanoseconds to read back to the memory. The CPU Cache that is actively pushing the data to be accessed can be optimized. A typical scenario is the traversal of a linked list. The next node in the linked list is a random memory address, so the CPU cannot automatically preload it. But when we are processing this node, we can push the next node to the Cache through CPU instructions.

API documentation: https://doc.dpdk.org/api/rte__prefetch_8h.html

static inline void rte_prefetch0(const volatile void *p)
{
 asm volatile ("prefetcht0 %[p]" : : [p] "m" (*(const volatile char *)p));
}

#if !defined(prefetch)
#define prefetch(x) __builtin_prefetch(x)
#endif

…etc

memory alignment

Memory alignment has 2 benefits:

l Avoid structure members that cross the Cache Line and require two reads to be merged into the register, which reduces performance. Structure members need to be sorted and aligned from large to small. Refer to "Data alignment: Straighten up and fly right"

#define __rte_packed __attribute__((__packed__))

l Writing in a multi-threaded scenario produces False sharing, causing Cache Miss, and the structure is aligned according to Cache Line

#ifndef CACHE_LINE_SIZE
#define CACHE_LINE_SIZE 64
#endif

#ifndef aligined
#define aligined(a) __attribute__((__aligned__(a)))
#endif

constant optimization

The compilation phase of constant-related operations is completed. For example, C++11 introduces constexp. For example, you can use GCC's __builtin_constant_p to determine whether a value is a constant, and then compile the constant to get the result. Example of network sequence host sequence conversion

#define rte_bswap32(x) ((uint32_t)(__builtin_constant_p(x) ?  \
       rte_constant_bswap32(x) :  \
       rte_arch_bswap32(x)))

Among them, the implementation of rte_constant_bswap32

#define RTE_STATIC_BSWAP32(v) \
 ((((uint32_t)(v) & UINT32_C(0x000000ff)) << 24) | \
  (((uint32_t)(v) & UINT32_C(0x0000ff00)) <<  8) | \
  (((uint32_t)(v) & UINT32_C(0x00ff0000)) >>  8) | \
  (((uint32_t)(v) & UINT32_C(0xff000000)) >> 24))

5) Use CPU instructions

Modern CPUs provide many instructions to directly complete common functions, such as big and small endian conversion. x86 has direct support for the bswap instruction.

static inline uint64_t rte_arch_bswap64(uint64_t _x)
{
 register uint64_t x = _x;
 asm volatile ("bswap %[x]"
        : [x] "+r" (x)
        );
 return x;
}

This implementation is also an implementation of GLIBC. It first optimizes constants, optimizes CPU instructions, and finally implements it with bare code. After all, they are all top programmers, and they have different pursuits of language, compiler, and implementation. Therefore, you must first understand the wheel before building it.

Google's open source cpu_features can obtain what features the current CPU supports, so as to optimize the execution of a specific CPU. High-performance programming never ends, and the understanding of hardware, kernels, compilers, and development languages must be in-depth and keep pace with the times.

7. DPDK Ecology

For our Internet backend development, the capabilities provided by the DPDK framework itself are relatively bare. For example, if you want to use DPDK, you must implement basic functions such as ARP and IP layer, which is difficult to get started. If you want to use higher-level services, you also need user-mode transmission protocol support. It is not recommended to use DPDK directly.

At present, the application layer development project with a complete ecosystem and strong community (supported by first-tier manufacturers) is FD.io (The Fast Data Project). It has VPP supported by Cisco open source and has relatively complete protocol support, such as ARP, VLAN, Multipath, and IPv4/v6. , MPLS, etc. The user-mode transport protocol UDP/TCP has TLDK. It is a relatively reliable framework from project positioning to community support.

Seastar is also very powerful and flexible. You can switch between the kernel state and DPDK at will. It also has its own transmission protocol Seastar Native TCP/IP Stack support. However, we have not yet seen any large-scale projects using Seastar, and there may be many pitfalls that need to be filled.

Link: https://cloud.tencent.com/developer/article/1198333 (Copyright belongs to the original author, infringement and deletion)

end

Renren geek community

Follow and reply to [ peter ] Massive Linux information will be given away

Collection of wonderful articles

Latest articles about

■CPU cache consistency: from theory to practice

■Throw some cold water on the cunning Hongmeng

■The process of receiving network data packets

■Let's talk about the current AI and a bunch of other things in plain language

■Vomiting blood sorting | Liver over Linux interrupt all knowledge points

■Introduction to Linux V4L2 subsystem and video codec equipment

■Arm64 stack backtrace

■Unbeatable! I strongly recommend taking the software exam this year!

■Domestic real-time operating system: real-time comparison with RT-Linux and Zephyr

■MIPI-DSI Display Process D-PHY