Deep understanding of Linux networking
Source: https://blog
Like CPU, memory and I/O, network is also the core function of Linux system.
Network is a technology that connects different computers or network devices together. It is essentially a method of inter-process communication, especially cross-system inter-process communication, which must be carried out through the network.
network model
Multiple servers are connected together through network devices such as network cards, switches, routers, etc. to form an interconnected network.
Due to the heterogeneity of network equipment and the complexity of network protocols, the International Organization for Standardization has defined a seven-layer OSI network model. However, this model is too complex. The de facto standard in actual work is the more practical TCP/IP model.
In the early days of the computer network era, major manufacturers introduced different network architectures and standards. In order to unify standards, the International Organization for Standardization ISO launched a unified OSI Open System Interconnection Reference Model .
Network layering solves complex network problems. When transmitting data on the network, we need to define a data standard for the format of data transmission between different devices, so there is a network protocol.
In order to solve the compatibility problem of heterogeneous devices in network interconnection and decouple complex network packet processing processes, the OSI model divides the framework of network interconnection into application layer, presentation layer, session layer, transport layer, network layer, and data link There are seven layers including physical layer and physical layer, each layer is responsible for different functions. in,
-
• Application layer, responsible for providing a unified interface for applications.
-
• Presentation layer, responsible for converting data into a format compatible with the receiving system.
-
• Session layer, responsible for maintaining communication connections between computers.
-
• The transport layer is responsible for adding a transport header to the data to form a data packet.
-
• Network layer, responsible for routing and forwarding of data.
-
• Data link layer, responsible for MAC addressing, error detection and error correction.
-
• Physical layer, responsible for transmitting data frames in the physical network.
However, the OSI model is still too complex and fails to provide an implementable method. Therefore, in Linux, another more practical four-layer model is actually used, namely the TCP/IP network model.
The TCP/IP model divides the network interconnection framework into four layers : application layer, transport layer, network layer, and network interface layer . Among them,
-
• Application layer, responsible for providing a set of applications to users, such as HTTP, FTP, DNS, etc.
-
• Transport layer, responsible for end-to-end communication, such as TCP, UDP, etc.
-
• Network layer, responsible for encapsulation, addressing and routing of network packets, such as IP, ICMP, etc.
-
• The network interface layer is responsible for the transmission of network packets in the physical network, such as MAC addressing, error detection, and transmission of network frames through the network card.
The relationship between TCP/IP and OSI model is as follows:
Although Linux actually implements the network protocol stack according to the TCP/IP model, in daily learning and communication, we are still accustomed to use the OSI seven-layer model to describe it.
For example, when it comes to layer seven and layer four load balancing, they correspond to the application layer and transport layer in the OSI model respectively (and they correspond to layer four and layer three in the TCP/IP model).
Linux network stack
With the TCP/IP model, during network transmission, the data packet will process the data sent from the upper layer layer by layer according to the protocol stack; then encapsulate the protocol header of that layer, and then send it to the next layer .
Of course, the processing logic of network packets at each layer depends on the network protocols used at each layer. For example, at the application layer, an application that provides a REST API can use the HTTP protocol to encapsulate the JSON data it needs to transmit into the HTTP protocol, and then pass it down to the TCP layer.
What encapsulation does is very simple. It just adds fixed-format metadata before and after the original load, and the original load data will not be modified.
For example, taking the network packet communicated through the TCP protocol as an example, through the picture below, we can see the encapsulation format of application data at each layer.
in:
-
• The transport layer adds a TCP header in front of the application data;
-
• The network layer adds an IP header before the TCP packet;
-
• The network interface layer adds a frame header and a frame trailer before and after the IP packet.
These new headers and trailers increase the size of network packets, but we all know that physical links cannot transmit data packets of any size.
The maximum transmission unit (MTU) configured on the network interface specifies the maximum IP packet size. In our most commonly used Ethernet, the default value of MTU is 1500bytes (this is also the default value of Linux).
Execute ifconfig in the Linux operating system to view the mtu value of each network card, which has different values such as 1450 and 1500.
[root@dev ~]# ifconfig
cni0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.244.0.1 netmask 255.255.255.0 broadcast 10.244.0.255
inet6 fe80::6435:53ff:fea0:638b prefixlen 64 scopeid 0x20<link>
ether 66:35:53:a0:63:8b txqueuelen 1000 (Ethernet)
RX packets 124 bytes 12884 (12.5 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 122 bytes 29636 (28.9 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:12:9c:9e:91 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.2.129 netmask 255.255.255.0 broadcast 192.168.2.255
inet6 fe80::a923:989b:b165:8e3b prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:d9:5e:32 txqueuelen 1000 (Ethernet)
RX packets 131 bytes 13435 (13.1 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 73 bytes 17977 (17.5 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Once the network packet exceeds the MTU size, it will be fragmented at the network layer to ensure that the fragmented IP packet is not larger than the MTU value.
Obviously, the larger the MTU, the less packetization is required, and naturally, the better the network throughput.
After understanding the TCP/IP network model and the encapsulation principle of network packets, the network stack in the Linux kernel is actually similar to the four-layer structure of TCP/IP.
As shown in the figure below, it is a schematic diagram of the Linux universal IP network stack:
Looking at this network stack from top to bottom, you can find that,
-
• The top-level application needs to interact with the socket interface through system calls;
-
• Below the socket are the transport layer, network layer and network interface layer we mentioned earlier;
-
• The lowest layer is the network card driver and physical network card device.
A network card is the basic device for sending and receiving network packets.
During system startup, the network card is registered into the system through the network card driver in the kernel. During the network sending and receiving process, the kernel interacts with the network card through interrupts.
The processing of network packets is very complex, so the network card hard interrupt only handles the core network card data reading or sending, while most of the logic in the protocol stack will be processed in soft interrupts.
Linux network packet sending and receiving process
After understanding the Linux network stack, let's take a look at how Linux sends and receives network packets.
PS: The following content uses physical network cards as examples. Linux also supports numerous virtual network devices, and their network sending and receiving processes will be somewhat different.
Network packet reception process
Let’s first look at the receiving process of network packets.
-
1. When a network frame arrives at the network card, the network card will put the network packet into the packet receiving queue through DMA; then use a hard interrupt to tell the interrupt handler that the network packet has been received.
-
2. Next, the network card interrupt handler allocates the kernel data structure (sk_buff) for the network frame and copies it to the sk_buff buffer; then it notifies the kernel that a new network frame has been received through a soft interrupt.
-
3. Next, the kernel protocol stack takes out the network frame from the buffer and processes the network frame layer by layer from bottom to top through the network protocol stack. for example,
Check the validity of the message at the link layer, find out the type of upper layer protocol (IPv4 or IPv6), remove the frame header and frame trailer, and then hand it over to the network layer.
-
1. 网络层取出 IP 头,判断网络包下一步的走向,比如是交给上层处理还是转发。当网络层确认这个包是要发送到本机后,就会取出上层协议的类型(TCP 还是 UDP),去掉 IP 头,再交给传输层处理。
-
2. 传输层取出 TCP 头或者 UDP 头后,根据 < 源 IP、源端口、目的 IP、目的端口 > 四元组作为标识,找出对应的 Socket,并把数据拷贝到 Socket 的接收缓存中。
-
3. 最后,应用程序就可以使用 Socket 接口,读取到新接收到的数据了。
具体过程如下图所示,这张图的左半部分表示接收流程,而图中的粉色箭头则表示网络包的处理路径。
网络包的发送流程
网络包的发送流程就是上图的右半部分,很容易发现,网络包的发送方向,正好跟接收方向相反。
首先,应用程序调用 Socket API(比如 sendmsg)发送网络包。
由于这是一个系统调用,所以会陷入到内核态的套接字层中。套接字层会把数据包放到 Socket 发送缓冲区中。
接下来,网络协议栈从 Socket 发送缓冲区中,取出数据包;再按照 TCP/IP 栈,从上到下逐层处理。
比如,传输层和网络层,分别为其增加 TCP 头和 IP 头,执行路由查找确认下一跳的 IP,并按照 MTU 大小进行分片。
分片后的网络包,再送到网络接口层,进行物理地址寻址,以找到下一跳的 MAC 地址。然后添加帧头和帧尾,放到发包队列中。这一切完成后,会有软中断通知驱动程序:发包队列中有新的网络帧需要发送。
最后,驱动程序通过 DMA ,从发包队列中读出网络帧,并通过物理网卡把它发送出去。
在不同的网络协议处理下,给我们的网络数据包加上了各种头部,这保证了网络数据在各层物理设备的流转下可以正确抵达目的地。
收到处理后的网络数据包后,接受端再通过网络协议将头部字段去除,得到原始的网络数据。
下图是客户端与服务器之间用网络协议连接通信的过程:
Linux 网络根据 TCP/IP 模型,构建其网络协议栈。TCP/IP 模型由应用层、传输层、网络层、网络接口层等四层组成,这也是 Linux 网络栈最核心的构成部分。
应用程序通过套接字接口发送数据包时,先要在网络协议栈中从上到下逐层处理,然后才最终送到网卡发送出去;而接收数据包时,也要先经过网络栈从下到上的逐层处理,最后送到应用程序。
常用网络相关命令
分析网络问题的第一步,通常是查看网络接口的配置和状态。你可以使用 ifconfig 或者 ip 命令,来查看网络的配置。
ifconfig 和 ip 分别属于软件包 net-tools 和 iproute2,iproute2 是 net-tools 的下一代,通常情况下它们会在发行版中默认安装。
以网络接口 ens33 为例,可以运行下面的两个命令,查看它的配置和状态:
[root@dev ~]# ifconfig ens33
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.2.129 netmask 255.255.255.0 broadcast 192.168.2.255
inet6 fe80::a923:989b:b165:8e3b prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:d9:5e:32 txqueuelen 1000 (Ethernet)
RX packets 249 bytes 22199 (21.6 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 106 bytes 22636 (22.1 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@dev ~]#
[root@dev ~]# ip -s addr show ens33
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:d9:5e:32 brd ff:ff:ff:ff:ff:ff
inet 192.168.2.129/24 brd 192.168.2.255 scope global noprefixroute ens33
valid_lft forever preferred_lft forever
inet6 fe80::a923:989b:b165:8e3b/64 scope link noprefixroute
valid_lft forever preferred_lft forever
RX: bytes packets errors dropped overrun mcast
24877 279 0 0 0 0
TX: bytes packets errors dropped carrier collsns
24616 123 0 0 0 0
可以看到,ifconfig 和 ip 命令输出的指标基本相同,只是显示格式略微不同。比如,它们都包括了网络接口的状态标志、MTU 大小、IP、子网、MAC 地址以及网络包收发的统计信息。
有几个字段可以重点关注下:
第一,网络接口的状态标志。ifconfig 输出中的 RUNNING ,或 ip 输出中的 LOWER_UP ,都表示物理网络是连通的,即网卡已经连接到了交换机或者路由器中。如果你看不到它们,通常表示网线被拔掉了。
第二,MTU 的大小。MTU 默认大小是 1500,根据网络架构的不同(比如是否使用了 VXLAN 等叠加网络),你可能需要调大或者调小 MTU 的数值。
第三,网络接口的 IP 地址、子网以及 MAC 地址。这些都是保障网络功能正常工作所必需的,你需要确保配置正确。
第四,网络收发的字节数、包数、错误数以及丢包情况,特别是 TX ( Transmit发送 )和 RX(Receive接收 ) 部分的 errors、dropped、overruns、carrier 以及 collisions 等指标不为 0 时,通常表示出现了网络问题。其中:
-
• errors 表示发生错误的数据包数,比如校验错误、帧同步错误等;
-
• dropped 表示丢弃的数据包数,即数据包已经收到了 Ring Buffer,但因为内存不足等原因丢包;
-
• overruns 表示超限数据包数,即网络 I/O 速度过快,导致 Ring Buffer 中的数据包来不及处理(队列满)而导致的丢包;
-
• carrier 表示发生 carrirer 错误的数据包数,比如双工模式不匹配、物理电缆出现问题等;
-
• collisions 表示碰撞数据包数。
套接字信息
套接字接口在网络程序功能中是内核与应用层之间的接口。TCP/IP 协议栈的所有数据和控制功能都来自于套接字接口,与 OSI 网络分层模型相比,TCP/IP 协议栈本身在传输层以上就不包含任何其他协议。
在 Linux 操作系统中,替代传输层以上协议实体的标准接口,称为套接字,它负责实现传输层以上所有的功能,可以说套接字是 TCP/IP 协议栈对外的窗口。
ifconfig 和 ip 只显示了网络接口收发数据包的统计信息,但在实际的性能问题中,网络协议栈中的统计信息,我们也必须关注,可以用 netstat 或者 ss ,来查看套接字、网络栈、网络接口以及路由表的信息。
我个人更推荐,使用 ss 来查询网络的连接信息,因为它比 netstat 提供了更好的性能(速度更快)。
比如,你可以执行下面的命令,查询套接字信息:
# head -n 4 表示只显示前面4行
# -l 表示只显示监听套接字
# -n 表示显示数字地址和端口(而不是名字)
# -p 表示显示进程信息
[root@dev ~]# netstat -nlp | head -n 4
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 952/sshd
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 11/master
# -l 表示只显示监听套接字
# -t 表示只显示 TCP 套接字
# -n 表示显示数字地址和端口(而不是名字)
# -p 表示显示进程信息
$ ss -ltnp | head -n 4
netstat 和 ss 的输出也是类似的,都展示了套接字的状态、接收队列、发送队列、本地地址、远端地址、进程 PID 和进程名称等。
其中,接收队列(Recv-Q)和发送队列(Send-Q)需要你特别关注,它们通常应该是 0。
当你发现它们不是 0 时,说明有网络包的堆积发生。当然还要注意,在不同套接字状态下,它们的含义不同。
当套接字处于 连接状态(Established) 时,
-
• Recv-Q 表示套接字缓冲还没有被应用程序取走的字节数(即接收队列长度)。
-
• Send-Q 表示还没有被远端主机确认的字节数(即发送队列长度)。
当套接字处于 监听状态(Listening) 时,
-
• Recv-Q 表示全连接队列的长度。
-
• Send-Q represents the maximum length of the fully connected queue.
The so-called full connection means that the server receives the ACK from the client, completes the TCP three-way handshake, and then moves the connection to the full connection queue.
These sockets in the full connection still need to be taken away by the accept() system call before the server can actually start processing the client's request.
Corresponding to the fully connected queue, there is also a semi-connected queue. The so-called semi-connection refers to a connection that has not completed the TCP three-way handshake, and the connection is only halfway through.
After the server receives the SYN packet from the client, it will put the connection in the semi-connection queue, and then send the SYN+ACK packet to the client.
Connection statistics
Similarly, you can use netstat or ss to view protocol stack information:
[root@dev ~]# netstat -s
...
Tcp:
1898 active connections openings
1502 passive connection openings
24 failed connection attempts
1304 connection resets received
178 connections established
133459 segments received
133428 segments send out
22 segments retransmited
0 bad segments received.
1400 resets sent
...
[root@dev ~]# ss -s
Total: 1700 (kernel 2499)
TCP: 340 (estab 178, closed 144, orphaned 0, synrecv 0, timewait 134/0), ports 0
Transport Total IP IPv6
* 2499 - -
RAW 1 0 1
UDP 5 3 2
TCP 196 179 17
INET 202 182 20
FRAG 0 0 0
The statistics of these protocol stacks are very intuitive.
ss only displays brief statistics such as connected, closed, orphan sockets, etc., while netstat provides more detailed network protocol stack information, showing the analysis of active connections, passive connections, failed retries, sending and receiving of the TCP protocol. Various information such as the number of segments.
Connectivity and latency
Ping is usually used to test the connectivity and latency of the remote host, and ping is based on the ICMP protocol.
For example, by executing the following command, you can test the connectivity and latency from the local machine to the IP address 192.168.2.129:
[root@dev ~]# ping -c3 192.168.2.129
PING 192.168.2.129 (192.168.2.129) 56(84) bytes of data.
64 bytes from 192.168.2.129: icmp_seq=1 ttl=64 time=0.026 ms
64 bytes from 192.168.2.129: icmp_seq=2 ttl=64 time=0.016 ms
64 bytes from 192.168.2.129: icmp_seq=3 ttl=64 time=0.015 ms
--- 192.168.2.129 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.015/0.019/0.026/0.005 ms
The output of ping can be divided into two parts.
-
1. The first part is the information of each ICMP request, including ICMP sequence number (icmp_seq), TTL (time to live, or hop count) and round-trip delay.
-
2. The second part is a summary of three ICMP requests.
For example, the above example shows that 3 network packets were sent and 3 responses were received without any packet loss. This shows that the test host is connected to 192.168.2.129; the average round-trip delay (RTT) is 0.026 ms, that is The total time elapsed from sending ICMP to receiving acknowledgment from the host.
Spring recruitment has begun. If you are not fully prepared, it will be difficult to find a good job in spring recruitment.
I’m sending you a big employment gift package, so you can raid the spring recruitment and find a good job!