How to troubleshoot network packet loss problems
To understand a knowledge point, we must first quickly build a conceptual model of the knowledge point. After we have the conceptual model, we can then continue to fill in some details on this model, which will help us grasp the essence of the knowledge.
带宽是什么?
Bandwidth is the ability of the network to be sent. It will be affected by the network card's ability to copy network packets to the kernel buffer or transfer network packets from the kernel buffer to the network card buffer. It will also be affected by the receiving window or congestion window. That is to say, if As the peer's receiving capability becomes smaller, the bandwidth cannot be increased.
When the entire network link becomes longer, the network situation becomes very complicated. Network packets may go through multiple routers or lines between different operators for data exchange, and the network traffic between different agents is extremely large, which will cause your network packets to be lost or retransmitted. In view of this situation, when deploying service nodes, if you have the ability to design links, it is best to avoid such network exchanges between different agents and optimize the link selection capabilities of the entire network transmission. This This is also a principle why CDN provides global acceleration.
The principle of CDN is to deploy many nodes around the world, and then a link selection between each node is carefully orchestrated by the service operator. It can ensure that the links of your entire network are optimized and can make Your network packets are less likely to be lost or retransmitted.
网络包的收发过程
We have to understand how a network packet passes through the application,
Generally, an application initiates a network request. The data of this network request will be written to the socket buffer of the kernel. Then the kernel will add a tcp header or udp header to the data in the socket buffer, and then pass it through ip. layer, plus an IP header, the network packet will be filtered through a series of firewall rules to see whether it should be discarded or continue to be sent to the network card. After finally reaching the link layer, the network packet will pass through the link layer It is sent to the ring buffer on the network card, and finally sent by the network card to the entire network. Packet loss may occur in each ring.
After understanding the process of sending and receiving network packets and establishing such a conceptual model, it will help us troubleshoot packet loss problems.
如何去衡量网络情况的好坏
When monitoring application services, how to measure the quality of the network is also generally used to measure the quality of hardware resources.
As a general rule, we usually first look at the performance of network indicators at the system level, then look at the specific process that caused the abnormal performance, and then locate the problem code.
Specifically for the network, how to judge the quality of the network from the system level or what tools should we use?
From the system level, there are several important indicators of the network. MBS represents how many M bytes the network card sends or receives per second, and Mbps refers to how many M bits per second. The usually unit of bandwidth is Mbps. Generally, if 100M bandwidth is converted into MBS, it is equal to Mbps divided by 8.
When choosing a server node, in addition to bandwidth, pps is the number of packets sent and received per second, which is also limited.
When we encounter network performance problems, we can first observe whether these two indicators on your machine node have reached a bottleneck state. If the bandwidth is only 100Mbps, then use a tool to check the node bandwidth on the machine. When it is about to exceed this value, it is very likely that the bandwidth has become a bottleneck at this time, and the machine quota may need to be upgraded.
sar
# 使用sar每一秒统计一次网络接口的活动状况,连续显示5次
sar -n DEV 1 5
-
IFACE is the name of the network card interface
-
rxpck/s, txpck/s The number of data packets received or sent per second
-
rxkB/s, txkB/s The number of bytes received or sent per second, in kB/s.
-
rxcmp/s, txcmp/s The number of compressed data packets received or sent per second
-
rxmcst/s Multicast ( multicast is point-to-multipoint communication) data packets received per second
After looking at the network situation at the entire system level, we can look at this problem more carefully from the perspective of the process.
iftop
# https://www.tecmint.com/iftop-linux-network-bandwidth-monitoring-tool/
yum -y install libpcap libpcap-devel ncurses ncurses-devel
yum install epel-release
yum install -y iftop
iftop -P
You can list the Mbps of each link in this system and find out which IP consumes the most traffic. More often than not, it’s not that the system network reaches a bottleneck, but that the process’s ability to process network packets cannot keep up.
nethogs
yum install nethogs
# 查看进程占用带宽的情况
nethogs ens33
Listing the data of sending and receiving traffic of each process and finding out which process consumes the most traffic can make it easier for us to locate the problem of which process.
The go trace tool can analyze the delay problem caused by network scheduling. In fact, it can also give feedback from the side that your program may be performing frequent network scheduling on a certain piece of code. It may be after frequent scheduling. Consumes bandwidth, which may indirectly reflect a slight increase in latency. Go trace can also allow us to indirectly find a piece of code that causes network performance problems.
One of the more important points in network performance is how to find your packet loss problem. For the above figure [network packet transmission process], analyze it from top to bottom, first look at the application layer, and use the listen method to monitor the socket. , during the three-way handshake, there will be two queues. First, when the server receives the syn packet from the client, it will create a semi-connection queue. This semi-connection queue will store those who have not completed the three-way handshake but have sent a If the connection of the syn package is placed in it, it will reply to the client with a syn+ack. After the client receives the ack and syn package, it will reply to the server with an ack. At this time, the kernel will put the connection into a full connection. Queue, when the server calls the accept method, it will take the connection out of the full connection queue, so two queues are involved at this time. If these two queues are full, packet loss may occur. .
First, let's take a look at the semi-connected queue. It is determined by kernel parameters and can be adjusted. A connection can be established only through a three-way handshake. However, because this queue mechanism is likely to cause the queue to become full and then lose packets when the amount of concurrency is large, the kernel provides a tcp_syncookies parameter, which can Enable the tcp_syncookies mechanism. When the semi-connection queue overflows, it allows the kernel not to directly discard the new packet, but to reply to the packet with syncookie. At this time, when the client makes a request to the server, it will verify it. This syncookie can prevent the service from being unavailable when the semi-connection queue overflows.
How to determine whether packet loss is caused by semi-connected queue overflow?
By searching for tcp drop in the log through dmesg, you can find packet loss. dmesg is a kernel log record, and we can find out some kernel behaviors from it.
dmesg|grep "TCP: drop open erquest form"
Then let’s take a look at how to view the full connection queue. Through the ss command, you can see the size of the full connection queue when your service is listening.
ss -lnt
# -l 显示正在监听
# -n 不解析服务名称
# -t 只显示 tcp socket
For your listening service, its Send-Q represents the current full connection queue length, which is the TCP connection that has completed the three-way handshake and is waiting for the server to accept(). Recv-Q refers to the size of the current full connection queue. The above output shows that the TCP service listening to port 9000 has a maximum full connection length of 128. Recv-Q is generally 0. If there is a situation that is greater than 0 and lasts for a long time, it means that your service's ability to process connections is relatively slow, which will cause the full connection queue to be overfull or discarded. This This should speed up your service's ability to handle connections.
For connections with status ESTAB, the ss command does not look at your listening service, but looks at indicators related to an established connection. Recv-Q represents the number of bytes received but not read by the application. , the number of bytes that Send-Q has sent but has not received confirmation. Through these two indicators, you can see whether the application is slow in processing a piece of data, or whether the client is slow in processing the received data. In this case, generally both values are 0. If one of them is not 0, you may want to check whether it is a client problem or a server problem.
When the full connection queue is full, the kernel will discard the packet by default, but you can also specify another behavior of the kernel. If the value of tcp_abort_on_overflow is set to 1, a reset packet will be sent directly to the client. Disconnecting this connection means abolishing the handshake process and this connection.
After passing through the application layer, the network packet will reach the transport layer, and there will be a firewall in the transport layer. If the firewall is turned on, the connection tracking table related to the firewall: nf_conntrack is Linux for each data packet passing through the kernel network stack. A connection record item will be generated. When the server processes too much, the connection tracking table where this connection record item is located will be filled up, and then the server will discard the data packets of the new connection, so sometimes the packet loss may be caused by the firewall. The connection tracking table is designed to be too small.
那如何去看连接跟踪表的大小呢
# 查看nf_conntrack表最大连接数
cat /proc/sys/net/netfilter/nf_conntrack_max
# 查看nf_conntrack表当前连接数
cat /proc/sys/net/netfilter/nf_conntrack_count
Use this file to see the maximum number of connections nf_conntrack_max in the connection tracking table, so when packets are lost, you can check this part to see if the connection tracking table is full.
After the network packet passes through the transport layer, let's look at the network layer and physical layer. When it comes to the network layer and physical layer, we need to look at the network card. Through the netstat command, we can see the packet loss and packet reception of the network card on the entire machine.
RX-DRP indicator data, if it is greater than 0, it means that the network card is experiencing packet loss. The data recorded here is the data from boot up to now, so when analyzing, check whether this indicator has occurred after a certain period of time. rise.
The RX-OVR indicator describes the discarding behavior that occurs when the ring buffer of this network card is full.
Network card packet loss can be analyzed through netstat.
# netstat可以统计网路丢包以及环形缓冲区溢出
netstat -i
Netstat can also count packet loss at the network protocol layer.
MTU
When the network packets of the application layer pass through the network layer, they are divided into packets and sent according to the size of the data packets.
When the size of the tcp data packet is sent to the network layer, the network layer finds that the packet will be larger than its mtu value, and the data packet will undergo a packetization operation. When setting up the network card, it will be set to your transport layer packet. If it is greater than the mtu value, the network packet can be discarded directly. This is also a packet loss problem that is often encountered in real life.
So when you check the link, it may not be easy to troubleshoot if the link is long. If the link is shorter, it may be easier to see the MTU status of the entire link. Check to see if there is a corresponding MTU on each link. The mtu indicator of each network card is different. If it is different, it may cause your packet loss problem, because the forwarding of a packet is related to the mtu value set on the network. For example, if it is set to be greater than the mtu, it will Throw away this bag. If the size of the MTU packet sent exceeds the size specified by the network card and the network card does not allow fragmentation, packet loss will occur.