Detailed explanation of epoll usage (essence)

Latest update time：2023-12-07 02:12

Reads：

Click on the blue " Linux " in the upper left corner and select " Set as Star "

Read useful articles for the first time

☞ [Dry information] Embedded driver engineer learning route

☞ [Dry information] A comprehensive project based on Linux IoT that can be written on your resume

☞ [Dry information] Linux embedded knowledge points-mind map-free access

☞ [Thank you] My new book "Learning ARM from Scratch" is officially launched

epoll and select

Compared with select, the biggest advantage of epoll is that it does not reduce efficiency as the number of listening fds increases . Because in the select implementation in the kernel , it is processed by polling . The more fds polled, the more time it will take.

Moreover, there is this statement in the linux/posix_types.h header file:

#define __FD_SETSIZE 1024

It means that select can monitor up to 1024 fds at the same time . Of course, this number can be expanded by modifying the header file and recompiling the kernel, but this does not seem to cure the root cause.

1. IO multiplexing selection

The advantage of IO multiplexing compared to blocking and non-blocking is that it can listen to multiple sockets and does not consume too many resources. When the user process calls select, it will listen to all sockets in it until one or more socket data is ready, otherwise it will remain blocked. The disadvantage of select is that there is a maximum limit on the number of file descriptors that a single process can monitor. The data structure maintained by select() that stores a large number of file descriptors has a linear cost of copying as the number of file descriptors increases. increase. At the same time, due to the delay in network response time, a large number of TCP links are very active, but calling select() will perform a linear scan of all sockets, so this also wastes a certain amount of overhead. But its benefit is its cross-platform nature.

2. epoll

epoll's ET must work on non-blocking sockets, and LT can also work on blocking sockets.

All I/O multiplexing operations are synchronous, including select/poll .

Blocking/non-blocking is relative to synchronous I/O and has nothing to do with asynchronous I/O.

select/poll/epoll itself is synchronous and can block or not block.

(Blocking and non-blocking are different from synchronization and non-synchronization; whether it is blocking or not is itself, and whether it is asynchronous or not is the relationship with external collaboration)

skater:

Whether it is blocking I/O, non-blocking I/O, or multiplexing based on non-blocking I/O, they are all synchronous calls. Because when they call read, the kernel copies the data from the kernel space to the application space (epoll should be from mmap), and the process needs to wait, which means that the process is synchronous. If the copy efficiency implemented by the kernel is not high , the read call will wait for a relatively long time during this synchronization process.

epoll事件：
   EPOLLIN ： 表示对应的文件描述符可以读（包括对端SOCKET正常关闭）；
   EPOLLOUT： 表示对应的文件描述符可以写；
   EPOLLPRI： 表示对应的文件描述符有紧急的数据可读（这里应该表示有带外数据到来）；
   EPOLLERR： 表示对应的文件描述符发生错误；
   EPOLLHUP： 表示对应的文件描述符被挂断；

The core of epoll's efficiency is: 1. User mode and kernel share memory mmap. 2. The arrival of data adopts an event notification mechanism (without polling).

epoll interface

The interface of epoll is very simple, with a total of three functions:

(1)epoll_create system call

The prototype of epoll_create in the C library is as follows.

int epoll_create(int size);

epoll_create returns a handle, and subsequent uses of epoll will be identified by this handle. The parameter size tells epoll the approximate number of events to be processed. When epoll is no longer used, you must call close to close this handle.

Note: The size parameter only tells the kernel the approximate number of events that this epoll object will handle, not the maximum number of events that can be processed. In the implementation of some of the latest Linux kernel versions, this size parameter has no meaning.

(2)epoll_ctl system call

The prototype of epoll_ctl in the C library is as follows.

int epoll_ctl(int epfd, int op, int fd, struct epoll_event* event);

epoll_ctl adds, modifies or deletes events of interest to the epoll object and returns 0 to indicate success, otherwise it returns -1. In this case, the error type needs to be determined based on the errno error code. The event returned by the epoll_wait method must be added to epoll through epoll_ctl.

parameter:

epfd: The handle returned by epoll_create,

The meaning of op: is shown in the table below:

EPOLL_CTL_ADD: Register new fd into epfd;

EPOLL_CTL_MOD: Modify the listening events of registered fd;

EPOLL_CTL_DEL: Delete an fd from epfd;

fd: The socket handle fd that needs to be monitored,

event: A structure that tells the kernel what to monitor. The structure of struct epoll_event is as follows:

epoll_data_t;
 
struct epoll_event {
    __uint32_t events; /* Epoll events */
    epoll_data_t data; /* User data variable */
};

__uint32_t events Events to be monitored (events of interest):

EPOLLIN: Indicates that the corresponding file descriptor can be read (including the peer SOCKET being closed normally);

EPOLLOUT: Indicates that the corresponding file descriptor can be written;

EPOLLPRI: Indicates that the corresponding file descriptor has urgent data to read (this should indicate the arrival of out-of-band data);

EPOLLERR: Indicates that an error occurred in the corresponding file descriptor;

EPOLLHUP: Indicates that the corresponding file descriptor is hung up;

EPOLLET: Set EPOLL to Edge Triggered mode, which is relative to Level Triggered.

EPOLLONESHOT: Only listens to the event once. After listening to this event, if you still need to continue to monitor the socket, you need to add the socket to the EPOLL queue again.

data成员是一个epoll_data联合，其定义如下：
 
typedef union epoll_data {
 
void *ptr;
 
int fd;
 
uint32_t u32;
 
uint64_t u64;
 
} epoll_data_t;
 
可见，这个 data成员还与具体的使用方式相关。例如，ngx_epoll_module模块只使用了联合中的 ptr成员，
作为指向 ngx_connection_t连接的指针。我们在项目中一般使用的也是 ptr成员，因为它可以指向任意的结构
体地址。

3. int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

The prototype of epoll_wait in the C library is as follows:

int epoll_wait(int epfd,struct epoll_event* events,int maxevents,int timeout);

Collect events that have occurred in the events monitored by epoll. If no event occurs in epoll, it will wait up to timeout milliseconds before returning. The return value of epoll_wait indicates the number of events currently occurring. If it returns 0, it means that no event occurred in this call. If it returns -1, it means that an error occurred. You need to check the errno error code to determine the error type.

epfd: descriptor of epoll.

events: Array of allocated epoll_event structure. epoll will copy the events that occur to the events array (events cannot be a null pointer. The kernel is only responsible for copying data to this events array and will not help us in user mode. Allocate memory in the kernel. This approach is very efficient).

maxevents: Indicates the maximum number of events that can be returned this time. Usually the maxevents parameter is equal to the size of the pre-allocated events array.

timeout: Indicates the maximum waiting time (in milliseconds) when no event is detected. If timeout is 0, it means that epoll_wait is empty in the rdllist list and will return immediately without waiting.

epoll has two working modes: LT (level trigger) mode and ET (edge trigger) mode.

By default, epoll works in LT mode, which can handle blocking and non-blocking sockets, and EPOLLET in the above table indicates that an event can be changed to ET mode. ET mode is more efficient than LT mode, and it only supports non-blocking sockets.

(Horizontal trigger LT: When a read-write event occurs on the monitored file descriptor, epoll_wait() will notify the handler to read and write. If the data is not read and written all at once this time (such as reading and writing buffers is too small), then the next time you call epoll_wait(), it will also notify you to continue reading and writing on the file descriptor that was not read and written last time.

Edge-triggered ET: When a read-write event occurs on the monitored file descriptor, epoll_wait() will notify the handler to read and write. If all the data is not read and written this time (for example, the read-write buffer is too small), it will not notify you the next time you call epoll_wait(), that is, it will only notify you once until the file descriptor is You will not be notified until the second readable and writable event occurs.

It can be seen that when triggering horizontally, if there are a large number of ready file descriptors in the system that you do not need to read or write, and they will return every time, this will greatly reduce the efficiency of the handler in retrieving the ready file descriptors it cares about, while triggering on the edge, Then you won't be flooded with a large number of ready file descriptors that you don't care about, so the performance difference will be obvious. )

How to use epoll

1. Include a header file #include <sys/epoll.h>

2. Create_epoll(int maxfds) to create an epoll handle, where maxfds is the maximum number of handles supported by your epoll. This function will return a new epoll handle, and all subsequent operations will be performed through this handle. After use, remember to use close() to close the created epoll handle.

3. Then in your network main loop, call epoll_wait(int epfd, epoll_event events, int max events, int timeout) every frame to query all network interfaces to see which one can be read and which one can be written. The basic syntax is:

nfds = epoll_wait(kdpfd, events, maxevents, -1);

Among them, kdpfd is the handle created with epoll_create, and events is a pointer to epoll_event*. When the epoll_wait function operates successfully, all read and write events will be stored in epoll_events. max_events is the number of all socket handles that currently need to be monitored.

The last timeout: is the timeout of epoll_wait,

When it is 0, it means returning immediately.

When it is -1, it means to wait until an event returns.

When it is any positive integer, it means waiting for such a long time. If there is no event, return.

Generally, if the network main loop is a separate thread, you can use -1 to wait, which can ensure some efficiency. If it is in the same thread as the main logic, you can use 0 to ensure the efficiency of the main loop.

There should be a loop after the epoll_wait range to cycle through all events.

epoll applies for a simple file system in the Linux kernel (what data structure is generally used to implement the file system? B+ tree). Divide the original select/poll call into 3 parts:

1) Call epoll_create() to create an epoll object (allocate resources for this handle object in the epoll file system)

2) Call epoll_ctl to add these 1 million connected sockets to the epoll object

3) Call epoll_wait to collect the connection of the event that occurred

epoll program framework

Almost all epoll programs use the following framework:

pseudocode:

listenfd is a global variable, the fd of the socket that the server listens to.

关于epoll_wait返回值的一个简单测试
 
void test(int epollfd)
{
　　struct epoll_event events[MAX_EVENT_NUMBER];
　　int number;
 
　　while (1)
　　{
　　　　number = epoll_wait(epollfd, events, MAX_EVENT_NUMBER, -1);
　　　　printf("number : %2d\n\n", number);
　　　　for (i = 0; i < number; i++)
　　　　{
　　　　　　sockfd = events[i].data.fd;
 
　　　　　　if (sockfd == listenfd)
　　　　　　{/*用户上线*/
 
　　　　　　}
　　　　　　else if (events[i].events & EPOLLIN)
　　　　　　{/*有数据可读*/
 
　　　　　　}
　　　　　　else if (events[i].events & EPOLLOUT)
　　　　　　{/*有数据可写*/
 
　　　　　　}
　　　　　　else
　　　　　　{/*出错*/
 
　　　　　　}
　　　　}
　　}
}
 
通过测试发现epoll_wait返回值number是不会大于MAX_EVENT_NUMBER的。
 
测试过程中，连接的客户端数远大于MAX_EVENT_NUMBER，由此可以推论：epoll_wait()每次返回的是活跃客户端的个数，每次并将这些活跃的客户端信息加入到events[MAX_EVENT_NUMBER]。
 
由此可见，活跃客户端的个数相同的情况下，events[MAX_EVENT_NUMBER]越大，epoll_wait()函数执行次数越少，但是events[MAX_EVENT_NUMBER]越大越消耗存储资源。
 
所以，MAX_EVENT_NUMBER的选择应该在效率和资源间取一个平衡点。

Sample code

for( ; ; )
    {
        nfds = epoll_wait(epfd,events,20,500);
        for(i=0;i<nfds;++i)
        {
            if(events[i].data.fd==listenfd) //服务端监听的套接字listenfd上有新的连接
            {
                connfd = accept(listenfd,(sockaddr *)&clientaddr, &clilen); //accept这个连接
                ev.data.fd=connfd;          //把[这个新连接的fd]加入ev
                ev.events=EPOLLIN|EPOLLET;  //[关注该连接的EPOLLIN事件,EPOLLET边缘触发]加入ev
                epoll_ctl(epfd,EPOLL_CTL_ADD,connfd,&ev); //ev添加到epoll的监听队列中
            }
            else if( events[i].events&EPOLLIN ) //接收到数据，读socket
            {
                n = read(sockfd, line, MAXLINE)) < 0    //读
                ev.data.ptr = md;     //md为自定义类型，添加数据
                ev.events=EPOLLOUT|EPOLLET;
                epoll_ctl(epfd,EPOLL_CTL_MOD,sockfd,&ev);//修改标识符，等待下一个循环时发送数据，异步处理的精髓
                //epfd 就是epoll_create()返回的epfd
            }
            else if(events[i].events&EPOLLOUT) //有数据待发送，写socket
            {
                struct myepoll_data* md = (myepoll_data*)events[i].data.ptr;    //取数据
                sockfd = md->fd;
                send( sockfd, md->ptr, strlen((char*)md->ptr), 0 );        //发送数据
               
                ev.data.fd=sockfd;
                ev.events=EPOLLIN|EPOLLET;
                epoll_ctl(epfd,EPOLL_CTL_MOD,sockfd,&ev); //修改标识符，等待下一个循环时接收数据
            }
            else
            {
                //其他的处理
            }
        }
    }

General process

 struct epoll_event ev, event_list[EVENT_MAX_COUNT];//ev用于注册事件,event_list用于回传要处理的事件
 
 listenfd = socket(AF_INET, SOCK_STREAM, 0);
 if(0 != bind(listenfd, (struct sockaddr *)
 if(0 != listen(listenfd, LISTENQ)) //LISTENQ 定义了宏//#define LISTENQ   20  
 
 ev.data.fd = listenfd; //设置与要处理的事件相关的文件描述符
 ev.events = EPOLLIN | EPOLLET; //设置要处理的事件类型EPOLLIN ：表示对应的文件描述符可以读，EPOLLET状态变化才通知
 
 
epfd = epoll_create(256);  //生成用于处理accept的epoll专用的文件描述符
//注册epoll事件
epoll_ctl(epfd, EPOLL_CTL_ADD, listenfd, &ev); //epfd epoll实例ID，EPOLL_CTL_ADD添加，listenfd:socket,ev事件（监听listenfd）
 
 
nfds = epoll_wait(epfd, event_list, EVENT_MAX_COUNT, TIMEOUT_MS); //等待epoll事件的发生

1. First get familiar with the three interfaces of epoll

int epoll_create(int size);

Create epoll related data structures, the most important of which is

1. Red-black tree, used to store file handles and events that need to be monitored

2. Ready linked list, used to store triggered file handles and events

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

Used to set, modify, or delete monitored file handles and events

int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

Blocking waiting timeout time, if the relevant event on the file handle is triggered, epoll_wait exits, and the triggered event is written into the events parameter, and the number of triggered events is returned as the return value

2. How to use these three interfaces to write a server

First use epoll_create to create epoll related data structures
Secondly, create the TCP socket file handle acceptfd, bind (ip:port), then turn on listening, and use epoll_ctl to register in epoll, and listen to the EPOLL_IN event (readable event) of the acceptfd handle.
Call epoll_wait to start blocking waiting
If a client connects, the EPOLL_IN event on acceptfd is triggered. After epoll_wait returns, the information about the triggered event can be obtained. This information is actually a struct epoll_event object. We can judge whether the epoll event object fd is consistent with acceptfd. If so, When it is believed that a new connection is coming in, obtain the clientfd corresponding to the new connection, use epoll_ctl to register to epoll, and monitor the epoll_in event on clientfd. At this time, the connection between the client and the server is established.

           typedef union epoll_data {
               void    *ptr;
               int      fd; //可以用fd, 也可以用ptr来保存事件对应的文件句柄
               uint32_t u32;
               uint64_t u64;
           } epoll_data_t;
 
           struct epoll_event {
               uint32_t     events;    /* Epoll events */
               epoll_data_t data;      /* User data variable */
           };

When the user enters a command on the client, the epoll_in event on the server clientfd will be triggered. epoll_wait returns the triggered event, reads the data in the clientfd kernel buffer corresponding to the event, parses the protocol, executes the command, and obtains the return result. This return result To return to the client, use epoll_ctl to register the epoll_out event of clientfd to epoll. At this time, we will notice that there is both epoll_in and epoll_out on clientfd. This is actually not necessary. The client is waiting for the return result at this time and will not Enter the command again, so you need to use epoll_ctl to delete epoll_out
If the clientfd kernel buffer is writable, epoll_wait will return at this time and return the epoll_out event. At this time, the returned result data will be written to clientfd and returned to the client.

Example source code

The original text contains quite a few errors:

Need to be added to the header file and errors modified

//'/0'->'\0'
//bzero() 替换为memset （注意二者参数不一样，bzero将前n个字节设为0，memset将前n 个字节的值设为值 c）
//local_addr 由char* 改为 string
#include <cstdlib>
#include <stdlib.h> //atoi
#include <string.h> //memset
#include <string>   //std:cout  等

Corrected source code C++ lnux:

#include <iostream>
#include <sys/socket.h>
#include <sys/epoll.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
 
//'/0'->'\0'
//bzero() 替换为memset （注意二者参数不一样，bzero将前n个字节设为0，memset将前n 个字节的值设为值 c）
//local_addr 由char* 改为 string
#include <cstdlib>
#include <stdlib.h> //atoi
#include <string.h> //memset
#include <string>   //std:cout  等
 
 
 
using namespace std;
 
#define MAXLINE   255              //读写缓冲
#define OPEN_MAX  100
#define LISTENQ   20             //listen的第二个参数  定义TCP链接未完成队列的大小（linux >2.6 则表示accpet之前的队列）
#define SERV_PORT 5000
#define INFTIM    1000
 
#define TIMEOUT_MS      500
#define EVENT_MAX_COUNT 20
 
void setnonblocking(int sock)
{
    int opts;
    opts = fcntl(sock, F_GETFL);
    if(opts < 0)
    {
        perror("fcntl(sock,GETFL)");
        exit(1);
    }
    opts = opts | O_NONBLOCK;
    if(fcntl(sock, F_SETFL, opts) < 0)
    {
        perror("fcntl(sock,SETFL,opts)");
        exit(1);
    }
}
 
int main(int argc, char *argv[])
{
    int i, maxi, listenfd, connfd, sockfd, epfd, nfds, portnumber;
    ssize_t n;
    char line_buff[MAXLINE];
    
 
 
    if ( 2 == argc )
    {
        if( (portnumber = atoi(argv[1])) < 0 )
        {
            fprintf(stderr, "Usage:%s portnumber/r/n", argv[0]);
            //fprintf()函数根据指定的format(格式)(格式)发送信息(参数)到由stream(流)指定的文件
            //printf 将内容发送到Default的输出设备，通常为本机的显示器，fprintf需要指定输出设备，可以为文件，设备。
            //stderr
            return 1;
        }
    }
    else
    {
        fprintf(stderr, "Usage:%s portnumber/r/n", argv[0]);
        return 1;
    }
 
    //声明epoll_event结构体的变量,ev用于注册事件,数组用于回传要处理的事件
    struct epoll_event ev, event_list[EVENT_MAX_COUNT];
 
    //生成用于处理accept的epoll专用的文件描述符
    epfd = epoll_create(256); //生成epoll文件描述符,既在内核申请一空间，存放关注的socket fd上是否发生以及发生事件。size既epoll fd上能关注的最大socket fd数。随你定好了。只要你有空间。
 
    struct sockaddr_in clientaddr;
    socklen_t clilenaddrLen;
    struct sockaddr_in serveraddr;
    
    listenfd = socket(AF_INET, SOCK_STREAM, 0);//Unix/Linux“一切皆文件”，创建（套接字）文件，id=listenfd
    if (listenfd < 0)
    {
       printf("socket error,errno %d:%s\r\n",errno,strerror(errno));
    }
    //把socket设置为非阻塞方式
    //setnonblocking(listenfd);
 
    //设置与要处理的事件相关的文件描述符
    ev.data.fd = listenfd;
 
    //设置要处理的事件类型
    ev.events = EPOLLIN | EPOLLET; //EPOLLIN ：表示对应的文件描述符可以读，EPOLLET状态变化才通知
    //ev.events=EPOLLIN;
 
    //注册epoll事件
    epoll_ctl(epfd, EPOLL_CTL_ADD, listenfd, &ev); //epfd epoll实例ID，EPOLL_CTL_ADD添加，listenfd:socket,ev事件（监听listenfd）
   
 
    memset(&serveraddr, 0, sizeof(serveraddr));
    serveraddr.sin_family     = AF_INET;
    serveraddr.sin_addr.s_addr=htonl(INADDR_ANY); /*IP，INADDR_ANY转换过来就是0.0.0.0，泛指本机的意思，也就是表示本机的所有IP*/
    serveraddr.sin_port       = htons(portnumber);
    
    if(0 != bind(listenfd, (struct sockaddr *)&serveraddr, sizeof(serveraddr)))
     {
        printf("bind error,errno %d:%s\r\n",errno,strerror(errno));
     }
 
     if(0 != listen(listenfd, LISTENQ)) //LISTENQ 定义了宏
     {
         printf("listen error,errno %d:%s\r\n",errno,strerror(errno));
     }
 
    maxi = 0;
 
    for ( ; ; )
    {
 
        //等待epoll事件的发生
        nfds = epoll_wait(epfd, event_list, EVENT_MAX_COUNT, TIMEOUT_MS); //epoll_wait(int epfd, struct epoll_event * event_list, int maxevents, int timeout)，返回需要处理的事件数目
 
        //处理所发生的所有事件
        for(i = 0; i < nfds; ++i)
        {
            if(event_list[i].data.fd == listenfd) //如果新监测到一个SOCKET用户连接到了绑定的SOCKET端口，建立新的连接。
            {
                clilenaddrLen = sizeof(struct sockaddr_in);//在调用accept()前，要给addrLen赋值,这样才不会出错,addrLen = sizeof(clientaddr);或addrLen = sizeof(struct sockaddr_in);
                connfd = accept(listenfd, (struct sockaddr *)&clientaddr, &clilenaddrLen);//（accpet详解：https://blog.csdn.net/David_xtd/article/details/7087843）
                if(connfd < 0)
                {
                    //perror("connfd<0:connfd= %d",connfd);
                    printf("connfd<0,accept error,errno %d:%s\r\n",errno,strerror(errno));
                    exit(1);
                }
 
                //setnonblocking(connfd);
 
                char *str = inet_ntoa(clientaddr.sin_addr);//将一个32位网络字节序的二进制IP地址转换成相应的点分十进制的IP地址
 
                cout << "accapt a connection from " << str << endl;
 
                //设置用于读操作的文件描述符
                ev.data.fd = connfd;
 
                //设置用于注测的读操作事件
                ev.events = EPOLLIN | EPOLLET;
                //ev.events=EPOLLIN;
 
                //注册ev
                epoll_ctl(epfd, EPOLL_CTL_ADD, connfd, &ev); //将accpet的句柄添加进入（增加监听的对象）
            }
            else if(event_list[i].events & EPOLLIN) //如果是已经连接的用户，并且收到数据，那么进行读入。
            {
                cout << "EPOLLIN" << endl;
                if ( (sockfd = event_list[i].data.fd) < 0)
                    continue;
 
                
                if ( (n = read(sockfd, line_buff, MAXLINE)) < 0)  //read时fd中的数据如果小于要读取的数据，就会引起阻塞?
                {
               //当read()或者write()返回-1时，一般要判断errno
                    if (errno == ECONNRESET)//与客户端的Socket被客户端强行被断开，而服务器还企图read
                    {
                        close(sockfd);
                        event_list[i].data.fd = -1;
                    }
                    else
                        std::cout << "readline error" << std::endl;
                }
                else if (n == 0) //返回的n为0时，说明客户端已经关闭 
                {
                    close(sockfd);
                    event_list[i].data.fd = -1;
                }
 
                line_buff[n] = '\0';
                cout << "read " << line_buff << endl;
 
                //设置用于写操作的文件描述符
                ev.data.fd = sockfd;
 
                //设置用于注测的写操作事件
                ev.events = EPOLLOUT | EPOLLET; //EPOLLOUT：表示对应的文件描述符可以写；
 
                //修改sockfd上要处理的事件为EPOLLOUT
                //epoll_ctl(epfd,EPOLL_CTL_MOD,sockfd,&ev);
 
            }
            else if(event_list[i].events & EPOLLOUT) // 如果有数据发送
            {
                sockfd = event_list[i].data.fd;
                write(sockfd, line_buff, n);
               
                //设置用于读操作的文件描述符
                ev.data.fd = sockfd;
                
                //设置用于注测的读操作事件
                ev.events = EPOLLIN | EPOLLET;
                
                //修改sockfd上要处理的事件为EPOLIN
                epoll_ctl(epfd, EPOLL_CTL_MOD, sockfd, &ev);
            }
        }
    }
    return 0;
}

Compilation command

Compile under linux: g++ epoll.cpp -o epoll

Simple test from command line

curl 192.168.0.250:5000 -d "phone=123456789&name=Hwei"

related information

How to dynamically change the number of listeners?

If the specified value is a constant in the source code, increasing its size requires recompiling the server program. Then, we can set a default value for it, but allow this value to be overridden through command line options or environment variables.

void Listen(int fd, int backlog)
{
    char *ptr;
    
    if((ptr = getenv("LISTENQ")) != NULL)
        backlog = atoi(ptr);
 
    if(listen(fd, backlog) < 0)
        printf("listen error\n");
}

How to handle the situation when the queue is full?

When a client SYN arrives, if the queue is full, TCP ignores the segment, that is, no RST is sent.

The reason for this is that the queue is full. If the client TCP does not receive the RST, it will resend the SYN and process the request when the queue is idle. If the server TCP immediately responds with a RST, the client's connect call will immediately return an error, forcing the application process to handle the situation without retransmitting the SYN again. Moreover, the client also distinguishes the status of the socket, whether "the queue is full" or "the port is not listening."

SYN flood attack

Send a large number of SYNs to a target server to fill the outstanding queue of one or more TCP ports. The source IP address of each SYN is set to a random number (IP spoofing), which prevents the attacking server from learning the hacker's real IP address. The unfinished connection queue is filled with forged SYNs, so that legitimate SYNs cannot be queued, causing services to legitimate users to be denied.

Defense methods:

Methods for the server host. Increase the connection buffer queue length and shorten the timeout for connection requests occupying the buffer queue. This method is the simplest and is used by many operating systems, but it also has the weakest defense performance.
Router filtering method. Since DDoS attacks, including SYN-Flood, all use address masquerading technology, using rules on the router to filter out packets considered to be address masquerading will effectively curb attack traffic.
Methods for firewalls. Use a firewall-based gateway to test the legitimacy of SYN requests before they connect to the real server. It is a commonly adopted defense mechanism specifically against SYN-Flood attacks.

SYN: Synchronize Sequence Numbers

它们的含义是：
SYN表示建立连接，
FIN表示关闭连接，
ACK表示响应，
PSH表示有 
DATA数据传输，
RST表示连接重置。

SYN (synchronous establishment of connection)

ACK(acknowledgement confirmation)

PSH (push transmission)

FIN(finish)

RST (reset reset)

URG (urgent emergency)

Sequence number

Acknowledge number

Three handshakes:

在TCP/IP协议中，TCP协议提供可靠的连接服务，采用三次握手建立一个连接。
 第一次握手：建立连接时，客户端发送syn包(syn=j)到服务器，并进入SYN_SEND状态，等待服务器确认；
第二次握手：服务器收到syn包，必须确认客户的SYN（ack=j+1），同时自己也发送一个SYN包（syn=k），即SYN+ACK包，此时服务器进入SYN_RECV状态；
 第三次握手：客户端收到服务器的SYN＋ACK包，向服务器发送确认包ACK(ack=k+1)，此包发送完毕，客户端和服务器进入ESTABLISHED状态，完成三次握手。完成三次握手，客户端与服务器开始传送数据.

The first handshake: Host A sends a bit code of syn=1 and randomly generates a data packet with seq number=1234567 to the server. Host B knows that SYN=1, and A requests to establish a connection;

Second handshake: Host B needs to confirm the connection information after receiving the request, and sends ack number=(host A’s seq+1), syn=1, ack=1 to A, and randomly generates a packet with seq=7654321;

The third handshake: After receiving it, host A checks whether the ack number is correct, that is, the seq number + 1 sent for the first time, and whether the bit code ack is 1. If it is correct, host A will send another ack number = (host B's seq+1), ack=1. After receiving it, host B confirms the seq value and ack=1, and the connection is successfully established.

Example code two

Multi-process Epoll:


#include <sys/types.h>  
#include <sys/socket.h>  
#include <sys/epoll.h>  
#include <netdb.h>  
#include <string.h>  
#include <stdio.h>  
#include <unistd.h>  
#include <fcntl.h>  
#include <stdlib.h>  
#include <errno.h>  
#include <sys/wait.h>  
#define PROCESS_NUM 10  
static int  
create_and_bind (char *port)  
{  
    int fd = socket(PF_INET, SOCK_STREAM, 0);  
    struct sockaddr_in serveraddr;  
    serveraddr.sin_family = AF_INET;  
    serveraddr.sin_addr.s_addr = htonl(INADDR_ANY);  
    serveraddr.sin_port = htons(atoi(port));  
    bind(fd, (struct sockaddr*)&serveraddr, sizeof(serveraddr));  
    return fd;  
}  
    static int  
make_socket_non_blocking (int sfd)  
{  
    int flags, s;  
 
    flags = fcntl (sfd, F_GETFL, 0);  
    if (flags == -1)  
    {  
        perror ("fcntl");  
        return -1;  
    }  
 
    flags |= O_NONBLOCK;  
    s = fcntl (sfd, F_SETFL, flags);  
    if (s == -1)  
    {  
        perror ("fcntl");  
        return -1;  
    }  
 
    return 0;  
}  
  
#define MAXEVENTS 64  
 
int  
main (int argc, char *argv[])  
{  
    int sfd, s;  
    int efd;  
    struct epoll_event event;  
    struct epoll_event *events;  
 
    sfd = create_and_bind("1234");  
    if (sfd == -1)  
        abort ();  
 
    s = make_socket_non_blocking (sfd);  
    if (s == -1)  
        abort ();  
 
    s = listen(sfd, SOMAXCONN);  
    if (s == -1)  
    {  
        perror ("listen");  
        abort ();  
    }  
 
    efd = epoll_create(MAXEVENTS);  
    if (efd == -1)  
    {  
        perror("epoll_create");  
        abort();  
    }  
 
    event.data.fd = sfd;  
    //event.events = EPOLLIN | EPOLLET;  
    event.events = EPOLLIN;  
    s = epoll_ctl(efd, EPOLL_CTL_ADD, sfd, &event);  
    if (s == -1)  
    {  
        perror("epoll_ctl");  
        abort();  
    }  
 
    /* Buffer where events are returned */  
    events = calloc(MAXEVENTS, sizeof event);  
            int k;  
    for(k = 0; k < PROCESS_NUM; k++)  
    {  
        int pid = fork();  
        if(pid == 0)  
        {  
 
            /* The event loop */  
            while (1)  
            {  
                int n, i;  
                n = epoll_wait(efd, events, MAXEVENTS, -1);  
                printf("process %d return from epoll_wait!\n", getpid());  
                                       /* sleep here is very important!*/  
                //sleep(2);  
                                       for (i = 0; i < n; i++)  
                {  
                    if ((events[i].events & EPOLLERR) || (events[i].events & EPOLLHUP)
                                                    || (!(events[i].events & EPOLLIN)))  
                    {  
                        /* An error has occured on this fd, or the socket is not  
                        ready for reading (why were we notified then?) */  
                        fprintf (stderr, "epoll error\n");  
                        close (events[i].data.fd);  
                        continue;  
                    }  
                    else if (sfd == events[i].data.fd)  
                    {  
                        /* We have a notification on the listening socket, which  
                        means one or more incoming connections. */  
                        struct sockaddr in_addr;  
                        socklen_t in_len;  
                        int infd;  
                        char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV];  
 
                        in_len = sizeof in_addr;  
                        infd = accept(sfd, &in_addr, &in_len);  
                        if (infd == -1)  
                        {  
                            printf("process %d accept failed!\n", getpid());  
                            break;  
                        }  
                        printf("process %d accept successed!\n", getpid());  
 
                        /* Make the incoming socket non-blocking and add it to the  
                        list of fds to monitor. */  
                        close(infd); 
                    }  
                }  
            }  
        }  
    }  
    int status;  
    wait(&status);  
    free (events);  
    close (sfd);  
    return EXIT_SUCCESS;  
}

Create test code for 2000+ links


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <netdb.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include<arpa/inet.h>
#include <fcntl.h>
#include <errno.h>
 
const int MAXLINE = 5;
int count = 1;
 
static int make_socket_non_blocking(int fd)
{
  int flags, s;
 
  flags = fcntl (fd, F_GETFL, 0);
  if (flags == -1)
    {
      perror ("fcntl");
      return -1;
    }
 
  flags |= O_NONBLOCK;
  s = fcntl (fd, F_SETFL, flags);
  if (s == -1)
    {
      perror ("fcntl");
      return -1;
    }
 
  return 0;
}
 
void sockconn()
{
	int sockfd;
	struct sockaddr_in server_addr;
	struct hostent *host;
	char buf[100];
	unsigned int value = 1;
 
	host = gethostbyname("127.0.0.1");
	sockfd = socket(AF_INET, SOCK_STREAM, 0);
	if (sockfd == -1) {
		perror("socket error\r\n");
		return;
	}
	
	//setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &value, sizeof(value));
	
	//make_socket_non_blocking(sockfd);
 
	bzero(&server_addr, sizeof(server_addr));
 
	server_addr.sin_family = AF_INET;
	server_addr.sin_port = htons(8080);
	server_addr.sin_addr = *((struct in_addr*) host->h_addr);
 
	int cn = connect(sockfd, (struct sockaddr *) &server_addr,
			sizeof(server_addr));
	if (cn == -1) {
		printf("connect error errno=%d\r\n", errno);
		return;
 
	}
//	char *buf = "h";
	sprintf(buf, "%d", count);
	count++;
	write(sockfd, buf, strlen(buf));
	close(sockfd);
	
	printf("client send %s\r\n", buf);
	
	return;
}
 
int main(void) {
 
 int i;
 for (i = 0; i < 2000; i++)
 {
 	  sockconn();
 }
 
 return 0;
}

About the two working modes of ET and LT

Level trigger LT:

LT is similar to select and poll. When a read-write event occurs on the monitored file descriptor, epoll_wait() will notify the handler to read and write. If the data is not read and written all at once this time (for example, the read and write buffer is too small), then the next time you call epoll_wait(), it will also notify you to continue reading on the file descriptor that was not read and written last time. Write

Edge triggered ET:

When a read-write event occurs on the monitored file descriptor, epoll_wait() will notify the handler to read and write. If all the data is not read and written this time (for example, the read-write buffer is too small), it will not notify you the next time you call epoll_wait(), that is, it will only notify you once until the file descriptor is You will not be notified until the second readable and writable event occurs.

Horizontal trigger : It will always trigger as long as there is data in the buffer.

Edge trigger : Triggered only at the moment when data is added to the buffer

It can be seen that when triggering horizontally, if there are a large number of ready file descriptors in the system that you do not need to read or write (some fds have data, but you do not process those fds), and they will return every time, this will greatly reduce the handler retrieval You care about the efficiency of ready file descriptors, but edge triggering will not be flooded with a large number of ready file descriptors that you don't care about, so the performance difference is obvious.

4. About the two working modes of ET and LT

It can be concluded that:

ET mode is notified only when the status changes. The so-called status change here does not include unprocessed data in the buffer. In other words, if you want to use ET mode, you need to read/write until an error occurs. , many people have reported why the ET mode only receives a part of the data and then never gets notified, mostly because of this; while the LT mode will continue to notify as long as there is data that has not been processed.

Things to note when reading and writing data in epoll

Call the read/write function on a non-blocking socket and return EAGAIN or EWOULDBLOCK (Note: EAGAIN is EWOULDBLOCK).

Literally, it means:

EAGAIN: Try again

EWOULDBLOCK: If this is a blocking socket, the operation will be blocked

perror output: Resource temporarily unavailable

Summarize:

This error indicates that resources are temporarily insufficient. There may be no data in the read buffer during read, or the write buffer may be full during write.

In this case, if the socket is blocked, read/write will be blocked. If it is a non-blocking socket, read/write returns -1 immediately, and errno is set to EAGAIN.

So for blocking sockets, read/write returns -1, which means there is an error in the network. But for non-blocking sockets, read/write returns -1, which does not necessarily mean that the network is really wrong. It may be that the Resource is temporarily unavailable. At this time you should try again until the Resource is available.

This article mainly talks about the precautions for using the read and write data interface under the epoll model (not entirely for epoll)

1、read write

The function prototype is as follows:

#include <unistd.h>
ssize_t read(int filedes, void* buf, size_t nbytes)
ssize_t write(int filedes, const void* buf, size_t nbytes)

Among them, read returns the number of bytes actually read. But the actual bytes read are very likely to be less than the number nbytes of bytes specified to be read. Therefore it will be divided into:

①The return value is greater than 0. Reading is normal and the actual number of bytes read is returned.

②The return value is equal to 0. Reading exception, reading to the end of file filedes. Logically, it should be understood that read has finished reading the data.

③The return value is less than 0 (-1). A reading error may be caused by a network exception when processing network requests. Note that when -1 is returned, the values of errno are EAGAIN and EWOULLDBLOCK, indicating that the corresponding read buffer of the kernel is empty.

The actual number of written bytes returned by write is normally the same as the specified number of written bytes nbytes. If it is not equal, it means that the writing is abnormal. It is important to note that the values of errno at this time are EAGAIN and EWOULLDBLOCK, which represent the corresponding writes of the kernel. The buffer is empty. Note, EAGAIN is equivalent to EWOULLDBLOCK.

In short, this error indicates that the resources are temporarily insufficient. There may be no data in the read buffer during read, or the write buffer may be full during write. In this case, if the socket is blocked, read/write will be blocked. If it is a non-blocking socket, read/write returns -1 immediately, and errno is set to EAGAIN.

So for blocking sockets, read/write returns -1, which means there is an error in the network. But for non-blocking sockets, read/write returns -1, which does not necessarily mean that the network is really wrong. The buffer may be empty or full, in which case you should try again until Resource is available.

In summary, for non-blocking sockets, the correct read and write operations are:

LT mode

Reading: Ignore the error of errno = EAGAIN and continue reading next time;

Write: Ignore the error of errno = EAGAIN and continue writing next time.

For the LT mode of select and epoll, this reading and writing method is no problem. But for epoll's ET mode, this method still has loopholes.

Let's introduce the two modes of epoll event, LT (horizontal trigger) and ET (edge trigger). It can be understood that the epoll event will be triggered only when the read and write status of the file descriptor changes. Specifically, the two modes are as follows: The difference is that in level-trigger mode, as long as a certain socket is in the readable/writable state, the socket will be returned whenever epoll_wait is performed; while in edge-trigger mode, only when a certain socket changes from unreadable to readable, or from unwritable to writable , epoll_wait will return the socket. The following two diagrams are shown:

Read data from socket:

Write data to socket:

So in epoll's ET mode, the correct reading and writing methods are:

Read: As long as it is readable, keep reading until 0 is returned, or errno = EAGAIN Write: As long as it is writable, keep writing until the data is sent, or errno = EAGAIN

What this means here is that for ET mode, it is equivalent to rewriting read and write ourselves to make it like an "atomic operation" to ensure that one read or write can completely read the data in the buffer or write it into the buffer. area data. Therefore, it can be implemented by wrapping read and write with while. But for select or LT mode, we can only use read and write once, because it will always be while in the main program, and the event will be obtained the next time it is selected. But it can also be implemented by wrapping read and write with while. Logically speaking, reading the data completely at one time can ensure the integrity of the data.

Let's explain this "atomic operation" read and write

int n = 0;
while(1)
{
    nread = read(fd, buf + n, BUFSIZ - 1); //读时，用户进程指定的接收数据缓冲区大小固定，一般要比数据大
    if(nread < 0)
    {
        if(errno == EAGAIN || errno == EWOULDBLOCK)
        {
            continue;
        }
        else
        {
            break; //or return;
        }
    }
    else if(nread == 0)
    {
        break; //or return. because read the EOF
    }
    else
    {
        n += nread;
    }
}


int data_size = strlen(buf);
int n = 0;
while(1)
{
    nwrite = write(fd, buf + n, data_size);//写时，数据大小一直在变化
    if(nwrite < data_size)
    {
        if(errno == EAGAIN || errno == EWOULDBLOCK)
        {
            continue;
        }
        else
        {
            break;//or return;        
        }
 
    }
    else
    {
        n += nwrite;
        data_size -= nwrite;
    }
}

For the correct accept, there are two issues to consider when accepting:

(1) Problems with blocked listening socket and accept in LT mode or ET mode

Each time accept takes out a connection from the TCP queue that has completed the three-way handshake, consider this situation: The TCP connection is aborted by the client, that is, before the server calls accept, the client actively sends RST to terminate the connection, resulting in the newly established connection Removed from the ready queue, if the socket is set to blocking mode, the server will block on the accept call until some other client establishes a new connection. But during this period, the server is simply blocked on the accept call, and other descriptors in the ready queue are not processed.

The solution is: set the listening socket to non-blocking. When the client terminates a connection before the server calls accept, the accept call can immediately return -1. At this time, the implementation from Berkeley will handle the event in the kernel and This event will not be notified to epoll, and other implementations set errno to ECONNABORTED or EPROTO errors, which we should ignore.

(2) Problems with accept in ET mode

Consider this situation: multiple connections arrive at the same time, and the server's TCP ready queue instantly accumulates multiple ready connections. Since it is an edge trigger mode, epoll will only notify once, and accept will only process one connection, resulting in the remaining connections in the TCP ready queue. None can be dealt with.

The solution is: set the listening socket to non-blocking mode, use a while loop to hold the accept call, and then exit the loop after processing all connections in the TCP ready queue. How do you know if all connections in the ready queue have been processed? Accept returns -1 and errno is set to EAGAIN to indicate that all connections have been processed.

Based on the above two situations, the server should use non-blocking accept. The correct way to use accept in ET mode is:

while ((conn_sock = accept(listenfd,(struct sockaddr *) &remote,   
                (size_t *)&addrlen)) > 0) {  
    handle_client(conn_sock);  
}  
if (conn_sock == -1) {  
    if (errno != EAGAIN && errno != ECONNABORTED   
            && errno != EPROTO && errno != EINTR)   
        perror("accept");  
}

An interview question for Tencent backend development:

Using the Linux epoll model, horizontal trigger mode; when the socket is writable, socket writable events will be triggered continuously. How to deal with it?

The first and most common way:

When you need to write data to the socket, add the socket to epoll and wait for the writable event. After receiving the writable event, call write or send to send data. When all data is written, move the socket out of epoll.

The disadvantage of this method is that even if a small amount of data is sent, the socket must be added to epoll and removed from epoll after writing, which has a certain operating cost.

An improved way:

At first, the socket is not added to epoll. When you need to write data to the socket, directly call write or send to send data. If it returns to EAGAIN, add the socket to epoll, write data under the driver of epoll, and then move out of epoll after all the data is sent.

The advantage of this method is that when there is not much data, epoll event processing can be avoided and efficiency can be improved.

Multiplexing

As stated at the beginning of the article, all three are I/O multiplexing mechanisms, and the definition of multiplexing is briefly introduced. So how can we understand multiplexing more intuitively? Here is a picture:

For the web server Nginx, there will be many connections coming in, and epoll will monitor them all, and then like a switch, whoever has data will dial it, and then call the corresponding code for processing.

Generally speaking, I/O multiplexing is required in the following situations:

When the client handles multiple descriptors (typically interactive input and network sockets)
If a server needs to handle both TCP and UDP, I/O multiplexing is generally used.
If a TCP server has to handle both listening sockets and connected sockets

Why can epoll support millions of connections?

During the processing of the server, you can see that the important operation is to use epoll_ctl to modify the epoll_event registered by clientfd in epoll. This operation first finds the epoll_event corresponding to fd in the red-black tree, and then modifies it. The red-black tree is The time complexity of a typical binary balanced tree is log2(n). For 1 million file handles, only about 16 searches are needed. The speed is very fast and it can support millions of levels without any pressure.
In addition, epoll registers the callback function on fd. When the callback function monitors that an event occurs, it prepares the relevant data and puts it into the ready list. This action is very fast and the cost is very small.

Processing of socket read and write return values

When calling the socket reading and writing functions read() and write(), there will be a return value. If the return value is not handled correctly, it may introduce some problems

The following points are summarized

1When the return value of the read() or write() function is greater than 0, it indicates the number of bytes actually read or written from the buffer.

2. When the return value of the read() function is 0, it means that the peer has closed the socket. At this time, the socket must also be closed, otherwise the socket will be leaked. Check with the netstat command. If there is a socket in the closewait state, the socket is leaked.

When the write() function returns 0, it means that the current write buffer is full, which is normal. Just write again next time.

3. When read() or write() returns -1, it is generally necessary to judge errno

If errno == EINTR, it means the system is currently interrupted and is ignored directly.

If errno == EAGAIN or EWOULDBLOCK, the non-blocking socket is ignored directly; if it is a blocked socket, it is usually because the read and write operations have timed out and have not returned yet. This timeout refers to the SO_RCVTIMEO and SO_SNDTIMEO attributes of the socket. Therefore, when using blocking sockets, do not set the timeout too small. Otherwise, -1 is returned, and you don't know whether the socket connection is really disconnected or whether it is normal network jitter. Generally, if a blocked socket returns -1, it needs to be closed and reconnected.

4. In addition, for non-blocking connect, -1 may be returned. At this time, errno needs to be judged. If errno == EINPROGRESS, it means that it is being processed. Otherwise, it means that the connection has gone wrong and needs to be closed and reconnected. After using select, when the writable event of the socket is detected, it is necessary to judge getsockopt(c->fd, SOL_SOCKET, SO_ERROR, &err, &errlen) to see if there is an error in the socket. If the err value is 0, it means that the connection is successful; otherwise, reconnection should also be turned off.

5 When using epoll, there are two modes: ET and LT. In ET mode, the socket needs to be read or written until -1 is returned. There is no problem for non-blocking sockets, but if it is a blocking socket, as mentioned in Article 3, it will only return after timeout. So never use blocking sockets in ET mode. So why is there no problem with LT mode? Under normal circumstances, when using LT mode, we only need to call the read or write function once. If the reading or writing is not completed, just try again next time. Since the readable or writable event has been returned, it is guaranteed that calling read or write will return normally.

nread is -1 and errno==EAGAIN, indicating that the data has been read, and EPOLLOUT is set.

Network status query command

sar、iostat、lsof

Problem record

client

1、Cannot assign requested address

Generally speaking, it is because the client frequently connects to the server, and each connection ends in a short period of time, resulting in a lot of TIME_WAIT, so that the available port numbers are used up, so new connections cannot bind the port, that is, "Cannot assign requested address". It's a client-side problem, not a server-side problem. Through netstat, I did see a lot of connections in the TIME_WAIT state.

The client frequently establishes connections and releases ports slowly, resulting in no available ports when establishing new connections.

netstat -a|grep TIME_WAIT
tcp        0      0 e100069210180.zmf:49477     e100069202104.zmf.tbs:websm TIME_WAIT   
tcp        0      0 e100069210180.zmf:49481     e100069202104.zmf.tbs:websm TIME_WAIT   
tcp        0      0 e100069210180.zmf:49469     e100069202104.zmf.tbs:websm TIME_WAIT   
……

Solution

Execute the command to modify the following kernel parameters (root privileges are required)

Reduce the waiting time after port release, the default is 60s, change it to 15~30s:

sysctl -w net.ipv4.tcp_fin_timeout=30

Modify the tcp/ip protocol configuration by configuring /proc/sys/net/ipv4/tcp_tw_resue, the default is 0, change it to 1, and release the TIME_WAIT port for new connections:

sysctl -w net.ipv4.tcp_timestamps=1

Modify the tcp/ip protocol configuration to quickly recycle socket resources. The default is 0, change it to 1:

sysctl -w net.ipv4.tcp_tw_recycle=1

Allow port reuse:

sysctl -w net.ipv4.tcp_tw_reuse = 1

2. There are about 28,000 links and an error is reported (the port may be exhausted)

If there is no connection in TIME_WAIT state, it is possible that the ports are exhausted, especially when the connection is long. Check the open port range:

[root@VM_0_8_centos usr]# sysctl -a |grep port_range
net.ipv4.ip_local_port_range = 32768    60999
sysctl: reading key "net.ipv6.conf.all.stable_secret"
sysctl: reading key "net.ipv6.conf.default.stable_secret"
sysctl: reading key "net.ipv6.conf.eth0.stable_secret"
sysctl: reading key "net.ipv6.conf.lo.stable_secret"
[root@VM_0_8_centos usr]#

60999 - 32768 = 28,231, exactly 28,000, long connections use up all the ports. Modify the port open range:

vi /etc/sysctl.conf net.ipv4.ip_local_port_range = 10000 65535

Execute sysctl -p to take effect

Server

Reasons that affect links not going up:

1. The port is used up (client,

Check the port range sysctl -a | grep port_range, return:
net.ipv4.ip_local_port_range = 32768 60999, so the available ports are 60999-32768 =2.8w)

2. The file fd is used up.

3. The memory is exhausted

4. Network

5. Configuration: fsfile-max = 1048576 #The maximum value of file fd,

fd=open(), fd starts from 3, 0: stdin standard input 1: stdout standard output 2: stderr error output

Epoll’s hard-to-solve problems

int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout); Let’s start with this API. This API can monitor many fds, but there is only one timeout. There is this scenario: You want to pay attention to the descriptor with fd == 10. You want the descriptor to notify you when data arrives, and if no data comes, you hope to be notified after 20 seconds. What would you do instead?

Suppose further that you want to pay attention to 100 fds, and hope that these one hundred fds will be counted from the time they join the listening queue, and you will be notified of timeout after 20 seconds.

If you use this epoll_wait, do you have to remember the remaining timeout of all fds yourself?

Libevent solves this problem.

Configuration and debugging

net.ipv4.tcp_syncookies = 1 means turning on SYN Cookies. When the SYN waiting queue overflows, enable cookies to handle it, which can prevent a small number of SYN attacks. The default is 0, which means closed;

net.ipv4.tcp_tw_reuse = 1 means to enable reuse. Allow TIME-WAIT sockets to be reused for new TCP connections. The default is 0, which means closed;

net.ipv4.tcp_tw_recycle = 1 means to enable fast recycling of TIME-WAIT sockets in TCP connections. The default is 0, which means it is closed.

net.ipv4.tcp_fin_timeout modifies the default TIMEOUT time

Why should we enable syncookie to optimize time_wait? syncookie can bypass seq queue restrictions and has nothing to do with optimizing time_wait

In addition, timestamps must be turned on for reuse and recycle to take effect.

How to modify fin_timeout? How about turning it up or down?

//Test Data

Single thread, peak processing link, it seems that the current test conditions can test up to 2.9K

CONNECT: 2.9K/s

QPS: 276,000

50,000 links.

QPS: 19,000

How many connections do the client and server support?

client

Now we can finally draw a more correct conclusion. For a client with 1 IP, it is limited by the ip_local_port_range parameter and is also limited by 65535. However, a single Linux can be configured with multiple IPs. With several IPs, the maximum theoretical value will be doubled several times.

Multiple network cards are not necessary. Even if there is only one network card, multiple IPs can be configured. This is what k8s does. In k8s, multiple pods can be deployed on a physical machine. But each pod will be assigned an independent IP, so there is no need to worry about deploying too many pods on the physical machine and affecting the number of TCP connections in the pods you use. The moment the IP is given to you, your pod is isolated from other applications.

Service-Terminal

If a TCP connection does not send data, it consumes about 3.3K of memory. If there is data to send, you need to allocate a sending buffer area for each TCP. The size is affected by your parameter net.ipv4.tcp_wmem configuration. By default, the minimum is 4K. If the sending is completed, the memory consumed by the buffer area will be recycled.

Assuming that you only maintain the connection and do not send data, then the maximum number of connections that your server can establish = your memory/3.3K. If it is 4GB of memory, then the approximate number of acceptable TCP connections is about 1 million.

In this example, the premise we consider is to hold all server-side connections in one process. In actual projects, in order to facilitate sending and receiving data, many network IO models will also create another thread or coroutine for the TCP connection. Taking the lightest golang as an example, a coroutine stack also requires 2KB of memory overhead.

in conclusion

TCP connection client: The TCP connection that can be established by each IP is theoretically limited by the ip_local_port_range parameter and is also limited by 65535. But you can increase your ability to establish connections by configuring multiple IPs.
TCP connected server machine: Although the theoretical value of each listening port is large, this number has no practical significance. The maximum number of concurrencies depends on your memory size. Each static TCP connection requires approximately 3.3K of memory.

The difference between select, poll, and epoll (Sogou interview)

(1)select==>Time complexity O(n)

It only knows that an I/O event has occurred, but it does not know which streams it is (there may be one, multiple, or even all). We can only poll all streams indiscriminately to find out which streams can be read. data, or streams that write data, and operate on them. So select has an indiscriminate polling complexity of O(n). The more streams processed at the same time, the longer the indiscriminate polling time will be.

(2)poll==>Time complexity O(n)

Poll is essentially the same as select. It copies the array passed in by the user to the kernel space, and then queries the device status corresponding to each fd. However, it does not have a limit on the maximum number of connections because it is stored based on a linked list.

(3)epoll==>Time complexity O(1)

Epoll can be understood as event poll. Different from busy polling and indiscriminate polling, epoll will notify us of which I/O event occurred in which stream. So we say that epoll is actually event-driven (each event is associated with fd). At this time, our operations on these streams are meaningful. (The complexity is reduced to O(1))

Select, poll, and epoll are all IO multiplexing mechanisms. I/O multiplexing uses a mechanism to monitor multiple descriptors. Once a certain descriptor is ready (usually read-ready or write-ready), the program can be notified to perform corresponding read and write operations. But select, poll, and epoll are essentially synchronous I/O, because they all need to be responsible for reading and writing after the read and write events are ready, which means that the read and write process is blocked, while asynchronous I/O does not require them to do so themselves. Responsible for reading and writing, the implementation of asynchronous I/O will be responsible for copying data from the kernel to user space.

Both epoll and select can provide multi-channel I/O multiplexing solutions. All of them can be supported in the current Linux kernel. Among them, epoll is unique to Linux, while select should be stipulated by POSIX and implemented in general operating systems.

select：

Select essentially performs the next step of processing by setting or checking the data structure storing the fd flag. The disadvantages of this are:

1. The number of fds that can be monitored by a single process is limited, that is, the size of the listening port is limited.

Generally speaking, this number has a lot to do with system memory. The specific number can be viewed by cat /proc/sys/fs/file-max. The default for 32-bit machines is 1024. The default for 64-bit machines is 2048.

2. When scanning the socket, linear scanning is used, that is, the polling method is used, which is less efficient:

When there are many sockets, each select() must complete the scheduling by traversing FD_SETSIZE Sockets. No matter which Socket is active, it will be traversed once. This wastes a lot of CPU time. If you can register a callback function for the socket and automatically complete the relevant operations when they are active, you can avoid polling. This is what epoll and kqueue do.

3. It is necessary to maintain a data structure to store a large number of FDs, which will cause high copy overhead when transferring the structure between user space and kernel space.

poll：

Poll is essentially the same as select. It copies the array passed in by the user to the kernel space, and then queries the device status corresponding to each fd. If the device is ready, it adds an item to the device waiting queue and continues to traverse. If it has traversed all If no ready device is found after fd, the current process will be suspended until the device is ready or the initiative times out. After being awakened, it will traverse fd again. This process went through many unnecessary traversals.

It has no limit on the maximum number of connections because it is stored based on a linked list, but it also has a disadvantage:

1. A large number of fd arrays are copied as a whole between user mode and kernel address space, regardless of whether such copying is meaningful.

2. Another feature of poll is "horizontal triggering". If an fd is reported but is not processed, the fd will be reported again the next time poll is performed.

epoll:

epoll has two trigger modes: EPOLLLT and EPOLLET. LT is the default mode and ET is the "high-speed" mode. In LT mode, as long as the fd still has data to read, epoll_wait will return its event each time to remind the user program to operate. In ET (edge trigger) mode, it will only prompt once until there is data next time. There will be no further prompts before inflow, regardless of whether there is still readable data in fd. Therefore, in ET mode, when reading an fd, its buffer must be read out, that is to say, until the return value of read is less than the requested value, or an EAGAIN error is encountered. Another feature is that epoll uses the "event" readiness notification method to register the fd through epoll_ctl. Once the fd is ready, the kernel will use a callback-like callback mechanism to activate the fd, and epoll_wait can receive the notification.

Why does epoll have EPOLLET trigger mode?

If the EPOLLLT mode is used, once there are a large number of ready file descriptors in the system that you do not need to read or write, they will return every time epoll_wait is called, which will greatly reduce the efficiency of the handler in retrieving the ready file descriptors it cares about. If the edge trigger mode of EPOLLET is used, when a readable and writable event occurs on the monitored file descriptor, epoll_wait() will notify the handler to read and write. If all the data is not read and written this time (for example, the read-write buffer is too small), it will not notify you the next time you call epoll_wait(), that is, it will only notify you once until the file descriptor is You will not be notified until the second read-write event occurs! ! ! This mode is more efficient than horizontal triggering, and the system will not be flooded with a large number of ready file descriptors that you don't care about.

Advantages of epoll:

1. There is no limit on the maximum concurrent connections, and the upper limit of FDs that can be opened is much greater than 1024 (about 100,000 ports can be monitored on 1G of memory);

2. The efficiency is improved. It is not a polling method, and the efficiency will not decrease as the number of FDs increases. Only active and available FDs will call the callback function;

That is, the biggest advantage of Epoll is that it only cares about your "active" connections and has nothing to do with the total number of connections. Therefore, in the actual network environment, the efficiency of Epoll will be much higher than select and poll.

3. Memory copy, use mmap() file mapping memory to accelerate message passing with the kernel space; that is, epoll uses mmap to reduce copy overhead.

Summary of the differences between select, poll, and epoll:

1. Support the maximum number of connections that can be opened by a process

select

The maximum number of connections that can be opened by a single process is defined by the FD_SETSIZE macro, and its size is the size of 32 integers (on a 32-bit machine, the size is 3232, and similarly on a 64-bit machine, FD_SETSIZE is 3264). Of course, we can Modify and then recompile the kernel, but performance may be affected, which requires further testing.

poll

Poll is essentially the same as select, but it does not have a limit on the maximum number of connections because it is stored based on a linked list.

epoll

Although there is an upper limit on the number of connections, it is very large. A machine with 1G memory can open about 100,000 connections, and a machine with 2G memory can open about 200,000 connections.

2. IO efficiency problems caused by the sharp increase in FD

select

Because the connection will be linearly traversed every time it is called, as the FD increases, it will cause a "linear degradation performance problem" with slow traversal speed.

poll

Same as above

epoll

Because the implementation in the epoll kernel is based on the callback function on each fd, only active sockets will actively call callback. Therefore, when there are fewer active sockets, using epoll does not have the linear decline performance problem of the previous two. , but when all sockets are active, there may be performance issues.

3. Message delivery method

select

The kernel needs to pass the message to user space, which requires kernel copy action.

poll

Same as above

epoll

epoll is implemented by sharing a piece of memory between the kernel and user space.

Summarize:

In summary, when choosing select, poll, or epoll, you should consider the specific use occasions and the characteristics of these three methods.

1. On the surface, epoll has the best performance, but when the number of connections is small and the connections are very active, the performance of select and poll may be better than epoll. After all, epoll's notification mechanism requires many function callbacks.

2. Select is inefficient because it needs to be polled every time. But inefficiency is relative and can be improved through good design, depending on the situation.

Today we will compare these three types of IO multiplexing, refer to the information on the Internet and books, and organize it as follows:

1. Select implementation

The calling process of select is as follows:

(1) Use copy_from_user to copy fd_set from user space to kernel space

(2) Register callback function __pollwait

(3) Traverse all fd and call its corresponding poll method (for socket, this poll method is sock_poll, sock_poll will call tcp_poll, udp_poll or datagram_poll according to the situation)

(4) Taking tcp_poll as an example, its core implementation is __pollwait, which is the callback function registered above.

(5) The main job of __pollwait is to hang current (current process) into the waiting queue of the device. Different devices have different waiting queues. For tcp_poll, the waiting queue is sk->sk_sleep (note that the process is hung in Being in the waiting queue does not mean that the process has slept). After the device receives a message (network device) or fills in the file data (disk device), it will wake up the device and wait for the sleeping process on the queue. At this time, current will be awakened.

(6) When the poll method returns, it will return a mask describing whether the read and write operations are ready, and assign a value to fd_set based on this mask.

(7) If all fds are traversed and a readable and writable mask is not returned, schedule_timeout will be called to cause the process that called select (that is, current) to go to sleep. When the device driver becomes able to read and write its own resources, it will wake up the process sleeping on the waiting queue. If no one wakes up after a certain timeout period (specified by schedule_timeout), the process calling select will be awakened again to obtain the CPU, and then traverse the fd again to determine whether there is a ready fd.

(8) Copy fd_set from kernel space to user space.

Summarize:

Several major disadvantages of select:

(1) Each time select is called, the fd collection needs to be copied from user mode to kernel mode. This overhead will be very large when there are many fds.

(2) At the same time, each time you call select, you need to traverse all the passed FDs in the kernel. This overhead is also very large when there are many FDs.

(3) The number of file descriptors supported by select is too small, the default is 1024

2 poll implementation

The implementation of poll is very similar to that of select, except that the way to describe the fd set is different. Poll uses the pollfd structure instead of the fd_set structure of select. Everything else is similar. The management of multiple descriptors also involves polling and processing based on the status of the descriptor. But poll has no limit on the maximum number of file descriptors. A disadvantage of poll and select is that the array containing a large number of file descriptors is copied as a whole between the user mode and the kernel's address space, regardless of whether these file descriptors are ready. Its overhead increases with the number of file descriptors. And increases linearly.

3、epoll

Since epoll is an improvement on select and poll, it should be able to avoid the above three shortcomings. So how do you solve epoll? Before that, let's take a look at the difference in the calling interfaces of epoll, select and poll. Both select and poll only provide one function-select or poll function. Epoll provides three functions, epoll_create, epoll_ctl and epoll_wait. epoll_create is to create an epoll handle; epoll_ctl is to register the event type to be monitored; epoll_wait is to wait for the event to be generated.

For the first shortcoming, epoll's solution is in the epoll_ctl function. Each time a new event is registered in the epoll handle (specify EPOLL_CTL_ADD in epoll_ctl), all fds will be copied into the kernel instead of repeated copies during epoll_wait. epoll ensures that each fd will only be copied once during the entire process.

For the second shortcoming, epoll's solution is not like select or poll, which adds current to the device waiting queue corresponding to fd every time. Instead, it only hangs current once during epoll_ctl (this time is essential) and adds the current to the device waiting queue corresponding to fd every time. Each fd specifies a callback function. When the device is ready and wakes up the waiters on the waiting queue, this callback function will be called, and this callback function will add the ready fd to a ready linked list). The job of epoll_wait is actually to check whether there is a ready fd in the ready list (use schedule_timeout() to sleep for a while and judge the effect, which is similar to step 7 in the select implementation).

Regarding the third shortcoming, epoll does not have this limitation. The upper limit of FD it supports is the maximum number of files that can be opened. This number is generally much greater than 2048. For example, on a machine with 1GB memory, it is about 100,000. The specific number You can check it with cat /proc/sys/fs/file-max. Generally speaking, this number has a lot to do with system memory.

Summarize:

(1) The select and poll implementation needs to continuously poll all fd collections until the device is ready. During this period, sleep and wake-up may have to alternate multiple times. In fact, epoll also needs to call epoll_wait to continuously poll the ready list. During this period, it may alternate between sleeping and waking up multiple times. However, when the device is ready, it calls the callback function, puts the ready fd into the ready list, and wakes up in epoll_wait to go to sleep. process. Although both sleep and alternate, select and poll need to traverse the entire fd collection when "awake", while epoll only needs to determine whether the ready list is empty when "awake", which saves a lot of CPU time. This is the performance improvement brought by the callback mechanism.

(2) Each time select and poll are called, the fd collection must be copied from the user state to the kernel state, and current must be placed in the device waiting queue once, while epoll only needs to be copied once, and current must be placed in the waiting queue. Only hang once (at the beginning of epoll_wait, note that the waiting queue here is not a device waiting queue, but a waiting queue defined internally by epoll). This can also save a lot of expenses.

end