What is the system call mechanism? Combined with Linux0.12 source code illustration

Latest update time：2024-04-04

Reads：

Kernel state and user state
What is a system call
How system calls are implemented

Library function write
Library function extended assembly macro
int 0x80 interrupt calls the corresponding interrupt processing function
Retrieve system call function table
Finally execute sys_write

Kernel mode and user mode data interaction

Kernel state and user state

When early engineers wrote programs on the operating system, they could write a program to access other people's program addresses, or even the addresses occupied by the operating system. In this way, it was easy to accidentally kill the operating system directly, so that Programmers at that time had to be careful when writing programs.

Computer core resources generally include: memory, I/O ports, special machine instructions, etc. These resources must be protected and stipulated which programs can access and which programs cannot access.

Therefore, the concept of "privilege level" was introduced , and hardware equipment manufacturers directly provide hardware-level support. The most common one is to classify the permissions of the CPU instruction set to control the access permissions of the CPU.

For example, Intel CPU指令集 the operation permissions are divided into 4 levels from high to low: Ring0、Ring1、Ring2、Ring3 , among which Ring0 has the highest permissions and can use all CPU instruction sets. Ring3 has the lowest permissions and can only use some CPU instructions. For example, it cannot use CPU instructions that operate hardware resources: I/O operations, memory allocation and other operations; in addition, the CPU is in Ring3 state and cannot access the address space of Ring0, including code and data.

The CPU instruction set is a set of instructions used in the CPU to calculate and control the computer system. It is a medium for software to command hardware execution. Common CPU instruction sets include X86, ARM, MIPS, Alpha, RISC, etc.

So how does the CPU record this privilege level information?

We take here as 80386CPU an example . As mentioned earlier , there are many segment registers in the CPU (CS, DS, SS, ES, FS, GS, etc.). These segment registers store segment selectors (also called segment selectors)

The segment selector contains the request privilege level RPL (CPL) field. Through the segment selector, you can find the corresponding items in the global descriptor table GDT and local descriptor table LDT. You need to perform a privilege level check first ; these items all contain DPL. field (specifies the permission level to access the segment), only DPL >= max {CPL, RPL} access is allowed

CPL is very special. The value of DPL in the descriptor that tracks the segment where the code currently being executed by the CPU is always equal to the current privilege level of the CPU.

Kernel mode and user mode are concepts at the operating system level and have no necessary connection with CPU hardware. Since the hardware has already provided a set of related mechanisms for privileged use, the Linux operating system does not need to "reinvent the wheel" and uses it directly. These two levels of hardware Ring0和Ring3 permissions use Ring3 as the user state and Ring0 as the kernel state.

So some people may ask why the Linux system only uses Ring0和Ring3 these two levels?

Because the permission management provided by the CPU is not detailed enough, for example, there is no difference in the security situation in the operating system, Intel CPU and the system permissions under the system need to frequently call privilege instructions. The cost of frequent switching of privilege levels is too high. The operating system is better to merge into and include privilege level Ring2和Ring3Ring1Ring0Ring2Ring3Ring1Ring0

On the other hand, not every processor x86 supports 4 privilege levels like

Let’s take a look at the Linux system architecture diagram:

We can find that the Linux system as a whole is divided into user mode and kernel mode.

Kernel state

The kernel state is at the core of the operating system. Ring0 It is a privileged level and has the highest authority of the operating system. It can control all hardware resources, control various core data, and access any address in the memory; the kernel state manages these cores uniformly. resources to reduce access and use conflicts of limited resources; any program exception that occurs in the kernel is catastrophic and will cause the entire operating system to crash.

User mode

User mode, where we usually write programs, is at Ring3 the privileged level and has lower permissions; programs at this level do not have direct control over the hardware, nor can they directly access the memory address . In this mode, even if the program crashes, it will not affect other programs and can be recovered

What is a system call

When the computer starts, the CPU is in the Ring0 state. At this time, all instructions can be executed. The operating system program in the disk sector is loaded into the memory through the main boot program, thereby starting the operating system ( it should be noted that the operation of this article The system takes Linux0.12 as an example )

That is to say, when Linux 0.12 is started , it runs in the kernel state with the highest level of authority ; at the same time, the memory is divided and a part (kernel area) is specially designated for the kernel. This part of the memory can only be used by the kernel; the main The memory area is used by other application software. If you are interested in this part, you can read my previous article Linux0.12 Kernel Source Code Interpretation (6)-main.c

When the operating system is started, the CPU switches to Ring3 the level, and the operating system enters the user mode at the same time. Subsequent application codes all run in the user mode with the lowest permissions . Usually, all the programs we can write run in the user mode.

需要格外注意一下， CPU特权级其实并不会对操作系统的用户造成什么影响 ！有人会和Linux的用户权限搞混淆，无论是根用户(root)，管理员，访客还是一般用户，它们都属于用户； 而所有的用户代码都在用户态Ring3上执行，所有的内核代码都在内核态Ring0上执行，和Linux用户的身份权限并没有关系 ！

因为我们编写的程序都运行在用户态上，是无法对内存和I/O端口的访问，可以说基本上无法与外部世界交互，但是我们平时工作的时候访问磁盘、写文件，这些都是必要的需求，怎么办？

那就需要通过执行 系统调用system call ，操作系统会切换到内核态，由内核去统一执行相关操作(大哥帮小弟去执行)；当执行完操作系统再切换回用户态。这样方便集中管理，减少有限资源的访问和使用冲突

系统调用 是操作系统专门为用户态运行的进程与硬件设备之间进行交互提供了一组接口，是用户态主动要求切换到内核态的一种方式

系统调用是怎么实现的

接下来我们就结合 Linux0.12 的源码一起来看看系统调用是怎么实现的？

库函数write

本文以一个常见的库函数 write 函数为例来，来更方便大家理解，开始发车：

//  lib/write.c

#define __LIBRARY__
#include <unistd.h> //头文件

_syscall3(int,write,int,fd,const char *,buf,off_t,count) //定义write的实现,：fd - 文件描述符；buf - 写缓冲区指针；count - 写字节数

write.c 这个文件主要是定义write的实现， _syscall3(*,write,*) 函数的主要功能是，向文件描述符fd指定的文件写入count个字节的数据到缓冲区buf中

需要注意一下 #define __LIBRARY__ 这个宏定义，这里定义 直接原因 是为了包括在 unistd.h 中的内嵌汇编代码

库函数扩展汇编宏

因为 _syscall3 这个函数定义在 /include/unistd.h 中，来看下源码：

//  /include/unistd.h


#ifdef __LIBRARY__ # 若提前定义__LIBRARY__，则以后内容被包含

...

#define __NR_write 4 //系统调用号，用作系统调用函数表中索引值

...

//定义有3个參数的, 定义系统调用嵌入式汇编宏函数
//%0 - eax(__res)，%1 - eax(__NR_name)，%2 - ebx(a)，%3 - ecx(b)，%4 - edx(c)。
#define _syscall3(type,name,atype,a,btype,b,ctype,c) \
type name(atype a,btype b,ctype c) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \                                             // 调用系统中断 0x80
 : "=a" (__res) \                                                          // 返回值eax(__res)
 : "0" (__NR_##name),"b" ((long)(a)),"c" ((long)(b)),"d" ((long)(c))); \   //输入为：系统中断调用号__NR_name,还有另外3个参数
if (__res>=0) \                                                             // 如果返回值>=0，则直接返回该值
 return (type) __res; \
errno=-__res; \                                                             // 否则置出错号，并返回-1
return -1; \                                                                
}

#endif /* __LIBRARY__ */

...

int write(int fildes, const char * buf, off_t count); //write系统调用的函数原型定义

...

只有在 lib/write.c 中先定义了 #define __LIBRARY__ ，那么才能在 /include/unistd.h 中，找到系统调用号和内嵌汇编 _syscall3() ；不然就代表它不需要进行系统调用，这样就可以忽略 unistd.h 中和系统调用相关的宏定义，非常的优雅

其实我们可以把write.c中的write函数再重新整合一下：

int write(int fd,const char* buf,off_t count) \
{ \
long __res; \
__asm__ volatile ( "int $0x80" \
: "=a" (__res) \
: "" (__NR_write), "b" ((long)(fd)), "c" ((long)(buf)), "d" ((long)(count))); \
if (__res>=0) \
return (type) __res; \
errno=-__res; \
return -1; \
}

这样大家就能更容易明白 #define __LIBRARY__ 的作用

上面 int $0x80" 表示调用 系统中断0x80 ** ，其实 系统调用的本质还是通过中断(0x80)去实现的**！操作系统中真的是处处离不开中断。中断相关知识不了解的，可以看看笔者之前写过的一篇文章图解计算机中断

另外由于程序处于用户态无法直接操作硬件资源，所以需要进行 系统调用 ，切换到内核态；也就是说用户程序如果使用 库函数write ，会进行系统调用

而系统调用，其实就是去调用 int 0x80中断 ，然后把三个参数 fd、buf、count 依次存入 ebx、ecx、edx寄存器

还有 #define __NR_write 4 ，定义了 系统调用号 ； _NR_write 会被存入 eax寄存器 ；当调用返回后，从eax取出返回值，存入 __res ，建立了用户栈和内核栈的联系。至于 __NR_write 的作用下文再讲解

int 0x80中断调用对应的中断处理函数

我们来看下中断是调用对应的中断处理函数的流程图：

当发生中断的时候，CPU获取到中断向量号后，通过 IDTR ，去查找 IDT中断描述符表 ，得到相应的中断描述符；然后根据描述符中的对应中断处理程序的入口地址，去执行中断处理程序

早在linux0.12启动时，会进行调度程序初始化 main.c/sched_init() ，其源码：

//     /kernel/sched.c

...

void sched_init(void)
{
 ...
 set_system_gate(0x80,&system_call);//设置系统调用中断门
}

...

set_system_gate 在之前的文章 Linux0.12内核源码解读(7)-陷阱门初始化讲解过，不再赘述

需要注意的是：在用户态和内核态运行的进程使用的栈是不同的，分别叫做 用户栈和内核栈 ，两者各自负责相应特权级别状态下的函数调用；所以当执行系统调用中断 int 0x80 从用户态进入内核态时，会 从用户栈切换到内核栈 ，系统调用返回时，还要切换回用户栈，继续完成用户态下的函数调用(这也叫做被中断进程上下文的保存与恢复)

其中其关键作用的是，CPU会可以自动通过 TR寄存器 找到当前进程的 TSS ，然后根据里面 ss0和esp0 的值找到内核栈的位置，完成用户栈到内核栈的切换。先了解一下，这块等进程那块我们会再详细聊聊

set_system_gate(0x80,&system_call) 这句整体作用是， 设置系统调用中断门 ，将 0x80中断 和函数 system_call 绑定在一起，换句话说 system_call 就是 0x80 的中断处理函数

检索系统调用函数表

我们接着去看 system_call 函数的源码：

//    /kernel/sys_call.s

...

// int 0x80
_system_call:
 push %ds      # 压栈, 保存原段寄存器值
 push %es
 push %fs   
 pushl %eax  # 保存eax原值
 pushl %edx  
 pushl %ecx  # push %ebx,%ecx,%edx as parameters
 pushl %ebx  # to the system call,  ebx,ecx,edx 中放着系统调用对应的C语言函数的参数
 movl $0x10,%edx  # ds,es 指向内核数据段
 mov %dx,%ds
 mov %dx,%es
 movl $0x17,%edx  # fs 指向当前局部数据段(局部描述符表中数据段描述符)
 mov %dx,%fs
 cmpl _NR_syscalls,%eax  # 判断eax是否超过了最大的系统调用号,调用号如果超出范围的话就跳转!
 jae bad_sys_call
 call _sys_call_table(,%eax,4)   # 间接调用指定功能C函数!
 pushl %eax                      #  把系统调用的返回值入栈！

...

ret_from_sys_call:  #当系统调用执行完毕之后，会执行此处的汇编代码，从而返回用户态
 movl _current,%eax  # 取当前任务（进程）数据结构指针->eax
 cmpl _task,%eax   # task[0] cannot have signals
 ...

Among them _sys_call_table(,%eax,4) , the eax register here stores _NR_write the system call number, _sys_call_table which is an array of type in sys.h. int (*)() It stores all the system call function addresses, also called the system call function table , so it __NR_write also represents the system call function table. index value

So why %eax * 4 multiply it by 4? This is because sys_call_table[] pointers are 4 bytes per item, so 被调用处理函数的地址=[_sys_call_table + %eax * 4]

Let’s look at sys_call_table the definition of :

//    /include/linux/sys.h

...
extern int sys_write();
...

fn_ptr sys_call_table[] = { sys_setup, sys_exit, sys_fork, sys_read,
sys_write, sys_open, sys_close, sys_waitpid, sys_creat, sys_link,
sys_unlink, sys_execve, sys_chdir, sys_time, sys_mknod, sys_chmod,
sys_chown, sys_break, sys_stat, sys_lseek, sys_getpid, sys_mount,
sys_umount, sys_setuid, sys_getuid, sys_stime, sys_ptrace, sys_alarm,
sys_fstat, sys_pause, sys_utime, sys_stty, sys_gtty, sys_access,
sys_nice, sys_ftime, sys_sync, sys_kill, sys_rename, sys_mkdir,
sys_rmdir, sys_dup, sys_pipe, sys_times, sys_prof, sys_brk, sys_setgid,
sys_getgid, sys_signal, sys_geteuid, sys_getegid, sys_acct, sys_phys,
sys_lock, sys_ioctl, sys_fcntl, sys_mpx, sys_setpgid, sys_ulimit,
sys_uname, sys_umask, sys_chroot, sys_ustat, sys_dup2, sys_getppid,
sys_getpgrp, sys_setsid, sys_sigaction, sys_sgetmask, sys_ssetmask,
sys_setreuid,sys_setregid, sys_sigsuspend, sys_sigpending, sys_sethostname,
sys_setrlimit, sys_getrlimit, sys_getrusage, sys_gettimeofday, 
sys_settimeofday, sys_getgroups, sys_setgroups, sys_select, sys_symlink,
sys_lstat, sys_readlink, sys_uselib };

//系统调用总数目,注意一下：这里相较于linux0.11做了改进，新增系统调用不再需要手动调整该数目！
int NR_syscalls = sizeof(sys_call_table)/sizeof(fn_ptr);

What you can know here call _sys_call_table(,%eax,4) is to call the kernel system call function corresponding to the system call number. sys_write

Finally execute sys_write

sys_write Under fs read_write.c :

//   /fs/read_write.c

// 写文件系统调用
int sys_write(unsigned int fd,char * buf,int count)
{
 struct file * file;
 struct m_inode * inode;

  //判断函数参数的有效性
 if (fd>=NR_OPEN || count <0 || !(file=current->filp[fd]))
  return -EINVAL;
 if (!count)
  return 0;
  // 取文件相应的i节点
 inode=file->f_inode;
  // 若是管道文件，并且是写管道文件模式，则进行写管道操作
 if (inode->i_pipe)
  return (file->f_mode&2)?write_pipe(inode,buf,count):-EIO;
  //如果是字符设备文件，则进行写字符设备操作
 if (S_ISCHR(inode->i_mode))
  return rw_char(WRITE,inode->i_zone[0],buf,count,&file->f_pos);
  // 如果是块设备文件，则进行块设备写操作
 if (S_ISBLK(inode->i_mode))
  return block_write(inode->i_zone[0],&file->f_pos,buf,count);
  // 若是常规文件，则执行文件写操作
 if (S_ISREG(inode->i_mode))
  return file_write(inode,file,buf,count);
 printk("(Write)inode->i_mode=%06o\n\r",inode->i_mode);
 return -EINVAL;
}

At this point, the library function write makes a system call and finally calls sys_write this function

Let’s review the entire system call process through the following figure:

Kernel mode and user mode data interaction

At this point we have understood the process of system calls, but there is still a problem that needs to be solved, which is how to interact with data between the kernel state and the user state?

Looking back at the system call process, we can find that registers play an indispensable role in it. Linus linux0.12 also uses a similar method for data interaction in

Let's continue to use the function as an example here sys_write to take a look at what's inside file_write(inode,file,buf,count);

//   /fs/file_dev.c

// 写文件函数 - 根据 i 节点和文件结构信息，将用户数据写入文件中
int file_write(struct m_inode * inode, struct file * filp, char * buf, int count)
{
 off_t pos;
 int block,c;
 struct buffer_head * bh;
 char * p;
 int i=0;

/*
 * ok, append may not work when many processes are writing at the same time
 * but so what. That way leads to madness anyway.
 */
 //如果设置了追加标记位，则更新当前位置指针到文件最后一个字节
 if (filp->f_flags & O_APPEND)
  pos = inode->i_size;
 else
  pos = filp->f_pos;
  // i为已经写入的长度，count为需要写入的长度
 while (i<count) {
    // 先取文件数据块号，如果没有则创建一个块
  if (!(block = create_block(inode,pos/BLOCK_SIZE)))
   break;
  if (!(bh=bread(inode->i_dev,block)))
   break;
  c = pos % BLOCK_SIZE;
  p = c + bh->b_data;// 开始写入数据的位置
  bh->b_dirt = 1; //标记数据需要回写硬盘
  c = BLOCK_SIZE-c; //算出能写的长度
  if (c > count-i) c = count-i;
  pos += c;
  if (pos > inode->i_size) {
   inode->i_size = pos;
   inode->i_dirt = 1;
  }
  i += c;
  while (c-->0)
   *(p++) = get_fs_byte(buf++);//从用户态拷贝一个字节的数据到内核态
  brelse(bh);
 }
  //当数据已经全部写入文件或者在写操作过程中发生问题时就会退出循环
 inode->i_mtime = CURRENT_TIME;
 if (!(filp->f_flags & O_APPEND)) {
  filp->f_pos = pos;
  inode->i_ctime = CURRENT_TIME;
 }
 return (i?i:-1);
}

We won’t go into details here. We have to finish talking about disks and file systems later and then come back to this topic. Let’s focus on get_fs_byte the function. Let’s take a look at its source code:

//  include/asm/segment.h
 
 // 读取 fs 段中指定地址处的字节。
 // 参数：addr - 指定的内存地址。
 // %0 - (返回的字节_v)；%1 - (内存地址 addr)。
 // 返回：返回内存 fs:[addr]处的字节。
 // 第 3 行定义了一个寄存器变量_v，该变量将被保存在一个寄存器中，以便于高效访问和操作。
extern inline unsigned char get_fs_byte(const char * addr)
{
 unsigned register char _v;

 __asm__ ("movb %%fs:%1,%0":"=r" (_v):"m" (*addr));
 return _v;
}

 // 将一字节存放在 fs 段中指定内存地址处。
 // 参数：val - 字节值；addr - 内存地址。
 // %0 - 寄存器(字节值 val)；%1 - (内存地址 addr)。
extern inline void put_fs_byte(char val,char *addr)
{
__asm__ ("movb %0,%%fs:%1"::"r" (val),"m" (*addr));
}

get_fs_byte The function copies one byte of data from user mode to kernel mode, while the function put_fs_byte copies one byte of data from kernel mode to user mode.

During the entire process of the system call running, the DS and ES segment registers point to the kernel data space, while the FS segment register is set to point to the user data space . Some people may ask why?

Don't forget this paragraph /kernel/sys_call.s in _system_call :

_system_call:
...
 movl $0x10,%edx  # ds,es 指向内核数据段
 mov %dx,%ds
 mov %dx,%es
 movl $0x17,%edx  # fs 指向当前局部数据段(局部描述符表中数据段描述符)
 mov %dx,%fs
...

0x10 It is the segment value of the kernel data segment descriptor in the global descriptor table GDT , 0x17 and it is the segment value of the data segment descriptor of the task in the local descriptor table LDT.

Therefore, Linux uses the FS register to copy data 内核数据空 between and 用户数据空间 . When the process exits from the interrupt call, the register will automatically pop up from the kernel stack, which is fast and efficient.

References:

"Operating System Concepts"

"Linux Kernel Complete Annotation 5.0"

why-do-x86-cpus-only-use-two-out-of-four-rings

end