Implementation of high-precision delay scheme under X86-Linux (10us error)
Click on " Jiayouchuang Information Technology " above , select Follow , and get useful benefits as soon as possible.
Linux achieves high-precision delay. Most methods on the Internet can only achieve a delay accuracy of about 50us. Today we will take a look at how Mr. Dong solved the problem and increased the delay accuracy to 10us.
A friend is working on a recent project Develop Ethercat master site , a high-precision delay mechanism is required, and the design requirements are Under a 1000us cycle, the error cannot exceed 1% (10us) .
Since the project hardware solution is intel processor X86, anyone familiar with Linux knows that this is difficult to implement. At that time, when evaluating the plan, it was a little hasty and it was used directly. PREEMPT_RT patch+kernel hrtimer+signal notification way to evaluate. The results of the verification were very satisfactory at that time, so I excitedly told my boss that the plan was feasible, but little did I know that I had dug a huge hole. . .
When the actual project started, it was discovered that this solution would not work at all, for two reasons:
-
Signal notification can only notify the process, and the current transplantation solution cannot ensure that there are no other threads in the notified process. If such a high-frequency signal is sent, other threads will basically be killed. (repair Recharge instructions: This specifically refers to the kernel The driver notifies the application layer, and there are special functions in the user layer that can notify different threads. And after research, this problem can be solved by setting the sigmask of the thread, but it still cannot change the conclusion that the solution does not work)
-
This is also the main reason. Although the synchronization period of Ethercat can be fixed at the beginning of the program, the running period needs to be dynamically adjusted during actual operation, and the adjustment range is within 5us. In this way, the overhead of dynamically adjusting the hrtimer becomes impossible to ignore. In other words, What we need is a delay mechanism, not a timer.
So this plan was PASSed.
Since signal doesn't work, it can only be analyzed through other means. To sum up, I roughly tried the following:
1. Sleep plan Sure: tried usleep , nanosleep, clock_nanosleep, cond_timedwait, select, etc., and finally use clock_nanosleep , the reason for choosing it Not because it supports ns-level precision . Because after testing, it was found that the accuracy of the above calls is almost the same when the cycle is less than 10000us. The error mainly comes from the overhead of context switching . The main reason for choosing it is because it supports an option called TIME_ABSTIME, which means Support absolute time . Here is a simple example to explain why absolute time is used:
while(1)
{
do_work();
sleep(1);
do_post();
}
Assuming the above loop, our purpose is to let do_post The execution is executed once in a 1s period, but in fact, it cannot be absolutely 1s, because sleep() can only delay relative time, and the current loop The actual cycle is the overhead of do_work + the time of sleep(1) . Therefore, this overhead becomes impossible to ignore in the scenario of our needs. The advantage of using clock_nanosleep is that on the one hand it can select the clock source, and secondly it supports absolute time wake-up, so I set it before each do_work clock_nanosleep next wake up absolute time, then the actual execution time of clock_nanosleep will actually be subtracted from the overhead of do_work, which is equivalent to Alarm clock the concept of.
2. Switch to real-time thread: Change the thread of important tasks come true For threads, the scheduling strategy is changed to FIFO, and the priority is set to the highest to reduce the possibility of being preempted.
3. Set thread affinity: for all threads under the application Plan and bind several heavily loaded task threads to different CPU cores according to load conditions, thus reducing the overhead caused by switching CPUs.
4. Reduce unnecessary sleep calls: Since many tasks have sleep calls, I used the strace command to analyze the proportion of sleep system calls in the entire application, which was as high as 98%. The overhead caused by this high frequency of sleep + wake-up cannot be ignored. . So I put the sleep was changed to a method of looping and waiting for the semaphore , because the waiting for semaphores in the pthread library uses futex, which makes the overhead of waking up the thread much smaller. Sleep in other places is also optimized as much as possible. This effect is actually quite obvious. Can reduce the error by almost 20us .
5. Trick : Strip out the smallest tasks from existing applications and reduce the impact of all external tasks
After the above five points,
The error of 1000us has been controlled from ±100us at the beginning to ±40us.
. But this is not enough. . .
At the end of my rope, I started a long Google+ Baidu ing. . . .
During this period, some strange phenomena were also discovered, such as the picture below.
The pictures are generated by analyzing the data of the packet capture tool using Python, so there is no doubt about the reference. The vertical axis represents the actual time spent in this cycle. Very interesting phenomena can be found:
1. Every certain period, large-scale error jitter will appear concentratedly.
2. The error is not normally distributed, but frequently appears around ±30us.
3. Every time a large error occurs, a reverse error will definitely occur in the next cycle, and the amplitude will be roughly the same (this cannot be seen from the chart, but is analyzed through other means).
A brief description is that assuming the execution time of this cycle is 980us, then the execution time of the next cycle will definitely be around 1020us.
Points 1 and 2 can be eliminated through the above four optimization measures, but no very effective method has been found for point 3. My understanding may be that the kernel is aware of this error and intends to make up for it. If there are experts who know the principles behind it, please share it.
I also tried to solve this third strange phenomenon. manual intervention , for example, set a threshold. When the actual program execution error is greater than this threshold, I will manually subtract this error when setting the wake-up time for the next cycle, but The running effect was astonishing and even worse. . . .
After trying more than 200 parameter adjustments and being stuck on this problem for more than a week, I didn’t know what search keywords I typed at the time, and accidentally discovered a Dell document. Finally solved this problem, the document title is:
After a targeted search, I finally found out the whole story:
It turns out that Intel CPUs have many power consumption modes, referred to as C-states, in order to save energy.
When the program is running, the CPU is in the C0 state, but once the operating system enters sleep, the CPU will use the Halt instruction to switch to C1 or C1E mode. If the os wakes up in this mode, the overhead of context switching will increase!
Logically, this option can be turned off in the BIOS, but the pitfall lies in the version. Relatively new Linux kernel versions have this state turned on by default and ignore BIOS settings! This is very confusing!
After a targeted search, I found that there are also netizens testing on the Internet. The 2.6 version of the kernel will not enable this by default, but the 3.2 version of the kernel will. And the comparison test found that when the two versions of the kernel have the same hardware, Context switch overhead can vary by a factor of 10, The former is 4us and the latter is 40-60us.
one, forever long modified : You can modify the boot parameters of linux and modify /etc/default/grub in the file GRUB_CMDLINE_LINUX_DEFAULT option, change it to the following:
intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll
and then use update-grub Command to make the parameters take effect, just restart.
2. Dynamic modification : You can go to /dev/cpu_dma_latency Write values in this file to adjust the context switch overhead in C1/C1E mode. I choose to write 0 and close it directly. Of course, you can also choose to write a value. This value represents the cost of context switching, and the unit is us. for example If you write 1, then the cost is set to 1us. Of course, this value has a range. This range is within /sys/devices/system/cpu/cpuX/cpuidle/stateY/latency It can be found in the file that X represents the specific core and Y represents the corresponding idle_state.
At this point, this performance problem has been perfectly solved. The current stable test performance is shown in the figure below:
Achieved High-precision delay under X86-Linux: 1000us accurate delay, accuracy 10us.
Thank you for your attention, the next issue will be more exciting.
Favorites, likes, and watching in three clicks
-- END --