Watchdogs are divided into hardware watchdogs and software watchdogs. The hardware watchdog uses a timer circuit whose timing output is connected to the reset end of the circuit. The program clears the timer within a certain time range (commonly known as "feeding the dog"). Therefore, when the program works normally, the timer can never overflow, and no reset signal can be generated. If the program fails and the watchdog is not reset within the timing period, the watchdog timer overflows to generate a reset signal and restart the system. The principle of the software watchdog is the same, except that the timer on the hardware circuit is replaced by the processor's internal timer. This can simplify the hardware circuit design, but it is not as reliable as the hardware timer. For example, if the system's internal timer fails, it cannot be detected. Of course, there are also dual timers that monitor each other, which not only increases system overhead, but also cannot solve all problems, such as interrupt system failure causing timer interrupt failure.
The watchdog itself is not used to solve the problems that occur in the system. The faults found during the debugging process should be checked and corrected for the errors in the design itself. The purpose of adding a watchdog is to automatically restore the system to normal working state without human intervention when the system freezes due to some potential program errors and harsh environmental interference. The watchdog cannot completely avoid the losses caused by faults. After all, there is a period of idleness from the discovery of the fault to the system reset and recovery. At the same time, some systems also need to protect the on-site data before resetting and restore the on-site data after restarting, which may also require a software and hardware cost.
Figure 1: (a) Schematic diagram of a multi-tasking system watchdog; (b) corresponding watchdog reset logic diagram
The working principle of the watchdog in a single-task system is as described above and is easy to implement. The situation is slightly more complicated in a multi-task system. If each task does the same as a single-task system, as shown in Figure 1(a), as long as there is a task that works normally and "feeds the dog" regularly, the watchdog timer will not overflow. Unless all tasks fail, the watchdog timer will overflow and reset, as shown in Figure 1(b).
Often, what we need is that the system needs to be reset as long as one task fails. Or, we select several key tasks to be monitored, and the system needs to be reset as long as one task fails, as shown in Figure 2(a). The corresponding watchdog reset logic is shown in Figure 2(b).
In a multi-task system, a monitoring task TaskMonitor is created, and its priority is higher than the monitored task group Task1, Task2...Taskn. When Task1~Taskn are working normally, TaskMonitor will clear the hardware watchdog timer within a certain period of time. If a Task_x in the monitored task group fails, TaskMonitor will not clear the watchdog timer, thus achieving the purpose of automatically restarting the system when the monitored task fails. In addition, when the task TaskMonitor itself fails, it cannot clear the watchdog timer in time, and the watchdog can also automatically reset and restart. The next problem that needs to be solved is: how can the monitoring task effectively monitor the monitored task group.
Figure 2: (a) Schematic diagram of a multi-tasking system watchdog; (b) Correct watchdog reset logic diagram
Define a set of structures in TaskMonitor to simulate the watchdog timer group.
typedef struct
{
UINT32 CurCnt, LastCnt;
BOOL RunState;
int taskID;
} STRUCT_WATCH_DOG;
The structure includes the monitored task number taskID, the variables CurCnt and LastCnt used to simulate "feeding the dog" (see below for specific meanings), and the watchdog state flag RunState used to control whether the current task is monitored.
The monitored tasks Task1~Taskn call the custom function CreateWatchDog(int taskid) to create a watchdog. The monitored task requires "feeding the dog" within a period of time and calls ResetWatchDog(int taskid). The "feeding the dog" action is actually to add 1 to the variable CurCnt in the watchdog timer structure. TaskMonitor is in a delay state most of the time. Assuming that the hardware watchdog timing is 2 seconds, the monitoring task can be delayed by 1.5 seconds. Then the created watchdog timer groups are checked one by one. Before the delay, the current value of CurCnt is saved to LastCnt. After the delay, compare whether CurCnt and LastCnt are equal. If they are not equal, the system is normal. It should be noted that the number of bytes of CurCnt and LastCnt data is too small, and "feeding the dog" is too frequent, which may cause CurCnt to add 1 to reach a cycle and be equal to LastCnt.
If any group of CurCnt is equal to LastCnt, it is considered that the corresponding monitored task has no "feeding dog" action, and it is detected that the task has a fault and needs to be restarted. At this time, TaskMonitor does not clear the hardware watchdog timer, or delays for a long time, such as 10 seconds, which is enough to restart the system. On the contrary, if the system is normal, Task1~Taskn regularly "feeds the dog" to TaskMonitor, and TaskMonitor regularly "feeds the dog" to the hardware watchdog, the system cannot be reset. Another point is that the monitored task can cancel the corresponding watchdog by calling PauseWatchDog(int taskid), which is actually an operation on the RunState in the STRUCT_WATCH_DOG structure. This flag reflects whether the watchdog is effective or not.
The maximum number of tasks that can be monitored in this way is determined by the number of STRUCT_WATCH_DOG structure data. There should be a variable in the program to record the number of watchdogs currently created. To determine whether the monitored tasks Task1~Taskn are "fed", you only need to compare the values of CurCnt and LastCnt n times.
Figure 3: System reset logic diagram.
The hardware watchdog monitors the TaskMonitor task, and the TaskMonitor task monitors other monitored tasks Task1~Taskn, forming such a chain. The fault diagram of this system is shown in Figure 3. The monitored tasks Task1~Taskn and TaskMonitor are in an OR relationship, so if any monitored task fails, the hardware circuit watchdog can be reset.
In order to realize the watchdog monitoring function of the multi-task system, the TaskMonitor task is added. How much execution time this task takes is also an important issue. Assuming that the TaskMonitor task has a monitoring cycle delay of 1.5 seconds, and in addition needs to execute statements such as saving the current count value and judging whether to "feed the dog", its CPU occupancy time is very small. A specific experiment confirmed that using a CPU with a 50M operating frequency (S3C4510), transplanting the vxWorks operating system, monitoring 10 tasks without enabling the cache, each monitoring cycle takes 220~240 microseconds. It can be seen that the task is in the task delay state most of the time.
The monitored task may have statements such as getting a message or waiting for a semaphore. Often, the waiting for the message or semaphore is indefinite. This requires some modifications to these statements. For example, in vxWorks, an indefinite semaphore acquisition operation
semTake(semID, WAIT_FOREVER); // WAIT_FOREVER means infinite waiting time
Decompose into
do
{
ResetWatchDog; // "Feed the dog" operation
}while(semTake(semID, sysClkRateGet()) != OK); // Waiting for semaphore operation within 1s
The semaphore acquisition operation is performed within multiple time ranges to ensure that the dog is fed in time.
Another thing to note is whether there are tasks in the system with higher priority than TaskMonitor and in execution for a long time, and TaskMonitor cannot be scheduled for a long time, causing the watchdog to reset incorrectly. Good task division and configuration should not result in such a long-term execution of high-priority tasks.
|