Skip to content

Watchdogs

About external watchdogs

Software that is critical for vehicle function requires the highest level of availability and reliability, which makes it important to detect and react to unexpected behaviors. A watchdog is a mechanism that monitors whether a process performs recurring actions at defined intervals or within a specified period of time. If the action does not occur, then the watchdog triggers a predefined reaction intended to restore the system to a working state. Watchdogs use a combination of hardware and software mechanisms to enable systems to detect and react to a malfunction.

Failure to interact with (or "pet") the watchdog correctly or on schedule indicates that a fault has occurred.

An external watchdog can watch for and respond to malfunctions even when the system is unresponsive.

Faults, errors, and failures

As you learn about how watchdogs can alert you to potential failures of your operating system (OS), remember the following terms:

Error : An observable deviation from correct operation caused by a fault, which is evidence that the fault occurred

Fault : A temporal (functional, deadline) or logical (progress, correctness) defect, demonstrated by one or more errors, that can cause a failure

Failure : Termination of the system's intended behavior due to one or more unmitigated faults

Fault handling time intervals

By configuring a watchdog to check for certain operations at specific times, you can detect faults that your OS could not otherwise detect and respond before they cause a failure.

Temporal fault detection using a watchdog

FDTI + FRTI = FHTI < FTTI, not FHTI >= FTTI

To ensure predictable functionality in your OS, the system must detect, react to, and handle a fault before it causes a failure that might cause a hazardous event. The risk of a potential hazardous event occurrence increases if the FHTI exceeds the FTTI.

Remember the following time intervals as you consider watchdog configurations for your OS:

Fault Detection Time Internal (FDTI) : The time between fault occurrence and detection

Fault Reaction Time Interval (FRTI) : The time between fault detection and the return of the system to a normal state

Fault Handling Time Interval (FHTI) : The total time between fault occurrence and the return of the system to a normal state

Fault Tolerant Time Interval (FTTI) : The time between fault occurrence and a potential hazardous event

Watchdog types

Depending on the system design, watchdog failure or timeout signals might trigger notifications and responses, such as a reset or interrupt request (IRQ).

The following are examples of watchdog types that you might consider for the function you intend to watch:

Timeout watchdog : A timer that the system must reload within the configured time limit, which detects inactivity

Windowed watchdog : A timer that only accepts reloads within a specific time window, which confirms ongoing operation and detects an error when the signal is not received within that interval

Question and answer watchdog : A timer that requires valid responses to specific queries presented upon reload, such as math functions or bit operations that require responses from the controlling software

Multi-stage watchdog : Cascading actions that trigger upon each timer stage timeout; for example, a first stage might raise an interrupt, which then triggers a second stage to reset the system

Error detection using the Linux kernel

The Linux kernel has two error detection and mitigation mechanisms:

Non-Fatal errors (also known as “oops” errors) : An explicit BUG() in the kernel code that normally results in continued execution rather than halting the system

Fatal errors (also known as “panic” errors) : An explicit panic() call that always halts the system

With a robust focus on the detection of unexpected conditions rather than recovery, Automotive SIG developers chose to configure to “panic on oops” in kernel-automotive, so that kernel errors that are not fatal also cause a kernel panic.

Tasks executed in User mode can eventually perform an invalid operation. For example, an application might divide by zero or access an invalid virtual address. Such operations cause the hardware to raise an exception that the kernel handles through the associated exception handler.

When the exception handler runs, it checks whether the error that triggered the exception came from a task executing in Kernel mode or User mode:

  • In Kernel mode, the kernel handles the exception using "panic on oops".
  • In User mode, the kernel sends a signal to terminate the task that had an uncorrectable or fatal error.
External watchdog monitoring example (i6300esb)

In this scenario, consider the necessary sequence of events to set up an application that you want to monitor with an i6300esb watchdog as it runs on the kernel and the hardware platform.

This example illustrates how you might implement a timeout watchdog monitor by using the i6300esb watchdog feature of the Linux kernel. You might also consider other valid implementations.

Example i6300esb watchdog

The following events characterize the expected operations performed by the application on the i6300esb watchdog device through the Kernel Watchdog Driver API:

  1. The application invokes the open system call to pass the input path from the file system to the watchdog (for example, open("/dev/watchdog0", O_WRONLY);).

  2. Based on the path passed from open, the kernel performs the following tasks:

    1. Allocate a unique file descriptor (fd) associated with the watchdog device.

    2. Unlock the watchdog registers for write access.

    3. Reload the external watchdog to stop any previous counting.

    4. Start the watchdog using the default timeout set at startup.

  3. The kernel returns the allocated unique fd to the application.

  4. The application uses the returned fd to set a new watchdog timeout by invoking ioctl(fd, WDIOC_SETTIMEOUT, &timeout);, where:

    1. fd is the watchdog file descriptor returned by open

    2. WDIOC_SETTIMEOUT is the ioctl command that corresponds to the operation of setting a watchdog timeout

    3. &timeout is the user space address of the variable that contains the timeout value written to the watchdog

  5. According to the fd passed by the application, the kernel performs the following tasks:

    1. Shift the timeout value left by 9 bits.

    2. Unlock the watchdog registers for write access.

    3. Write the shifted value to ESB_TIMER1_REG.

    4. Unlock the watchdog registers for write access.

    5. Write the same shifted value to ESB_TIMER2_REG.

    6. Unlock the watchdog registers for write access.

    7. Reload the external watchdog to stop any previous counting.

  6. If successful, the kernel returns 0 to the application, and the application starts its periodic operations; if unsuccessful, the kernel returns -1.

  7. After the operations complete, the application "pets" the watchdog by writing to the i6300esb watchdog device.

  8. The kernel unlocks the watchdog registers for write access and reloads the watchdog timer.

  9. The write operation returns a nonnegative number to the application to signal a successful watchdog reset.

The application then starts the next cycle of periodic operations.

In the context of this scenario, we can evaluate the following error detection mechanisms:

Error occurs in Kernel mode : This error triggers a kernel panic that kills the system. The kernel halts the application immediately, triggering the i6300esb watchdog because the watchdog did not reset.

Fatal or unrecoverable error occurs in User mode : The exception handler in the Kernel sends a signal terminating the application, which triggers the i6300esb watchdog because the watchdog did not reset.

Application slows down or hangs due to unforeseeable software or hardware failures : If the resulting delay exceeds the allocated FDTI, the application does not reset the watchdog on time, which triggers the i6300esb watchdog.

Using systemd as a process monitor

AutoSD uses systemd as its init system or service manager, which makes it the parent of all processes in the system. The systemd service unit files start all system services and applications. These service files can also configure the actions that systemd takes when it detects a failure, such as restarting the service or executing recovery or notification services.

Using the main clock, systemd can also set timers. If a managed service does not respond on time, systemd can perform a predefined action, similar to a watchdog.

When used as a process monitor, systemd has the following limitations:

  • systemd cannot act as an external watchdog. It can only use standard Linux timers.
  • Because it signals through its UNIX-domain socket using the dbus protocol, its action, and any service state change, systemd can trigger a reaction to a state change.
  • A kernel panic also kills systemd and stops its process monitoring.
Additional resources