Watchdog – independent monitoring of VM status in the cloud.
As virtual environments become increasingly critical to business operations, ensuring high availability and reliability of services is becoming paramount. Apache CloudStack, as a mature cloud infrastructure management platform, offers a range of mechanisms supporting these objectives. One of the less obvious, yet extremely useful tools, is the Watchdog function running directly inside the virtual machine.
Watchdog is a mechanism that monitors the operating system's operation from within the VM itself. Unlike classical high availability (HA) solutions implemented at the hypervisor level, which respond to failures of the entire virtual machine, Watchdog allows detection of problems within the operating system – such as suspension of the init process, kernel lock, or other critical errors resulting in system unresponsiveness.
Enabling Watchdog in a VM allows for faster detection and reaction to failures that may not result in complete machine shutdown, but prevent it from functioning properly. This makes it possible to automatically restart the operating system without requiring administrator intervention or waiting for HA mechanisms to respond at the infrastructure level. This significantly reduces service downtime and increases its reliability.
An additional advantage is the ability to tailor Watchdog's behavior to the specifics of applications running in the VM. Administrators can configure monitoring of specific processes or services, enabling more precise response to issues. Combined with logging and alerting tools, Watchdog becomes a valuable element of a proactive production environment maintenance strategy.
Implementing Watchdog in virtual machines managed by Apache CloudStack does not require any hypervisor configuration changes, making this solution particularly attractive in shared environments or those managed by external providers. This allows VM users to independently increase the resilience of their systems to failures without compromising the principles of responsibility separation in cloud environments.
Watchdog is based on interaction with physical infrastructure through the virtual device /dev/watchdog. The hypervisor, through reporting from the VM level, knows that a problem has occurred and is able to respond to it.
Example: running Watchdog on Ubuntu
To run Watchdog on Ubuntu (e.g., version 22.04), follow these steps:
1. Installing the watchdog package
Run the following in the VM terminal:
Shell:
sudo apt update
sudo apt install watchdog2. Enabling the watchdog service
After installation, activate the service:
Shell:
sudo systemctl enable watchdog
sudo systemctl start watchdog
Show more rows3. Configuring the /etc/watchdog.conf file
Edit the configuration file to customize Watchdog's behavior:
Shell:
sudo nano /etc/watchdog.conf
Show more rows
Example options to uncomment or add:
watchdog-device = /dev/watchdog
max-load-1 = 24
file = /var/log/syslog
pidfile = /var/run/watchdog.pid
interval = 104. Checking operation
After starting the service, you can check its status:
Shell:
systemctl status watchdog
Show more rows5. Testing the response
You can simulate a failure, for example, by stopping a critical process or overloading the system,to check if Watchdog responds according to the configuration.
WebDisk Cloud Computing provides comprehensive monitoring of physical and virtual infrastructure status. All VMs are covered by full HA, but there is always a chance that external mechanisms will incorrectly interpret your VM's state (running but internally suspended) as functioning properly. Watchdog gives you a chance to restore your VM to full functionality, unfortunately its response to detected problems is a VM reset. Therefore, it is not enabled in our VM templates by default. However, every VM comes with the /dev/watchdog device – it's your decision whether to use this mechanism!