In the last days I received several Nagios notifications with a wrong host alias. The bad thing: The host alias is also used in the subject. So at first sight it looks like there is a problem on a possible business critical machine but actually its only a service on a test server. This creates confusion and leads to errors.
The affected service which was in a warning state was "Disk Space /" on SERVER21. The host alias for SERVER21 is SERVER21-DEVL and the IP is 192.168.1.21. But instead the notification looked like this:
Subject: ** PROBLEM Service Alert: SERVER31-UAT/Disk Space / is WARNING **
Service: Disk Space /
As you can see, the notification mail uses the host alias rather than the real servername in the subject and in the mail body. Only the IP address is correct. But where does this wrong entry come from? The host definition of SERVER21 is correct:
After doing some grep-research, I came across the file retention.dat in the Nagios var folder. Here I found that multiple hosts have the wrong alias:
That's the source! The Nagios notifications use the current state of a host/service from this file (retention.dat) and also use the values used in it. I completely deleted retention.dat and restarted Nagios - a new retention.dat will be created but Nagios will re-check all your hosts and services and scheduled downtimes, comments, etc. will be lost.
It may also work if you stop Nagios, correct the entries in retention.dat manually and then start Nagios again but I haven't tested that.
A Nagios restart is necessary in any case. If one only changes rentention.dat, Nagios will overwrite the values again as they seem to be stored in RAM.
This problem is described more detailled in this Nagios user mailing list thread: Macro values don't seem to be consistent.