Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

New version 20180411 of check_esxi_hardware released
Wednesday - Apr 11th 2018 - by - (0 comments)

A new version (20180411) of the monitoring plugin check_esxi_hardware is available. This version contains an additional check to return an unknown state if, for any reason, the cim EnumerateInstances function isn't returning something useful (error message "ThreadPool --- Failed to enqueue request)".

Thanks to Peter Newman for the code contribution.

 

Unable to download attachment from Gmail, how to get it anyway (by using Takeout)
Monday - Apr 9th 2018 - by - (0 comments)

For backup and archive purposes I usually create a config backup of each of my monitoring setups. A couple of years back, in 2012, I simply sent myself the tar.gz file into my gmail mailbox. Now that I had to lookup the usage of a monitoring plugin I used back then, it turns out it wasn't the best decision. Gmail decided that the attachment suddenly contains a virus (probably because the monitoring plugins in the compressed archive are executable, but I'm not certain) and didn't allow me to download the file.

Gmail attachment disabled

Luckily there's a way around it. Unfortunately it takes some time, but if you need that data from the attachment, you still gotta do it, right? (Imagine it's a bitcoin wallet from 2012! ^_^)

Google offers an "archive" service, called Takeout, where you can download your data. You can access this service on https://takeout.google.com/. In this case I only needed the archive of my Gmail mailbox, so I only selected Gmail. After clicking next, the "split size" can be chosen. If your mailbox is larger than the given size, it will be split into several files (with the suffix .mbox).

Now I had to wait... The archive won't be ready immediately. A day later I got an e-mail telling me that my archive is ready for download:

Google data archive is ready

So I went on and downloaded archive file 1 of 2.
As mentioned above, the file is a mbox file. Which is basically explained your whole mailbox in text format. The mutt command is perfectly capable to read the mbox file. There's no need to install a GUI (like Thunderbird) to simply extract the attachment I need.

$ mutt -f /tmp/All\ mail\ Including\ Spam\ and\ Trash-001.mbox
Reading /tmp/All mail Including Spam and Trash-001.mbox... 359578 (86%)

Once mutt loaded all e-mails, the following overview was presented to me:

Opening mailbox with mutt

I wasn't happy with this, I need to immediately see the year of the mail to quickly find the mail I am looking for. I adapted my ~/.muttrc and added the year in the date format:

$ cat ~/.muttrc
set index_format="%4C %Z %{%Y %b %d} %-15.15L (%4l) %s"

After opening the mailbox again with mutt, this time I saw the year, however the mails were not sorted as I wanted them.

Displaying year in mutt index

To solve this, I added another line (set sort) in muttrc to sort by date:

$ cat ~/.muttrc
set index_format="%4C %Z %{%Y %b %d} %-15.15L (%4l) %s"
set sort = date

I opened the mailbox again and this time I saw the oldest mails of the mailbox and used "PageDown" to scroll down to my needed date. And finally, there was my mail:

Found Gmail archive mail in mbox mutt 

By selecting the mail with the cursor and hitting "Enter", the mail is opened:

Opening a mail in mutt 

The attachment can be shown by pressing "v":

View attachment in mutt 

I moved the cursor down to the tar.gz file and pressed "s" to save the file in /tmp:

Save attachment in mutt 

And finally I was able to open the file:

Open tar gz archive  

 

Monitor dual storage (raid) controller on IBM x3650 M4
Friday - Apr 6th 2018 - by - (0 comments)

I wanted to monitor the current RAID status on an IBM x3650 M4 server, simply by using check_raid. I've been using this plugin for years and it supports most software and hardware raid controllers. I've never had any problems with it (once I installed the required cli tools for each hardware controller) - until today.

Due to a very strange hardware setup, inherited from an ex-colleague, the server turns out to have two different RAID controllers active. 12 physical drives are attached to one controller, 2 physical drives to another.

Once I installed the megacli command (from http://hwraid.le-vert.net/), the plugin correctly identified the physical drives behind /dev/sda:

# /usr/lib/nagios/plugins/check_raid -l
megacli
1 active plugins

# /usr/lib/nagios/plugins/check_raid
WARNING: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]

To disable the warning on the disabled WriteCache:

# /usr/lib/nagios/plugins/check_raid --cache-fail=OK
OK: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]

But where are the other two physical drives? From my experience with hardware raid controllers I was pretty sure that megacli is able to detect multiple controllers and is able to retrieve the drive information from all controllers.
A manual verification using megacli still only returned 12 drives:

# megacli -CfgDsply -aall |grep Physical
Physical Disk Information:
Physical Disk: 0
Physical Sector Size:  512
Physical Disk: 1
Physical Sector Size:  512
Physical Disk: 2
Physical Sector Size:  512
Physical Disk: 3
Physical Sector Size:  512
Physical Disk: 4
Physical Sector Size:  512
Physical Disk: 5
Physical Sector Size:  512
Physical Disk Information:
Physical Disk: 0
Physical Sector Size:  512
Physical Disk: 1
Physical Sector Size:  512
Physical Disk: 2
Physical Sector Size:  512
Physical Disk: 3
Physical Sector Size:  512
Physical Disk: 4
Physical Sector Size:  512
Physical Disk: 5
Physical Sector Size:  512

Thankfully a colleague, who recently was working on that particular server, made a screenshot of the storage controller menu during the boot process:

Two different storage controllers in the same server 

As it turns out, there are two different storage controllers built into that server. One is a MegaRaid controller (ServeRAID M5210) and one is a MPT controller. No wonder megacli wasn't able to find the drives.

I tried again with "mpt-status" (http://hwraid.le-vert.net/wiki/LSIFusionMPT), but this didn't show any config:

# apt-get install mpt-status

/usr/sbin/mpt-status -p
Checking for SCSI ID:0
ioctl: No such device

I removed mpt-status again and went on to try the command "sas2ircu" for newer MPT cards (http://hwraid.le-vert.net/wiki/LSIFusionMPTSAS2).
Finally I got some output:

# apt-get install sas2ircu

# sas2ircu LIST
LSI Corporation SAS2 IR Configuration Utility.
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2009-2013 LSI Corporation. All rights reserved.


         Adapter      Vendor  Device                       SubSys  SubSys
 Index    Type          ID      ID    Pci Address          Ven ID  Dev ID
 -----  ------------  ------  ------  -----------------    ------  ------
   0     SAS2004     1000h    70h   00h:0ah:00h:00h      1014h   040eh
SAS2IRCU: Utility Completed Successfully.

And, hurray, check_raid was now able to read the infos from both controllers:

# /usr/lib/nagios/plugins/check_raid -l
megacli
sas2ircu
2 active plugins

# /usr/lib/nagios/plugins/check_raid --cache-fail=OK
OK: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]; sas2ircu:[ctrl #0: 1 Vols: Optimal: 2 Drives: Optimal (OPT)::]


 

Automatic SLA reporting from Icinga and push into Confluence page
Wednesday - Apr 4th 2018 - by - (0 comments)

Back in 2010 I created automatic availability reporting from a Nagios installation (see How to create automatic PDF from Nagios Availability Reports?). The idea was pretty simple: In a monthly interval (generally running on the 1st of the moth) simply create a PDF from the availability report (using the previous month's data) and send it by mail.

Methods (and knowledge) have changed since and I was asked to create an automatic SLA reporting using the statistics from Icinga 2.Initially sending the report by e-mail would have been enough, but when I came across Confluence's REST API, the goal was to directly add the reporting into Confluence.

Note: Icinga's new interface icingaweb2 does not support availability reporting as of April 2018. We're still using icinga2-classicui for this purpose.

The script I created is split into several steps. Let's go through them.

Step one: Define your (more or less fixed) base variables

At the begin of the script, I defined some base variables which will later be used in the script.

# Basic variable definitions
yearlastmonth=$(dateutils.dadd today -1mo -f '%Y')
monthlastmonth=$(dateutils.dadd today -1mo -f '%m')
availurl="http://icinga.example.com/cgi-bin/icinga2-classicui/avail.cgi"
icingauser="icingaadmin"
icingapass="password"
wikiuser="slareporter"
wikipass="dd2ddAADw2"

You might have noticed that I'm using dateutils.dadd instead of date to determine the year and month of the previous month. Take a look at my article "Bash: date -d 1 month ago still shows the same month" to see why.
The availurl variable contains the address to your Nagios' or Icinga's avail.cgi.
The two credentials are used to login on avail.cgi and on the Confluence Wiki.

Step two: Create a PDF from the availablity report

By using wkhtmltopdf, the availability report seen in the Nagios or Icinga2-ClassicUI interface, can be saved as a PDF (including formatting, colors, etc). The command is pretty simple:

xvfb-run -a -s "-screen 0 640x480x16" wkhtmltopdf --username $icingauser --password $icingapass "${availurl}?show_log_entries=&host=internet&service=HTTP+www.example.com&timeperiod=lastmonth" /tmp/${yearlastmonth}${monthlastmonth}-www.example.com.pdf

xvfb-run is used to run wkhtmltopdf in a non-interactive way. Otherwise wkhtmltopdf would complain about a missing X display.

Of course the important parameters in the requested URL are: host=internet (which is the host object), service=HTTP+www.example.com (the service object we want the report from) and timeperiod=lastmonth (get the statistics for the previous month).

Because now is April 2018, the PDF document is saved as /tmp/2018-03-www.example.com.pdf. 

Step three: Upload the PDF to the relevant Confluence page

The upload of a file/attachment is pretty easy, compared to changing the content of a page (more on that later):

curl -s -S -u "${wikiuser}:${wikipass}" -X POST -H "X-Atlassian-Token: no-check" -F "file=@/tmp/${yearlastmonth}${monthlastmonth}-www.example.com.pdf" -F "comment=${yearlastmonth}-${monthlastmonth} www.example.com" "https://wiki.example.com/confluence/rest/api/content/12345678/child/attachment"| python -mjson.tool

Obviously the generated PDF is uploaded using -F "file=@/tmp/${yearlastmonth}${monthlastmonth}-www.example.com.pdf". 
Don't forget to adjust the Confluence host address (here wiki.example.com) and the page ID (here 12345678). You can find the page ID either in the address or in the "Page information" of the relevant page.

After successful upload, the PDF will appear as attachment on that Confluence page.

Step four: Get the availability percentage

As you might know, the availability report (we created the report pdf of) can also be displayed in another format: csv, json, xml.
With a json parser like jshon, the value of the field "percent_known_time_ok" (inside the "services" array) can be directly shown:

availpercent=$(curl -s -u "${icingauser}:${icingapass}" "${availurl}?show_log_entries=&hostservice=internet^HTTP+www.example.com&assumeinitialstates=yes&assumestateretention=yes&assumestatesduringnotrunning=yes&includesoftstates=no&initialassumedhoststate=0&initialassumedservicestate=0&timeperiod=lastmonth&backtrack=8&jsonoutput" | jshon -e avail -e service_availability -e services -a -e percent_known_time_ok | awk '{printf("%.3f\n", $1)}')

If you want to see the structure of the json output, simply click on the "export to json" button in the user interface.
I'm using the awk command in the end to get a maximum of 3 decimals for the value. E.g. 99.335654 will be cut to 99.335.

But under some circumstances it's possible that the json output cannot be handled by jshon ("too big integer"). This happened on a service where at the begin of the month recurring downtimes were set, but at the end of the month they were removed. This caused a miscalculation in the report and created a huge field number ("time_critical_unscheduled": 18446744073709548794). I opened an issue on the Github project for jshon to address this. In the meantime I created the following workaround:

# In some cases, we could hit a json parsing error due to a too big integer. In such a case we try the csv output.
if [[ $? -gt 0 ]] || [[ -z $availpercent ]]
  then availpercent=$(curl -s -u "${icingauser}:${icingapass}" "${availurl}?show_log_entries=&hostservice=internet^HTTP+www.example.com&timeperiod=lastmonth&rpttimeperiod=24x7&assumeinitialstates=yes&assumestateretention=yes&assumestatesduringnotrunning=yes&includesoftstates=no&initialassumedservicestate=6&backtrack=8&content_type=csv" | grep "internet" | awk -F';' '{print $11}' | sed "s/'//g" | sed "s/%//g")
fi

In case the pevious command failed or the previously defined variable $availpercent is empty, the csv output of the same service will be accessed. The parsing is of course different; here I'm interested in the 11th column (which is percent_known_time_ok).

Step five: Retrieve the Confluence page's information and content

Here's a very important information: If you want to change the content of a Confluence page, you need to:

  • Retrieve the full content (body.storage)
  • Retrieve the current page version number and other information
  • Change the full content by adding your changes
  • Increase the version number
  • Submit the full content (old + your change), including the new version number, page ID, space ID and page title

Let's do this slowly:

# Get current version number and content from wiki page
wikiversion=$(curl -s -u "${wikiuser}:${wikipass}" "https://wiki.example.com/confluence/rest/api/content/12345678?expand=version" | python -mjson.tool | jshon -e version -e number)
wikicontent=$(curl -s -u "${wikiuser}:${wikipass}" "https://
wiki.example.com/confluence/rest/api/content/12345678?expand=body.storage" | python -mjson.tool | jshon -e body -e storage -e value)

Here again I'm using jshon to get the values of the fields and save them into variables "wikiversion" and "wikicontent".

Note: The value saved in $wikiversion is a number, the value in $wikicontent is a string already containing doublequotes.

Step six: Make the changes

The Wiki page I prepared simply contained a table and I wanted to add a new row for the previous month at the end of the table.

Icinga SLA reporting into Confluence 

This means I have to add the new row right before the code marking the end of the table. I chose sed for this:

# Change content (add new row at bottom of table)
newcontent=$(echo $wikicontent | sed "s#<\\\\/tbody>#Website www.example.com<\\\/td>https://www.example.com<\\\/td>${yearlastmonth}-${monthlastmonth}<\\\/td>${availpercent}%<\\\/td><\\\/tr><\\\/tbody>#")

Note the crazy amount of backslashes. This is because the escaped backslashes need to remain in the final submit (compare with the value of $wikicontent).

We only need to increase the version number of the Wiki page:

# Increment version number
newversion=$(( $wikiversion + 1 ))

Step seven: Upload the changes

OK, now we're finally ready to upload the change to Confluence:

# Update Wiki page
curl -s -u "${wikiuser}:${wikipass}" -X PUT -H 'Content-Type: application/json' -d "{\"id\":\"12345678\",\"type\":\"page\",\"title\":\"SLA Reporting www.example.com\",\"space\":{\"key\":\"SysServices\"},\"body\":{\"storage\":{\"value\":$newcontent,\"representation\":\"storage\"}},\"version\":{\"number\":$newversion}}" https://wiki.example.com/confluence/rest/api/content/12345678 | python -mjson.tool

Note that $newcontent was not put into additional double-quotes. As mentioned before, the original value ($wikicontent) already starts and ends with double-quotes.
$newversion was also not put into (double-) quotes because it's a number, not a string.

Step eight (final step): Automate it

I went one step further and instead of having a huge script with hundreds of lines for each service we want SLA reporting, I added some parameters at the begin:

# Get user-given variables (dynamic)
while getopts "T:U:W:P:H:S:" Input;
do
       case ${Input} in
       T)      title=${OPTARG};;
       U)      url=${OPTARG};;
       W)      wikiid=${OPTARG};;
       P)      wikipagetitle=${OPTARG};;
       H)      icingahost=${OPTARG};;
       S)      icingaservice=${OPTARG};;
       *)      echo "Wrong option given."
               exit 1
               ;;
       esac
done

# Before we do anything, check if we have all information
if [[ -z $title ]]; then echo "Missing title, use -T"; exit 1
elif [[ -z $url ]]; then echo "Missing URL, use -U"; exit 1
elif [[ -z $wikiid ]]; then echo "Missing Wiki page ID, use -W"; exit 1
elif [[ -z $wikipagetitle ]]; then echo "Missing Wiki page title, use -P"; exit 1
elif [[ -z $icingahost ]]; then echo "Missing Icinga host name of this SLA, use -H"; exit 1
elif [[ -z $icingaservice ]]; then echo "Missing Icinga service name of this SLA, use -S"; exit 1
fi

This way I can launch the script for many services, each with its own Wiki page (if necessary):

# crontab -l
# SLA Reportings
00 02 1 * * /root/scripts/icinga-sla-reporting.sh -T "Website www.example.com" -U "www.example.com" -W 12345678 -P "SLA Reporting www.example.com" -H internet -S "HTTP+www.example.com" >/dev/null
01 02 1 * * /root/scripts/icinga-sla-reporting.sh -T "Rest API api.example.com" -U "api.example.com" -W 12312399 -P "SLA Reporting api.example.com" -H internet -S "HTTP+api.example.com" >/dev/null


 

ArchiCAD 17 fails to install on Mac OS X 10.9 Mavericks
Tuesday - Apr 3rd 2018 - by - (0 comments)

When I tried to install ArchiCAD 17 on a Mac OS X 10.9 (Mavericks), the setup wheel just turned and turned, but nothing ever happened (even after 20 minutes of waiting). No error, no warning, nothing.

I tried to run the setup several times, with reboots in between, but always with the same problem. 

Eventually I checked the system logs (/var/log/system.log) and found the reason:

Apr  3 12:00:57 CAD-STATION-2.local ArchiCAD Installer[234]: OS X 10.7.3 or higher was found
Apr  3 12:00:57 CAD-STATION-2.local ArchiCAD Installer[234]: Suitable Java was not fount, need to install it
Apr  3 12:00:57 CAD-STATION-2.local ArchiCAD Installer[234]: JVM folder was found
Apr  3 12:01:02 CAD-STATION-2.local ArchiCAD Installer[234]: Java 7 Update 13.pkg installation has failed with error code 1: installer: Package name is Java 7 Update 13
    installer: Certificate used to sign package is not trusted. Use -allowUntrusted to override.

Obviously the setup needs an already installed Java on the machine. The setup DMG (or CD) contains a subfolder called "Java" with a pkg file in it. I executed it manually and I got a certificate warning, the same as I saw in the log. But when doing the installation manually, I was able to click on "trust certificate" and continue. After this, I was able to launch the ArchiCAD installer again. This time it worked:

Apr  3 12:10:08 CAD-STATION-2.local ArchiCAD Installer[277]: OS X 10.7.3 or higher was found
Apr  3 12:10:08 CAD-STATION-2.local ArchiCAD Installer[277]: ArchiCAD Installer has lauched with java: /Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java

Note: Yes, I'm well aware of the outdated versions (both ArchiCAD and Mac OS X). But for compatibility reasons these versions are mandatory.

 

Version 20180329 of check_esxi_hardware available (pywbem version lookup)
Saturday - Mar 31st 2018 - by - (0 comments)

A new version (20180329) of the monitoring plugin check_esxi_hardware is available.

This version contains an improvement and is based on issue #26 in which Github user storm49152 suggested another method of determining the version of the pywbem python module.

The idea to use the pywbem's internal __version__ function is great but some tests revealed that the "original" 0.7 version of pywbem didn't include such a function. To not break compatibility, a try clause was added. The added code tries to determine pywbem's version using the internal __version__ function. If it fails, it uses the external module pkg_resources.

 

Bash: date -d 1 month ago still shows the same month
Thursday - Mar 29th 2018 - by - (0 comments)

I'm currently working on an automated way of getting availability stats into a Confluence page (basically my article from 2010 How to create automatic PDF from Nagios Availabilty Reports? but more advanced).

The idea is, so far, to run on the 1st of the month and get the statistics from the previous month (lastmonth). While doing some tests, I came across this:

root@monitoring:~# date
Thu Mar 29 14:00:48 CEST 2018

root@monitoring:~# date -d "1 month ago"
Thu Mar  1 13:00:54 CET 2018

Can you see it? Today is March 29th but the command "date" thinks a month ago was the 1st of March. That's definitely not correct. Without having checked the man page for details, I assume that "1 month ago" is an alias for "4 weeks ago":

root@monitoring:~# date -d "4 weeks ago"
Thu Mar  1 13:03:00 CET 2018

This can be verified manually by checking the calendar:

root@monitoring:~# cal
     March 2018       
Su Mo Tu We Th Fr Sa  
             1  2  3  
 4  5  6  7  8  9 10  
11 12 13 14 15 16 17  
18 19 20 21 22 23 24  
25 26 27 28 29 30 31 

Take the 29th, move up the cursor 4 times and there you arrive on the 1st of March.

Of course I'm not the first one this happens to. I came across a Stackexchange article where the second response mentions dateadd from the "dateutils".
Note: Interestingly the OP selected the first answer "The usual wisdom is use the 15 of this month. Then subtract 1 month" as solution. Which to me is just another workaround, not a real fix.

dateutils can be installed in all major distributions as it has a dedicated package:

root@monitoring:~# apt-get install dateutils

Let's try this newly installed command (dateutils.dadd):

root@monitoring:~# dateutils.dadd today -1mo
2018-02-28

root@monitoring:~# dateutils.dadd today -1mo -f '%Y%m'
201802

That looks much better (and correct)!

 

New version of check_couchdb_replication allows check of all replications
Monday - Mar 26th 2018 - by - (0 comments)

I'm glad to release a new version (20180326) of the monitoring plugin check_couchdb_replication, which monitors CouchDB replications.

Two bugs were fixed and one important enhancement was added. Let's talk about the features, because nobody wants to talk bugs (if you do, check https://github.com/Napsty/check_couchdb_replication/issues?q=is%3Aissue+is%3Aclosed ^^).

The enhancement or new feature allows to check all discovered replications at once. Instead of defining a check on a certain replication (doc_id), the parameter "-r "can now be used as "-r ALL". This tells the plugin to go through all discovered replications and check all of them at once. You can still continue to monitor a single replication ID, too, of course.

Example:

# ./check_couchdb_replication.sh -H mycouchdb.example.com -u admin -p mysecretpass -r ALL
COUCHDB REPLICATION CRITICAL - 2 replications not running ("doc_id":"claudioreptest" "state":"crashing" "info":"unauthorized: unauthorized to access or create database http://admin:*****@localhost:5984/db99/","doc_id":"claudioreptest333" "error_count":1 "info":"Replication `c7d010d31ab268968f22f4d71c5766bf+continuous+create_target` specified by document `claudioreptest333` already started,)

Enjoy.


 

Bash: Why is a multi-line output saved as one line in a variable?
Monday - Mar 26th 2018 - by - (0 comments)

Let's start off with the response to this question (Why is a multi-line output saved as one line in a variable?): It's not!

As I'm currently improving the monitoring plugin check_couchdb_replication, I came across a little problem.

I wanted to check the number of lines of a curl output. By simply using the curl command directly, I got 29 lines in return:

$ curl -q -s localhost:5984/_scheduler/docs/_replicator -u admin:secret  | wc -l
29

But when I saved the command's output in a variable, the whole output seems to have merged into one line:

$ output=$(curl http://localhost:5984/_scheduler/docs/_replicator -u admin:secret)
$ echo $output | wc -l
1

I came across an article on StackOverflow where the solution was presented. Which by the way is surprisingly easy:

$ echo "$output" | wc -l
29

By simply putting the variable into double-quotes, the correct number of lines is shown again. But why is that? In the same article linked above, the explanation from Jonathan Leffler is really good:

the difference is that (1) the double-quoted version of the variable (echo "$VARIABLE") preserves internal spacing of the value exactly as it is represented in the variable newlines, tabs, multiple blanks and all whereas (2) the unquoted version (echo $VARIABLE) replaces each sequence of one or more blanks, tabs and newlines with a single space. Thus (1) preserves the shape of the input variable, whereas (2) creates a potentially very long single line of output

Quite crazy that after more than 10 years of bashing I only came across this today.

 

Windows: Monitoring of files or directories and alert when older than certain age
Wednesday - Mar 21st 2018 - by - (0 comments)

On a Windows server, a service was hanging and nobody noticed it. The application team found out that this service, when working correctly, always creates certain temporary folders which disappear after a few minutes. This can be monitored, of course!

As the Windows servers have NSClient installed, I can use check_nrpe from the Icinga server to check for the folders. So I created a folder "claudiotest" in the temp folder of the application:

Windows monitor file age

Basic check: Does such a folder exist?
Note that I used an asterisk wildcard in the path in order to simulate the temporary folders of the application, they all start with the same name but have a different ending.

$ /usr/lib/nagios/plugins/check_nrpe -H windowsserver -c check_files -a "file=C:\Program Files\Application\tmp\claudio*"
OK: All 1 files are ok|

Indeed, there was one file found (my folder "claudiotest").

What if I search for another name?

$ /usr/lib/nagios/plugins/check_nrpe -H windowsserver -c check_files -a "file=C:\Program Files\Application\tmp\claudiooo*"
No files found|

No surprise, nothing was found with that name.

Advanced check: Check if file age is older than 15min (=900s).

So here I had to add filters to limit my search result. I only wanted to have results matching the filename (C:\Program Files\Application\tmp\claudio*) and an age older than 15 minutes:

$ /usr/lib/nagios/plugins/check_nrpe -H windowsserver -c check_files -a "file=C:\Program Files\Application\tmp\claudio*" "filter=age>900"
OK: All 1 files are ok|

So far so good, but it should not be OK, it should WARN that the application is probably hanging. For this the "warn" argument must be used:

$ /usr/lib/nagios/plugins/check_nrpe -H windowsserver -c check_files -a "file=C:\Program Files\Application\tmp\claudio*" "filter=age>900" "warn=count>0"
WARNING: 1/1 files (claudiotest)|'count'=1;0;0

This means: As soon as the check found at least one file matching the filename and the age is older than 15min, it will return a warning.

But I faced one more issue. When no such directories exist (which can happen), I got an UNKNOWN return code (3):

$ /usr/lib/nagios/plugins/check_nrpe -H windowsserver -c check_files -a "file=C:\Program Files\Application\tmp\claudiooo*" "filter=age>900" "warn=count>0"; echo $?
No files found|'count'=0;0;0
3

This means, that in Icinga this would be shown as an UNKNOWN alert, which should not be the case. But this can be solved with the parameter "empty-state". This basically means when nothing applies to the filter (no result), this return code should be used:

 $ /usr/lib/nagios/plugins/check_nrpe -H windowsserver -c check_files -a "file=C:\Program Files\Application\tmp\claudiooo*" "filter=age>900" "empty-state=ok" "warn=count>0"; echo $?
No files found|'count'=0;0;0
0

This time, the return code was OK (0).

And the final check:

$ /usr/lib/nagios/plugins/check_nrpe -H windowsserver -c check_files -a "file=C:\Program Files\Application\tmp\claudio*" "filter=age>900" "empty-state=ok" "warn=count>0"
WARNING: 1/1 files (claudiotest)|'count'=1;0;0

Solved!

 


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7210 Days
until Death of Computers
Why?