Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

apt upgrade exits with error: package pkgname contains empty filename
Wednesday - Jan 10th 2018 - by - (0 comments)

A Debian Jessie server couldn't run apt-get upgrade anymore and exited with the following error message:

root@debian ~ # apt-get upgrade
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
  apache2 apache2-bin apache2-data apache2-mpm-event apache2-mpm-prefork [...]
131 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 0 B/57.6 MB of archives.
After this operation, 196 kB disk space will be freed.
Do you want to continue? [Y/n] y
Extracting templates from packages: 100%
Preconfiguring packages ...
dpkg: unrecoverable fatal error, aborting:
 files list file for package `psmisc' contains empty filename
E: Sub-process /usr/bin/dpkg returned an error code (2)

From this error alone, it's not very clear what failed. It requires some additional knowledge about the apt package manager. Apt retrieves information about the files of a package from the /var/lib/dkpg/info directory. Inside this path there are several files for each package. For the example for psmisc:

root@debian ~ # ls /var/lib/dpkg/info/psmisc*
psmisc.list      psmisc.md5sums   psmisc.postinst  psmisc.postrm   

psmisc.list is the above mentioned "files list file" for the package psmisc. Let's check this out:

root@debian ~ # cat /var/lib/dpkg/info/psmisc.list
????d?0bi
              ?@4
                  ??
0`??D?????????????????????????????????X<X<Zb??b|?
[...]

Woah... this can't be right! So this file is definitely corrupt. I verified with another Debian Jessie installation and the file looks different, of course:

root@jessie ~ # cat /var/lib/dpkg/info/psmisc.list
/.
/usr
/usr/share
/usr/share/menu
/usr/share/menu/psmisc
/usr/share/man
/usr/share/man/man1
/usr/share/man/man1/prtstat.1.gz
/usr/share/man/man1/peekfd.1.gz
/usr/share/man/man1/pstree.1.gz
/usr/share/man/man1/killall.1.gz
/usr/share/man/man1/fuser.1.gz
/usr/share/locale
/usr/share/locale/hr
/usr/share/locale/hr/LC_MESSAGES
/usr/share/locale/hr/LC_MESSAGES/psmisc.mo
/usr/share/locale/el
/usr/share/locale/el/LC_MESSAGES
/usr/share/locale/el/LC_MESSAGES/psmisc.mo
/usr/share/locale/fr
/usr/share/locale/fr/LC_MESSAGES
/usr/share/locale/fr/LC_MESSAGES/psmisc.mo
/usr/share/locale/ca
/usr/share/locale/ca/LC_MESSAGES
/usr/share/locale/ca/LC_MESSAGES/psmisc.mo
/usr/share/locale/sv
/usr/share/locale/sv/LC_MESSAGES
/usr/share/locale/sv/LC_MESSAGES/psmisc.mo
/usr/share/locale/eo
/usr/share/locale/eo/LC_MESSAGES
/usr/share/locale/eo/LC_MESSAGES/psmisc.mo
/usr/share/locale/ja
/usr/share/locale/ja/LC_MESSAGES
/usr/share/locale/ja/LC_MESSAGES/psmisc.mo
/usr/share/locale/nb
/usr/share/locale/nb/LC_MESSAGES
/usr/share/locale/nb/LC_MESSAGES/psmisc.mo
/usr/share/locale/cs
/usr/share/locale/cs/LC_MESSAGES
/usr/share/locale/cs/LC_MESSAGES/psmisc.mo
/usr/share/locale/id
/usr/share/locale/id/LC_MESSAGES
/usr/share/locale/id/LC_MESSAGES/psmisc.mo
/usr/share/locale/zh_TW
/usr/share/locale/zh_TW/LC_MESSAGES
/usr/share/locale/zh_TW/LC_MESSAGES/psmisc.mo
/usr/share/locale/ru
/usr/share/locale/ru/LC_MESSAGES
/usr/share/locale/ru/LC_MESSAGES/psmisc.mo
/usr/share/locale/vi
/usr/share/locale/vi/LC_MESSAGES
/usr/share/locale/vi/LC_MESSAGES/psmisc.mo
/usr/share/locale/uk
/usr/share/locale/uk/LC_MESSAGES
/usr/share/locale/uk/LC_MESSAGES/psmisc.mo
/usr/share/locale/sr
/usr/share/locale/sr/LC_MESSAGES
/usr/share/locale/sr/LC_MESSAGES/psmisc.mo
/usr/share/locale/da
/usr/share/locale/da/LC_MESSAGES
/usr/share/locale/da/LC_MESSAGES/psmisc.mo
/usr/share/locale/pt
/usr/share/locale/pt/LC_MESSAGES
/usr/share/locale/pt/LC_MESSAGES/psmisc.mo
/usr/share/locale/fi
/usr/share/locale/fi/LC_MESSAGES
/usr/share/locale/fi/LC_MESSAGES/psmisc.mo
/usr/share/locale/ro
/usr/share/locale/ro/LC_MESSAGES
/usr/share/locale/ro/LC_MESSAGES/psmisc.mo
/usr/share/locale/eu
/usr/share/locale/eu/LC_MESSAGES
/usr/share/locale/eu/LC_MESSAGES/psmisc.mo
/usr/share/locale/hu
/usr/share/locale/hu/LC_MESSAGES
/usr/share/locale/hu/LC_MESSAGES/psmisc.mo
/usr/share/locale/de
/usr/share/locale/de/LC_MESSAGES
/usr/share/locale/de/LC_MESSAGES/psmisc.mo
/usr/share/locale/zh_CN
/usr/share/locale/zh_CN/LC_MESSAGES
/usr/share/locale/zh_CN/LC_MESSAGES/psmisc.mo
/usr/share/locale/pt_BR
/usr/share/locale/pt_BR/LC_MESSAGES
/usr/share/locale/pt_BR/LC_MESSAGES/psmisc.mo
/usr/share/locale/pl
/usr/share/locale/pl/LC_MESSAGES
/usr/share/locale/pl/LC_MESSAGES/psmisc.mo
/usr/share/locale/nl
/usr/share/locale/nl/LC_MESSAGES
/usr/share/locale/nl/LC_MESSAGES/psmisc.mo
/usr/share/locale/it
/usr/share/locale/it/LC_MESSAGES
/usr/share/locale/it/LC_MESSAGES/psmisc.mo
/usr/share/locale/bg
/usr/share/locale/bg/LC_MESSAGES
/usr/share/locale/bg/LC_MESSAGES/psmisc.mo
/usr/share/pixmaps
/usr/share/pixmaps/pstree16.xpm
/usr/share/pixmaps/pstree32.xpm
/usr/share/doc
/usr/share/doc/psmisc
/usr/share/doc/psmisc/README
/usr/share/doc/psmisc/README.Debian
/usr/share/doc/psmisc/changelog.Debian.gz
/usr/share/doc/psmisc/changelog.gz
/usr/share/doc/psmisc/copyright
/usr/bin
/usr/bin/pstree
/usr/bin/prtstat
/usr/bin/peekfd
/usr/bin/killall
/bin
/bin/fuser
/usr/share/man/man1/pstree.x11.1.gz
/usr/bin/pstree.x11

Once I replaced the corrupt file on host "debian" with the correct file from host "jessie", apt-get upgrade ran like a charm again.

 

New monitoring plugin check_couchdb_replication to monitor CouchDB replications
Monday - Jan 8th 2018 - by - (0 comments)

It's 2018 now and this means it's almost been 10 years since my first (public) monitoring plugin (check_mysql_slavestatus). Today's the time for a new one: check_couchdb_replication. This is a monitoring plugin to monitor the status of CouchDB replications. 

As I'm not an expert of CouchDB the plugin might not be 100% correct yet. Mainly the dynamic detection of replications might be wrong, but the replication status check itself (and this is what it's about) should be working and should be ready for production monitoring.

The documentation of check_couchdb_replication contains config example for Icinga1/Nagios and Icinga2. Here are some examples when the plugin is run on the command line.

Check the status of replication "rep_db1":

# ./check_couchdb_replication.sh -H mycouchdb.example.com -u admin -p mysecretpass -r "rep_db1"
COUCHDB REPLICATION OK - Replication rep_db1 is "running"

If there is no such replication with this doc_id (rep_db1), the plugin will return:

# ./check_couchdb_replication.sh -H mycouchdb.example.com -u admin -p mysecretpass -r "rep_db3"
COUCHDB REPLICATION CRITICAL - Replication for rep_db3 not found

If you're not sure or you forgot the name (doc_id) of a replication, run the plugin with the -d (detect) parameter:

# ./check_couchdb_replication.sh -H mycouchdb.example.com -d
COUCHDB AVAILABLE REPLICATIONS: "rep_db1" "rep_db2"

In the background, the plugin uses "/_active_tasks" the find the available replications.

Feedback and contributions are welcome as always.

 

New version of check_es_system plugin fixes authentication problems
Friday - Jan 5th 2018 - by - (0 comments)

Happy new year everyone!

There's a new version of check_es_system.sh available, a monitoring plugin for Nagios, Icinga and derivates to monitor ElasticSearch usage (disk and memory).

Basically everything stays the same, the new version contains two bugfixes:

  • Fix if statement for authentication (Thanks to Github user deric)
  • Fix authentication when wrong credentials were used (see issue #2)


 

Install newer ninja-build (1.8.2) on Ubuntu 14.04 Trusty
Friday - Dec 22nd 2017 - by - (1 comments)

The developer tool ninja-build can be installed from the official Ubuntu repositories. If you have installed the ninja-build package, you can see it's in version 1.3.4:

/usr/bin/ninja --version
1.3.4

But sometimes developers require (or want) a newer version. With ninja-build that's very easy to do because pre-compiled binaries are ready to download and use.

In this case I installed the current version of ninja-build 1.8.2 on an Ubuntu 14.04 Trusty machine.

Download the current release:

wget https://github.com/ninja-build/ninja/releases/download/v1.8.2/ninja-linux.zip

This file contains the ninja binary. I unzipped it into /usr/local/bin:

sudo unzip ninja-linux.zip -d /usr/local/bin/

Tell Ubuntu to use this new path for the "ninja" command:

sudo update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force
update-alternatives: using /usr/local/bin/ninja to provide /usr/bin/ninja (ninja) in auto mode

Test version output:

/usr/bin/ninja --version
1.8.2

No need to compile from source in this case. Quick and efficient.

 

Install/Upgrade cmake 3.10.1 in Ubuntu 14.04 using alternatives
Friday - Dec 22nd 2017 - by - (0 comments)

In a previous article, I described how it's possible to Install and use cmake 3.4.1 in Ubuntu 14.04 using alternatives.

Since then a couple of new versions were released and the same procedure can still be used to install cmake 3.10.1.

Download and compile:

wget http://www.cmake.org/files/v3.10/cmake-3.10.1.tar.gz
tar -xvzf cmake-3.4.1.tar.gz
cd cmake-3.4.1/
./configure
make

Make's install command installs cmake by default in /usr/local/bin/cmake, shared files are installed into /usr/local/share/cmake-3.10.
To install (copy) the binary and libraries to the new destination, run:

sudo make install

If you haven't already installed a newer cmake installation, run the following command to tell Ubuntu that the cmake command is now being replaced by an alternative installation:

sudo update-alternatives --install /usr/bin/cmake cmake /usr/local/bin/cmake 1 --force

If you already have a custom cmake version installed (in my case I still had the 3.4.1 version active), the update-alternatives command is not necessary. The make install command will replace the existing binary in /usr/local/bin/cmake. This can be verified using:

cmake --version
cmake version 3.10.1

CMake suite maintained and supported by Kitware (kitware.com/cmake).

 

ElasticSearch cluster stays red, stuck unassigned shards not being assigned
Tuesday - Dec 19th 2017 - by - (0 comments)

Yesterday our ELK's ElasticSearch ran out of disk space and stopped working. After I deleted some older indexes and even grew the file system a bit, the ElasticSearch cluster status still showed red:

ElasticSearch Cluster Red Alert

But why? To make sure all shards are being handled correctly, I restarted one ES node and let it assign and re-index all the indexes. But it got stuck with 16 shards being left unassigned.
That's when I realized something's off and I found these two blog articles which helped me understand what's going on:
- https://thoughts.t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode-ce196e20ba95
- https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/

I manually verified about such shards being left unassigned:

claudio@tux ~ $ curl -q -s "http://es01.exampe.com:9200/_cat/shards" | egrep "(UNASSIGNED|INIT)"
docker-2017.12.18      1 p UNASSIGNED                                
docker-2017.12.18      1 r UNASSIGNED                                
docker-2017.12.18      3 p UNASSIGNED                                
docker-2017.12.18      3 r UNASSIGNED                                
docker-2017.12.18      0 p UNASSIGNED                                
docker-2017.12.18      0 r UNASSIGNED                                
filebeat-2017.12.18    4 p UNASSIGNED                                
filebeat-2017.12.18    4 r UNASSIGNED                                
application-2017.12.18 4 p UNASSIGNED                                
application-2017.12.18 4 r UNASSIGNED                                
application-2017.12.18 0 p UNASSIGNED                                
application-2017.12.18 0 r UNASSIGNED                                
logstash-2017.12.18    1 p UNASSIGNED                                
logstash-2017.12.18    1 r UNASSIGNED                                
logstash-2017.12.18    0 p UNASSIGNED                                
logstash-2017.12.18    0 r UNASSIGNED  

Yep, here they are. A total of 16 shards (as mentioned by the monitoring) were not assigned.

I followed the hint of the articles above, however the syntax has changed since. Both articles describe the "allocate" command. But in ElasticSearch 6.x this command does not exist anymore.
Instead there are now two commands, one for a primary shard, one for a replica shard. From the documentation (https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html):

 allocate_replica
    Allocate an unassigned replica shard to a node. Accepts the index and shard for index name and shard number, and node to allocate the shard to. Takes allocation deciders into account.



As a manual override, two commands to forcefully allocate primary shards are available:

allocate_stale_primary
    Allocate a primary shard to a node that holds a stale copy. Accepts the index and shard for index name and shard number, and node to allocate the shard to. Using this command may lead to data loss for the provided shard id. If a node which has the good copy of the data rejoins the cluster later on, that data will be overwritten with the data of the stale copy that was forcefully allocated with this command. To ensure that these implications are well-understood, this command requires the special field accept_data_loss to be explicitly set to true for it to work.
allocate_empty_primary
    Allocate an empty primary shard to a node. Accepts the index and shard for index name and shard number, and node to allocate the shard to. Using this command leads to a complete loss of all data that was indexed into this shard, if it was previously started. If a node which has a copy of the data rejoins the cluster later on, that data will be deleted! To ensure that these implications are well-understood, this command requires the special field accept_data_loss to be explicitly set to true for it to work.

So I created the following command to parse all unassigned shards and run the corresponding allocate command - depending whether the shards are primary or replica shards (with echo to verify the command uses the correct variable values):

claudio@tux ~ $ curl -q -s "http://es01.exampe.com:9200/_cat/shards" | egrep "UNASSIGNED" | while read index shard type state; do if [ $type = "r" ]; then echo curl -X POST "http://es01.exampe.com:9200/_cluster/reroute" -d "{ \"commands\" : [ { \"allocate_replica\": { \"index\": \"$index\", \"shard\": $shard, \"node\": \"es01\" } } ] }"; elif [ $type = "p" ]; then echo curl -X POST "http://es01.exampe.com:9200/_cluster/reroute" -d "{ \"commands\" : [ { \"allocate_stale_primary\": { \"index\": \"$index\", \"shard\": $shard, \"node\": \"es02\", \"accept_data_loss\": true } } ] }"; fi; done
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_stale_primary": { "index": "docker-2017.12.18", "shard": 1, "node": "es02", "accept_data_loss": true } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_replica": { "index": "docker-2017.12.18", "shard": 1, "node": "es01" } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_stale_primary": { "index": "docker-2017.12.18", "shard": 3, "node": "es02", "accept_data_loss": true } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_replica": { "index": "docker-2017.12.18", "shard": 3, "node": "es01" } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_stale_primary": { "index": "docker-2017.12.18", "shard": 0, "node": "es02", "accept_data_loss": true } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_replica": { "index": "docker-2017.12.18", "shard": 0, "node": "es01" } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_stale_primary": { "index": "filebeat-2017.12.18", "shard": 4, "node": "es02", "accept_data_loss": true } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_replica": { "index": "filebeat-2017.12.18", "shard": 4, "node": "es01" } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_stale_primary": { "index": "application-2017.12.18", "shard": 4, "node": "es02", "accept_data_loss": true } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_replica": { "index": "application-2017.12.18", "shard": 4, "node": "es01" } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_stale_primary": { "index": "application-2017.12.18", "shard": 0, "node": "es02", "accept_data_loss": true } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_replica": { "index": "application-2017.12.18", "shard": 0, "node": "es01" } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_stale_primary": { "index": "logstash-2017.12.18", "shard": 1, "node": "es02", "accept_data_loss": true } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_replica": { "index": "logstash-2017.12.18", "shard": 1, "node": "es01" } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_stale_primary": { "index": "logstash-2017.12.18", "shard": 0, "node": "es02", "accept_data_loss": true } } ] }
curl -X POST http://es01.exampe.com:9200/_cluster/reroute -d { "commands" : [ { "allocate_replica": { "index": "logstash-2017.12.18", "shard": 0, "node": "es01" } } ] }

But when I ran the command without the "echo", I got a ton of errors back. Taken a snippet from the huge error message:

"index":"logstash-2017.11.24","allocation_id":{"id":"eTNR1rY2TSqVhbzng-gTqA"}},{"state":"STARTED","primary":true,"node":"0o0eQXxcSJuWIFG2ohjwUg","relocating_node":null,"shard":2,"index":"logstash-2017.11.24","allocation_id":{"id":"v4BjD0FAR2SCbEWmWXv5QQ"}},{"state":"STARTED","primary":true,"node":"0o0eQXxcSJuWIFG2ohjwUg","relocating_node":null,"shard":4,"index":"logstash-2017.11.24","allocation_id":{"id":"L9uG4CIXS8-QAs8_0UAXWA"}},{"state":"STARTED","primary":true,"node":"0o0eQXxcSJuWIFG2ohjwUg","relocating_node":null,"shard":3,"index":"logstash-2017.11.24","allocation_id":{"id":"0xS1BcwSQpqn9JpjL6tJlg"}},{"state":"STARTED","primary":false,"node":"0o0eQXxcSJuWIFG2ohjwUg","relocating_node":null,"shard":0,"index":"logstash-2017.11.24","allocation_id":{"id":"QWO_lYpIRL6U8gSjTNL8pw"}}]}}}}{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"[allocate_replica] trying to allocate a replica shard [logstash-2017.12.18][0], while corresponding primary shard is still unassigned"}],"type":"illegal_argument_exception","reason":"[allocate_replica] trying to allocate a replica shard [logstash-2017.12.18][0], while corresponding primary shard is still unassigned"},"status":400}

The important part being:

trying to allocate a replica shard [logstash-2017.12.18][0], while corresponding primary shard is still unassigned

Makes sense. I tried to allocate a replica shard but obviously the primary shard needs to be allocated first. I changed the while loop to only run on primary shards:

claudio@tux ~ $ curl -q -s "http://es01.exampe.com:9200/_cat/shards" | egrep "UNASSIGNED" | while read index shard type state; do if [ $type = "p" ]; then curl -X POST "http://es01.exampe.com:9200/_cluster/reroute" -d "{ \"commands\" : [ { \"allocate_stale_primary\": { \"index\": \"$index\", \"shard\": $shard, \"node\": \"es01\", \"accept_data_loss\": true } } ] }"; fi; done

This time it seemed to work. I verified the unassigned shards again:

claudio@tux ~ $ curl -q -s "http://es01.exampe.com:9200/_cat/shards" | egrep "UNASSIGNED"
logstash-2017.12.18    0 r UNASSIGNED                                  
filebeat-2017.12.19    1 r UNASSIGNED                                  
filebeat-2017.12.19    3 r UNASSIGNED                                  
docker-2017.12.18      3 r UNASSIGNED                                  
application-2017.12.18 4 r UNASSIGNED                                  
application-2017.12.18 0 r UNASSIGNED

Hey, much less now. And it seems that some of the replica shards were automatically assigned, too.
And now the curl command to force the allocation of the replica shards:

claudio@tux ~ $ curl -q -s "http://es01.exampe.com:9200/_cat/shards" | egrep "UNASSIGNED" | while read index shard type state; do if [ $type = "r" ]; then curl -X POST "http://es01.exampe.com:9200/_cluster/reroute" -d "{ \"commands\" : [ { \"allocate_replica\": { \"index\": \"$index\", \"shard\": $shard, \"node\": \"es02\" } } ] }"; fi; done

Note: I set data node es01 for primary shards and es02 for replica shards. You don't want to have both primary and replica shards on the same node. Don't forget about that.

I checked again about the current status and some of the allocated shards were now being re-indexed (but no unassigned shards were found anymore):

claudio@tux ~ $ curl -q -s "http://es01.exampe.com:9200/_cat/shards" | egrep "(UNASSIGNED|INIT)"
application-2017.12.18 4 r INITIALIZING                  10.161.206.52 es02
application-2017.12.18 0 r INITIALIZING                  10.161.206.52 es02
logstash-2017.12.18    1 r INITIALIZING                  10.161.206.52 es02
logstash-2017.12.18    0 r INITIALIZING                  10.161.206.52 es02

It took a couple of minutes until, eventually, all indexes were finished and cluster returned to green:

claudio@tux ~ $ curl -q -s "http://es01.exampe.com:9200/_cat/shards" | egrep "(UNASSIGNED|INIT)"; date
logstash-2017.12.18    0 r INITIALIZING                  10.161.206.52 es02
Tue Dec 19 13:52:55 CET 2017

claudio@tux ~ $ curl -q -s "http://es01.exampe.com:9200/_cat/shards" | egrep "(UNASSIGNED|INIT)"; date
Tue Dec 19 13:54:50 CET 2017
claudio@tux ~ $

ElasticSearch Cluster Green Monitoring

 

Monitor Windows disk space usage on a drive without letter
Monday - Dec 18th 2017 - by - (0 comments)

Monitoring a drive on Windows is pretty easy, as a drive usually has a drive-letter assigned (for example C:).
Here I'm using NSClient++ running as agent on the Windows host while on the monitoring server I use check_nt to query the agent:

nagios@monitoring:~# /usr/lib/nagios/plugins/check_nt -H sqldevserver -p 1248 -v USEDDISKSPACE -l "D" -w 100G -c 99
D:\ - total: 99.87 Gb - used: 10.93 Gb (11%) - free 88.94 Gb (89%) | 'D:\ Used Space'=10.93Gb;99.87;98.87;0.00;99.87

But what about drives that appear in the disk management but are not assigned with a drive letter but are rather mounted as a folder?
In this example we have the classical C: drive for the Windows OS and an additional D: as data partition. But as you can see in the Disk Management UI, Disk 2 (named SQL_Data_DEV001) and Disk 3 (named SQL_Log_DEV001) have no drive letter assigned.

 Windows Disk Management Drives without letter

Instead they're mounted as a subfolder inside D:
- Drive 2 is mounted on D:\SQL_Data
- Drive 3 is mounted on D:\SQL_Log

Windows drives mounted as subfolder 

Unfortunately when using NSClient, check_nt and the USEDDISKSPACE variable, this won't work out because a drive letter is a must. From the check_nt manpage:

 USEDDISKSPACE =
  Size and percentage of disk use.
  Request a -l parameter containing the drive letter only.

But NSClient++ also speaks NRPE and its internal checks are even newer. To check a drive with NSClient as agent and check_nrpe from the monitoring server:

nagios@monitoring:~# /usr/lib/nagios/plugins/check_nrpe -H sqldevserver -c check_drivesize -a "drive=D:"
OK All 1 drive(s) are ok|'D: used'=10.94061GB;79.89838;89.88568;0;99.87298 'D: used %'=10%;79;89;0;100

And here comes the good news: The NRPE command check_drivesize (internally configured within the NSClient agent, no need to define this command somewhere) also allows mounted volumes. From the NSClient++ documentation:

To check the size of a mounted volume (c:\volume_test)[...]

According to the documentation, only the mount-path is needed. Let's try that:

nagios@monitoring:~# /usr/lib/nagios/plugins/check_nrpe -H sqldevserver -c check_drivesize -a "drive=D:\SQL_Data"
OK All 1 drive(s) are ok|'D:\SQL_Data used'=302.15924GB;319.8976;359.8848;0;399.872 'D:\SQL_Data used %'=75%;79;89;0;100

Indeed, the returned values are different than from the D: drive!
And this is how you can monitor Windows drives/partitions without a drive-letter.

 

Recover a crashed MySQL or MariaDB InnoDB database from ibd files
Wednesday - Dec 13th 2017 - by - (0 comments)

It happened. Defect power supply. Zap. Dark.

The database (MariaDB 10.0) running on that particular server (Debian Jessie) suffered data corruption and data loss. A million (I'm exhausting this fact but in 10+ years on Linux I have never seen so many filesystem errors) ext4 file system errors were detected and most of them repaired. But for the database all hope was lost. Simply too much corruption occurred to recover the databases after a start of MariaDB.

At least the daily database dumps could be used to restore all databases. But that also meant that all changes since the dump was taken were lost. One particular database contained a lot of changes exactly in that time range - a lot of effort had been done on this one database. So I was looking for a way to get the data from the moment of the crash back online.

This particular database (wordpress) is not heavily used, so there might be a chance that the InnoDB files (idb) are still usable. Most of the Internet's howto's how to recover a crashed MySQL/MariaDB database simply point you to a restore from a previously backed up database dump. Some articles I found hinted that the files can simply be put into /var/lib/mysql/wordpress/ while the database server is stopped and a restart should recover them. That's just bullcrap as a mysqlcheck reveals:

root@ /var/lib/mysql # mysqlcheck --all-databases -p
Enter password:
mysql.column_stats                                 OK
mysql.columns_priv                                 OK
mysql.db                                           OK
mysql.event                                        OK
mysql.func                                         OK
mysql.gtid_slave_pos
Error    : Table 'mysql.gtid_slave_pos' doesn't exist in engine
status   : Operation failed
mysql.help_category                                OK
mysql.help_keyword                                 OK
mysql.help_relation                                OK
mysql.help_topic                                   OK
mysql.host                                         OK
mysql.index_stats                                  OK
mysql.innodb_index_stats
Error    : Table 'mysql.innodb_index_stats' doesn't exist in engine
status   : Operation failed
mysql.innodb_table_stats
Error    : Table 'mysql.innodb_table_stats' doesn't exist in engine
status   : Operation failed
mysql.plugin                                       OK
mysql.proc                                         OK
mysql.procs_priv                                   OK
mysql.proxies_priv                                 OK
mysql.roles_mapping                                OK
mysql.servers                                      OK
mysql.table_stats                                  OK
mysql.tables_priv                                  OK
mysql.time_zone                                    OK
mysql.time_zone_leap_second                        OK
mysql.time_zone_name                               OK
mysql.time_zone_transition                         OK
mysql.time_zone_transition_type                    OK
mysql.user                                         OK
wordpress.np_commentmeta
Error    : Table 'wordpress.np_commentmeta' doesn't exist in engine
status   : Operation failed
wordpress.np_comments
Error    : Table 'wordpress.np_comments' doesn't exist in engine
status   : Operation failed
wordpress.np_layerslider
Error    : Table 'wordpress.np_layerslider' doesn't exist in engine
status   : Operation failed
wordpress.np_layerslider_revisions
Error    : Table 'wordpress.np_layerslider_revisions' doesn't exist in engine
status   : Operation failed
wordpress.np_links
Error    : Table 'wordpress.np_links' doesn't exist in engine
status   : Operation failed
wordpress.np_options
Error    : Table 'wordpress.np_options' doesn't exist in engine
status   : Operation failed
wordpress.np_postmeta
Error    : Table 'wordpress.np_postmeta' doesn't exist in engine
status   : Operation failed
wordpress.np_posts
Error    : Table 'wordpress.np_posts' doesn't exist in engine
status   : Operation failed
wordpress.np_term_relationships
Error    : Table 'wordpress.np_term_relationships' doesn't exist in engine
status   : Operation failed
wordpress.np_term_taxonomy
Error    : Table 'wordpress.np_term_taxonomy' doesn't exist in engine
status   : Operation failed
wordpress.np_termmeta
Error    : Table 'wordpress.np_termmeta' doesn't exist in engine
status   : Operation failed
wordpress.np_terms
Error    : Table 'wordpress.np_terms' doesn't exist in engine
status   : Operation failed
wordpress.np_usermeta
Error    : Table 'wordpress.np_usermeta' doesn't exist in engine
status   : Operation failed
wordpress.np_users
Error    : Table 'wordpress.np_users' doesn't exist in engine
status   : Operation failed
wordpress.np_yoast_seo_links
Error    : Table 'wordpress.np_yoast_seo_links' doesn't exist in engine
status   : Operation failed
wordpress.np_yoast_seo_meta
Error    : Table 'wordpress.np_yoast_seo_meta' doesn't exist in engine
status   : Operation failed

But eventually I came across a very interesting answer on dba.stackexchange.com. The answer came from user "carment" and this is seriously one of the best and most important howto's I ever found on a MySQL/MariaDB topic: She described how to restore the database from the frm and ibd files only.

First of all you need the command "mysqlfrm", which is part of the packagemysql-utilities. Install that package:

root@ ~ # apt-get install mysql-utilities

I placed the database's files from the time of the crash into /tmp/wordpress:

root@ ~ # ll /tmp/wordpress
total 23668
-rw-rw---- 1 mysql mysql       61 Dec  7 10:51 db.opt
-rw-rw---- 1 mysql mysql     3033 Dec  7 10:55 np_commentmeta.frm
-rw-rw---- 1 mysql mysql   131072 Dec 12 10:58 np_commentmeta.ibd
-rw-rw---- 1 mysql mysql     6685 Dec  7 10:55 np_comments.frm
-rw-rw---- 1 mysql mysql   180224 Dec 12 11:02 np_comments.ibd
-rw-rw---- 1 mysql mysql     2047 Dec  7 11:07 np_layerslider.frm
-rw-rw---- 1 mysql mysql    98304 Dec  7 11:07 np_layerslider.ibd
-rw-rw---- 1 mysql mysql     1041 Dec  7 11:07 np_layerslider_revisions.frm
-rw-rw---- 1 mysql mysql    98304 Dec  7 11:07 np_layerslider_revisions.ibd
-rw-rw---- 1 mysql mysql     8105 Dec  7 10:55 np_links.frm
-rw-rw---- 1 mysql mysql   114688 Dec  7 10:55 np_links.ibd
-rw-rw---- 1 mysql mysql     2365 Dec  7 10:55 np_options.frm
-rw-rw---- 1 mysql mysql   507904 Dec 12 17:41 np_options.ibd
-rw-rw---- 1 mysql mysql     3030 Dec  7 10:55 np_postmeta.frm
-rw-rw---- 1 mysql mysql  9437184 Dec 12 17:42 np_postmeta.ibd
-rw-rw---- 1 mysql mysql     7223 Dec  7 10:55 np_posts.frm
-rw-rw---- 1 mysql mysql 12582912 Dec 12 17:42 np_posts.ibd
-rw-rw---- 1 mysql mysql     3030 Dec  7 10:55 np_termmeta.frm
-rw-rw---- 1 mysql mysql   131072 Dec  7 10:55 np_termmeta.ibd
-rw-rw---- 1 mysql mysql     1496 Dec  7 10:55 np_term_relationships.frm
-rw-rw---- 1 mysql mysql   114688 Dec 12 17:03 np_term_relationships.ibd
-rw-rw---- 1 mysql mysql     3592 Dec  7 10:55 np_terms.frm
-rw-rw---- 1 mysql mysql   131072 Dec 12 17:07 np_terms.ibd
-rw-rw---- 1 mysql mysql     2209 Dec  7 10:55 np_term_taxonomy.frm
-rw-rw---- 1 mysql mysql   131072 Dec 12 17:07 np_term_taxonomy.ibd
-rw-rw---- 1 mysql mysql     3031 Dec  7 10:55 np_usermeta.frm
-rw-rw---- 1 mysql mysql   131072 Dec 12 17:07 np_usermeta.ibd
-rw-rw---- 1 mysql mysql     6965 Dec  7 10:55 np_users.frm
-rw-rw---- 1 mysql mysql   147456 Dec  7 17:51 np_users.ibd
-rw-rw---- 1 mysql mysql     2585 Dec  7 11:08 np_yoast_seo_links.frm
-rw-rw---- 1 mysql mysql   114688 Dec 12 17:17 np_yoast_seo_links.ibd
-rw-rw---- 1 mysql mysql     1015 Dec  7 11:08 np_yoast_seo_meta.frm
-rw-rw---- 1 mysql mysql    98304 Dec 12 17:41 np_yoast_seo_meta.ibd

As you can see, the table files are all there. But at the point I didn't know, if I there was any data corruption on the files.
The frm files contain the structure of the table, the ibd files contain the data itself.

By using the mysqlfrm command, a "parallel" MySQL instance is started up reading the information from the frm file (given by the path at the end of the command). Before you do that, make sure you're in a folder the mysql user can write to (e.g. /tmp) because it is advised to not run mysqlfrm's parallel MySQL instance as root user (that's why I added --user to the command):

root@ ~ # cd /tmp/

Now run mysqlfrm and make sure you don't use the "real" MySQL port of the already running server (here I chose 3307):

root@ /tmp # mysqlfrm --user=mysql --server=root:maria@localhost --port=3307 /tmp/wordpress/np_commentmeta.frm -vvv
# Source on localhost: ... connected.
# Checking read access to .frm files
# Creating a temporary datadir = /tmp/c1205314-ceba-464c-b1d9-65e6240dbf21
# Spawning server with --user=mysql.
# Starting the spawned server on port 3307 ...
# Cloning the MySQL server located at /usr.
# Configuring new instance...
# Locating mysql tools...
# Location of files:
#                       mysqld: /usr/sbin/mysqld
#                   mysqladmin: /usr/bin/mysqladmin
#      mysql_system_tables.sql: /usr/share/mysql/mysql_system_tables.sql
# mysql_system_tables_data.sql: /usr/share/mysql/mysql_system_tables_data.sql
# mysql_test_data_timezone.sql: /usr/share/mysql/mysql_test_data_timezone.sql
#         fill_help_tables.sql: /usr/share/mysql/fill_help_tables.sql
# Setting up empty database and mysql tables...
171213  9:03:28 [Note] /usr/sbin/mysqld (mysqld 10.0.30-MariaDB-0+deb8u2) starting as process 3211 ...
171213  9:03:28 [Note] InnoDB: Using mutexes to ref count buffer pool pages
171213  9:03:28 [Note] InnoDB: The InnoDB memory heap is disabled
171213  9:03:28 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
171213  9:03:28 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
171213  9:03:28 [Note] InnoDB: Compressed tables use zlib 1.2.8
171213  9:03:28 [Note] InnoDB: Using Linux native AIO
171213  9:03:28 [Note] InnoDB: Using CPU crc32 instructions
171213  9:03:28 [Note] InnoDB: Initializing buffer pool, size = 128.0M
171213  9:03:28 [Note] InnoDB: Completed initialization of buffer pool
171213  9:03:28 [Note] InnoDB: The first specified data file ./ibdata1 did not exist: a new database to be created!
171213  9:03:28 [Note] InnoDB: Setting file ./ibdata1 size to 12 MB
171213  9:03:28 [Note] InnoDB: Database physically writes the file full: wait...
171213  9:03:28 [Note] InnoDB: Setting log file ./ib_logfile101 size to 48 MB
171213  9:03:28 [Note] InnoDB: Setting log file ./ib_logfile1 size to 48 MB
171213  9:03:29 [Note] InnoDB: Renaming log file ./ib_logfile101 to ./ib_logfile0
171213  9:03:29 [Warning] InnoDB: New log files created, LSN=45781
171213  9:03:29 [Note] InnoDB: Doublewrite buffer not found: creating new
171213  9:03:29 [Note] InnoDB: Doublewrite buffer created
171213  9:03:29 [Note] InnoDB: 128 rollback segment(s) are active.
171213  9:03:29 [Warning] InnoDB: Creating foreign key constraint system tables.
171213  9:03:29 [Note] InnoDB: Foreign key constraint system tables created
171213  9:03:29 [Note] InnoDB: Creating tablespace and datafile system tables.
171213  9:03:29 [Note] InnoDB: Tablespace and datafile system tables created.
171213  9:03:29 [Note] InnoDB: Waiting for purge to start
171213  9:03:29 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.35-80.0 started; log sequence number 0
171213  9:03:29 [Note] Plugin 'FEEDBACK' is disabled.
171213  9:03:30 [Note] InnoDB: FTS optimize thread exiting.
171213  9:03:30 [Note] InnoDB: Starting shutdown...
171213  9:03:30 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
171213  9:03:32 [Note] InnoDB: Shutdown completed; log sequence number 1616697
# Starting new instance of the server...
# Startup command for new server:
/usr/sbin/mysqld --no-defaults --datadir=/tmp/c1205314-ceba-464c-b1d9-65e6240dbf21 --tmpdir=/tmp/c1205314-ceba-464c-b1d9-65e6240dbf21 --pid-file=/tmp/c1205314-ceba-464c-b1d9-65e6240dbf21/clone.pid --port=3307 --server-id=101 --basedir=/usr --socket=/tmp/c1205314-ceba-464c-b1d9-65e6240dbf21/mysql.sock --user=mysql
# Testing connection to new instance...
171213  9:03:32 [Note] /usr/sbin/mysqld (mysqld 10.0.30-MariaDB-0+deb8u2) starting as process 3237 ...
171213  9:03:32 [Note] InnoDB: Using mutexes to ref count buffer pool pages
171213  9:03:32 [Note] InnoDB: The InnoDB memory heap is disabled
171213  9:03:32 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
171213  9:03:32 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
171213  9:03:32 [Note] InnoDB: Compressed tables use zlib 1.2.8
171213  9:03:32 [Note] InnoDB: Using Linux native AIO
171213  9:03:32 [Note] InnoDB: Using CPU crc32 instructions
171213  9:03:32 [Note] InnoDB: Initializing buffer pool, size = 128.0M
171213  9:03:32 [Note] InnoDB: Completed initialization of buffer pool
171213  9:03:32 [Note] InnoDB: Highest supported file format is Barracuda.
171213  9:03:32 [Note] InnoDB: 128 rollback segment(s) are active.
171213  9:03:32 [Note] InnoDB: Waiting for purge to start
171213  9:03:32 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.35-80.0 started; log sequence number 1616697
171213  9:03:32 [Note] Plugin 'FEEDBACK' is disabled.
171213  9:03:32 [Note] Server socket created on IP: '::'.
171213  9:03:32 [ERROR] Native table 'performance_schema'.'cond_instances' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_waits_current' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_waits_history' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_waits_history_long' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_waits_summary_by_host_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_waits_summary_by_instance' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_waits_summary_by_thread_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_waits_summary_by_user_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_waits_summary_by_account_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_waits_summary_global_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'file_instances' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'file_summary_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'file_summary_by_instance' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'host_cache' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'mutex_instances' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'objects_summary_global_by_type' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'performance_timers' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'rwlock_instances' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'setup_actors' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'setup_consumers' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'setup_instruments' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'setup_objects' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'setup_timers' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'table_io_waits_summary_by_index_usage' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'table_io_waits_summary_by_table' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'table_lock_waits_summary_by_table' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'threads' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_stages_current' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_stages_history' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_stages_history_long' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_stages_summary_by_thread_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_stages_summary_by_account_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_stages_summary_by_user_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_stages_summary_by_host_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_stages_summary_global_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_statements_current' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_statements_history' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_statements_history_long' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_statements_summary_by_thread_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_statements_summary_by_account_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_statements_summary_by_user_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_statements_summary_by_host_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_statements_summary_global_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'events_statements_summary_by_digest' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'users' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'accounts' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'hosts' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'socket_instances' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'socket_summary_by_instance' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'socket_summary_by_event_name' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'session_connect_attrs' has the wrong structure
171213  9:03:32 [ERROR] Native table 'performance_schema'.'session_account_connect_attrs' has the wrong structure
171213  9:03:32 [Note] /usr/sbin/mysqld: ready for connections.
Version: '10.0.30-MariaDB-0+deb8u2'  socket: '/tmp/c1205314-ceba-464c-b1d9-65e6240dbf21/mysql.sock'  port: 3307  (Debian)
# trying again...
# Success!
# Setting the root password...
# Connection Information:
#  -uroot -proot --socket=/tmp/c1205314-ceba-464c-b1d9-65e6240dbf21/mysql.sock
#...done.
# Connecting to spawned server
done.
# Reading .frm files
#
# Reading the np_commentmeta.frm file.
# Changing engine for .frm file /tmp/c1205314-ceba-464c-b1d9-65e6240dbf21/wordpress_temp/np_commentmeta.frm:
# Skipping to header at : 2
# General Data from .frm file:
{'IO_SIZE': 86,
 'MYSQL_VERSION_ID': 100030,
 'avg_row_length': 0,
 'charset_low': 0,
 'create_options': 9,
 'db_create_pack': 2,
 'default_charset': 224,
 'default_part_eng': 0,
 'extra_size': 16,
 'frm_file_ver': 5,
 'frm_version': 10,
 'key_block_size': 0,
 'key_info_length': 87,
 'key_length': 1483,
 'legacy_db_type': 'INNODB',
 'length': 3033,
 'max_rows': 0,
 'min_rows': 0,
 'rec_length': 1051,
 'row_type': 0,
 'table_charset': 224,
 'tmp_key_length': 1483}
# Engine string: InnoDB
# Server version in file: 1.0.30
#
# CREATE statement for /tmp/wordpress/np_commentmeta.frm:
#

CREATE TABLE `wordpress`.`np_commentmeta` (
  `meta_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `comment_id` bigint(20) unsigned NOT NULL DEFAULT '0',
  `meta_key` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `meta_value` longtext COLLATE utf8mb4_unicode_ci,
  PRIMARY KEY (`meta_id`),
  KEY `comment_id` (`comment_id`),
  KEY `meta_key` (`meta_key`(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

# Shutting down spawned server
# Removing the temporary datadir
171213  9:03:33 [Note] /usr/sbin/mysqld: Normal shutdown

171213  9:03:33 [Note] Event Scheduler: Purging the queue. 0 events
171213  9:03:33 [Note] InnoDB: FTS optimize thread exiting.
171213  9:03:33 [Note] InnoDB: Starting shutdown...
#...done.
171213  9:03:34 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
171213  9:03:35 [Note] InnoDB: Shutdown completed; log sequence number 1616707
171213  9:03:35 [Note] /usr/sbin/mysqld: Shutdown complete

That's obviously a lot of information, but I used -vvv which is debug mode to see what's actually happening in the background. At the end of the command, if it was successful, it returns the table's CREATE command. This command can now be used in the real MySQL server instance right after the new database was created:

MariaDB [(none)]> CREATE DATABASE wordpress;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> CREATE TABLE `wordpress`.`np_commentmeta` (
    ->   `meta_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
    ->   `comment_id` bigint(20) unsigned NOT NULL DEFAULT '0',
    ->   `meta_key` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
    ->   `meta_value` longtext COLLATE utf8mb4_unicode_ci,
    ->   PRIMARY KEY (`meta_id`),
    ->   KEY `comment_id` (`comment_id`),
    ->   KEY `meta_key` (`meta_key`(191))
    -> ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Query OK, 0 rows affected (0.14 sec)

MariaDB [(none)]> use wordpress;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [wordpress]> show tables;
+----------------------------+
| Tables_in_wordpress |
+----------------------------+
| np_commentmeta             |
+----------------------------+
1 row in set (0.00 sec)

 So far so good - the table was created. But what about the data? And this is where the tricky part comes in: First you need to tell MySQL that this particular table (np_commentmeta) should discard its tablespace:

MariaDB [wordpress]> ALTER TABLE np_commentmeta DISCARD TABLESPACE;
Query OK, 0 rows affected (0.01 sec)

This bascially tells MySQL to forget the .ibd file of that table.

Now on the filesystem, I copied the ibd file of that table from /tmp/wordpress right into the data directory of MySQL:

root@ /tmp # cp -p /tmp/wordpress/np_commentmeta.ibd /var/lib/mysql/wordpress/

The following command does the magic trick: It will import the table's tablespace again (using the now copied ibd file):

MariaDB [wordpress]> ALTER TABLE np_commentmeta IMPORT TABLESPACE;
Query OK, 0 rows affected, 1 warning (0.21 sec)

There's 1 warning shown, but it can be ignored (phew!).

Now let's check, if we got some data back:

MariaDB [wordpress]> select * from np_commentmeta;
+---------+------------+-----------------------+------------+
| meta_id | comment_id | meta_key              | meta_value |
+---------+------------+-----------------------+------------+
|       1 |          1 | _wp_trash_meta_status | 1          |
|       2 |          1 | _wp_trash_meta_time   | 1513072669 |
|       3 |          2 | _wp_trash_meta_status | 1          |
|       4 |          2 | _wp_trash_meta_time   | 1513072672 |
+---------+------------+-----------------------+------------+
4 rows in set (0.00 sec)

Hurray! The data is back!

I continued these steps for all tables of the wordpress database and was able to successfully recover the whole database with the status right before the crash.

Note: According to user comments on the mentioned stackexchange link, this works in MySQL 5.6 and also MySQL 5.7 (so probably > 5.6). In my case it was, as mentioned, a MariaDB 10.0.

 

Create separate measurement tables in InfluxDB for Icinga 2 NRPE checks
Tuesday - Dec 12th 2017 - by - (0 comments)

In a previous article I wrote how Icinga 2 performance graphs can be created using InfluxDB and Grafana. At the end of the article I mentioned a special note concerning NRPE checks:

Note: For NRPE checks you will have to adapt the graphs because these performance data are stored in the "nrpe" measurements table. 

My monitoring architecture relys heavily on remotely executed checks using check_nrpe therefore almost all system related information (cpu, memory, network io, diskspace, etc) were collected in one and the same measurement table: nrpe:

root@inf-mon02-t:~# influx
Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
Connected to http://localhost:8086 version 0.10.0
InfluxDB shell 0.10.0
> USE icinga2
Using database icinga2
> SHOW MEASUREMENTS
name: measurements
------------------
name
apt
disk
hostalive
http
icinga
load
ping4
ping6
procs
ssh
swap
users

At the begin of this year, in January 2017, I had some problems with PNP4Nagios and NRPE checks. I was unable to control the graph's behavior on certain remotely executed checks, because PNP4Nagios interpreted all the checks as the same plugin: check_nrpe. With a workaround (applying a special variable containing the NRPE check command) I was able to create separate PNP4Nagios templates for each individual remote NRPE check command (see article Creating custom PNP4Nagios template in Icinga 2 for NRPE checks for more details).
Where am I going with this? The same workaround can also be applied to the InfluxdbWriter object!

Fist I modified the apply rule which added the remote disk usage checks (you guessed it, using check_nrpe) on the Linux hosts:

apply Service "Diskspace " for (partition_name => config in host.vars.partitions) {
  import "generic-service"

  vars += config
  if (!vars.warn) { vars.warn = "15%" }
  if (!vars.crit) { vars.crit = "5%" }
  if (!vars.iwarn) { vars.iwarn = "15%" }
  if (!vars.icrit) { vars.icrit = "5%" }
  if (!vars.service) { vars.service = "generic-service" }

  import vars.service

  display_name = "Diskspace " + partition_name
  check_command = "nrpe"
  vars.nrpe_command = "check_disk"
  vars.nrpe_arguments = [ vars.warn, vars.crit, partition_name, vars.iwarn, vars.icrit ]
  vars.influx_append = "_$nrpe_command$"

  assign where host.address && host.vars.os == "Linux"
  ignore where host.vars.applyignore.partitions == true
}

Note: For more information about such advanced Icinga2 configurations using apply rules, take a look at Icinga 2: Advanced usage of arrays/dictionaries for monitoring of partition.

Take a look at the following line:

  vars.influx_append = "_$nrpe_command$"

Here I define a new variable influx_append. It is a string starting with an underscore (_) followed by the value of the variable nrpe_command. Which is actually check_disk as you can see two lines above it. Whenever this applied disk usage check is running, the service object now also contains the variable influx_append. This can now be used in the InfluxdbWriter.

The InfluxdbWriter feature object needs to be modified in a way, that the measurement table to use/create contains the value of the influx_append variable. And this is how I've done it:

root@inf-mon02-t:~# cat /etc/icinga2/features-enabled/influxdb.conf
/**
 * The InfluxdbWriter type writes check result metrics and
 * performance data to an InfluxDB HTTP API
 */

library "perfdata"

object InfluxdbWriter "influxdb" {
  //host = "127.0.0.1"
  //port = 8086
  //database = "icinga2"
  //flush_threshold = 1024
  //flush_interval = 10s
  //host_template = {
  //  measurement = "$host.check_command$"
  //  tags = {
  //    hostname = "$host.name$"
  //  }
  //}
  service_template = {
  //  measurement = "$service.check_command$"
    measurement = "$service.check_command$$influx_append$"
    tags = {
      hostname = "$host.name$"
      service = "$service.name$"
    }
  }
}

As you can see if kept the defaults, but un-commented the service_template part. The original measurement definition is still there (commented). I slightly modified it:

    measurement = "$service.check_command$$influx_append$"

So the measurement table to be used is now appended with new content. The nice thing is: This doesn't change anything for the local executed checks like http or ldap, because the variable influx_append is empty unless it comes from the NRPE disk usage check. On the other hand, as soon as a disk usage check through check_nrpe was executed, the variable contains information and appends the measurement like this: measurement = nrpe_check_disk .

After a restart of Icinga 2, the following can be seen in the debug logs (you must enable debug level in /etc/icinga2/features-enabled/mainlog.conf):

[2017-12-12 14:13:16 +0100] debug/InfluxdbWriter: Add to metric list: 'nrpe_check_disk,hostname=remoteserver01,service=Diskspace\ /var,metric=/var value=387973120 1513084396'.

Inside the InfluxDB this can be verified now:


root@inf-mon02-t:~# influx
Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
Connected to http://localhost:8086 version 0.10.0
InfluxDB shell 0.10.0
> use icinga2
Using database icinga2
> show measurements
name: measurements
------------------
name
apt
disk
dns
hostalive
http
icinga
ldap
load
nrpe
nrpe_check_disk
ping4
ping6
procs
ssh
swap
users

Indeed, the measurement table nrpe_check_disk was created! Let's check the content:

> select * from nrpe_check_disk
name: nrpe_check_disk
---------------------
time            hostname        metric  service         value
1513084394000000000     remoteserver01    /var    Diskspace /var  3.9845888e+08
1513084395000000000     remoteserver02    /       Diskspace /     2.524971008e+09
1513084395000000000     remoteserver01    /tmp    Diskspace /tmp  1.048576e+06
1513084396000000000     remoteserver02    /var    Diskspace /var  3.8797312e+08
1513084396000000000     remoteserver02    /tmp    Diskspace /tmp  1.048576e+06
1513084451000000000     remoteserver01    /var    Diskspace /var  3.9845888e+08
1513084452000000000     remoteserver02    /       Diskspace /     2.524971008e+09
1513084452000000000     remoteserver01    /tmp    Diskspace /tmp  1.048576e+06
1513084454000000000     remoteserver02    /tmp    Diskspace /tmp  1.048576e+06
1513084454000000000     remoteserver02    /var    Diskspace /var  3.8797312e+08
1513084508000000000     remoteserver01    /var    Diskspace /var  3.9845888e+08
1513084510000000000     remoteserver02    /       Diskspace /     2.524971008e+09
1513084510000000000     remoteserver01    /tmp    Diskspace /tmp  1.048576e+06
1513084512000000000     remoteserver02    /var    Diskspace /var  3.8797312e+08
1513084512000000000     remoteserver02    /tmp    Diskspace /tmp  1.048576e+06

Success! Now I have my own measurement table for this type of remote check. This makes it easier for queries instead of having all the remote nrpe checks in one measurement table.

 

ElasticSearch stopped to assign shards due to low disk space
Tuesday - Dec 12th 2017 - by - (0 comments)

Our monitoring informed about a yellow ElasticSearch status on our internal ELK cluster:

ElasticSearch monitoring warning

I manually checked the cluster health and indeed, 80 unassigned shards are shown:

# curl "http://es01.example.com:9200/_cluster/health?pretty&human"
{
  "cluster_name" : "escluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 1031,
  "active_shards" : 1982,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 80,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue" : "0s",
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent" : "96.1%",
  "active_shards_percent_as_number" : 96.12027158098934
}

In the ElasticSearch logs such entries can be found:

[2017-12-11T13:42:30,362][INFO ][o.e.c.r.a.DiskThresholdMonitor] [es01] low disk watermark [85%] exceeded on [t3GAvhY1SS2xZkt4U389jw][es02][/var/lib/elasticsearch/nodes/0] free: 222gb[12.2%], replicas will not be assigned to this node

A quick check on the disk space of the ES01 node revealed that 89% are currently used:

DISK OK - free space: /var/lib/elasticsearch 222407 MB (11% inode=99%):

That's OK for our monitoring (warning starting at 90%) but ElasticSearch itself also runs an internal monitoring. If the disk usage is 85% or higher, it will stop the shard allocation.
From the ElasticSearch documentation:

cluster.routing.allocation.disk.watermark.low
Controls the low watermark for disk usage. It defaults to 85%, meaning ES will not allocate new shards to nodes once they have more than 85% disk used.

As this is quite a big partition of 1.8TB, I decided to increase the watermark.low to 95%. I modified /etc/elasticsearch/elasticsearch.yml and added at the end:

# tail /etc/elasticsearch/elasticsearch.yml
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true

# 20171211 Claudio: Set higher disk threshold (default is 85%)
cluster.routing.allocation.disk.watermark.low: "95%"
# 20171211 Claudio: Do not relocate shards to another node (default is true)
cluster.routing.allocation.disk.include_relocations: false

As you can see I also set cluster.routing.allocation.disk.include_relocations to false. In our setup we have a two-node Elasticsearch cluster, both with the exact same size. They should use the same disk usage, so it doesn't make sense if ElasticSearch starts to move shards from one node to the other when almost no disk space is available anymore (hitting the watermark.high value).

After the config modifications, Elasticsearch was restarted:

# systemctl restart elasticsearch

It took quite some time for reindexing, but then all shards were assigned again.

Note: This can and should be done online and without having to restarting ElasticSearch:

# curl -X PUT "http://es01.example.com:9200/_cluster/settings" -d '{ "transient": { "cluster.routing.allocation.disk.watermark.low": "95%", "cluster.routing.allocation.disk.include_relocations": "false" } }'

 


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7269 Days
until Death of Computers
Why?