Header RSS Feed
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

ELK stack not sending notifications anymore because of DNS cache
Tuesday - Nov 20th 2018 - by - (0 comments)

In our primary ELK stack we enabled XPack to send notifications to a Slack channel:

# Slack config for data team
      url: https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXX
        from: watcher

But one day these notifications suddenly stopped. As you can see from the config, the xpack.notification is supposed to connect to https://hooks.slack.com.  

When we checked the firewal logs we saw that ElasticSearch always connected to the same IP address, yet our DNS resolution check pointed towards another (new) IP address. Means: Slack has changed the public IP for hooks.slack.com. But ElasticSearch, which uses the local Java settings, wasn't aware of that change. This is because, by default, DNS is cached forever in the JVM (see DNS cache settings).

To change this, I checked $JAVA_HOME/jre/lib/security/java.security and the defaults were the following:

# The Java-level namelookup cache policy for successful lookups:
# any negative value: caching forever
# any positive value: the number of seconds to cache an address for
# zero: do not cache
# default value is forever (FOREVER). For security reasons, this
# caching is made forever when a security manager is set. When a security
# manager is not set, the default behavior in this implementation
# is to cache for 30 seconds.
# NOTE: setting this to anything other than the default value can have
#       serious security implications. Do not set it unless
#       you are sure you are not exposed to DNS spoofing attack.

# The Java-level namelookup cache policy for failed lookups:
# any negative value: cache forever
# any positive value: the number of seconds to cache negative lookup results
# zero: do not cache
# In some Microsoft Windows networking environments that employ
# the WINS name service in addition to DNS, name service lookups
# that fail may take a noticeably long time to return (approx. 5 seconds).
# For this reason the default caching policy is to maintain these
# results for 10 seconds.

And changed it to use an internal DNS cache of 5 minutes (300s) but failed resolutions should not be cached at all:

# grep ttl $JAVA_HOME/jre/lib/security/java.security

After this a restart of ElasticSearch was needed.


Mounting NFS export suddenly does not work anymore - blame systemd
Thursday - Nov 15th 2018 - by - (0 comments)

Mounting NFS exports has been something which "just worked" since... I can't even remember since when. 

But behold the great SystemD times when things which used to work stop working.

root@client:~# mount nfsserver:/export /mnt
Job for rpc-statd.service failed because the control process exited with error code. See "systemctl status rpc-statd.service" and "journalctl -xe" for details.
mount.nfs: rpc.statd is not running but is required for remote locking.
mount.nfs: Either use '-o nolock' to keep locks local, or start statd.
mount.nfs: an incorrect mount option was specified

I came across a similar problem on Unix Stackexchange which solved my problem:

root@client:~# systemctl enable rpcbind.service
Synchronizing state of rpcbind.service with SysV init with /lib/systemd/systemd-sysv-install...
Executing /lib/systemd/systemd-sysv-install enable rpcbind
root@client:~# systemctl start rpcbind.service
root@client:~# systemctl restart rpcbind.service

root@client:~# mount nfsserver:/export /mnt

root@client:~# mount | grep nfs
nfsserver:/export on /mnt type nfs (rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=

And yes, nfs-common package was already installed.


Icinga2-classicui is gone after installing Icinga2 2.10
Wednesday - Nov 14th 2018 - by - (0 comments)

Farewell my beloved icinga2-classicui.

# apt-get upgrade
Removing icinga2-classicui (2.9.0-1.xenial) ...
disabling Apache2 configuration ...
apache2_invoke postrm: Disable configuration icinga2-classicui
(Reading database ... 39711 files and directories currently installed.)
Preparing to unpack .../icinga2_2.10.1-1.xenial_amd64.deb ...
Unpacking icinga2 (2.10.1-1.xenial) over (2.9.0-1.xenial) ...
Preparing to unpack .../icinga2-ido-mysql_2.10.1-1.xenial_amd64.deb ...
Unpacking icinga2-ido-mysql (2.10.1-1.xenial) over (2.9.0-1.xenial) ...
Processing triggers for libc-bin (2.23-0ubuntu10) ...
dpkg: libicinga2: dependency problems, but removing anyway as you requested:
 icinga2-bin depends on libicinga2 (= 2.9.0-1.xenial).

(Reading database ... 39713 files and directories currently installed.)
Removing libicinga2 (2.9.0-1.xenial) ...
(Reading database ... 39711 files and directories currently installed.)
Preparing to unpack .../icinga2-bin_2.10.1-1.xenial_amd64.deb ...
Unpacking icinga2-bin (2.10.1-1.xenial) over (2.9.0-1.xenial) ...
Preparing to unpack .../icinga2-common_2.10.1-1.xenial_all.deb ...
Unpacking icinga2-common (2.10.1-1.xenial) over (2.9.0-1.xenial) ...

# dpkg -l|grep classic
rc  icinga2-classicui     2.9.0-1.xenial     all  host and network monitoring system - classic UI

Icinga 2 Classic UI was removed 

You've probably been the most viewed interface of my browser history in the last few years.

But you're not dead yet as I still need you for SLA calculations...


Creating an InfluxDB asynchronous replication using subscription service
Tuesday - Nov 13th 2018 - by - (0 comments)

Today a full disk corrupted my InfluxDB 0.8 metrics database for Icinga2 monitoring and I was unable to recover the data.
I found quite some issues with the same errors I found in the logs but all of them were fixed in more recent versions. Luckily this database is not in production yet so this is probably (and forcibly) the day to use a newer InfluxDB version.

InfluxDB 0.8 came from the Ubuntu repositories itself. And it featured a cluster setup! Unfortunately newer versions do not support a cluster setup anymore, unless you buy the license for the Enterprise Edition of InfluxDB. That was the reason why I thought I'd just stay on 0.8. But, as mentioned, a lot of bugfixes and improvements happened since then. 

However I didn't want to give up on the cluster or to have at least a standby InfluxDB that the graphs can still be shown, even if the primary monitoring server is down. This is when I came across subscriptions.

According to the documentation, the receiving InfluxDB copies data to known subscribers. Imagine this like a mail server sending a newsletter to registered subscribers (this comparison really helps, doesn't it? You're welcome!). In order to do that, the subscriber service needs to be enabled in the InfluxDB config (usually /etc/influxdb/influxdb.conf):

  # Determines whether the subscriber service is enabled.
  enabled = true

Restart InfluxDB after this setting change.

On the master server you need to define the known subscribers:

root@influx01:/# influx
Connected to http://localhost:8086 version 1.6.4
InfluxDB shell version: 1.6.4
> CREATE SUBSCRIPTION "icinga-replication" ON "icinga"."autogen" DESTINATIONS ALL ''

So what does it do?

Obviously a new subscription with the unique ID "icinga-replication" is created. It covers the database "icinga" and sets a retention policy of "autogen".
As destination the transport over http was chosen and endpoint is, which is InfluxDB running on host influx02.

By showing the subscriptions (SHOW SUBSCRIPTIONS), this can be confirmed:

name: icinga
retention_policy name               mode destinations
---------------- ----               ---- ------------
autogen          icinga-replication ALL  []

From now on every record written into the influx01 instance is copied to to influx02 and will show up with such entries on influx02:

influx02 influxd[5045]: [httpd] - icinga [13/Nov/2018:14:23:41 +0100] "POST /write?consistency=&db=icinga&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" 564b790a-e747-11e8-8b0c-000000000000 13708

But what about authentication?

When authentication is enabled (which should always be the case on a database), the following error message appears in the log file on the master:

influx01 influxd[7232]: ts=2018-11-13T10:18:11.142719Z lvl=info msg="{\"error\":\"unable to parse authentication credentials\"}\n" log_id=0Bk8uDzG000 service=subscriber

In this case, the subscription needs to be added with credentials in the URL string:

> CREATE SUBSCRIPTION "icinga-replication" ON "icinga"."autogen" DESTINATIONS ALL 'http://dbuser:password@'

Note: I used the same subscription ID again (icinga-replication). In order to do so, the existing subscription must be removed:

> DROP SUBSCRIPTION "icinga-replication" ON "icinga"."autogen"

And what about data prior to the subscription "replication"?

The subscription service only copies incoming data. This means that older data is not copied over to the subscribers. In this case you need to transfer the data using dump/restore or even by syncing the data (/var/lib/influxdb/data) and wal (/var/lib/influxdb/wal) directories.
In my case I stopped InfluxDB on both influx01 and influx02, rsynced the content of /var/lib/influxdb/data and /var/lib/influxdb/wal from influx01 to influx02 and then started InfluxDB again. First on host influx02, then on influx01. Now data is in sync (until a network issue or similar happens).

As this graphing database is not production critical I can live with that situation.


MariaDB Galera Cluster: When the data is in sync but the query results differ
Friday - Nov 9th 2018 - by - (0 comments)

On Friday's there's always a bit more time for the curious cases which require some deeper research.

A developer contacted me with the suspicion that the data inside a clustered MariaDB (using Galera) was not in sync. Sometimes a certain query resulted X results, sometimes Y results.

Technically this sounded plausible because the application connects to which is a HAProxy listening on 3306. The requests are then balanced across multiple Galera nodes in round-robin style. So when the query lands on node1, X results are shown. But when the query lands on node2, Y results are shown. That could indeed be a data sync problem.

But it wasn't. The data across the cluster was exactly the same on all nodes. The same amount of rows inside the tables, the same data.

On node1:

MariaDB [app]> select count(*) from docs;
| count(*) |
|     3916 |
1 row in set (0.00 sec)

MariaDB [app]> select count(*) from pages;
| count(*) |
|      446 |
1 row in set (0.01 sec)

On node2:

MariaDB [app]> select count(*) from docs;
| count(*) |
|     3916 |
1 row in set (0.00 sec)

MariaDB [app]> select count(*) from pages;
| count(*) |
|      446 |
1 row in set (0.01 sec)

Even a dump created on both nodes showed the exact same size:

root@node1:/tmp# mysqldump app > /tmp/app.sql
root@node1:/tmp# du -ks /tmp/app.sql
3416    /tmp/app.sql

root@node2:/tmp# mysqldump app > /tmp/app.sql
root@node2:/tmp# du -ks /tmp/app.sql
3416    /tmp/app.sql

Could the two nodes interpret the query differently? The query itself is indeed a bit special, joining two tables together and using several internal MySQL functions:

SELECT DISTINCT d.*, p.page, p.pagestatus, DATE_FORMAT(creationdate,"%d.%m.%Y %H:%i:%s") as dcreationdate, DATE_FORMAT(updatedate,"%d.%m.%Y %H:%i:%s") as dupdatedate, DATE_FORMAT(lastpubdate,"%d.%m.%Y %H:%i:%s") as dlastpubdate, DATE_FORMAT(modified,"%d.%m.%Y %H:%i:%s") as dmodified FROM docs d LEFT JOIN (SELECT * FROM pages WHERE id IN (SELECT MAX(id) FROM pages GROUP BY wwid)) p ON (p.wwid = d.wwid) WHERE d.id in (SELECT MAX(id) FROM docs WHERE channelid = 2 GROUP BY ldid) AND d.printpublicationdate = '2018-06-22' ORDER BY updatedate DESC;

Agreed, not your typical select query... And this query did indeed show two different result sets from node1 and node2:

node1: 62 rows in set (0.02 sec)
node2: 82 rows in set (0.01 sec)

I checked for differences on the two nodes and indeed found a different patch level of MariaDB: node1 had 10.0.33 installed, node2 was running a newer version 10.0.36. Could this really be the source of the difference?

As this problem was identified on a staging server, I updated all cluster nodes to the current MariaDB patch level 10.0.37 (I wouldn't have done that on a Friday afternoon on PROD, no way!).
And by "magic", both nodes reported the same results once they ran the same version:

node1: 82 rows in set (0.02 sec)
node2: 82 rows in set (0.01 sec)

There were two important lessons learned here:

1) The data was still in sync and therefore not corrupted.

2) Always make sure a Galera cluster uses the same version across all nodes.

I'm not sure which exact version (10.0.34, 35 or 36) fixed this, but it could be related to bug MDEV-15035 which was fixed in 10.0.35.


Monitoring plugin check_rancher2 1.0.0 available, leaving beta phase
Friday - Nov 9th 2018 - by - (0 comments)

This week was the open source monitoring conference (OSMC) in Nuremberg, Germany. I was invited as a first time speaker and in my presentation ("It's all about the... containers!") I introduced check_rancher2.


As the word is now officially out, let's leave the beta phase and call it version 1.0.0. I've been running the plugin successfully for the last couple of weeks in both Staging and Production environments and it already helped a lot.

As always, feedback and contributions are always welcome. The plugin is developed on the public repo on https://github.com/Napsty/check_rancher2.


Different proxy_pass upstream depending on client ip address in Nginx
Friday - Oct 26th 2018 - by - (0 comments)

Sometimes there is a special situation when you need to show a different website to certain website users. Whether the situation is for clients coming from internal IP's, from specific countries (using GeoIP lookups) or bot user agents, ..., there are many use cases for such a need.This article will show the configuration in a Nginx web server, used here as a reverse proxy.

In this scenario certain client IP's needed to be (reverse) proxied to a different backend server (upstream) than the default server on which the "normal" web application is running.

The first idea how to implement the solution is (most likely) an if condition like this:

  location / {

    include /etc/nginx/proxy.conf;
    proxy_set_header X-Forwarded-Proto https;

    if ($remote_addr ~ "(|(") {
      proxy_pass https://different-upstream.example.com;

    proxy_pass http://default-upstream.example.com:8080;


The above config checks for the internal client IP addresses (saved in the global variable $remote_addr) and For these clients requests, the reverse proxy upstream is set to https://different-upstream.example.com. For all other clients, the default upstream (http://default-upstream.example.com:8080) is used.

Although this solution works, it is technically not advised for several reasons:

  • There is no else condition which clearly identifies an alternative action. Just leaving the second proxy_pass after the if condition becomes the "default".
  • If is evil. If you're an experienced Nginx administrator, you know about that already!

So if you care about a well working Nginx and a better solution, take a look at the http_geo_module and the http_map_module. This module's purpose is to create a map based on a condition with a defined target. Sounds complicated but it actually isn't. The following example will show you that.

First, above your server { } configuration, define the upstreams:

# Upstream definitions
upstream default {
  server default-upstream.example.com:8080;

upstream different {
  server different-upstream.example.com:443;

Note: The "different" upstream uses https. It is mandatory to define the port 443 in this case, otherwise default port 80 will be taken if no port is set.

Now comes the magic: The geo-map itself:

geo $remote_addr $backend {
  default http://default; https://different; https://different;

The map config explained:

  • geo: Tells Nginx to create a new "geo-map"...
  • $remote_addr: ... based on the client ip address ($remote_addr)...
  • $backend: ... and save the following value into the new variable $backend
  • default: This is a reserved keyword and specifies the default entry if nothing of the map is matched (in this scenario this means all client ip addresses which are not listed)

As you can see inside the map itself, default points to value "http://default;" which refers to the upstream called "default".
On the other hand, the entries with the two internal ip addresses point to value "https://different;", which, of course, is the upstream called "different".

Within the server { } configuration the map can be called, for example inside the location /:

server {

  location / {

    include /etc/nginx/proxy.conf;
    proxy_set_header X-Forwarded-Proto https;

    proxy_pass $backend;



Nginx now needs to access the $backend variable to determine the proxy_pass upstream. This calls the "map" entry from before which defines the wanted upstream server.

Note: It might also work to just define the full upstream URL inside the map, without having to define the upstreams first. But I didn't try that.

TL;DR: There's always a way around if, usually using a map. It's not that difficult to use, is easier to maintain than to add ip addresses to the if condition and most importantly ensures a well working Nginx config!

Update October 30th 2018: While fixed IP addresses worked with the "map" module, IP ranges (e.g. did not work. For this reason I switched to the "geo" map, which was made for working with IP addresses and ranges.


Permission denied when writing on NTFS mount, even as root
Thursday - Oct 18th 2018 - by - (0 comments)

To create an offsite backup, I plugged an external hard drive via USB to my home NAS server. The external hdd has one partition and is formatted with NTFS (to allow create some backups from Windows hosts, too).

I mounted the partition to /mnt2 and wanted to sync the data from NAS, but it failed:

# rsync -rtuP /mnt/data/Movies/ /mnt2/Movies/
sending incremental file list
  1,956,669,762 100%  108.73MB/s    0:00:17 (xfr#1, to-chk=1040/1042)
    436,338,688  20%  104.03MB/s    0:00:16  ^C
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(638) [sender=3.1.1]
rsync: mkstemp "/mnt2/Movies/.Test1.mp4.mwuEtR" failed: Permission denied (13)
rsync: mkstemp "/mnt2/Movies/.Test2.mp4.ialsVx" failed: Permission denied (13)
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at io.c(504) [generator=3.1.1]

Permissions were correct, at least that root was able to write:

# ls -l /mnt2
total 20M
drwx------ 1 root root    0 Jan  2  2018 Family
drwsr-sr-x 1 root root 232K Oct 14 20:00 Movies
drwx------ 1 root root  24K Dec 30  2017 Movies-Kids
drwx------ 1 root root    0 May  6  2017 Pictures

But when I tried to manually create a file, permission denied again:

# touch /mnt2/bla
touch: cannot touch ‘bla’: Permission denied

I checked dmesg and saw the following:

[12867539.697380] ntfs: (device sde1): ntfs_setattr(): Changes in user/group/mode are not supported yet, ignoring.
[12867539.697386] ntfs: (device sde1): ntfs_setattr(): Changes in user/group/mode are not supported yet, ignoring.
[12867539.697392] ntfs: (device sde1): ntfs_setattr(): Changes in user/group/mode are not supported yet, ignoring.

I checked how the partition was mounted:

# mount | grep sde1
/dev/sde1 on /mnt2 type ntfs (rw,relatime,uid=0,gid=0,fmask=0177,dmask=077,nls=utf8,errors=continue,mft_zone_multiplier=1)

"rw is there so it should work", would be my first guess. But I remembered that NTFS mounts are a little bit special on Linux.

In order to "really" mount a NTFS drive and write on it, one needs the ntfs-3g package, which uses fuse in the background.
Note: I wrote a similar article but for MAC OS X back in 2011: How to read and write an NTFS external disk on a MAC OS X.

I installed the package which installed fuse as a dependency:

# apt-get install ntfs-3g
Reading package lists... Done
Building dependency tree      
Reading state information... Done
The following extra packages will be installed:
The following NEW packages will be installed:
  fuse ntfs-3g

Now I just needed to unmount the external hdd and mount it with ntfs-3g:

# mount -t ntfs-3g /dev/sde1 /mnt2

Checking mount again, the partition is now mounted as type fuseblk:

# mount | grep sde1
/dev/sde1 on /mnt2 type fuseblk (rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096)

And voilà, I can now write to the NTFS partition:

# touch /mnt2/bla && stat /mnt2/bla
  File: ‘/mnt2/bla’
  Size: 0             Blocks: 0          IO Block: 4096   regular empty file
Device: 841h/2113d    Inode: 16406       Links: 1
Access: (0777/-rwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2018-10-18 20:29:45.914775500 +0200
Modify: 2018-10-18 20:29:45.914775500 +0200
Change: 2018-10-18 20:29:45.914775500 +0200
 Birth: -


check_esxi_hardware now supports python3!
Tuesday - Oct 2nd 2018 - by - (0 comments)

It has been a long time since python3 was released, yet the monitoring plugin check_esxi_hardware was not compatible with python3. Until yesterday. 

The initial reason for the delay (see issue #13) was the python module pywbem, which at first was only a module for python2. Since a new team took over the maintenance of pywbem, there was life yet again in pywbem and it was also ported to python3.

Second reason for the delay was: life. I already prepared a python3-compatible version a while ago, it just needed some more fine-tuning and testing. Finally this is now also completed and check_esxi_hardware now works on both python2 and python3. Same code, same plugin. That was very important to me to still be able to run the new version of the plugin on whatever environment.


Grub2 install fails on USB drive with error: appears to contain a ufs1 filesystem
Friday - Sep 28th 2018 - by - (0 comments)

In my previous article (How to compare speed of USB flash pen drives) I briefly mentioned I had to reinstall the OS of my NAS-server on a USB flash drive. When I did so, the last step in the Debian installer (install grub2) failed, but without a clear error message. Because I was in a hurry back then, I installed LILO as bootloader. This worked and the NAS booted correctly.

Now it was time to investigate and on the running Debian OS I tried to install grub2:

root@nas:~# apt-get install grub2
Reading package lists... Done
Building dependency tree      
Reading state information... Done
The following NEW packages will be installed:
0 upgraded, 1 newly installed, 0 to remove and 6 not upgraded.
Need to get 2,476 B of archives.
After this operation, 16.4 kB of additional disk space will be used.
Get:1 http://ftp.ch.debian.org/debian stretch/main amd64 grub2 amd64 2.02~beta3-5 [2,476 B]
Fetched 2,476 B in 0s (38.7 kB/s)
Selecting previously unselected package grub2.
(Reading database ... 28456 files and directories currently installed.)
Preparing to unpack .../grub2_2.02~beta3-5_amd64.deb ...
Unpacking grub2 (2.02~beta3-5) ...
Setting up grub2 (2.02~beta3-5) ...

So the installation of the package itself worked. What about the grub install?

root@nas:~# grub-install /dev/sde
Installing for i386-pc platform.
grub-install: error: hostdisk//dev/sde appears to contain a ufs1 filesystem which isn't known to reserve space for DOS-style boot.  Installing GRUB there could result in FILESYSTEM DESTRUCTION if valuable data is overwritten by grub-setup (--skip-fs-probe disables this check, use at your own risk).

When I was searching for this error I came across a Linux mint bug (grub-install fails on drive that previously had ufs2 installed). This is a true fact for the USB drive I'm using in the NAS server, it was previously used as a simple USB pen drive. So I tried to use the --skip-fs-probe parameter:

root@nas:~# grub-install /dev/sde --skip-fs-probe
Installing for i386-pc platform.
grub-install: warning: Attempting to install GRUB to a disk with multiple partition labels.  This is not supported yet..
grub-install: warning: Embedding is not possible.  GRUB can only be installed in this setup by using blocklists.  However, blocklists are UNRELIABLE and their use is discouraged..
grub-install: error: will not proceed with blocklists.

Now I was at the same point as in the mentioned bug report.

I decided to wipe the first 2047 sectors of the flash drive:

root@nas:~# dd if=/dev/zero of=/dev/sde bs=512 seek=1 count=2047
2047+0 records in
2047+0 records out
1048064 bytes (1.0 MB, 1.0 MiB) copied, 0.111457 s, 9.4 MB/s

Now there shouldn't be anything left to cause a grub-install hiccup. Let's try it again:

root@nas:~# grub-install /dev/sde
Installing for i386-pc platform.
Installation finished. No error reported.

Hurray! Oh and wow, the NAS boots so much faster with grub2 than with ILO (haven't used ILO since 2005...)


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7000 Days
until Death of Computers