Header RSS Feed
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

Regex != regex in sed (or: replacing digits in sed)
Friday - Dec 14th 2018 - by - (0 comments)

This is supposed to be a quick reminder to myself, the next time I run into such a problem: regular expressions are not exactly the same in sed!

On my previous article "How to manually clean up Zoneminder events" I wrote a shell script in which I wanted to remove a certain part of a path:


should become:


Simple, right? Just use sed replace and remove ".448512/" out of the string.

But see for yourself:

$ echo "/var/cache/zoneminder/events/5/18/12/14/.448512/06/45/12" | sed "s/\.\d+\///g"

The old path is still shown. Nothing was replaced. My first thought was of course that I've made a mistake in my regular expression, but on all the regex checkers online confirmed my regex was correct. For example on https://regexr.com/:

Regex match dot and digit 

I was able to break it down that it must have something to do with the regular expression for the number (\d+) because simply replacing the dot character works:

$ echo "/var/cache/zoneminder/events/5/18/12/14/.448512/06/45/12" | sed "s/\.//g"

And then I received the final hint from a friend: Some typical regex don't work in sed! Excerpt from sed's documentation:

*    Matches a sequence of zero or more instances of matches for the preceding regular expression, which must be an ordinary character, a special character preceded by \, a ., a grouped regexp (see below), or a bracket expression. As a GNU extension, a postfixed regular expression can also be followed by *; for example, a** is equivalent to a*. POSIX 1003.1-2001 says that * stands for itself when it appears at the start of a regular expression or subexpression, but many nonGNU implementations do not support this and portable scripts should instead use \* in these contexts.

\+   As *, but matches one or more. It is a GNU extension. 


‘[a-zA-Z0-9]’  In the C locale, this matches any ASCII letters or digits.

So first of all the plus-sign (+) must be escaped. And second to match a digit, \d doesn't work, it must be used in [0-9] style!

With these adjustments, sed now finally does the replace part:

$ echo "/var/cache/zoneminder/events/5/18/12/14/.448512/06/45/12" | sed "s/\.[0-9]\+\///g"

Dang it, I am sure that I ran into this at least once already in my Linux career. Hence this post to not lose much time the next time this happens again.


How to manually clean up Zoneminder events
Friday - Dec 14th 2018 - by - (0 comments)

Zoneminder is a great tool to build a surveillance system, combining all kinds of ip cameras in one dashboard and use it to manage recordings. 

But sometimes Zoneminder can be a bit of a pain, especially when the disk is getting filled. With the high resolutions of todays IP cameras this can happen pretty quickly. Although Zoneminder has an internal "filter" to automatically purge old events when the disk threshold hits a certain limit.

Zoneminder Purge Filter 

However this filter only works if there's actually still some space left available. The filter searches the oldest N events in the database, deleting them from the database and also on the filesystem. But when there's no disk space available at all, the database is likely to be frozen/unavailable. Ergo no clean up anymore. And in this situation you're stuck with a non-working Zoneminder.

This happened to me twice already on my Zoneminder installation (side note: I have to admit my current disk size of 120GB dedicated for Zoneminder is rather small) so I built a clean up script and this is what this post is about.

Step one: We don't want to delete the archived events!

When you archive a footage, this usually means you want to keep it. The cleanup script needs to respect that. But it needs to know about the archived events first. This can be done by getting the relevant information from the database:

root@zoneminder:~# mysql -N -b -r -e "select Id from zm.Events where Archived = 1;"
| 135933 |
| 136590 |
| 154831 |
| 160832 |
| 162647 |
| 162649 |
| 167562 |

Step two: Find all events except the archived ones

We can use the find command and the ID's from step one to find all events except the archived ones:

root@zoneminder:~# find /var/cache/zoneminder/events/ -mindepth 2 -type l ! -name ".135933" ! -name ".136590" ! -name ".154831" ! -name ".160832" ! -name ".162647" ! -name ".162649" ! -name ".167562" -exec ls -la {} + > /tmp/xxx

find will now look in the path /var/cache/zoneminder/events/ for symlinks (type l), except for the given names (excluded with exclamation mark). The output of the full path and other information will be saved in /tmp/xxx.

The output file /tmp/xxx will now look something like that:

root@zoneminder:~# tail /tmp/xxx
lrwxrwxrwx 1 www-data www-data 8 Dec 14 06:45 /var/cache/zoneminder/events/5/18/12/14/.448512 -> 06/45/12
lrwxrwxrwx 1 www-data www-data 8 Dec 14 06:51 /var/cache/zoneminder/events/5/18/12/14/.448517 -> 06/51/29
lrwxrwxrwx 1 www-data www-data 8 Dec 14 06:51 /var/cache/zoneminder/events/5/18/12/14/.448518 -> 06/51/34
lrwxrwxrwx 1 www-data www-data 8 Dec 14 07:02 /var/cache/zoneminder/events/5/18/12/14/.448533 -> 07/02/28
lrwxrwxrwx 1 www-data www-data 8 Dec 14 07:44 /var/cache/zoneminder/events/5/18/12/14/.448546 -> 07/44/56
lrwxrwxrwx 1 www-data www-data 8 Dec 14 07:47 /var/cache/zoneminder/events/5/18/12/14/.448548 -> 07/47/09
lrwxrwxrwx 1 www-data www-data 8 Dec 14 08:22 /var/cache/zoneminder/events/5/18/12/14/.448551 -> 08/22/17
lrwxrwxrwx 1 www-data www-data 8 Dec 14 08:26 /var/cache/zoneminder/events/5/18/12/14/.448552 -> 08/26/13
lrwxrwxrwx 1 www-data www-data 8 Dec 14 08:27 /var/cache/zoneminder/events/5/18/12/14/.448555 -> 08/27/30
lrwxrwxrwx 1 www-data www-data 8 Dec 14 08:28 /var/cache/zoneminder/events/5/18/12/14/.448557 -> 08/28/19

Step three: Get the event id and real path

Each path in /tmp/xxx contains two important information: The event id and the real path.

/var/cache/zoneminder/events/5/18/12/14/.448512 -> 06/45/12

In this case .448512 is the symlink of the event pointing to the subfolders 06/45/12.
The name of the symlink also contains the event id (448512).
By removing the symlink and adding the subfolders into the path, we get the real path where the footage is stored:


Step four: Delete the footage and the info in the database

Now that the real path is known, it can be deleted:

root@zoneminder:~# rm -rf /var/cache/zoneminder/events/5/18/12/14/06/45/12

We should also delete the symlink:

root@zoneminder:~# rm -f /var/cache/zoneminder/events/5/18/12/14/.448512

And now that there's some disk space available again, the MySQL database should accept writes again. Therefore we can delete this event (448512) from the relevant tables:

mysql> DELETE FROM zm.Events where Id = 448512;
mysql> DELETE FROM zm.Frames where EventId = 448512;
mysql> DELETE FROM zm.Stats where EventId = 448512;

Step five: Automate it with a script

As I mentioned at the beginning, I wrote a script to automate these tasks. It's called zoneminder-event-cleanup.sh and you can download it here:

Using the script is very simple.

1) Download it

# wget https://www.claudiokuenzler.com/downloads/scripts/zoneminder-event-cleanup.sh

2) Give execute permissions:

# chmod 755 zoneminder-event-cleanup.sh

3) Open the script with an editor and adjust the user variables:

# User variables
olderthan=2 # Defines the minimum age in days of the events to be deleted
zmcache=/var/cache/zoneminder/events # Defines the path where zm stores events
mysqlhost=localhost # Defines the MySQL host for the zm database
mysqldb=zm # Defines the MySQL database name used by zm
mysqluser=zmuser # Defines a MySQL user to connect to the database
mysqlpass=secret # Defines the password for the MySQL user

4) Run the script

root@zoneminder:~# ./zoneminder-cleanup.sh
Deleting 447900
Deleting 447901
Deleting 447902
Deleting 447903
Deleting 447904
Deleting 447905
Deleting 447906
Deleting 447907
Deleting 447908
Deleting 447909
Deleting 447911
Deleting 447912
Deleting 447913
Deleting 447914
Deleting 447915

The script is also available on Github: https://github.com/Napsty/scripts/blob/master/zoneminder/zoneminder-event-cleanup.sh


Monitoring plugin check_lxc 0.6.3 released
Wednesday - Dec 5th 2018 - by - (0 comments)

The monitoring plugin check_lxc, which allows you to monitor resource usage of LXC containers, received a few updates in the past days.

The current up to date version is 0.6.3 and contains the following changes:

- Swap check fixed (https://github.com/Napsty/check_lxc/pull/4)

- Autostart check adjusted to allow both 'YES' and 1 (https://github.com/Napsty/check_lxc/pull/5)

- Return an unknown state and informative output when lxc-cgroup doesn't return any data (https://github.com/Napsty/check_lxc/issues/9)

- General bash code improvements (https://github.com/Napsty/check_lxc/pull/6)

Thanks a lot to contributor Julien for the PR's!


Logstash gelf output plugin gone after udpate
Wednesday - Nov 28th 2018 - by - (0 comments)

Today I was informed that some log data doesn't appear anymore in our ELK stack since yesterday afternoon. What changed yesterday afternoon? This particular log collector machien running Logstash was updated.

# zgrep logstash /var/log/apt/history.log.1.gz -B 3
Start-Date: 2018-11-27  15:52:08
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold dist-upgrade
Install: [...]
Upgrade: [...] logstash:amd64 (5.6.10-1, 5.6.13-1), [...]

By checking the Logstash logs, the reason was pretty well explained:

[2018-11-28T08:13:51,925][ERROR][logstash.plugins.registry] Problems loading a plugin with {:type=>"output", :name=>"gelf", :path=>"logstash/outputs/gelf", :error_message=>"NameError", :error_class=>NameError, :error_backtrace=>["/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:226:in `namespace_lookup'", "/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:162:in `legacy_lookup'", "/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:138:in `lookup'", "/usr/share/logstash/logstash-core/lib/logstash/plugins/registry.rb:180:in `lookup_pipeline_plugin'", "/usr/share/logstash/logstash-core/lib/logstash/plugin.rb:140:in `lookup'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:103:in `plugin'", "(eval):12:in `initialize'", "org/jruby/RubyKernel.java:1079:in `eval'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:75:in `initialize'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:165:in `initialize'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:296:in `create_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:313:in `execute'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:204:in `run'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", "/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"]}
[2018-11-28T08:13:51,955][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Couldn't find any output plugin named 'gelf'. Are you sure this is correct? Trying to load the gelf output plugin resulted in this error: Problems loading the requested plugin named gelf of type output. Error: NameError NameError"}

The important part:

:reason=>"Couldn't find any output plugin named 'gelf'.

Let's check the currently installed Logstash plugins:

# /usr/share/logstash/bin/logstash-plugin list --verbose | grep gelf
logstash-input-gelf (3.1.1)

Hmm.. Only the input-gelf plugin is shown, but not the output-gelf plugin. Let's install it manually then:

# /usr/share/logstash/bin/logstash-plugin install logstash-output-gelf
Validating logstash-output-gelf
Installing logstash-output-gelf
Installation successful

A Logstash restart is not needed after this. The logs happily showed up in Kibana again.

Gelf logs reappear in Kibana 


How to create a new admin user in a clustered CouchDB
Wednesday - Nov 21st 2018 - by - (0 comments)

CouchDB comes with a nice and (mostly) intuitive user interface called Fauxton. It can be accessed when adding "_utils" to the CouchDB URL, for example https://mycouchdb.example.com/_utils.

For a standalone CouchDB you can check the whole config, create new admin users, etc in the UI, however when you run a clustered CouchDB, these pages do not open:

CouchDB Config Disabled in Cluster

Config disabled: It seems that you are running a cluster with 2 nodes. For CouchDB 2.0 we recommend using a configuration management tools like Chef, Ansible, Puppet or Salt (in no particular order) to configure your nodes in a cluster.
We highly recommend against configuring nodes in your cluster using the HTTP API and suggest using a configuration management tool for all configurations.

Technically speaking this "blockage" makes sense because it prevents you to make wrong configuration changes in the wrong places.

Note: With some trickery in the URI you can still access the config on a per-node-level. For example: http://localhost:5984/_utils/#_config/couchdb@couchdb01.example.com will show the config for the cluster node couchdb01.example.com. But use with caution!!

In order to create a new server admin user in the cluster, one needs to:

1) Figure out the cluster members

2) Add this new server admin to each cluster member

Finding the members of the cluster is pretty easy, just go to /_membership on your CouchDB:

$ curl -s "http://localhost:5984/_membership" -u admin:XXX

This means we need to add the new server admin into the nodes couchdb@couchdb01.example.com and couchdb@couchdb02.example.com.

To create the new server admin, we can use curl with the PUT method to create the new user. Important here is the URI which contains the cluster node's local configuration:

$ curl -s -X PUT "http://localhost:5984/_node/couchdb@couchdb01.example.com/_config/admins/newadminuser" -d '"secret"' -u admin:XXX

This creates a new local server admin called "newadminuser" on node "couchdb@couchdb01.example.com" with a password "secret".

The user should now appear in the list of admin users:

$ curl -s "http://localhost:5984/_node/couchdb@couchdb01.example.com/_config/admins/" -u admin:XXX

All server admins are now shown. The newly created "newadminuser" and the default "admin" user which I used to create the new user.

Now don't forget to create the new user on the other cluster node(s), too. This data will not be synced in the cluster as it is part of the local (node) configuration.

$ curl -s -X PUT "http://localhost:5984/_node/couchdb@couchdb02.example.com/_config/admins/newadminuser" -d '"secret"' -u admin:XXX

Now your new server admin can be used on the cluster.


ELK stack not sending notifications anymore because of DNS cache
Tuesday - Nov 20th 2018 - by - (0 comments)

In our primary ELK stack we enabled XPack to send notifications to a Slack channel:

# Slack config for data team
      url: https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXX
        from: watcher

But one day these notifications suddenly stopped. As you can see from the config, the xpack.notification is supposed to connect to https://hooks.slack.com.  

When we checked the firewal logs we saw that ElasticSearch always connected to the same IP address, yet our DNS resolution check pointed towards another (new) IP address. Means: Slack has changed the public IP for hooks.slack.com. But ElasticSearch, which uses the local Java settings, wasn't aware of that change. This is because, by default, DNS is cached forever in the JVM (see DNS cache settings).

To change this, I checked $JAVA_HOME/jre/lib/security/java.security and the defaults were the following:

# The Java-level namelookup cache policy for successful lookups:
# any negative value: caching forever
# any positive value: the number of seconds to cache an address for
# zero: do not cache
# default value is forever (FOREVER). For security reasons, this
# caching is made forever when a security manager is set. When a security
# manager is not set, the default behavior in this implementation
# is to cache for 30 seconds.
# NOTE: setting this to anything other than the default value can have
#       serious security implications. Do not set it unless
#       you are sure you are not exposed to DNS spoofing attack.

# The Java-level namelookup cache policy for failed lookups:
# any negative value: cache forever
# any positive value: the number of seconds to cache negative lookup results
# zero: do not cache
# In some Microsoft Windows networking environments that employ
# the WINS name service in addition to DNS, name service lookups
# that fail may take a noticeably long time to return (approx. 5 seconds).
# For this reason the default caching policy is to maintain these
# results for 10 seconds.

And changed it to use an internal DNS cache of 5 minutes (300s) but failed resolutions should not be cached at all:

# grep ttl $JAVA_HOME/jre/lib/security/java.security

After this a restart of ElasticSearch was needed.


Mounting NFS export suddenly does not work anymore - blame systemd
Thursday - Nov 15th 2018 - by - (0 comments)

Mounting NFS exports has been something which "just worked" since... I can't even remember since when. 

But behold the great SystemD times when things which used to work stop working.

root@client:~# mount nfsserver:/export /mnt
Job for rpc-statd.service failed because the control process exited with error code. See "systemctl status rpc-statd.service" and "journalctl -xe" for details.
mount.nfs: rpc.statd is not running but is required for remote locking.
mount.nfs: Either use '-o nolock' to keep locks local, or start statd.
mount.nfs: an incorrect mount option was specified

I came across a similar problem on Unix Stackexchange which solved my problem:

root@client:~# systemctl enable rpcbind.service
Synchronizing state of rpcbind.service with SysV init with /lib/systemd/systemd-sysv-install...
Executing /lib/systemd/systemd-sysv-install enable rpcbind
root@client:~# systemctl start rpcbind.service
root@client:~# systemctl restart rpcbind.service

root@client:~# mount nfsserver:/export /mnt

root@client:~# mount | grep nfs
nfsserver:/export on /mnt type nfs (rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=

And yes, nfs-common package was already installed.


Icinga2-classicui is gone after installing Icinga2 2.10
Wednesday - Nov 14th 2018 - by - (0 comments)

Farewell my beloved icinga2-classicui.

# apt-get upgrade
Removing icinga2-classicui (2.9.0-1.xenial) ...
disabling Apache2 configuration ...
apache2_invoke postrm: Disable configuration icinga2-classicui
(Reading database ... 39711 files and directories currently installed.)
Preparing to unpack .../icinga2_2.10.1-1.xenial_amd64.deb ...
Unpacking icinga2 (2.10.1-1.xenial) over (2.9.0-1.xenial) ...
Preparing to unpack .../icinga2-ido-mysql_2.10.1-1.xenial_amd64.deb ...
Unpacking icinga2-ido-mysql (2.10.1-1.xenial) over (2.9.0-1.xenial) ...
Processing triggers for libc-bin (2.23-0ubuntu10) ...
dpkg: libicinga2: dependency problems, but removing anyway as you requested:
 icinga2-bin depends on libicinga2 (= 2.9.0-1.xenial).

(Reading database ... 39713 files and directories currently installed.)
Removing libicinga2 (2.9.0-1.xenial) ...
(Reading database ... 39711 files and directories currently installed.)
Preparing to unpack .../icinga2-bin_2.10.1-1.xenial_amd64.deb ...
Unpacking icinga2-bin (2.10.1-1.xenial) over (2.9.0-1.xenial) ...
Preparing to unpack .../icinga2-common_2.10.1-1.xenial_all.deb ...
Unpacking icinga2-common (2.10.1-1.xenial) over (2.9.0-1.xenial) ...

# dpkg -l|grep classic
rc  icinga2-classicui     2.9.0-1.xenial     all  host and network monitoring system - classic UI

Icinga 2 Classic UI was removed 

You've probably been the most viewed interface of my browser history in the last few years.

But you're not dead yet as I still need you for SLA calculations...


Creating an InfluxDB asynchronous replication using subscription service
Tuesday - Nov 13th 2018 - by - (0 comments)

Today a full disk corrupted my InfluxDB 0.8 metrics database for Icinga2 monitoring and I was unable to recover the data.
I found quite some issues with the same errors I found in the logs but all of them were fixed in more recent versions. Luckily this database is not in production yet so this is probably (and forcibly) the day to use a newer InfluxDB version.

InfluxDB 0.8 came from the Ubuntu repositories itself. And it featured a cluster setup! Unfortunately newer versions do not support a cluster setup anymore, unless you buy the license for the Enterprise Edition of InfluxDB. That was the reason why I thought I'd just stay on 0.8. But, as mentioned, a lot of bugfixes and improvements happened since then. 

However I didn't want to give up on the cluster or to have at least a standby InfluxDB that the graphs can still be shown, even if the primary monitoring server is down. This is when I came across subscriptions.

According to the documentation, the receiving InfluxDB copies data to known subscribers. Imagine this like a mail server sending a newsletter to registered subscribers (this comparison really helps, doesn't it? You're welcome!). In order to do that, the subscriber service needs to be enabled in the InfluxDB config (usually /etc/influxdb/influxdb.conf):

  # Determines whether the subscriber service is enabled.
  enabled = true

Restart InfluxDB after this setting change.

On the master server you need to define the known subscribers:

root@influx01:/# influx
Connected to http://localhost:8086 version 1.6.4
InfluxDB shell version: 1.6.4
> CREATE SUBSCRIPTION "icinga-replication" ON "icinga"."autogen" DESTINATIONS ALL ''

So what does it do?

Obviously a new subscription with the unique ID "icinga-replication" is created. It covers the database "icinga" and sets a retention policy of "autogen".
As destination the transport over http was chosen and endpoint is, which is InfluxDB running on host influx02.

By showing the subscriptions (SHOW SUBSCRIPTIONS), this can be confirmed:

name: icinga
retention_policy name               mode destinations
---------------- ----               ---- ------------
autogen          icinga-replication ALL  []

From now on every record written into the influx01 instance is copied to to influx02 and will show up with such entries on influx02:

influx02 influxd[5045]: [httpd] - icinga [13/Nov/2018:14:23:41 +0100] "POST /write?consistency=&db=icinga&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" 564b790a-e747-11e8-8b0c-000000000000 13708

But what about authentication?

When authentication is enabled (which should always be the case on a database), the following error message appears in the log file on the master:

influx01 influxd[7232]: ts=2018-11-13T10:18:11.142719Z lvl=info msg="{\"error\":\"unable to parse authentication credentials\"}\n" log_id=0Bk8uDzG000 service=subscriber

In this case, the subscription needs to be added with credentials in the URL string:

> CREATE SUBSCRIPTION "icinga-replication" ON "icinga"."autogen" DESTINATIONS ALL 'http://dbuser:password@'

Note: I used the same subscription ID again (icinga-replication). In order to do so, the existing subscription must be removed:

> DROP SUBSCRIPTION "icinga-replication" ON "icinga"."autogen"

And what about data prior to the subscription "replication"?

The subscription service only copies incoming data. This means that older data is not copied over to the subscribers. In this case you need to transfer the data using dump/restore or even by syncing the data (/var/lib/influxdb/data) and wal (/var/lib/influxdb/wal) directories.
In my case I stopped InfluxDB on both influx01 and influx02, rsynced the content of /var/lib/influxdb/data and /var/lib/influxdb/wal from influx01 to influx02 and then started InfluxDB again. First on host influx02, then on influx01. Now data is in sync (until a network issue or similar happens).

As this graphing database is not production critical I can live with that situation.


MariaDB Galera Cluster: When the data is in sync but the query results differ
Friday - Nov 9th 2018 - by - (0 comments)

On Friday's there's always a bit more time for the curious cases which require some deeper research.

A developer contacted me with the suspicion that the data inside a clustered MariaDB (using Galera) was not in sync. Sometimes a certain query resulted X results, sometimes Y results.

Technically this sounded plausible because the application connects to which is a HAProxy listening on 3306. The requests are then balanced across multiple Galera nodes in round-robin style. So when the query lands on node1, X results are shown. But when the query lands on node2, Y results are shown. That could indeed be a data sync problem.

But it wasn't. The data across the cluster was exactly the same on all nodes. The same amount of rows inside the tables, the same data.

On node1:

MariaDB [app]> select count(*) from docs;
| count(*) |
|     3916 |
1 row in set (0.00 sec)

MariaDB [app]> select count(*) from pages;
| count(*) |
|      446 |
1 row in set (0.01 sec)

On node2:

MariaDB [app]> select count(*) from docs;
| count(*) |
|     3916 |
1 row in set (0.00 sec)

MariaDB [app]> select count(*) from pages;
| count(*) |
|      446 |
1 row in set (0.01 sec)

Even a dump created on both nodes showed the exact same size:

root@node1:/tmp# mysqldump app > /tmp/app.sql
root@node1:/tmp# du -ks /tmp/app.sql
3416    /tmp/app.sql

root@node2:/tmp# mysqldump app > /tmp/app.sql
root@node2:/tmp# du -ks /tmp/app.sql
3416    /tmp/app.sql

Could the two nodes interpret the query differently? The query itself is indeed a bit special, joining two tables together and using several internal MySQL functions:

SELECT DISTINCT d.*, p.page, p.pagestatus, DATE_FORMAT(creationdate,"%d.%m.%Y %H:%i:%s") as dcreationdate, DATE_FORMAT(updatedate,"%d.%m.%Y %H:%i:%s") as dupdatedate, DATE_FORMAT(lastpubdate,"%d.%m.%Y %H:%i:%s") as dlastpubdate, DATE_FORMAT(modified,"%d.%m.%Y %H:%i:%s") as dmodified FROM docs d LEFT JOIN (SELECT * FROM pages WHERE id IN (SELECT MAX(id) FROM pages GROUP BY wwid)) p ON (p.wwid = d.wwid) WHERE d.id in (SELECT MAX(id) FROM docs WHERE channelid = 2 GROUP BY ldid) AND d.printpublicationdate = '2018-06-22' ORDER BY updatedate DESC;

Agreed, not your typical select query... And this query did indeed show two different result sets from node1 and node2:

node1: 62 rows in set (0.02 sec)
node2: 82 rows in set (0.01 sec)

I checked for differences on the two nodes and indeed found a different patch level of MariaDB: node1 had 10.0.33 installed, node2 was running a newer version 10.0.36. Could this really be the source of the difference?

As this problem was identified on a staging server, I updated all cluster nodes to the current MariaDB patch level 10.0.37 (I wouldn't have done that on a Friday afternoon on PROD, no way!).
And by "magic", both nodes reported the same results once they ran the same version:

node1: 82 rows in set (0.02 sec)
node2: 82 rows in set (0.01 sec)

There were two important lessons learned here:

1) The data was still in sync and therefore not corrupted.

2) Always make sure a Galera cluster uses the same version across all nodes.

I'm not sure which exact version (10.0.34, 35 or 36) fixed this, but it could be related to bug MDEV-15035 which was fixed in 10.0.35.


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

6975 Days
until Death of Computers