Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

Hardware Error: Parity error during data load. Or: Clean me!!!
Thursday - May 24th 2018 - by - (0 comments)

For a couple of months I've always wondered about the following error messages appearing on my NAS, a HP Proliant N40L Microserver running Debian 7 Wheezy, every five minutes:

[Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[Hardware Error]: MC2_ADDR: 0x00000000d3b42540
[Hardware Error]: MC2 Error: : SNP error during data copyback.
[Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[Hardware Error]: Corrected error, no action required.

I came across some articles, like the following:

But none offered real solutions to the problem. Some even said this logged error messages could simply be ignored...

A couple of days ago, I upgraded the NAS server from Debian Wheezy to Jessie (as a mid-way upgrad to Stretch) and realized after the successful OS upgrade, that the log entries now happen ALL THE TIME. I couldn't even use the terminal anymore because it was flooded by these messages:

[ 1026.904428] [Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[ 1026.910229] [Hardware Error]: MC2_ADDR: 0x00000000d3b42540
[ 1026.915945] [Hardware Error]: MC2 Error: : SNP error during data copyback.
[ 1026.921690] [Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[ 1027.182836] [Hardware Error]: Corrected error, no action required.
[ 1027.188553] [Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[ 1027.194345] [Hardware Error]: MC2_ADDR: 0x0000000001af2540
[ 1027.200132] [Hardware Error]: MC2 Error: : SNP error during data copyback.
[ 1027.205915] [Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[ 1027.338890] [Hardware Error]: Corrected error, no action required.
[ 1027.344632] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
[ 1027.350428] [Hardware Error]: MC1_ADDR: 0x0000ffff81012550
[ 1027.356222] [Hardware Error]: MC1 Error: Parity error during data load.
[ 1027.361997] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
[ 1027.430924] [Hardware Error]: Corrected error, no action required.
[ 1027.436645] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
[ 1027.442419] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2550
[ 1027.448216] [Hardware Error]: MC1 Error: Parity error during data load.
[ 1027.453960] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
[ 1027.939102] [Hardware Error]: Corrected error, no action required.

Damn. It's time to dig into that problem again. This time I got luckier and came across this forum thread:

The most interesting posted text there was:

"It is most likely a CPU fan dust bunny. That's the signal from the kernel to clean those out."

As easy as this sounds, it made sense. The microserver has been running day and night since it became my NAS server in December 2012 (see article Building a home file server with HP Proliant N40L). That's more than 5 years of total run time. As you might be aware of, the motherboard of this Microserver is under the drive cage and not easily accessible. And therefore not easily cleanable either.

I gave it a shot, shut down the server, removed the cables from the motherboard and pulled it out.

Dust on the heat sink causing hardware errors in kernel log 

There it is. A thick layer of dust sitting on the CPU's heat sink.

I cleaned the motherboard (vacuumed the dust off), re-attached the cable and pushed the motherboard back in position. Time of truth. I booted the server.

Checking syslog, you can easily see when I turned off (15:28) and booted the server again (15:42):

May 24 15:28:04 nas kernel: [77872.129490] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
May 24 15:28:04 nas kernel: [77872.135237] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2550
May 24 15:28:04 nas kernel: [77872.140955] [Hardware Error]: MC1 Error: Parity error during data load.
May 24 15:28:04 nas kernel: [77872.146656] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
May 24 15:28:04 nas kernel: [77872.263866] [Hardware Error]: Corrected error, no action required.
May 24 15:28:04 nas kernel: [77872.269509] [Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
May 24 15:28:04 nas kernel: [77872.275283] [Hardware Error]: MC2_ADDR: 0x0000000001af2540
May 24 15:28:04 nas kernel: [77872.280990] [Hardware Error]: MC2 Error: : SNP error during data copyback.
May 24 15:28:04 nas kernel: [77872.286694] [Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
May 24 15:28:04 nas kernel: [77872.323890] [Hardware Error]: Corrected error, no action required.
May 24 15:28:04 nas kernel: [77872.329552] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
May 24 15:28:04 nas kernel: [77872.335294] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2550
May 24 15:28:04 nas kernel: [77872.341013] [Hardware Error]: MC1 Error: Parity error during data load.
May 24 15:28:04 nas kernel: [77872.346716] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
May 24 15:28:04 nas kernel: [77872.371793] [Hardware Error]: Corrected error, no action required.
May 24 15:28:04 nas kernel: [77872.377085] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
May 24 15:28:04 nas kernel: [77872.382397] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2540
May 24 15:28:04 nas kernel: [77872.387718] [Hardware Error]: MC1 Error: Parity error during data load.
May 24 15:28:04 nas kernel: [77872.393030] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
May 24 15:42:13 nas kernel: [    0.000000] Initializing cgroup subsys cpuset
May 24 15:42:13 nas kernel: [    0.000000] Initializing cgroup subsys cpu
May 24 15:42:13 nas kernel: [    0.000000] Initializing cgroup subsys cpuacct
May 24 15:42:13 nas kernel: [    0.000000] Linux version 3.16.0-6-amd64 (debian-kernel@lists.debian.org) (gcc version 4.9.2 (Debian 4.9.2-10+deb8u1) ) #1 SMP Debian 3.16.56-1+deb8u1 (2018-05-08)
May 24 15:42:13 nas kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.16.0-6-amd64 root=UUID=e00b8ddf-5247-4b9f-834c-d557df90f575 ro quiet
May 24 15:42:13 nas kernel: [    0.000000] e820: BIOS-provided physical RAM map:

Then, I waited. From the logs above (which flooded my terminal) you can see that already after 1026 seconds of uptime the hardware errors appeared.

Now, after 1200 seconds of uptime, still no hardware errors:

root@nas:~# uptime
 16:03:00 up 20 min,  1 user,  load average: 0.04, 0.15, 0.09

root@nas:~# echo $((20 * 60 ))
1200

root@nas:~# dmesg | tail
[   10.257700] RPC: Registered named UNIX socket transport module.
[   10.257706] RPC: Registered udp transport module.
[   10.257709] RPC: Registered tcp transport module.
[   10.257711] RPC: Registered tcp NFSv4.1 backchannel transport module.
[   10.272263] FS-Cache: Loaded
[   10.321299] FS-Cache: Netfs 'nfs' registered for caching
[   10.376030] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[   11.809469] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[   11.809478] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX
[   11.809506] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

Even after now 41 minutes (=2460 seconds) of uptime, still no errors:

root@nas:~# uptime && dmesg |tail
 16:23:49 up 41 min,  1 user,  load average: 0.02, 0.03, 0.01
[   10.257700] RPC: Registered named UNIX socket transport module.
[   10.257706] RPC: Registered udp transport module.
[   10.257709] RPC: Registered tcp transport module.
[   10.257711] RPC: Registered tcp NFSv4.1 backchannel transport module.
[   10.272263] FS-Cache: Loaded
[   10.321299] FS-Cache: Netfs 'nfs' registered for caching
[   10.376030] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[   11.809469] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[   11.809478] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX
[   11.809506] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

These error messages really turned out to be a warning from the OS to clean the server. Who'd have thought that looking at these hardware error messages...

 

How to build a generic Icinga2 service graph in Grafana using InfluxDB
Friday - May 11th 2018 - by - (0 comments)

In the past weeks I've spent quite some time (whenever I had time) to slowly kick off the new monitoring architecture using a dual-master Icinga2 installation, InfluxDB as graphing database and Grafana as graph displaying software. See previous articles "Icinga2 graphing with InfluxDB and Grafana" and "Create separate measurement tables in InfluxDB for Icinga 2 NRPE checks" for further information.

I was quite happy so far with the dashboard I created in Grafana, based on the Icinga2 Grafana dashboard template:

Grafana Icinga2 Linux Dashboard 

But I was missing some dynamic graphs showing up. We currently have around 850 host objects in our "old" Icinga2 monitoring and not all the hosts are the same. Some have additional database checks, some have HTTP checks, some are running on Windows, others again have very specific application checks. It's difficult in the main dashboard to represent all the services with the (fixed) graph elements in Grafana.

Eventually I came across a question from user TryTryAgain on serverfault, which was basically more about creating a dynamic action_url to point to Grafana. The question itself was irrelvant to me, but something interesting hit my eye:

" I'd like this to work: action_url = "http://grafana-server.example/grafana/dashboard/db/generic-check?var-device=$HOSTNAME$&var-check=$SERVICEDESC$&var-checkmetric=$SERVICECHECKCOMMAND$&var-datatype=perfdata&var-value=value" "

So the user prepared a new template "generic-check" and uses dynamic variables to display the graph for one exact service. That's a great idea!

As I couldn't find a pre-existing template for such a generic graph, I went on to create it. And had to dig deeper into InfluxDB's queries and schemas...

 

1. Prepare the graph

I added a single graph, first with a static data query:

SELECT mean("value") FROM hostalive WHERE ("hostname" =~ /mytesthost/) AND $timeFilter GROUP BY time($__interval) fill(previous)

This graph needs to be adjusted in the next steps, as I added dynamic variables.

 

2. Variable $hostname

I decided I want the generic service template to start with the actual host object. This is usually the most important marker (from which host object do I need the graph?). I created the $hostname variable in the template's templating variables:

$hostname = SHOW TAG VALUES FROM "hostalive" WITH KEY = "hostname"

This query looks up the "hostalive" measurements table and shows all unique values of the key "hostname". Remember, InfluxDB is basically explained a key-value store (like Redis).

This one was pretty easy and immediately showed up all the hosts prepared in the new Icinga2 architecture:

Grafana variable hostname 

To use the dynamic variable $hostname in the graph, the graph's query needs to be adjusted:

SELECT mean("value") FROM hostalive WHERE ("hostname" =~ /^$hostname$/) AND $timeFilter GROUP BY time($__interval) fill(previous)

 

3. Variable $check

Now it gets slightly more complicated. Now that the host object is selected by using $hostname, Grafana needs to look up for which services it is able to display graphs. I decided the best way would be to look into the different measurement tables. I did this by:

$check = SHOW measurements

But I wasn't happy with that because it just showed all measurement tables, even irrelevant ones like "http" for a non-webserver.

Luckily the show measurements query also allows a WHERE clause:

$check = SHOW measurements WHERE "hostname" =~ /^$hostname$/

This way InfluxDB only shows measurement tables in which our already selected host object has already some data entries.

Grafana Dynamic Variable Check 

To use the dynamic variable $check in the graph, the graph's query needs to be adjusted:

SELECT mean("value") FROM /^$check$/ WHERE ("hostname" =~ /^$hostname$/) AND $timeFilter GROUP BY time($__interval) fill(previous)

 

4. Variable $service

At first I thought my template is almost complete with the defined $check. It worked for example for "ssh" checks, which is (normally) a single service check on the host object. But a very good example disproving it is a disk check: You usually run disk usage checks (check_disk) on several partitions on the same host object, having therefore multiple service objects in Icinga 2. In such a case, the query of the selected $check returns multiple results. A graph would then simply take all the data together (value), whether the value came from a partition "/" or "/tmp". This is wrong.

So I needed to create another variable $service which represents the already existing data for the selected $check:

$service = SHOW TAG VALUES FROM $check WITH KEY = "service" WHERE "hostname" =~ /^$hostname$/

In the following example, a host object with several tcp checks gives the following selection:

Grafana dynamic variable service 

 

5. Variable $metric

But a check can return multiple values! For example a check_http usually returns two sets of performance data: The size of the response and the time of the response (response time). To get the graph we actually want, for example response time of a http check, another flexible variable $metric was defined:

$metric = SHOW TAG VALUES FROM "$check" WITH KEY = "metric" WHERE "hostname" =~ /^$hostname$/

The new variable now allows to select the relevant data:

Grafana dynamic variable metric

To use the dynamic variable $metric in the graph, the graph's query needs to be adjusted:

SELECT mean("value") FROM /^$check$/ WHERE ("hostname" =~ /^$hostname$/ AND "metric" =~ /^$metric$/) AND $timeFilter GROUP BY time($__interval) fill(previous)

 

6. Variable $aggregation

I thought I was done and the first few tests looked promising. Until I came across a host having a MySQL running on it. The graphs for MySQL (metric: connections) just grew:

Grafana MySQL Connections wrong 

This is a misinterpretation by the graph because the MySQL connections is a so-called counter (connection = connection +1). Because this is a single graph, how does Grafana know what kind of data it gets?

The solution is to provide yet another dynamic variable $aggregation. With this variable, Grafana can be told how to display the data. I created a custom variable for this purpose with two values:

Grafana dynamic variable aggregation

$aggregation = mean("value"),derivative(mean("value"))

To use the dynamic variable $aggregation in the graph, the graph's query needs to be adjusted:

SELECT $aggregation FROM /^$check$/ WHERE ("hostname" =~ /^$hostname$/ AND "metric" =~ /^$metric$/) AND $timeFilter GROUP BY time($__interval) fill(previous)

The graph for the counter data now dynamically adapts:

Grafana MySQL connections correct 

 

The full picture

Grafana Icinga2 Dynamic Generic Service Graph 

The template "Generic Service" can be downloaded here as json export.

 

Next steps

Now being able to dynamically show up a graph for a generic service (by manually selecting the values from the drop down fields or by using the variables in the URL), I will try to use that somehow for the "action_url" in Icinga 2 or integrate it differently.

 

Open issues:

The graphs for disk/partition usage are not showing up due to an error in the query. I haven't found out yet why this happens (probably due to a conflict between $service and $metric, not sure) but for now I can live with it.

Grafana generic graph failing for disk

 

Monitoring plugin check_smart version 5.10 released
Tuesday - May 8th 2018 - by - (0 comments)

A new version (5.10) of the monitoring plugin check_smart was released today.

This version introduces a new parameter "-s" (or "--selftest"). By using this switch, an additional check will happen on the SMART's self-test log.

A typical example of such a self-test log entry:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%     10926         962363536

Remember: You can always launch a SMART self-test by running the following command (example on drive /dev/sda):

# smartctl -t long /dev/sda

At first I thought this selftest check should be part of the main check, but there might be some situations where a drive reported errors in the past but were resolved (e.g. sectors meanwhile re-allocated). By using an additional parameter "-s", which can be easily added into existing Nagios/Icinga/NRPE configurations, this doesn't break compatibility.

 

Lessons learned: Do not put stateful Minio into a Docker container
Monday - May 7th 2018 - by - (0 comments)

On a Big Data playground we've built (Spark, R, Minio) on top of a Rancher managed Docker environment, we've also put the Minio object storage into Docker containers. We were sceptic at first (don't create stateful containers!) but Rancher offers Minio directly from it's Cattle catalog:

 Minio in Rancher Cattle Catalog

We decided to go for it - it's a playground environment and in the worst case we'll begin from scratch again.

At first everything seemed to work fine. The containers started up and by using their internal Docker IP addresses, Minio was launched on each container (on four containers, spread on four physical hosts). On each physical host, three volumes were created which were volume-mounted into the Minio containers. Inside the containers:

# docker inspect -f '{{ .Mounts }}' 5ca3f3177e27
[{volume minio_minio-scheduler-setting_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_2b6e4 /var/lib/docker/volumes/minio_minio-scheduler-setting_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_2b6e4/_data /opt/scheduler local rw true } {volume minio_minio-data-0_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_76e19 /var/lib/docker/volumes/minio_minio-data-0_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_76e19/_data /data/disk0 local rw true } {volume rancher-cni /var/lib/docker/volumes/rancher-cni/_data /.r local ro false } {volume 42d6f216267ced36e186b3e082ae6b3c7ad53085432326220c583ab022842826 /var/lib/docker/volumes/42d6f216267ced36e186b3e082ae6b3c7ad53085432326220c583ab022842826/_data /data local  true }]

Minio itself was launched inside the container in the following way:

bash-4.3# ps auxf
PID   USER     TIME   COMMAND
    1 root       0:00 s6-svscan -t0 /var/run/s6/services
   38 root       0:00 s6-supervise s6-fdholderd
  323 root       0:00 s6-supervise minio
  326 root       0:00 sh ./run
  330 minio      0:00 sh -c /opt/minio/bin/minio server http://10.42.84.239/data/disk0 http://10.42.224.2/data/disk0
  332 minio      0:27 /opt/minio/bin/minio server http://10.42.84.239/data/disk0 http://10.42.224.2/data/disk0 http:/
  430 root       0:00 bash
  436 root       0:00 ps auxf

The command line of this process (PID 330) reveals more:

bash-4.3# cat /proc/330/cmdline
sh-c/opt/minio/bin/minio server http://10.42.84.239/data/disk0 http://10.42.224.2/data/disk0 http://10.42.55.111/data/disk0 http://10.42.116.171/data/disk0

And, no surprise, here we can find the four IP addresses of the four Minio containers.

We launched some basic tests to make sure the setup would hold:

  • Upscaling: We tried to scale from 4 to 5 minio-server containers. The 5th server didn't come up. This is most likely because it is not aware of the other minio-server containers (this is only true during the setup through the Catalog) and it requires the correct start command with all the other container IP's. Even if the 5th server would have come up, there is no way that the other containers are suddenly aware of a new Minio server. They, too, need to change the command line how Minio is started. TL;DR: Minio server's (deployed through the catalog) cannot be horizontally scaled, they're fixed once deployed.
  • Same IP: As we can see in the command line, the IP addresses are fixed in the Minio start command. We deleted a container, it was re-created by Rancher and it kept the same IP address.
  • Volumes: We wanted to see if the data is lost if a container is stopped or deleted. Even after deletion of a minio-server container, the volumes still exist on the Rancher host and when the container is re-created, the same volume is re-attached to the new container. Therefore no data loss.

So far so good but one thing caught my eye: The minio server service, automatically built by deploying from the catalog, was set to "Scale" and not to "Global". The latter one, global, is usually my first choice to run services across hosts. And Minio, by definition, is such a service.

When we had to take down physical host #1 for maintenance, the problems started. Due to the "Scaling" of the Minio Server Service, another Minio container was launched on one of the other three still remaining physical hosts. Therefore on Docker host #2 we had two Minio containers running. Even though the newly created container on host #2 used the same IP address, it had to create a new data volume on the host. Therefore the two Minio containers now use the double amount of disk space as before.

When host #1 was up again, we tried to place the Minio container on host #1 again. Unfortunately in Rancher they cannot be moved like a VM in VMware. So we deleted the second Minio container on host #2 and hoped it would automatically create a new Minio container (scale: 4, remember) on host #1 again. It didn't. The new Minio container was again launched on host #2. I had to downscale and then upscale again but this worked: The new Minio container was finally created again on host #1. But to my shock suddenly a new IP address was assigned instead of using the previous one (probably due to my down scaling). This means that communication with the other Minio containers was broken because they were not aware of an IP change of one of their Minio partner nodes.

Another problem was also that on node #2 the doubled volumes are still on the system, using disk space:

# docker volume ls | grep minio
local               minio_minio-data-0_1_fb18fef0-a34b-4f78-9628-ca66abe4b057_a745e
local               minio_minio-data-0_2_c416a26e-b724-49af-bd37-751813a0a586_10db2
local               minio_minio-scheduler-setting_1_fb18fef0-a34b-4f78-9628-ca66abe4b057_919c9
local               minio_minio-scheduler-setting_2_c416a26e-b724-49af-bd37-751813a0a586_fd842

At first we had some "hope" that the service would itself recover again, when each container was restarted - but the following error message shown in the Minio user interface was pretty much saying nay (Multiple disk failures, unable to write data.):

Minio as Docker container: Multiple disk failures

Luckily we were still in playground mode before some actual/important data was created and stored in the Minio object store. I decided to create static LXC containers for Minio and use the docker0 bridge in order to connect the (stateless) application Docker containers to Minio.
I was able to attach the LXC container "minio01" to the docker0 bridge and the container was able to communicate with the Docker containers on the same physical host. But, and this is the show stopper, it was not able to communicate with the Docker containers running on other hosts (but in the same Rancher environment). The reason for this is that Rancher uses some magic (technically speaking: An IPSEC tunnel between the Docker hosts) to "extend" the Docker-internal network (10.42.0.0/16) on to each Docker host. According to the Rancher documentation, it is possible to "join" this IPSEC tunneled network by manually running a Docker container with a specific label:

For any containers launched through the Docker CLI, an extra label --label io.rancher.container.network=true can be used to launch the container into the managed network. Without this label, containers launched from the Docker CLI will be using the bridge network.

But of course this only applies to containers started and managed by Docker, not the Linux Containers LXC (System Containers as they're nowadays called).

The Docker containers, have one thing in common though: They use the Docker host as router/gateway. This means they are able to connect to the physical network using iptables NAT rules. So I created another virtual bridge called virbr0 alongside docker0, using the physical primary interface of the Docker host and used the host's IP address to now assign to the virbr0 interface.

 # The primary network interface
iface enp15s0 inet manual

# Virtual bridge
auto virbr0
iface virbr0 inet static
        address 10.150.1.10
        netmask 255.255.255.128
        network 10.150.1.0
        broadcast 10.150.1.127
        gateway 10.150.1.1
        # bridge control
        bridge_ports enp15s0
        bridge_fd 0
        pre-up brctl addbr virbr0
        pre-down brctl delbr virbr0
        # dns-* options are implemented by the resolvconf package, if installed
        dns-nameservers 8.8.8.8 1.1.1.1
        dns-search playground.local

The LXC container is now using this virtual bridge virbr0 as network link and the own host's ip address as gateway:

# cat /var/lib/lxc/minio01/config
# Template used to create this container: /usr/share/lxc/templates/lxc-ubuntu
# Parameters passed to the template: --release xenial
# Template script checksum (SHA-1): 4d7c613c3c0a0efef4b23917f44888df507e662b
# For additional config options, please look at lxc.container.conf(5)

# Uncomment the following line to support nesting containers:
#lxc.include = /usr/share/lxc/config/nesting.conf
# (Be aware this has security implications)


# Common configuration
lxc.include = /usr/share/lxc/config/ubuntu.common.conf

# Container specific configuration
lxc.rootfs = /var/lib/docker/lxc/minio01
lxc.rootfs.backend = dir
lxc.utsname = minio01
lxc.arch = amd64
lxc.start.auto = 1

# Network configuration
lxc.network.type = veth
lxc.network.link = virbr0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:10:ee:90
lxc.network.ipv4 = 10.150.1.50/25
lxc.network.ipv4.gateway = 10.150.1.10

Now the Docker containers and the LXC container have one thing in common: They use the same physical host as gateway. This means, they're using the same routing table and therefore are able to communicate with each other.

But how do I tell the Docker application containers where Minio is located now? In Rancher this is possible by using "External Service" in which you can set external IP addresses, outside of the Docker container range:

Rancher External Service

And this external service can now be used in the internal Rancher load balancer:

Rancher LB using external Service 

Although Minio isn't running inside the Docker environment managed by Rancher anymore, it still uses the same physical network connections (gateways) of the same hosts and therefore not losing any performance. On the other hand we feel much more comfortable and "safe" when the persistent data is running in a LXC container.

 

Rsnapshot error: /bin/rm: cannot remove ... Directory not empty
Friday - May 4th 2018 - by - (0 comments)

For the last two days I got the following error message from the nightly rsnapshot run:

/bin/rm: cannot remove ‘/backup/rsnapshot/daily.14/
p01/mnt/containerdata/admin/[...]/Auth’:
Directory not empty
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot daily
----------------------------------------------------------------------------
ERROR: Warning! /bin/rm failed.
ERROR: Error! rm_rf("/backup/rsnapshot/daily.14/")

But even when I manually tried to delete the folder, it didn't work:

# rm -rfv /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth
rm: cannot remove ‘/backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth’: Directory not empty

Why would that happen? Let's check out the contents of that particular path:

# la -la /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth
ls: /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth: Input/output error
ls: cannot open directory /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth: Input/output error

 Oops, IO error! That doesn't sound too good! I unmounted the /backup path and ran a forced file system check:

# fsck.ext4 -f /dev/vgdata/lvbackup
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Inode 2900778 has a bad extended attribute block 524288.  Clear<y>? yes
Inode 12765722 has a bad extended attribute block 524288.  Clear<y>? yes
Inode 42844666 has a bad extended attribute block 524288.  Clear<y>? yes
Pass 2: Checking directory structure
Entry '.' in /rsnapshot/daily.5/p01/mnt/containerdata/[...] (89989522) is a link to '.'  Clear<y>? yes
Entry '..' in /rsnapshot/daily.5/p01/mnt/containerdata/[...] (89989522) is duplicate '..' entry.
Fix<y>? yes
Entry '.' in /rsnapshot/daily.7/p01/mnt/containerdata/[...] (71123012) is a link to '.'  Clear<y>? yes
Entry '..' in /rsnapshot/daily.7/p01/mnt/containerdata/[...] (71123012) is duplicate '..' entry.
Fix<y>? yes
Entry '.' in /rsnapshot/daily.8/p01/mnt/containerdata/[...] (70746560) is a link to '.'  Clear<y>? yes
Entry '..' in /rsnapshot/daily.8/p01/mnt/containerdata/[...] (70746560) is duplicate '..' entry.
Fix<y>? yes
Pass 3: Checking directory connectivity
'..' in /rsnapshot/daily.8/p01/mnt/containerdata/[...] (70746560) is <The NULL inode> (0), should be /rsnapshot/daily.8/p01/mnt/containerdata/[...] (70746549).
Fix<y>? yes
'..' in /rsnapshot/daily.7/p01/mnt/containerdata/[...] (71123012) is <The NULL inode> (0), should be /rsnapshot/daily.7/p01/mnt/containerdata/[...] (71123004).
Fix<y>? yes
'..' in /rsnapshot/daily.5/p01/mnt/containerdata/[...] (89989522) is <The NULL inode> (0), should be /rsnapshot/daily.5/p01/mnt/containerdata/[...] (89989521).
Fix<y>? yes
Pass 4: Checking reference counts
Inode 70746549 ref count is 51, should be 50.  Fix<y>? yes
Inode 70746560 ref count is 2, should be 1.  Fix<y>? yes
Inode 71123004 ref count is 7, should be 6.  Fix<y>? yes
Inode 71123012 ref count is 2, should be 1.  Fix<y>? yes
Inode 89989521 ref count is 6, should be 5.  Fix<y>? yes
Inode 89989522 ref count is 2, should be 1.  Fix<y>? yes
Pass 5: Checking group summary information

/dev/vgdata/lvbackup: ***** FILE SYSTEM WAS MODIFIED *****
/dev/vgdata/lvbackup: 11160908/170393600 files (0.2% non-contiguous), 610481642/681574400 blocks

 Note: To protect the privacy of the customer's data, I removed the full paths and directory names from the output.

Once the file system check passed, I was able to check the directory:

# ls -la /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth/
total 52
drwxr-xr-x  3 1002 1002 4096 Dec  2 13:04 .
drwxr-xr-x  3 1002 1002 4096 May  3 02:05 ..
drwxr-xr-x  2 1002 1002 4096 Dec  2 13:05 Access
[...]

And of course rm was now working, too:

# rm -rfv /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth
removed directory: ‘/backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth/Access’
[...]
removed directory: ‘/backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth’

TL;DR: When rsnapshot fails with such an error, it's worth to check out the file system's health.

 

Monitoring plugin check_smart version 5.9 released
Sunday - Apr 29th 2018 - by - (0 comments)

A new version (5.9) of the monitoring plugin to check hard and solid state drives' SMART table, check_smart.pl, was released.

With this release it is now possible to use the interface type "auto". However this automatic lookup only seems to works on ATA devices (see https://github.com/Napsty/check_smart/pull/20 for more).

 

Galera node on Ubuntu 14.04 Trusty not joining after update to MariaDB 10.0.34
Wednesday - Apr 25th 2018 - by - (0 comments)

After having updated a Ubuntu 14.04 Trusty server, including MariaDB and Galera packages, I noticed the node wasn't able to join the Galera cluster anymore.

According to the apt history log, the versions were updated like this:

mariadb-galera-server:amd64 (10.0.31+maria-1~trusty, 10.0.34+maria-1~trusty),  mariadb-client-core-10.0:amd64 (10.0.32+maria-1~trusty, 10.0.34+maria-1~trusty), mariadb-common:amd64 (10.0.32+maria-1~trusty, 10.0.34+maria-1~trusty), mariadb-galera-server-10.0:amd64 (10.0.31+maria-1~trusty, 10.0.34+maria-1~trusty), galera-3:amd64 (25.3.20-trusty, 25.3.22-trusty),  mariadb-client-10.0:amd64 (10.0.32+maria-1~trusty, 10.0.34+maria-1~trusty)

I know for a fact that an automatic join after a MariaDB/MySQL start worked. But after the upgrade and an automatic MariaDB start, the Galera node wasn't able to join the Galera cluster. It appeared in the "wsrep_incoming_addresses" but then quickly disappeared again. In the logs I found the following logs when I manually launched the mysql_safe daemon:

# mysqld_safe --skip-grant-tables &

Apr 25 22:06:09 galera-node-2 mysqld_safe: WSREP: Recovered position 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] /usr/sbin/mysqld (mysqld 10.0.34-MariaDB-1~trusty-wsrep) starting as process 29529 ...
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Read nil XID from storage engines, skipping position init
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: wsrep_load(): Galera 25.3.22(r3764) by Codership Oy loaded successfully.
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: CRC-32C: using hardware acceleration.
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Found saved state: 32c39115-8d32-11e5-9122-b6a2b3e41530:-1, safe_to_bootstrap: 0
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 10.55.55.86; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S;
Apr 25 22:06:09 galera-node-2 mysqld: pc.checksum = false; pc
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: GCache history reset: 32c39115-8d32-11e5-9122-b6a2b3e41530:0 -> 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Assign initial position for certification: 724808778, protocol version: -1
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: wsrep_sst_grab()
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Start replication
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Setting initial position to 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: protonet asio version 0
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Using CRC-32C for message checksums.
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: backend: asio
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: gcomm thread scheduling priority set to other:0
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: restore pc from disk failed
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: GMCast version 0
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: EVS version 0
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: gcomm: connecting to group 'GALERACLUSTER', peer '10.55.55.85:,10.55.55.86:,10.55.55.87:'
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') connection established to 185196d1 tcp://10.55.55.86:4567
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Warning] WSREP: (185196d1, 'tcp://0.0.0.0:4567') address 'tcp://10.55.55.86:4567' points to own listening address, blacklisting
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') connection established to acfd2d2f tcp://10.55.55.87:4567
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') connection established to d607ef0f tcp://10.55.55.85:4567
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: declaring acfd2d2f at tcp://10.55.55.87:4567 stable
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: declaring d607ef0f at tcp://10.55.55.85:4567 stable
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Node acfd2d2f state prim
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: view(view_id(PRIM,185196d1,12) memb {
Apr 25 22:06:09 galera-node-2 mysqld: #011185196d1,0
Apr 25 22:06:09 galera-node-2 mysqld: #011acfd2d2f,0
Apr 25 22:06:09 galera-node-2 mysqld: #011d607ef0f,0
Apr 25 22:06:09 galera-node-2 mysqld: } joined {
Apr 25 22:06:09 galera-node-2 mysqld: } left {
Apr 25 22:06:09 galera-node-2 mysqld: } partitioned {
Apr 25 22:06:09 galera-node-2 mysqld: })
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: save pc into disk
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: gcomm: connected
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Opened channel 'GALERACLUSTER'
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Waiting for SST to complete.
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 18ea8d6b-48c4-11e8-8a41-0fef3fa593ed
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: STATE EXCHANGE: sent state msg: 18ea8d6b-48c4-11e8-8a41-0fef3fa593ed
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: STATE EXCHANGE: got state msg: 18ea8d6b-48c4-11e8-8a41-0fef3fa593ed from 0 (mysql02)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: STATE EXCHANGE: got state msg: 18ea8d6b-48c4-11e8-8a41-0fef3fa593ed from 1 (mysql03)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: STATE EXCHANGE: got state msg: 18ea8d6b-48c4-11e8-8a41-0fef3fa593ed from 2 (mysql01)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Quorum results:
Apr 25 22:06:10 galera-node-2 mysqld: #011version    = 4,
Apr 25 22:06:10 galera-node-2 mysqld: #011component  = PRIMARY,
Apr 25 22:06:10 galera-node-2 mysqld: #011conf_id    = 10,
Apr 25 22:06:10 galera-node-2 mysqld: #011members    = 2/3 (joined/total),
Apr 25 22:06:10 galera-node-2 mysqld: #011act_id     = 724815064,
Apr 25 22:06:10 galera-node-2 mysqld: #011last_appl. = -1,
Apr 25 22:06:10 galera-node-2 mysqld: #011protocols  = 0/7/3 (gcs/repl/appl),
Apr 25 22:06:10 galera-node-2 mysqld: #011group UUID = 32c39115-8d32-11e5-9122-b6a2b3e41530
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Flow-control interval: [28, 28]
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Trying to continue unpaused monitor
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 724815064)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: State transfer required:
Apr 25 22:06:10 galera-node-2 mysqld: #011Group state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724815064
Apr 25 22:06:10 galera-node-2 mysqld: #011Local state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: New cluster view: global state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724815064, view# 11: Primary, number of nodes: 3, my index: 0, protocol version 3
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Warning] WSREP: Gap in state sequence. Need state transfer.
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '10.55.55.86' --datadir '/var/lib/mysql/'   --parent '29529' --binlog '/var/log/mysql/mariadb-bin' '
Apr 25 22:06:10 galera-node-2 mysqld: /usr//bin/wsrep_sst_xtrabackup: line 397: setup_ports: command not found
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [ERROR] WSREP: Failed to read 'ready ' from: wsrep_sst_xtrabackup --role 'joiner' --address '10.55.55.86' --datadir '/var/lib/mysql/'   --parent '29529' --binlog '/var/log/mysql/mariadb-bin'
Apr 25 22:06:10 galera-node-2 mysqld: #011Read: '(null)'
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup --role 'joiner' --address '10.55.55.86' --datadir '/var/lib/mysql/'   --parent '29529' --binlog '/var/log/mysql/mariadb-bin' : 2 (No such file or directory)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [ERROR] WSREP: Failed to prepare for 'xtrabackup' SST. Unrecoverable.
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [ERROR] Aborting
Apr 25 22:06:10 galera-node-2 mysqld:
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Closing send monitor...
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Closed send monitor.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: gcomm: terminating thread
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: gcomm: joining thread
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: gcomm: closing backend
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: view(view_id(NON_PRIM,185196d1,12) memb {
Apr 25 22:06:12 galera-node-2 mysqld: #011185196d1,0
Apr 25 22:06:12 galera-node-2 mysqld: } joined {
Apr 25 22:06:12 galera-node-2 mysqld: } left {
Apr 25 22:06:12 galera-node-2 mysqld: } partitioned {
Apr 25 22:06:12 galera-node-2 mysqld: #011acfd2d2f,0
Apr 25 22:06:12 galera-node-2 mysqld: #011d607ef0f,0
Apr 25 22:06:12 galera-node-2 mysqld: })
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') connection to peer 185196d1 with addr tcp://10.55.55.86:4567 timed out, no messages seen in PT3S
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') turning message relay requesting off
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: view((empty))
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: gcomm: closed
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Flow-control interval: [16, 16]
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Trying to continue unpaused monitor
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Received NON-PRIMARY.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 724815071)
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Received self-leave message.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Flow-control interval: [0, 0]
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Trying to continue unpaused monitor
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Received SELF-LEAVE. Closing connection.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 724815071)
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: RECV thread exiting 0: Success
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: recv_thread() joined.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Closing replication queue.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Closing slave action queue.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Service disconnected.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: rollbacker thread exiting
Apr 25 22:06:13 galera-node-2 mysqld: 180425 22:06:13 [Note] WSREP: Some threads may fail to exit.
Apr 25 22:06:13 galera-node-2 mysqld: 180425 22:06:13 [Note] /usr/sbin/mysqld: Shutdown complete
Apr 25 22:06:13 galera-node-2 mysqld:
Apr 25 22:06:18 galera-node-2 mysqld: Error in my_thread_global_end(): 1 threads didn't exit
Apr 25 22:06:18 galera-node-2 mysqld_safe: mysqld from pid file /var/run/mysqld/mysqld.pid ended

After the second or third time reading through the logs, one particular line hit my eyes:

Apr 25 22:06:10 galera-node-2 mysqld: /usr//bin/wsrep_sst_xtrabackup: line 397: setup_ports: command not found

Huh? So this file has changed with the update. On a still working (and not updated yet) node:

root@galera-node-1:~# ll /usr//bin/wsrep_sst_xtrabackup
-rwxr-xr-x 1 root root 21703 Jun  1  2017 /usr//bin/wsrep_sst_xtrabackup

On the updated node 2:

root@galera-node-2:~# ll /usr//bin/wsrep_sst_xtrabackup
-rwxr-xr-x 1 root root 20658 Feb  2 08:20 /usr//bin/wsrep_sst_xtrabackup

Checking the mentioned line 397:

391 if [[ ! ${WSREP_SST_OPT_ROLE} == 'joiner' && ! ${WSREP_SST_OPT_ROLE} == 'donor' ]];then
392     wsrep_log_error "Invalid role ${WSREP_SST_OPT_ROLE}"
393     exit 22
394 fi
395
396 read_cnf
397 setup_ports
398 get_stream
399 get_transfer
400
401 INNOEXTRA=""
402 INNOAPPLY="${INNOBACKUPEX_BIN} ${WSREP_SST_OPT_CONF} --apply-log \$rebuildcmd \${DATA} &>\${DATA}/innobackup.prepare.log"
403 INNOBACKUP="${INNOBACKUPEX_BIN} ${WSREP_SST_OPT_CONF} \$INNOEXTRA --galera-info --stream=\$sfmt \${TMPDIR} 2>\${DATA}/innobackup.backup.log"

Comparing with line 397 in the old version of the script on node-1:

387 check_extra()
388 {
389     local use_socket=1
390     if [[ $uextra -eq 1 ]];then
391         if $MY_PRINT_DEFAULTS -c $WSREP_SST_OPT_CONF mysqld | tr '_' '-' | grep -- "--thread-handlin    g=" | grep -q 'pool-of-threads';then
392             local eport=$($MY_PRINT_DEFAULTS -c $WSREP_SST_OPT_CONF mysqld | tr '_' '-' | grep -- "-    -extra-port=" | cut -d= -f2)
393             if [[ -n $eport ]];then
394                 # Xtrabackup works only locally.
395                 # Hence, setting host to 127.0.0.1 unconditionally.
396                 wsrep_log_info "SST through extra_port $eport"
397                 INNOEXTRA+=" --host=127.0.0.1 --port=$eport "
398                 use_socket=0
399             else
400                 wsrep_log_error "Extra port $eport null, failing"
401                 exit 1
402             fi
403         else
404             wsrep_log_info "Thread pool not set, ignore the option use_extra"
405         fi
406     fi
407     if [[ $use_socket -eq 1 ]] && [[ -n "${WSREP_SST_OPT_SOCKET}" ]];then
408         INNOEXTRA+=" --socket=${WSREP_SST_OPT_SOCKET}"
409     fi
410 }

Erm yes, that's obviously quite a big difference. Also the sizes of the two scripts (see above) differ.

By replacing the new version of the script with the old script, the node was able to join the cluster again after another MariaDB start:

root@galera-node-2:~# service mysql start

root@galera-node-2:~# tail -f /var/log/syslog
Apr 25 22:14:07 galera-node-2 mysqld_safe: WSREP: Recovered position 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] /usr/sbin/mysqld (mysqld 10.0.34-MariaDB-1~trusty-wsrep) starting as process 30772 ...
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Read nil XID from storage engines, skipping position init
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: wsrep_load(): Galera 25.3.22(r3764) by Codership Oy loaded successfully.
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: CRC-32C: using hardware acceleration.
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Found saved state: 32c39115-8d32-11e5-9122-b6a2b3e41530:-1, safe_to_bootstrap: 0
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 10.55.55.86; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S;
Apr 25 22:14:07 galera-node-2 mysqld: pc.checksum = false; pc
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: GCache history reset: 32c39115-8d32-11e5-9122-b6a2b3e41530:0 -> 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Assign initial position for certification: 724808778, protocol version: -1
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: wsrep_sst_grab()
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Start replication
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Setting initial position to 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: protonet asio version 0
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Using CRC-32C for message checksums.
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: backend: asio
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: gcomm thread scheduling priority set to other:0
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: restore pc from disk failed
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: GMCast version 0
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: EVS version 0
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: gcomm: connecting to group 'GALERACLUSTER', peer '10.55.55.85:,10.55.55.86:,10.55.55.87:'
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') connection established to 3599655f tcp://10.55.55.86:4567
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Warning] WSREP: (3599655f, 'tcp://0.0.0.0:4567') address 'tcp://10.55.55.86:4567' points to own listening address, blacklisting
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') connection established to acfd2d2f tcp://10.55.55.87:4567
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') connection established to d607ef0f tcp://10.55.55.85:4567
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: declaring acfd2d2f at tcp://10.55.55.87:4567 stable
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: declaring d607ef0f at tcp://10.55.55.85:4567 stable
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Node acfd2d2f state prim
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: view(view_id(PRIM,3599655f,14) memb {
Apr 25 22:14:08 galera-node-2 mysqld: #0113599655f,0
Apr 25 22:14:08 galera-node-2 mysqld: #011acfd2d2f,0
Apr 25 22:14:08 galera-node-2 mysqld: #011d607ef0f,0
Apr 25 22:14:08 galera-node-2 mysqld: } joined {
Apr 25 22:14:08 galera-node-2 mysqld: } left {
Apr 25 22:14:08 galera-node-2 mysqld: } partitioned {
Apr 25 22:14:08 galera-node-2 mysqld: })
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: save pc into disk
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: gcomm: connected
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Opened channel 'GALERACLUSTER'
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Waiting for SST to complete.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 35e60dd7-48c5-11e8-ab36-b74cbcd15212
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: STATE EXCHANGE: sent state msg: 35e60dd7-48c5-11e8-ab36-b74cbcd15212
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: STATE EXCHANGE: got state msg: 35e60dd7-48c5-11e8-ab36-b74cbcd15212 from 0 (mysql02)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: STATE EXCHANGE: got state msg: 35e60dd7-48c5-11e8-ab36-b74cbcd15212 from 1 (mysql03)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: STATE EXCHANGE: got state msg: 35e60dd7-48c5-11e8-ab36-b74cbcd15212 from 2 (mysql01)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Quorum results:
Apr 25 22:14:08 galera-node-2 mysqld: #011version    = 4,
Apr 25 22:14:08 galera-node-2 mysqld: #011component  = PRIMARY,
Apr 25 22:14:08 galera-node-2 mysqld: #011conf_id    = 12,
Apr 25 22:14:08 galera-node-2 mysqld: #011members    = 2/3 (joined/total),
Apr 25 22:14:08 galera-node-2 mysqld: #011act_id     = 724818998,
Apr 25 22:14:08 galera-node-2 mysqld: #011last_appl. = -1,
Apr 25 22:14:08 galera-node-2 mysqld: #011protocols  = 0/7/3 (gcs/repl/appl),
Apr 25 22:14:08 galera-node-2 mysqld: #011group UUID = 32c39115-8d32-11e5-9122-b6a2b3e41530
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Flow-control interval: [28, 28]
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Trying to continue unpaused monitor
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 724818998)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: State transfer required:
Apr 25 22:14:08 galera-node-2 mysqld: #011Group state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724818998
Apr 25 22:14:08 galera-node-2 mysqld: #011Local state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: New cluster view: global state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724818998, view# 13: Primary, number of nodes: 3, my index: 0, protocol version 3
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Warning] WSREP: Gap in state sequence. Need state transfer.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '10.55.55.86' --datadir '/var/lib/mysql/'   --parent '30772' --binlog '/var/log/mysql/mariadb-bin' '
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] Streaming with tar (20180425 22:14:08.550)
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] Using socat as streamer (20180425 22:14:08.552)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Prepared SST request: xtrabackup|10.55.55.86:4444/xtrabackup_sst
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: REPL Protocols: 7 (3, 2)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Assign initial position for certification: 724818998, protocol version: 3
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Service thread queue flushed.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: IST receiver addr using tcp://10.55.55.86:4568
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.55.55.86:4568
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | tar xfi - --recursive-unlink -h; RC=( ${PIPESTATUS[@]} ) (20180425 22:14:08.561)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Member 0.0 (mysql02) requested state transfer from '*any*'. Selected 2.0 (mysql01)(SYNCED) as donor.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 724818998)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Requesting state transfer: success, donor: 2
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: GCache history reset: 32c39115-8d32-11e5-9122-b6a2b3e41530:0 -> 32c39115-8d32-11e5-9122-b6a2b3e41530:724818998
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: 2.0 (mysql01): State transfer to 0.0 (mysql02) complete.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Member 2.0 (mysql01) synced with group.
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20180425 22:14:08.676)
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] Total time on joiner: 0 seconds (20180425 22:14:08.678)
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] Removing the sst_in_progress file (20180425 22:14:08.680)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: SST complete, seqno: 724808778
Apr 25 22:14:08 galera-node-2 mysqld: 2018-04-25 22:14:08 7f623bbb97c0 InnoDB: Warning: Using innodb_locks_unsafe_for_binlog is DEPRECATED. This option may be removed in future releases. Please use READ COMMITTED transaction isolation level instead, see http://dev.mysql.com/doc/refman/5.6/en/set-transaction.html.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Using mutexes to ref count buffer pool pages
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: The InnoDB memory heap is disabled
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Compressed tables use zlib 1.2.8
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Using Linux native AIO
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Using CPU crc32 instructions
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Initializing buffer pool, size = 2.0G
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Completed initialization of buffer pool
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Highest supported file format is Barracuda.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] InnoDB: 128 rollback segment(s) are active.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] InnoDB: Waiting for purge to start
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.36-83.0 started; log sequence number 1013752338227
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] Plugin 'FEEDBACK' is disabled.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] Server socket created on IP: '0.0.0.0'.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] WSREP: Signalling provider to continue.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] WSREP: SST received: 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] WSREP: Receiving IST: 10220 writesets, seqnos 724808778-724818998
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] WSREP: Receiving IST...  0.0% (    0/10220 events) complete.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] Reading of all Master_info entries succeded
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] Added new Master_info '' to hash table
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] /usr/sbin/mysqld: ready for connections.
Apr 25 22:14:09 galera-node-2 mysqld: Version: '10.0.34-MariaDB-1~trusty-wsrep'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution, wsrep_25.21.rc3fc46e
Apr 25 22:14:11 galera-node-2 mysqld: 180425 22:14:11 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') connection to peer 3599655f with addr tcp://10.55.55.86:4567 timed out, no messages seen in PT3S
Apr 25 22:14:11 galera-node-2 mysqld: 180425 22:14:11 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') turning message relay requesting off
Apr 25 22:14:19 galera-node-2 mysqld: 180425 22:14:19 [Note] WSREP: Receiving IST... 60.0% ( 6128/10220 events) complete.
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: Receiving IST...100.0% (10220/10220 events) complete.
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: IST received: 32c39115-8d32-11e5-9122-b6a2b3e41530:724818998
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: 0.0 (mysql02): State transfer from 2.0 (mysql01) complete.
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: Shifting JOINER -> JOINED (TO: 724819126)
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: Member 0.0 (mysql02) synced with group.
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 724819126)
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: Synchronized with group, ready for connections
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

There seems to be a bug already opened for this issue (https://jira.mariadb.org/browse/MDEV-15254?attachmentViewMode=list) but this is about MariaDB 10.1.x and not 10.0.x as in my case on Ubuntu 14.04 so I opened https://jira.mariadb.org/browse/MDEV-16032.

 

New version 20180411 of check_esxi_hardware released
Wednesday - Apr 11th 2018 - by - (0 comments)

A new version (20180411) of the monitoring plugin check_esxi_hardware is available. This version contains an additional check to return an unknown state if, for any reason, the cim EnumerateInstances function isn't returning something useful (error message "ThreadPool --- Failed to enqueue request)".

Thanks to Peter Newman for the code contribution.

 

Unable to download attachment from Gmail, how to get it anyway (by using Takeout)
Monday - Apr 9th 2018 - by - (0 comments)

For backup and archive purposes I usually create a config backup of each of my monitoring setups. A couple of years back, in 2012, I simply sent myself the tar.gz file into my gmail mailbox. Now that I had to lookup the usage of a monitoring plugin I used back then, it turns out it wasn't the best decision. Gmail decided that the attachment suddenly contains a virus (probably because the monitoring plugins in the compressed archive are executable, but I'm not certain) and didn't allow me to download the file.

Gmail attachment disabled

Luckily there's a way around it. Unfortunately it takes some time, but if you need that data from the attachment, you still gotta do it, right? (Imagine it's a bitcoin wallet from 2012! ^_^)

Google offers an "archive" service, called Takeout, where you can download your data. You can access this service on https://takeout.google.com/. In this case I only needed the archive of my Gmail mailbox, so I only selected Gmail. After clicking next, the "split size" can be chosen. If your mailbox is larger than the given size, it will be split into several files (with the suffix .mbox).

Now I had to wait... The archive won't be ready immediately. A day later I got an e-mail telling me that my archive is ready for download:

Google data archive is ready

So I went on and downloaded archive file 1 of 2.
As mentioned above, the file is a mbox file. Which is basically explained your whole mailbox in text format. The mutt command is perfectly capable to read the mbox file. There's no need to install a GUI (like Thunderbird) to simply extract the attachment I need.

$ mutt -f /tmp/All\ mail\ Including\ Spam\ and\ Trash-001.mbox
Reading /tmp/All mail Including Spam and Trash-001.mbox... 359578 (86%)

Once mutt loaded all e-mails, the following overview was presented to me:

Opening mailbox with mutt

I wasn't happy with this, I need to immediately see the year of the mail to quickly find the mail I am looking for. I adapted my ~/.muttrc and added the year in the date format:

$ cat ~/.muttrc
set index_format="%4C %Z %{%Y %b %d} %-15.15L (%4l) %s"

After opening the mailbox again with mutt, this time I saw the year, however the mails were not sorted as I wanted them.

Displaying year in mutt index

To solve this, I added another line (set sort) in muttrc to sort by date:

$ cat ~/.muttrc
set index_format="%4C %Z %{%Y %b %d} %-15.15L (%4l) %s"
set sort = date

I opened the mailbox again and this time I saw the oldest mails of the mailbox and used "PageDown" to scroll down to my needed date. And finally, there was my mail:

Found Gmail archive mail in mbox mutt 

By selecting the mail with the cursor and hitting "Enter", the mail is opened:

Opening a mail in mutt 

The attachment can be shown by pressing "v":

View attachment in mutt 

I moved the cursor down to the tar.gz file and pressed "s" to save the file in /tmp:

Save attachment in mutt 

And finally I was able to open the file:

Open tar gz archive  

 

Monitor dual storage (raid) controller on IBM x3650 M4
Friday - Apr 6th 2018 - by - (0 comments)

I wanted to monitor the current RAID status on an IBM x3650 M4 server, simply by using check_raid. I've been using this plugin for years and it supports most software and hardware raid controllers. I've never had any problems with it (once I installed the required cli tools for each hardware controller) - until today.

Due to a very strange hardware setup, inherited from an ex-colleague, the server turns out to have two different RAID controllers active. 12 physical drives are attached to one controller, 2 physical drives to another.

Once I installed the megacli command (from http://hwraid.le-vert.net/), the plugin correctly identified the physical drives behind /dev/sda:

# /usr/lib/nagios/plugins/check_raid -l
megacli
1 active plugins

# /usr/lib/nagios/plugins/check_raid
WARNING: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]

To disable the warning on the disabled WriteCache:

# /usr/lib/nagios/plugins/check_raid --cache-fail=OK
OK: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]

But where are the other two physical drives? From my experience with hardware raid controllers I was pretty sure that megacli is able to detect multiple controllers and is able to retrieve the drive information from all controllers.
A manual verification using megacli still only returned 12 drives:

# megacli -CfgDsply -aall |grep Physical
Physical Disk Information:
Physical Disk: 0
Physical Sector Size:  512
Physical Disk: 1
Physical Sector Size:  512
Physical Disk: 2
Physical Sector Size:  512
Physical Disk: 3
Physical Sector Size:  512
Physical Disk: 4
Physical Sector Size:  512
Physical Disk: 5
Physical Sector Size:  512
Physical Disk Information:
Physical Disk: 0
Physical Sector Size:  512
Physical Disk: 1
Physical Sector Size:  512
Physical Disk: 2
Physical Sector Size:  512
Physical Disk: 3
Physical Sector Size:  512
Physical Disk: 4
Physical Sector Size:  512
Physical Disk: 5
Physical Sector Size:  512

Thankfully a colleague, who recently was working on that particular server, made a screenshot of the storage controller menu during the boot process:

Two different storage controllers in the same server 

As it turns out, there are two different storage controllers built into that server. One is a MegaRaid controller (ServeRAID M5210) and one is a MPT controller. No wonder megacli wasn't able to find the drives.

I tried again with "mpt-status" (http://hwraid.le-vert.net/wiki/LSIFusionMPT), but this didn't show any config:

# apt-get install mpt-status

/usr/sbin/mpt-status -p
Checking for SCSI ID:0
ioctl: No such device

I removed mpt-status again and went on to try the command "sas2ircu" for newer MPT cards (http://hwraid.le-vert.net/wiki/LSIFusionMPTSAS2).
Finally I got some output:

# apt-get install sas2ircu

# sas2ircu LIST
LSI Corporation SAS2 IR Configuration Utility.
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2009-2013 LSI Corporation. All rights reserved.


         Adapter      Vendor  Device                       SubSys  SubSys
 Index    Type          ID      ID    Pci Address          Ven ID  Dev ID
 -----  ------------  ------  ------  -----------------    ------  ------
   0     SAS2004     1000h    70h   00h:0ah:00h:00h      1014h   040eh
SAS2IRCU: Utility Completed Successfully.

And, hurray, check_raid was now able to read the infos from both controllers:

# /usr/lib/nagios/plugins/check_raid -l
megacli
sas2ircu
2 active plugins

# /usr/lib/nagios/plugins/check_raid --cache-fail=OK
OK: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]; sas2ircu:[ctrl #0: 1 Vols: Optimal: 2 Drives: Optimal (OPT)::]


 


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7179 Days
until Death of Computers
Why?