Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

Install Linux Mint 18.3 on software raid (dmraid) device
Saturday - May 26th 2018 - by - (0 comments)

When I re-vamped my computer (bought in 2011) a few days ago, I replaced the two internal 750GB hard drives with three SSD's.

  • Drive 1 (/dev/sda) 500GB Samsung 850 Evo
  • Drive 2 (/dev/sdb) 500GB Samsung 850 Evo
  • Drive 3 (/dev/sdc) 120GB ADATA SSD S510

I wanted to use both Samsung drives as a raid-1 for my main installation, Linux Mint 18.3.

The standalone SSD would be used for a Windows 7 installation for dual booting.

When I launched the Linux Mint 18.3 installation, I couldn't find any options to create software raid. So I created them manually (mdadm...) and restarted the installer. At the end of the installation the installer asks to reboot. That's what I did. Just to come to the grub loader and it coudln't find any operating system. Great :-/

After some try'n'err, I finally got to a way which works. If you want to install Linux Mint 18.3 on a software raid, follow these steps. Make sure you are using the correct device names, in my case they were /dev/sda and /dev/sdb.

1) Create the partitions on /dev/sda

I chose a very simple approach here with two partitions. The main partition almost fills up the whole disk, only leaving 4GB left for the second partition (swap).

mint ~ # sfdisk -l /dev/sda
Disk /dev/sda: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x4f1db047

Device     Boot     Start       End   Sectors   Size Id Type
/dev/sda1  *         2048 968384511 968382464 461.8G 83 Linux
/dev/sda2       968384512 976773119   8388608     4G 82 Linux swap / Solaris

2) Copy the partition table from SDA to SDB

The following command dumps (-d) the partition table from /dev/sda and inserts it into /dev/sdb:

mint ~ # sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ... OK

Disk /dev/sdb: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x16a9c579

Old situation:

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new DOS disklabel with disk identifier 0x4f1db047.
Created a new partition 1 of type 'Linux' and of size 461.8 GiB.
/dev/sdb2: Created a new partition 2 of type 'Linux swap / Solaris' and of size 4 GiB.
/dev/sdb3:
New situation:

Device     Boot     Start       End   Sectors   Size Id Type
/dev/sdb1  *         2048 968384511 968382464 461.8G 83 Linux
/dev/sdb2       968384512 976773119   8388608     4G 82 Linux swap / Solaris

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

3) Create the raid devices

First I created /dev/md0, which will hold the Linux Mint installation:

mint ~ # mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
mdadm: /dev/sda1 appears to contain an ext2fs file system
       size=484191232K  mtime=Fri May 25 15:31:47 2018
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: /dev/sdb1 appears to be part of a raid array:
       level=raid1 devices=2 ctime=Fri May 25 12:50:09 2018
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

Then create /dev/md1 which will be used as swap partition:

mint ~ # mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sda2 /dev/sdb2
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: /dev/sdb2 appears to be part of a raid array:
       level=raid1 devices=2 ctime=Fri May 25 12:50:23 2018
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.

4) Wait for the sync to be completed

mint ~ # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1] sda2[0]
      4190208 blocks super 1.2 [2/2] [UU]
          resync=DELAYED
     
md0 : active raid1 sdb1[1] sda1[0]
      484060160 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.7% (3421824/484060160) finish=39.7min speed=201283K/sec
      bitmap: 4/4 pages [16KB], 65536KB chunk

unused devices:

Yes, patience you must have.

mint ~ # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1] sda2[0]
      4190208 blocks super 1.2 [2/2] [UU]
     
md0 : active raid1 sdb1[1] sda1[0]
      484060160 blocks super 1.2 [2/2] [UU]
      bitmap: 0/4 pages [0KB], 65536KB chunk

unused devices:

5) Format the raid device /dev/md0

I will be using an ext4 filesystem, so:

mint ~ # mkfs.ext4 /dev/md0
mke2fs 1.42.13 (17-May-2015)
Discarding device blocks: done                           
Creating filesystem with 121015040 4k blocks and 30261248 inodes
Filesystem UUID: 8f662d46-4759-4b81-b879-eb60dd643f41
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
    4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
    102400000

Allocating group tables: done                           
Writing inode tables: done                           
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

6) Launch the installer

But launch it from the command line:

mint ~ # ubiquity -b

7) In the installer...

When the installer asks how about the installation type, select "Something else":

Linux Mint Install Something Else 

The raid devices /dev/md0 (with ext4 as type) and /dev/md1 (with swap as type) should be shown in the list:

Linux Mint 18.3 partitions 

Double-click on the row /dev/md0 with the values and select use the partition as ext4 mounted as /.

Double-click on the row /dev/md1 with the values and select use the partition as swap. 

Important: Make sure the other swap partitions (from devices SDA and SDB) are set to "do not use this partition". Otherwise the installer will fail and somehow crashes...

Make sure you select the row /dev/md0 with the values, then click on "Install now":

Linux Mint 

8) At the end of the installation...

Very important: DO NOT click on "Restart Now". Click on "Continue Testing" instead. Otherwise you will have the same failing boot effect as I described at the begin of this article.

Linux Mint installation completed 

9) Prepare the Linux Mint installation to chroot into

Launch a terminal window and mount /dev/md0:

mint ~ # mount /dev/md0 /mnt

Also make sure you are mounting sys and proc file systems as bind mounts into /mnt:

mint ~ # for i in /dev /dev/pts /sys /proc; do mount --bind $i /mnt/$i; done

In case the resolv.conf inside the Linux Mint installation is empty, enter a nameserver manually:

mint ~ # cat /mnt/etc/resolv.conf
mint ~ # echo "nameserver 1.1.1.1" > /mnt/etc/resolv.conf

Now chroot into your Linux Mint installation, mounted as /mnt:

mint ~ # chroot /mnt

10) Fix grub in the terminal

Now install the package mdadm into the Linux Mint installation. I will show the full output here:

mint / # apt-get install mdadm
Reading package lists... Done
Building dependency tree      
Reading state information... Done
Suggested packages:
  default-mta | mail-transport-agent dracut-core
The following NEW packages will be installed:
  mdadm
0 upgraded, 1 newly installed, 0 to remove and 326 not upgraded.
Need to get 394 kB of archives.
After this operation, 1,208 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 mdadm amd64 3.3-2ubuntu7.6 [394 kB]
Fetched 394 kB in 0s (1,491 kB/s)
Preconfiguring packages ...
Selecting previously unselected package mdadm.
(Reading database ... 199757 files and directories currently installed.)
Preparing to unpack .../mdadm_3.3-2ubuntu7.6_amd64.deb ...
Unpacking mdadm (3.3-2ubuntu7.6) ...
Processing triggers for systemd (229-4ubuntu21) ...
Processing triggers for ureadahead (0.100.0-19) ...
Processing triggers for doc-base (0.10.7) ...
Processing 4 added doc-base files...
Registering documents with scrollkeeper...
Processing triggers for man-db (2.7.5-1) ...
Setting up mdadm (3.3-2ubuntu7.6) ...
Generating mdadm.conf... done.
update-initramfs: deferring update (trigger activated)
Generating grub configuration file ...
Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported.
Found linux image: /boot/vmlinuz-4.10.0-38-generic
Found initrd image: /boot/initrd.img-4.10.0-38-generic
Found memtest86+ image: /boot/memtest86+.elf
Found memtest86+ image: /boot/memtest86+.bin
ERROR: isw: Could not find disk /dev/sdd in the metadata
ERROR: isw: Could not find disk /dev/sdd in the metadata
ERROR: isw: Could not find disk /dev/sdd in the metadata
ERROR: isw: Could not find disk /dev/sdd in the metadata
ERROR: isw: Could not find disk /dev/sdd in the metadata
ERROR: isw: Could not find disk /dev/sdd in the metadata
ERROR: isw: Could not find disk /dev/sdd in the metadata
ERROR: isw: Could not find disk /dev/sdd in the metadata
File descriptor 3 (pipe:[1799227]) leaked on lvs invocation. Parent PID 29155: /bin/sh
  /run/lvm/lvmetad.socket: connect failed: No such file or directory
  WARNING: Failed to connect to lvmetad. Falling back to internal scanning.
Found Windows 7 (loader) on /dev/sdd1
done
Running in chroot, ignoring request.
update-rc.d: warning: start and stop actions are no longer supported; falling back to defaults
Processing triggers for systemd (229-4ubuntu21) ...
Processing triggers for ureadahead (0.100.0-19) ...
Processing triggers for initramfs-tools (0.122ubuntu8.9) ...
update-initramfs: Generating /boot/initrd.img-4.10.0-38-generic
Warning: No support for locale: en_US.utf8

You can ignore the errors about drive /dev/sdd (it was an additional USB drive, nothing to do with the installation).

Very important here: When mdadm was installed into the Linux Mint installation, a new Kernel initramfs was created and also the grub config was created. Also the mdadm.conf was written. Obviously this step (installing mdadm into the Linux Mint installation and therefore self-awareness of being a Linux raid) was missed by the installer...

11) Verification

After mdadm was installed, a bunch of necessary files were created. Let's start with grub:

mint / # ll /boot/grub/
total 2368
drwxr-xr-x 2 root root    4096 May 26 08:08 ./
drwxr-xr-x 3 root root    4096 May 26 08:08 ../
-rw-r--r-- 1 root root     712 Nov 24  2017 gfxblacklist.txt
-r--r--r-- 1 root root    9734 May 26 08:08 grub.cfg
-rw-r--r-- 1 root root 2398585 Nov 24  2017 unicode.pf2

grub.cfg was only created once mdadm was installed. No wonder, a boot was not possible without this manual fix.

What does it contain?

mint / # cat /boot/grub/grub.cfg
#
# DO NOT EDIT THIS FILE
#
# It is automatically generated by grub-mkconfig using templates
# from /etc/grub.d and settings from /etc/default/grub
#
[...]
insmod part_msdos
insmod part_msdos
insmod diskfilter
insmod mdraid1x
insmod ext2
set root='mduuid/650864013e2d41cf1f2acfeafc5c2bd7'
if [ x$feature_platform_search_hint = xy ]; then
  search --no-floppy --fs-uuid --set=root --hint='mduuid/650864013e2d41cf1f2acfeafc5c2bd7'  8f662d46-4759-4b81-b879-eb60dd643f41
else
  search --no-floppy --fs-uuid --set=root 8f662d46-4759-4b81-b879-eb60dd643f41
fi
    font="/usr/share/grub/unicode.pf2"
fi
[...]
menuentry 'Linux Mint 18.3 Cinnamon 64-bit' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-8f662d46-4759-4b81-b879-eb60dd643f41' {
    recordfail
    load_video
    gfxmode $linux_gfx_mode
    insmod gzio
    if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
    insmod part_msdos
    insmod part_msdos
    insmod diskfilter
    insmod mdraid1x
    insmod ext2
    set root='mduuid/650864013e2d41cf1f2acfeafc5c2bd7'
    if [ x$feature_platform_search_hint = xy ]; then
      search --no-floppy --fs-uuid --set=root --hint='mduuid/650864013e2d41cf1f2acfeafc5c2bd7'  8f662d46-4759-4b81-b879-eb60dd643f41
    else
      search --no-floppy --fs-uuid --set=root 8f662d46-4759-4b81-b879-eb60dd643f41
    fi
        linux    /boot/vmlinuz-4.10.0-38-generic root=UUID=8f662d46-4759-4b81-b879-eb60dd643f41 ro  quiet splash $vt_handoff
    initrd    /boot/initrd.img-4.10.0-38-generic
}
submenu 'Advanced options for Linux Mint 18.3 Cinnamon 64-bit' $menuentry_id_option 'gnulinux-advanced-8f662d46-4759-4b81-b879-eb60dd643f41' {
    menuentry 'Linux Mint 18.3 Cinnamon 64-bit, with Linux 4.10.0-38-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.10.0-38-generic-advanced-8f662d46-4759-4b81-b879-eb60dd643f41' {
        recordfail
        load_video
        gfxmode $linux_gfx_mode
        insmod gzio
        if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
        insmod part_msdos
        insmod part_msdos
        insmod diskfilter
        insmod mdraid1x
        insmod ext2
        set root='mduuid/650864013e2d41cf1f2acfeafc5c2bd7'
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root --hint='mduuid/650864013e2d41cf1f2acfeafc5c2bd7'  8f662d46-4759-4b81-b879-eb60dd643f41
        else
          search --no-floppy --fs-uuid --set=root 8f662d46-4759-4b81-b879-eb60dd643f41
        fi
        echo    'Loading Linux 4.10.0-38-generic ...'
            linux    /boot/vmlinuz-4.10.0-38-generic root=UUID=8f662d46-4759-4b81-b879-eb60dd643f41 ro  quiet splash $vt_handoff
        echo    'Loading initial ramdisk ...'
        initrd    /boot/initrd.img-4.10.0-38-generic
    }
    menuentry 'Linux Mint 18.3 Cinnamon 64-bit, with Linux 4.10.0-38-generic (upstart)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.10.0-38-generic-init-upstart-8f662d46-4759-4b81-b879-eb60dd643f41' {
        recordfail
        load_video
        gfxmode $linux_gfx_mode
        insmod gzio
        if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
        insmod part_msdos
        insmod part_msdos
        insmod diskfilter
        insmod mdraid1x
        insmod ext2
        set root='mduuid/650864013e2d41cf1f2acfeafc5c2bd7'
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root --hint='mduuid/650864013e2d41cf1f2acfeafc5c2bd7'  8f662d46-4759-4b81-b879-eb60dd643f41
        else
          search --no-floppy --fs-uuid --set=root 8f662d46-4759-4b81-b879-eb60dd643f41
        fi
        echo    'Loading Linux 4.10.0-38-generic ...'
            linux    /boot/vmlinuz-4.10.0-38-generic root=UUID=8f662d46-4759-4b81-b879-eb60dd643f41 ro  quiet splash $vt_handoff init=/sbin/upstart
        echo    'Loading initial ramdisk ...'
        initrd    /boot/initrd.img-4.10.0-38-generic
    }
    menuentry 'Linux Mint 18.3 Cinnamon 64-bit, with Linux 4.10.0-38-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.10.0-38-generic-recovery-8f662d46-4759-4b81-b879-eb60dd643f41' {
        recordfail
        load_video
        insmod gzio
        if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
        insmod part_msdos
        insmod part_msdos
        insmod diskfilter
        insmod mdraid1x
        insmod ext2
        set root='mduuid/650864013e2d41cf1f2acfeafc5c2bd7'
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root --hint='mduuid/650864013e2d41cf1f2acfeafc5c2bd7'  8f662d46-4759-4b81-b879-eb60dd643f41
        else
          search --no-floppy --fs-uuid --set=root 8f662d46-4759-4b81-b879-eb60dd643f41
        fi
        echo    'Loading Linux 4.10.0-38-generic ...'
            linux    /boot/vmlinuz-4.10.0-38-generic root=UUID=8f662d46-4759-4b81-b879-eb60dd643f41 ro recovery nomodeset
        echo    'Loading initial ramdisk ...'
        initrd    /boot/initrd.img-4.10.0-38-generic
    }
}

### END /etc/grub.d/10_linux ###
[...]

As root device (set root) an uuid using mdraid (mduuid/650864013e2d41cf1f2acfeafc5c2bd7) was used. Let's doublecheck that with the entries in mdadm.conf:

mint / # cat /etc/mdadm/mdadm.conf
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using
# wildcards if desired.
#DEVICE partitions containers

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md/0  metadata=1.2 UUID=65086401:3e2d41cf:1f2acfea:fc5c2bd7 name=mint:0
ARRAY /dev/md/1  metadata=1.2 UUID=ad052a0a:eb1f9198:ec842848:215b650b name=mint:1
ARRAY metadata=imsm UUID=0b24ad7f:9b251541:a98a3748:f6333faa
ARRAY /dev/md/RAID1 container=0b24ad7f:9b251541:a98a3748:f6333faa member=0 UUID=aaa62640:f0d57fc8:6c097c8f:547b9c8f

# This file was auto-generated on Sat, 26 May 2018 08:08:42 +0200
# by mkconf $Id$

The UUID for /dev/md/0 looks familiar ;-). It's the same UUID as used in the grub config. So far so good.

Let's check /etc/fstab, too:

mint / # cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
#              
# / was on /dev/md0 during installation
UUID=8f662d46-4759-4b81-b879-eb60dd643f41 /               ext4    errors=remount-ro 0       1
# swap was on /dev/md1 during installation
UUID=b8277371-01fb-4aa3-bec1-9c0a4295deea none            swap    sw              0       0

Here another UUID is used (because it's a device UUID, not a mdraid UUID). We can verify these by checking in /dev/disk/by-uuid:

mint / # ls -la /dev/disk/by-uuid/
total 0
drwxr-xr-x 2 root root 180 May 26 08:05 .
drwxr-xr-x 6 root root 120 May 25 18:59 ..
lrwxrwxrwx 1 root root  10 May 26 08:05 101EC9371EC9171E -> ../../sdd2
lrwxrwxrwx 1 root root  10 May 26 08:05 1EF881D5F881AB99 -> ../../sdc1
lrwxrwxrwx 1 root root   9 May 26 08:05 2017-11-24-13-25-42-00 -> ../../sr0
lrwxrwxrwx 1 root root   9 May 26 08:08 8f662d46-4759-4b81-b879-eb60dd643f41 -> ../../md0
lrwxrwxrwx 1 root root   9 May 26 08:08 b8277371-01fb-4aa3-bec1-9c0a4295deea -> ../../md1
lrwxrwxrwx 1 root root  10 May 26 08:05 CEAABB7CAABB601F -> ../../sdd1
lrwxrwxrwx 1 root root  10 May 26 08:05 E646DD2A46DCFBED -> ../../sdd3

Both UUID's used in fstab (for the root partition and for swap) are here.

12) Grub install on the physical drives

Now that everything looks in order, we can install grub to /dev/sda and /dev/sdb.

mint / # grub-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.

mint / # grub-install /dev/sdb
Installing for i386-pc platform.
Installation finished. No error reported.

Very good, no errors.

Now I exited the chroot environment and rebooted the machine.

mint / # exit

mint ~ # reboot

13) Booting

And finally, success: Linux Mint 18.3 now boots from the software raid-1 /dev/md0 device.

Grub 

Linux Mint finished booting on software raid

 

Hardware Error: Parity error during data load. Or: Clean me!!!
Thursday - May 24th 2018 - by - (0 comments)

For a couple of months I've always wondered about the following error messages appearing on my NAS, a HP Proliant N40L Microserver running Debian 7 Wheezy, every five minutes:

[Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[Hardware Error]: MC2_ADDR: 0x00000000d3b42540
[Hardware Error]: MC2 Error: : SNP error during data copyback.
[Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[Hardware Error]: Corrected error, no action required.

I came across some articles, like the following:

But none offered real solutions to the problem. Some even said this logged error messages could simply be ignored...

A couple of days ago, I upgraded the NAS server from Debian Wheezy to Jessie (as a mid-way upgrad to Stretch) and realized after the successful OS upgrade, that the log entries now happen ALL THE TIME. I couldn't even use the terminal anymore because it was flooded by these messages:

[ 1026.904428] [Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[ 1026.910229] [Hardware Error]: MC2_ADDR: 0x00000000d3b42540
[ 1026.915945] [Hardware Error]: MC2 Error: : SNP error during data copyback.
[ 1026.921690] [Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[ 1027.182836] [Hardware Error]: Corrected error, no action required.
[ 1027.188553] [Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[ 1027.194345] [Hardware Error]: MC2_ADDR: 0x0000000001af2540
[ 1027.200132] [Hardware Error]: MC2 Error: : SNP error during data copyback.
[ 1027.205915] [Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[ 1027.338890] [Hardware Error]: Corrected error, no action required.
[ 1027.344632] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
[ 1027.350428] [Hardware Error]: MC1_ADDR: 0x0000ffff81012550
[ 1027.356222] [Hardware Error]: MC1 Error: Parity error during data load.
[ 1027.361997] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
[ 1027.430924] [Hardware Error]: Corrected error, no action required.
[ 1027.436645] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
[ 1027.442419] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2550
[ 1027.448216] [Hardware Error]: MC1 Error: Parity error during data load.
[ 1027.453960] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
[ 1027.939102] [Hardware Error]: Corrected error, no action required.

Damn. It's time to dig into that problem again. This time I got luckier and came across this forum thread:

The most interesting posted text there was:

"It is most likely a CPU fan dust bunny. That's the signal from the kernel to clean those out."

As easy as this sounds, it made sense. The microserver has been running day and night since it became my NAS server in December 2012 (see article Building a home file server with HP Proliant N40L). That's more than 5 years of total run time. As you might be aware of, the motherboard of this Microserver is under the drive cage and not easily accessible. And therefore not easily cleanable either.

I gave it a shot, shut down the server, removed the cables from the motherboard and pulled it out.

Dust on the heat sink causing hardware errors in kernel log 

There it is. A thick layer of dust sitting on the CPU's heat sink.

I cleaned the motherboard (vacuumed the dust off), re-attached the cable and pushed the motherboard back in position. Time of truth. I booted the server.

Checking syslog, you can easily see when I turned off (15:28) and booted the server again (15:42):

May 24 15:28:04 nas kernel: [77872.129490] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
May 24 15:28:04 nas kernel: [77872.135237] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2550
May 24 15:28:04 nas kernel: [77872.140955] [Hardware Error]: MC1 Error: Parity error during data load.
May 24 15:28:04 nas kernel: [77872.146656] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
May 24 15:28:04 nas kernel: [77872.263866] [Hardware Error]: Corrected error, no action required.
May 24 15:28:04 nas kernel: [77872.269509] [Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
May 24 15:28:04 nas kernel: [77872.275283] [Hardware Error]: MC2_ADDR: 0x0000000001af2540
May 24 15:28:04 nas kernel: [77872.280990] [Hardware Error]: MC2 Error: : SNP error during data copyback.
May 24 15:28:04 nas kernel: [77872.286694] [Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
May 24 15:28:04 nas kernel: [77872.323890] [Hardware Error]: Corrected error, no action required.
May 24 15:28:04 nas kernel: [77872.329552] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
May 24 15:28:04 nas kernel: [77872.335294] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2550
May 24 15:28:04 nas kernel: [77872.341013] [Hardware Error]: MC1 Error: Parity error during data load.
May 24 15:28:04 nas kernel: [77872.346716] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
May 24 15:28:04 nas kernel: [77872.371793] [Hardware Error]: Corrected error, no action required.
May 24 15:28:04 nas kernel: [77872.377085] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
May 24 15:28:04 nas kernel: [77872.382397] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2540
May 24 15:28:04 nas kernel: [77872.387718] [Hardware Error]: MC1 Error: Parity error during data load.
May 24 15:28:04 nas kernel: [77872.393030] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
May 24 15:42:13 nas kernel: [    0.000000] Initializing cgroup subsys cpuset
May 24 15:42:13 nas kernel: [    0.000000] Initializing cgroup subsys cpu
May 24 15:42:13 nas kernel: [    0.000000] Initializing cgroup subsys cpuacct
May 24 15:42:13 nas kernel: [    0.000000] Linux version 3.16.0-6-amd64 (debian-kernel@lists.debian.org) (gcc version 4.9.2 (Debian 4.9.2-10+deb8u1) ) #1 SMP Debian 3.16.56-1+deb8u1 (2018-05-08)
May 24 15:42:13 nas kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.16.0-6-amd64 root=UUID=e00b8ddf-5247-4b9f-834c-d557df90f575 ro quiet
May 24 15:42:13 nas kernel: [    0.000000] e820: BIOS-provided physical RAM map:

Then, I waited. From the logs above (which flooded my terminal) you can see that already after 1026 seconds of uptime the hardware errors appeared.

Now, after 1200 seconds of uptime, still no hardware errors:

root@nas:~# uptime
 16:03:00 up 20 min,  1 user,  load average: 0.04, 0.15, 0.09

root@nas:~# echo $((20 * 60 ))
1200

root@nas:~# dmesg | tail
[   10.257700] RPC: Registered named UNIX socket transport module.
[   10.257706] RPC: Registered udp transport module.
[   10.257709] RPC: Registered tcp transport module.
[   10.257711] RPC: Registered tcp NFSv4.1 backchannel transport module.
[   10.272263] FS-Cache: Loaded
[   10.321299] FS-Cache: Netfs 'nfs' registered for caching
[   10.376030] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[   11.809469] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[   11.809478] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX
[   11.809506] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

Even after now 41 minutes (=2460 seconds) of uptime, still no errors:

root@nas:~# uptime && dmesg |tail
 16:23:49 up 41 min,  1 user,  load average: 0.02, 0.03, 0.01
[   10.257700] RPC: Registered named UNIX socket transport module.
[   10.257706] RPC: Registered udp transport module.
[   10.257709] RPC: Registered tcp transport module.
[   10.257711] RPC: Registered tcp NFSv4.1 backchannel transport module.
[   10.272263] FS-Cache: Loaded
[   10.321299] FS-Cache: Netfs 'nfs' registered for caching
[   10.376030] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[   11.809469] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[   11.809478] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX
[   11.809506] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

These error messages really turned out to be a warning from the OS to clean the server. Who'd have thought that looking at these hardware error messages...

 

How to build a generic Icinga2 service graph in Grafana using InfluxDB
Friday - May 11th 2018 - by - (0 comments)

In the past weeks I've spent quite some time (whenever I had time) to slowly kick off the new monitoring architecture using a dual-master Icinga2 installation, InfluxDB as graphing database and Grafana as graph displaying software. See previous articles "Icinga2 graphing with InfluxDB and Grafana" and "Create separate measurement tables in InfluxDB for Icinga 2 NRPE checks" for further information.

I was quite happy so far with the dashboard I created in Grafana, based on the Icinga2 Grafana dashboard template:

Grafana Icinga2 Linux Dashboard 

But I was missing some dynamic graphs showing up. We currently have around 850 host objects in our "old" Icinga2 monitoring and not all the hosts are the same. Some have additional database checks, some have HTTP checks, some are running on Windows, others again have very specific application checks. It's difficult in the main dashboard to represent all the services with the (fixed) graph elements in Grafana.

Eventually I came across a question from user TryTryAgain on serverfault, which was basically more about creating a dynamic action_url to point to Grafana. The question itself was irrelvant to me, but something interesting hit my eye:

" I'd like this to work: action_url = "http://grafana-server.example/grafana/dashboard/db/generic-check?var-device=$HOSTNAME$&var-check=$SERVICEDESC$&var-checkmetric=$SERVICECHECKCOMMAND$&var-datatype=perfdata&var-value=value" "

So the user prepared a new template "generic-check" and uses dynamic variables to display the graph for one exact service. That's a great idea!

As I couldn't find a pre-existing template for such a generic graph, I went on to create it. And had to dig deeper into InfluxDB's queries and schemas...

 

1. Prepare the graph

I added a single graph, first with a static data query:

SELECT mean("value") FROM hostalive WHERE ("hostname" =~ /mytesthost/) AND $timeFilter GROUP BY time($__interval) fill(previous)

This graph needs to be adjusted in the next steps, as I added dynamic variables.

 

2. Variable $hostname

I decided I want the generic service template to start with the actual host object. This is usually the most important marker (from which host object do I need the graph?). I created the $hostname variable in the template's templating variables:

$hostname = SHOW TAG VALUES FROM "hostalive" WITH KEY = "hostname"

This query looks up the "hostalive" measurements table and shows all unique values of the key "hostname". Remember, InfluxDB is basically explained a key-value store (like Redis).

This one was pretty easy and immediately showed up all the hosts prepared in the new Icinga2 architecture:

Grafana variable hostname 

To use the dynamic variable $hostname in the graph, the graph's query needs to be adjusted:

SELECT mean("value") FROM hostalive WHERE ("hostname" =~ /^$hostname$/) AND $timeFilter GROUP BY time($__interval) fill(previous)

 

3. Variable $check

Now it gets slightly more complicated. Now that the host object is selected by using $hostname, Grafana needs to look up for which services it is able to display graphs. I decided the best way would be to look into the different measurement tables. I did this by:

$check = SHOW measurements

But I wasn't happy with that because it just showed all measurement tables, even irrelevant ones like "http" for a non-webserver.

Luckily the show measurements query also allows a WHERE clause:

$check = SHOW measurements WHERE "hostname" =~ /^$hostname$/

This way InfluxDB only shows measurement tables in which our already selected host object has already some data entries.

Grafana Dynamic Variable Check 

To use the dynamic variable $check in the graph, the graph's query needs to be adjusted:

SELECT mean("value") FROM /^$check$/ WHERE ("hostname" =~ /^$hostname$/) AND $timeFilter GROUP BY time($__interval) fill(previous)

 

4. Variable $service

At first I thought my template is almost complete with the defined $check. It worked for example for "ssh" checks, which is (normally) a single service check on the host object. But a very good example disproving it is a disk check: You usually run disk usage checks (check_disk) on several partitions on the same host object, having therefore multiple service objects in Icinga 2. In such a case, the query of the selected $check returns multiple results. A graph would then simply take all the data together (value), whether the value came from a partition "/" or "/tmp". This is wrong.

So I needed to create another variable $service which represents the already existing data for the selected $check:

$service = SHOW TAG VALUES FROM $check WITH KEY = "service" WHERE "hostname" =~ /^$hostname$/

In the following example, a host object with several tcp checks gives the following selection:

Grafana dynamic variable service 

 

5. Variable $metric

But a check can return multiple values! For example a check_http usually returns two sets of performance data: The size of the response and the time of the response (response time). To get the graph we actually want, for example response time of a http check, another flexible variable $metric was defined:

$metric = SHOW TAG VALUES FROM "$check" WITH KEY = "metric" WHERE "hostname" =~ /^$hostname$/

The new variable now allows to select the relevant data:

Grafana dynamic variable metric

To use the dynamic variable $metric in the graph, the graph's query needs to be adjusted:

SELECT mean("value") FROM /^$check$/ WHERE ("hostname" =~ /^$hostname$/ AND "metric" =~ /^$metric$/) AND $timeFilter GROUP BY time($__interval) fill(previous)

 

6. Variable $aggregation

I thought I was done and the first few tests looked promising. Until I came across a host having a MySQL running on it. The graphs for MySQL (metric: connections) just grew:

Grafana MySQL Connections wrong 

This is a misinterpretation by the graph because the MySQL connections is a so-called counter (connection = connection +1). Because this is a single graph, how does Grafana know what kind of data it gets?

The solution is to provide yet another dynamic variable $aggregation. With this variable, Grafana can be told how to display the data. I created a custom variable for this purpose with two values:

Grafana dynamic variable aggregation

$aggregation = mean("value"),derivative(mean("value"))

To use the dynamic variable $aggregation in the graph, the graph's query needs to be adjusted:

SELECT $aggregation FROM /^$check$/ WHERE ("hostname" =~ /^$hostname$/ AND "metric" =~ /^$metric$/) AND $timeFilter GROUP BY time($__interval) fill(previous)

The graph for the counter data now dynamically adapts:

Grafana MySQL connections correct 

 

The full picture

Grafana Icinga2 Dynamic Generic Service Graph 

The template "Generic Service" can be downloaded here as json export.

 

Next steps

Now being able to dynamically show up a graph for a generic service (by manually selecting the values from the drop down fields or by using the variables in the URL), I will try to use that somehow for the "action_url" in Icinga 2 or integrate it differently.

 

Open issues:

The graphs for disk/partition usage are not showing up due to an error in the query. I haven't found out yet why this happens (probably due to a conflict between $service and $metric, not sure) but for now I can live with it.

Grafana generic graph failing for disk

 

Monitoring plugin check_smart version 5.10 released
Tuesday - May 8th 2018 - by - (0 comments)

A new version (5.10) of the monitoring plugin check_smart was released today.

This version introduces a new parameter "-s" (or "--selftest"). By using this switch, an additional check will happen on the SMART's self-test log.

A typical example of such a self-test log entry:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%     10926         962363536

Remember: You can always launch a SMART self-test by running the following command (example on drive /dev/sda):

# smartctl -t long /dev/sda

At first I thought this selftest check should be part of the main check, but there might be some situations where a drive reported errors in the past but were resolved (e.g. sectors meanwhile re-allocated). By using an additional parameter "-s", which can be easily added into existing Nagios/Icinga/NRPE configurations, this doesn't break compatibility.

 

Lessons learned: Do not put stateful Minio into a Docker container
Monday - May 7th 2018 - by - (0 comments)

On a Big Data playground we've built (Spark, R, Minio) on top of a Rancher managed Docker environment, we've also put the Minio object storage into Docker containers. We were sceptic at first (don't create stateful containers!) but Rancher offers Minio directly from it's Cattle catalog:

 Minio in Rancher Cattle Catalog

We decided to go for it - it's a playground environment and in the worst case we'll begin from scratch again.

At first everything seemed to work fine. The containers started up and by using their internal Docker IP addresses, Minio was launched on each container (on four containers, spread on four physical hosts). On each physical host, three volumes were created which were volume-mounted into the Minio containers. Inside the containers:

# docker inspect -f '{{ .Mounts }}' 5ca3f3177e27
[{volume minio_minio-scheduler-setting_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_2b6e4 /var/lib/docker/volumes/minio_minio-scheduler-setting_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_2b6e4/_data /opt/scheduler local rw true } {volume minio_minio-data-0_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_76e19 /var/lib/docker/volumes/minio_minio-data-0_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_76e19/_data /data/disk0 local rw true } {volume rancher-cni /var/lib/docker/volumes/rancher-cni/_data /.r local ro false } {volume 42d6f216267ced36e186b3e082ae6b3c7ad53085432326220c583ab022842826 /var/lib/docker/volumes/42d6f216267ced36e186b3e082ae6b3c7ad53085432326220c583ab022842826/_data /data local  true }]

Minio itself was launched inside the container in the following way:

bash-4.3# ps auxf
PID   USER     TIME   COMMAND
    1 root       0:00 s6-svscan -t0 /var/run/s6/services
   38 root       0:00 s6-supervise s6-fdholderd
  323 root       0:00 s6-supervise minio
  326 root       0:00 sh ./run
  330 minio      0:00 sh -c /opt/minio/bin/minio server http://10.42.84.239/data/disk0 http://10.42.224.2/data/disk0
  332 minio      0:27 /opt/minio/bin/minio server http://10.42.84.239/data/disk0 http://10.42.224.2/data/disk0 http:/
  430 root       0:00 bash
  436 root       0:00 ps auxf

The command line of this process (PID 330) reveals more:

bash-4.3# cat /proc/330/cmdline
sh-c/opt/minio/bin/minio server http://10.42.84.239/data/disk0 http://10.42.224.2/data/disk0 http://10.42.55.111/data/disk0 http://10.42.116.171/data/disk0

And, no surprise, here we can find the four IP addresses of the four Minio containers.

We launched some basic tests to make sure the setup would hold:

  • Upscaling: We tried to scale from 4 to 5 minio-server containers. The 5th server didn't come up. This is most likely because it is not aware of the other minio-server containers (this is only true during the setup through the Catalog) and it requires the correct start command with all the other container IP's. Even if the 5th server would have come up, there is no way that the other containers are suddenly aware of a new Minio server. They, too, need to change the command line how Minio is started. TL;DR: Minio server's (deployed through the catalog) cannot be horizontally scaled, they're fixed once deployed.
  • Same IP: As we can see in the command line, the IP addresses are fixed in the Minio start command. We deleted a container, it was re-created by Rancher and it kept the same IP address.
  • Volumes: We wanted to see if the data is lost if a container is stopped or deleted. Even after deletion of a minio-server container, the volumes still exist on the Rancher host and when the container is re-created, the same volume is re-attached to the new container. Therefore no data loss.

So far so good but one thing caught my eye: The minio server service, automatically built by deploying from the catalog, was set to "Scale" and not to "Global". The latter one, global, is usually my first choice to run services across hosts. And Minio, by definition, is such a service.

When we had to take down physical host #1 for maintenance, the problems started. Due to the "Scaling" of the Minio Server Service, another Minio container was launched on one of the other three still remaining physical hosts. Therefore on Docker host #2 we had two Minio containers running. Even though the newly created container on host #2 used the same IP address, it had to create a new data volume on the host. Therefore the two Minio containers now use the double amount of disk space as before.

When host #1 was up again, we tried to place the Minio container on host #1 again. Unfortunately in Rancher they cannot be moved like a VM in VMware. So we deleted the second Minio container on host #2 and hoped it would automatically create a new Minio container (scale: 4, remember) on host #1 again. It didn't. The new Minio container was again launched on host #2. I had to downscale and then upscale again but this worked: The new Minio container was finally created again on host #1. But to my shock suddenly a new IP address was assigned instead of using the previous one (probably due to my down scaling). This means that communication with the other Minio containers was broken because they were not aware of an IP change of one of their Minio partner nodes.

Another problem was also that on node #2 the doubled volumes are still on the system, using disk space:

# docker volume ls | grep minio
local               minio_minio-data-0_1_fb18fef0-a34b-4f78-9628-ca66abe4b057_a745e
local               minio_minio-data-0_2_c416a26e-b724-49af-bd37-751813a0a586_10db2
local               minio_minio-scheduler-setting_1_fb18fef0-a34b-4f78-9628-ca66abe4b057_919c9
local               minio_minio-scheduler-setting_2_c416a26e-b724-49af-bd37-751813a0a586_fd842

At first we had some "hope" that the service would itself recover again, when each container was restarted - but the following error message shown in the Minio user interface was pretty much saying nay (Multiple disk failures, unable to write data.):

Minio as Docker container: Multiple disk failures

Luckily we were still in playground mode before some actual/important data was created and stored in the Minio object store. I decided to create static LXC containers for Minio and use the docker0 bridge in order to connect the (stateless) application Docker containers to Minio.
I was able to attach the LXC container "minio01" to the docker0 bridge and the container was able to communicate with the Docker containers on the same physical host. But, and this is the show stopper, it was not able to communicate with the Docker containers running on other hosts (but in the same Rancher environment). The reason for this is that Rancher uses some magic (technically speaking: An IPSEC tunnel between the Docker hosts) to "extend" the Docker-internal network (10.42.0.0/16) on to each Docker host. According to the Rancher documentation, it is possible to "join" this IPSEC tunneled network by manually running a Docker container with a specific label:

For any containers launched through the Docker CLI, an extra label --label io.rancher.container.network=true can be used to launch the container into the managed network. Without this label, containers launched from the Docker CLI will be using the bridge network.

But of course this only applies to containers started and managed by Docker, not the Linux Containers LXC (System Containers as they're nowadays called).

The Docker containers, have one thing in common though: They use the Docker host as router/gateway. This means they are able to connect to the physical network using iptables NAT rules. So I created another virtual bridge called virbr0 alongside docker0, using the physical primary interface of the Docker host and used the host's IP address to now assign to the virbr0 interface.

 # The primary network interface
iface enp15s0 inet manual

# Virtual bridge
auto virbr0
iface virbr0 inet static
        address 10.150.1.10
        netmask 255.255.255.128
        network 10.150.1.0
        broadcast 10.150.1.127
        gateway 10.150.1.1
        # bridge control
        bridge_ports enp15s0
        bridge_fd 0
        pre-up brctl addbr virbr0
        pre-down brctl delbr virbr0
        # dns-* options are implemented by the resolvconf package, if installed
        dns-nameservers 8.8.8.8 1.1.1.1
        dns-search playground.local

The LXC container is now using this virtual bridge virbr0 as network link and the own host's ip address as gateway:

# cat /var/lib/lxc/minio01/config
# Template used to create this container: /usr/share/lxc/templates/lxc-ubuntu
# Parameters passed to the template: --release xenial
# Template script checksum (SHA-1): 4d7c613c3c0a0efef4b23917f44888df507e662b
# For additional config options, please look at lxc.container.conf(5)

# Uncomment the following line to support nesting containers:
#lxc.include = /usr/share/lxc/config/nesting.conf
# (Be aware this has security implications)


# Common configuration
lxc.include = /usr/share/lxc/config/ubuntu.common.conf

# Container specific configuration
lxc.rootfs = /var/lib/docker/lxc/minio01
lxc.rootfs.backend = dir
lxc.utsname = minio01
lxc.arch = amd64
lxc.start.auto = 1

# Network configuration
lxc.network.type = veth
lxc.network.link = virbr0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:10:ee:90
lxc.network.ipv4 = 10.150.1.50/25
lxc.network.ipv4.gateway = 10.150.1.10

Now the Docker containers and the LXC container have one thing in common: They use the same physical host as gateway. This means, they're using the same routing table and therefore are able to communicate with each other.

But how do I tell the Docker application containers where Minio is located now? In Rancher this is possible by using "External Service" in which you can set external IP addresses, outside of the Docker container range:

Rancher External Service

And this external service can now be used in the internal Rancher load balancer:

Rancher LB using external Service 

Although Minio isn't running inside the Docker environment managed by Rancher anymore, it still uses the same physical network connections (gateways) of the same hosts and therefore not losing any performance. On the other hand we feel much more comfortable and "safe" when the persistent data is running in a LXC container.

 

Rsnapshot error: /bin/rm: cannot remove ... Directory not empty
Friday - May 4th 2018 - by - (0 comments)

For the last two days I got the following error message from the nightly rsnapshot run:

/bin/rm: cannot remove ‘/backup/rsnapshot/daily.14/
p01/mnt/containerdata/admin/[...]/Auth’:
Directory not empty
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot daily
----------------------------------------------------------------------------
ERROR: Warning! /bin/rm failed.
ERROR: Error! rm_rf("/backup/rsnapshot/daily.14/")

But even when I manually tried to delete the folder, it didn't work:

# rm -rfv /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth
rm: cannot remove ‘/backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth’: Directory not empty

Why would that happen? Let's check out the contents of that particular path:

# la -la /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth
ls: /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth: Input/output error
ls: cannot open directory /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth: Input/output error

 Oops, IO error! That doesn't sound too good! I unmounted the /backup path and ran a forced file system check:

# fsck.ext4 -f /dev/vgdata/lvbackup
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Inode 2900778 has a bad extended attribute block 524288.  Clear<y>? yes
Inode 12765722 has a bad extended attribute block 524288.  Clear<y>? yes
Inode 42844666 has a bad extended attribute block 524288.  Clear<y>? yes
Pass 2: Checking directory structure
Entry '.' in /rsnapshot/daily.5/p01/mnt/containerdata/[...] (89989522) is a link to '.'  Clear<y>? yes
Entry '..' in /rsnapshot/daily.5/p01/mnt/containerdata/[...] (89989522) is duplicate '..' entry.
Fix<y>? yes
Entry '.' in /rsnapshot/daily.7/p01/mnt/containerdata/[...] (71123012) is a link to '.'  Clear<y>? yes
Entry '..' in /rsnapshot/daily.7/p01/mnt/containerdata/[...] (71123012) is duplicate '..' entry.
Fix<y>? yes
Entry '.' in /rsnapshot/daily.8/p01/mnt/containerdata/[...] (70746560) is a link to '.'  Clear<y>? yes
Entry '..' in /rsnapshot/daily.8/p01/mnt/containerdata/[...] (70746560) is duplicate '..' entry.
Fix<y>? yes
Pass 3: Checking directory connectivity
'..' in /rsnapshot/daily.8/p01/mnt/containerdata/[...] (70746560) is <The NULL inode> (0), should be /rsnapshot/daily.8/p01/mnt/containerdata/[...] (70746549).
Fix<y>? yes
'..' in /rsnapshot/daily.7/p01/mnt/containerdata/[...] (71123012) is <The NULL inode> (0), should be /rsnapshot/daily.7/p01/mnt/containerdata/[...] (71123004).
Fix<y>? yes
'..' in /rsnapshot/daily.5/p01/mnt/containerdata/[...] (89989522) is <The NULL inode> (0), should be /rsnapshot/daily.5/p01/mnt/containerdata/[...] (89989521).
Fix<y>? yes
Pass 4: Checking reference counts
Inode 70746549 ref count is 51, should be 50.  Fix<y>? yes
Inode 70746560 ref count is 2, should be 1.  Fix<y>? yes
Inode 71123004 ref count is 7, should be 6.  Fix<y>? yes
Inode 71123012 ref count is 2, should be 1.  Fix<y>? yes
Inode 89989521 ref count is 6, should be 5.  Fix<y>? yes
Inode 89989522 ref count is 2, should be 1.  Fix<y>? yes
Pass 5: Checking group summary information

/dev/vgdata/lvbackup: ***** FILE SYSTEM WAS MODIFIED *****
/dev/vgdata/lvbackup: 11160908/170393600 files (0.2% non-contiguous), 610481642/681574400 blocks

 Note: To protect the privacy of the customer's data, I removed the full paths and directory names from the output.

Once the file system check passed, I was able to check the directory:

# ls -la /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth/
total 52
drwxr-xr-x  3 1002 1002 4096 Dec  2 13:04 .
drwxr-xr-x  3 1002 1002 4096 May  3 02:05 ..
drwxr-xr-x  2 1002 1002 4096 Dec  2 13:05 Access
[...]

And of course rm was now working, too:

# rm -rfv /backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth
removed directory: ‘/backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth/Access’
[...]
removed directory: ‘/backup/rsnapshot/daily.14/p01/mnt/containerdata/admin/[...]/Auth’

TL;DR: When rsnapshot fails with such an error, it's worth to check out the file system's health.

 

Monitoring plugin check_smart version 5.9 released
Sunday - Apr 29th 2018 - by - (0 comments)

A new version (5.9) of the monitoring plugin to check hard and solid state drives' SMART table, check_smart.pl, was released.

With this release it is now possible to use the interface type "auto". However this automatic lookup only seems to works on ATA devices (see https://github.com/Napsty/check_smart/pull/20 for more).

 

Galera node on Ubuntu 14.04 Trusty not joining after update to MariaDB 10.0.34
Wednesday - Apr 25th 2018 - by - (0 comments)

After having updated a Ubuntu 14.04 Trusty server, including MariaDB and Galera packages, I noticed the node wasn't able to join the Galera cluster anymore.

According to the apt history log, the versions were updated like this:

mariadb-galera-server:amd64 (10.0.31+maria-1~trusty, 10.0.34+maria-1~trusty),  mariadb-client-core-10.0:amd64 (10.0.32+maria-1~trusty, 10.0.34+maria-1~trusty), mariadb-common:amd64 (10.0.32+maria-1~trusty, 10.0.34+maria-1~trusty), mariadb-galera-server-10.0:amd64 (10.0.31+maria-1~trusty, 10.0.34+maria-1~trusty), galera-3:amd64 (25.3.20-trusty, 25.3.22-trusty),  mariadb-client-10.0:amd64 (10.0.32+maria-1~trusty, 10.0.34+maria-1~trusty)

I know for a fact that an automatic join after a MariaDB/MySQL start worked. But after the upgrade and an automatic MariaDB start, the Galera node wasn't able to join the Galera cluster. It appeared in the "wsrep_incoming_addresses" but then quickly disappeared again. In the logs I found the following logs when I manually launched the mysql_safe daemon:

# mysqld_safe --skip-grant-tables &

Apr 25 22:06:09 galera-node-2 mysqld_safe: WSREP: Recovered position 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] /usr/sbin/mysqld (mysqld 10.0.34-MariaDB-1~trusty-wsrep) starting as process 29529 ...
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Read nil XID from storage engines, skipping position init
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: wsrep_load(): Galera 25.3.22(r3764) by Codership Oy loaded successfully.
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: CRC-32C: using hardware acceleration.
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Found saved state: 32c39115-8d32-11e5-9122-b6a2b3e41530:-1, safe_to_bootstrap: 0
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 10.55.55.86; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S;
Apr 25 22:06:09 galera-node-2 mysqld: pc.checksum = false; pc
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: GCache history reset: 32c39115-8d32-11e5-9122-b6a2b3e41530:0 -> 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Assign initial position for certification: 724808778, protocol version: -1
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: wsrep_sst_grab()
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Start replication
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Setting initial position to 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: protonet asio version 0
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Using CRC-32C for message checksums.
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: backend: asio
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: gcomm thread scheduling priority set to other:0
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: restore pc from disk failed
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: GMCast version 0
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: EVS version 0
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: gcomm: connecting to group 'GALERACLUSTER', peer '10.55.55.85:,10.55.55.86:,10.55.55.87:'
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') connection established to 185196d1 tcp://10.55.55.86:4567
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Warning] WSREP: (185196d1, 'tcp://0.0.0.0:4567') address 'tcp://10.55.55.86:4567' points to own listening address, blacklisting
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') connection established to acfd2d2f tcp://10.55.55.87:4567
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') connection established to d607ef0f tcp://10.55.55.85:4567
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: declaring acfd2d2f at tcp://10.55.55.87:4567 stable
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: declaring d607ef0f at tcp://10.55.55.85:4567 stable
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: Node acfd2d2f state prim
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: view(view_id(PRIM,185196d1,12) memb {
Apr 25 22:06:09 galera-node-2 mysqld: #011185196d1,0
Apr 25 22:06:09 galera-node-2 mysqld: #011acfd2d2f,0
Apr 25 22:06:09 galera-node-2 mysqld: #011d607ef0f,0
Apr 25 22:06:09 galera-node-2 mysqld: } joined {
Apr 25 22:06:09 galera-node-2 mysqld: } left {
Apr 25 22:06:09 galera-node-2 mysqld: } partitioned {
Apr 25 22:06:09 galera-node-2 mysqld: })
Apr 25 22:06:09 galera-node-2 mysqld: 180425 22:06:09 [Note] WSREP: save pc into disk
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: gcomm: connected
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Opened channel 'GALERACLUSTER'
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Waiting for SST to complete.
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 18ea8d6b-48c4-11e8-8a41-0fef3fa593ed
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: STATE EXCHANGE: sent state msg: 18ea8d6b-48c4-11e8-8a41-0fef3fa593ed
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: STATE EXCHANGE: got state msg: 18ea8d6b-48c4-11e8-8a41-0fef3fa593ed from 0 (mysql02)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: STATE EXCHANGE: got state msg: 18ea8d6b-48c4-11e8-8a41-0fef3fa593ed from 1 (mysql03)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: STATE EXCHANGE: got state msg: 18ea8d6b-48c4-11e8-8a41-0fef3fa593ed from 2 (mysql01)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Quorum results:
Apr 25 22:06:10 galera-node-2 mysqld: #011version    = 4,
Apr 25 22:06:10 galera-node-2 mysqld: #011component  = PRIMARY,
Apr 25 22:06:10 galera-node-2 mysqld: #011conf_id    = 10,
Apr 25 22:06:10 galera-node-2 mysqld: #011members    = 2/3 (joined/total),
Apr 25 22:06:10 galera-node-2 mysqld: #011act_id     = 724815064,
Apr 25 22:06:10 galera-node-2 mysqld: #011last_appl. = -1,
Apr 25 22:06:10 galera-node-2 mysqld: #011protocols  = 0/7/3 (gcs/repl/appl),
Apr 25 22:06:10 galera-node-2 mysqld: #011group UUID = 32c39115-8d32-11e5-9122-b6a2b3e41530
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Flow-control interval: [28, 28]
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Trying to continue unpaused monitor
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 724815064)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: State transfer required:
Apr 25 22:06:10 galera-node-2 mysqld: #011Group state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724815064
Apr 25 22:06:10 galera-node-2 mysqld: #011Local state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: New cluster view: global state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724815064, view# 11: Primary, number of nodes: 3, my index: 0, protocol version 3
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Warning] WSREP: Gap in state sequence. Need state transfer.
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '10.55.55.86' --datadir '/var/lib/mysql/'   --parent '29529' --binlog '/var/log/mysql/mariadb-bin' '
Apr 25 22:06:10 galera-node-2 mysqld: /usr//bin/wsrep_sst_xtrabackup: line 397: setup_ports: command not found
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [ERROR] WSREP: Failed to read 'ready ' from: wsrep_sst_xtrabackup --role 'joiner' --address '10.55.55.86' --datadir '/var/lib/mysql/'   --parent '29529' --binlog '/var/log/mysql/mariadb-bin'
Apr 25 22:06:10 galera-node-2 mysqld: #011Read: '(null)'
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup --role 'joiner' --address '10.55.55.86' --datadir '/var/lib/mysql/'   --parent '29529' --binlog '/var/log/mysql/mariadb-bin' : 2 (No such file or directory)
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [ERROR] WSREP: Failed to prepare for 'xtrabackup' SST. Unrecoverable.
Apr 25 22:06:10 galera-node-2 mysqld: 180425 22:06:10 [ERROR] Aborting
Apr 25 22:06:10 galera-node-2 mysqld:
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Closing send monitor...
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Closed send monitor.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: gcomm: terminating thread
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: gcomm: joining thread
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: gcomm: closing backend
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: view(view_id(NON_PRIM,185196d1,12) memb {
Apr 25 22:06:12 galera-node-2 mysqld: #011185196d1,0
Apr 25 22:06:12 galera-node-2 mysqld: } joined {
Apr 25 22:06:12 galera-node-2 mysqld: } left {
Apr 25 22:06:12 galera-node-2 mysqld: } partitioned {
Apr 25 22:06:12 galera-node-2 mysqld: #011acfd2d2f,0
Apr 25 22:06:12 galera-node-2 mysqld: #011d607ef0f,0
Apr 25 22:06:12 galera-node-2 mysqld: })
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') connection to peer 185196d1 with addr tcp://10.55.55.86:4567 timed out, no messages seen in PT3S
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: (185196d1, 'tcp://0.0.0.0:4567') turning message relay requesting off
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: view((empty))
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: gcomm: closed
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Flow-control interval: [16, 16]
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Trying to continue unpaused monitor
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Received NON-PRIMARY.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 724815071)
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Received self-leave message.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Flow-control interval: [0, 0]
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Trying to continue unpaused monitor
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Received SELF-LEAVE. Closing connection.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 724815071)
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: RECV thread exiting 0: Success
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: recv_thread() joined.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Closing replication queue.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Closing slave action queue.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: Service disconnected.
Apr 25 22:06:12 galera-node-2 mysqld: 180425 22:06:12 [Note] WSREP: rollbacker thread exiting
Apr 25 22:06:13 galera-node-2 mysqld: 180425 22:06:13 [Note] WSREP: Some threads may fail to exit.
Apr 25 22:06:13 galera-node-2 mysqld: 180425 22:06:13 [Note] /usr/sbin/mysqld: Shutdown complete
Apr 25 22:06:13 galera-node-2 mysqld:
Apr 25 22:06:18 galera-node-2 mysqld: Error in my_thread_global_end(): 1 threads didn't exit
Apr 25 22:06:18 galera-node-2 mysqld_safe: mysqld from pid file /var/run/mysqld/mysqld.pid ended

After the second or third time reading through the logs, one particular line hit my eyes:

Apr 25 22:06:10 galera-node-2 mysqld: /usr//bin/wsrep_sst_xtrabackup: line 397: setup_ports: command not found

Huh? So this file has changed with the update. On a still working (and not updated yet) node:

root@galera-node-1:~# ll /usr//bin/wsrep_sst_xtrabackup
-rwxr-xr-x 1 root root 21703 Jun  1  2017 /usr//bin/wsrep_sst_xtrabackup

On the updated node 2:

root@galera-node-2:~# ll /usr//bin/wsrep_sst_xtrabackup
-rwxr-xr-x 1 root root 20658 Feb  2 08:20 /usr//bin/wsrep_sst_xtrabackup

Checking the mentioned line 397:

391 if [[ ! ${WSREP_SST_OPT_ROLE} == 'joiner' && ! ${WSREP_SST_OPT_ROLE} == 'donor' ]];then
392     wsrep_log_error "Invalid role ${WSREP_SST_OPT_ROLE}"
393     exit 22
394 fi
395
396 read_cnf
397 setup_ports
398 get_stream
399 get_transfer
400
401 INNOEXTRA=""
402 INNOAPPLY="${INNOBACKUPEX_BIN} ${WSREP_SST_OPT_CONF} --apply-log \$rebuildcmd \${DATA} &>\${DATA}/innobackup.prepare.log"
403 INNOBACKUP="${INNOBACKUPEX_BIN} ${WSREP_SST_OPT_CONF} \$INNOEXTRA --galera-info --stream=\$sfmt \${TMPDIR} 2>\${DATA}/innobackup.backup.log"

Comparing with line 397 in the old version of the script on node-1:

387 check_extra()
388 {
389     local use_socket=1
390     if [[ $uextra -eq 1 ]];then
391         if $MY_PRINT_DEFAULTS -c $WSREP_SST_OPT_CONF mysqld | tr '_' '-' | grep -- "--thread-handlin    g=" | grep -q 'pool-of-threads';then
392             local eport=$($MY_PRINT_DEFAULTS -c $WSREP_SST_OPT_CONF mysqld | tr '_' '-' | grep -- "-    -extra-port=" | cut -d= -f2)
393             if [[ -n $eport ]];then
394                 # Xtrabackup works only locally.
395                 # Hence, setting host to 127.0.0.1 unconditionally.
396                 wsrep_log_info "SST through extra_port $eport"
397                 INNOEXTRA+=" --host=127.0.0.1 --port=$eport "
398                 use_socket=0
399             else
400                 wsrep_log_error "Extra port $eport null, failing"
401                 exit 1
402             fi
403         else
404             wsrep_log_info "Thread pool not set, ignore the option use_extra"
405         fi
406     fi
407     if [[ $use_socket -eq 1 ]] && [[ -n "${WSREP_SST_OPT_SOCKET}" ]];then
408         INNOEXTRA+=" --socket=${WSREP_SST_OPT_SOCKET}"
409     fi
410 }

Erm yes, that's obviously quite a big difference. Also the sizes of the two scripts (see above) differ.

By replacing the new version of the script with the old script, the node was able to join the cluster again after another MariaDB start:

root@galera-node-2:~# service mysql start

root@galera-node-2:~# tail -f /var/log/syslog
Apr 25 22:14:07 galera-node-2 mysqld_safe: WSREP: Recovered position 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] /usr/sbin/mysqld (mysqld 10.0.34-MariaDB-1~trusty-wsrep) starting as process 30772 ...
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Read nil XID from storage engines, skipping position init
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: wsrep_load(): Galera 25.3.22(r3764) by Codership Oy loaded successfully.
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: CRC-32C: using hardware acceleration.
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Found saved state: 32c39115-8d32-11e5-9122-b6a2b3e41530:-1, safe_to_bootstrap: 0
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 10.55.55.86; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S;
Apr 25 22:14:07 galera-node-2 mysqld: pc.checksum = false; pc
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: GCache history reset: 32c39115-8d32-11e5-9122-b6a2b3e41530:0 -> 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Assign initial position for certification: 724808778, protocol version: -1
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: wsrep_sst_grab()
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Start replication
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Setting initial position to 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: protonet asio version 0
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: Using CRC-32C for message checksums.
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: backend: asio
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: gcomm thread scheduling priority set to other:0
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: restore pc from disk failed
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: GMCast version 0
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: EVS version 0
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: gcomm: connecting to group 'GALERACLUSTER', peer '10.55.55.85:,10.55.55.86:,10.55.55.87:'
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') connection established to 3599655f tcp://10.55.55.86:4567
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Warning] WSREP: (3599655f, 'tcp://0.0.0.0:4567') address 'tcp://10.55.55.86:4567' points to own listening address, blacklisting
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') connection established to acfd2d2f tcp://10.55.55.87:4567
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
Apr 25 22:14:07 galera-node-2 mysqld: 180425 22:14:07 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') connection established to d607ef0f tcp://10.55.55.85:4567
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: declaring acfd2d2f at tcp://10.55.55.87:4567 stable
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: declaring d607ef0f at tcp://10.55.55.85:4567 stable
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Node acfd2d2f state prim
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: view(view_id(PRIM,3599655f,14) memb {
Apr 25 22:14:08 galera-node-2 mysqld: #0113599655f,0
Apr 25 22:14:08 galera-node-2 mysqld: #011acfd2d2f,0
Apr 25 22:14:08 galera-node-2 mysqld: #011d607ef0f,0
Apr 25 22:14:08 galera-node-2 mysqld: } joined {
Apr 25 22:14:08 galera-node-2 mysqld: } left {
Apr 25 22:14:08 galera-node-2 mysqld: } partitioned {
Apr 25 22:14:08 galera-node-2 mysqld: })
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: save pc into disk
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: gcomm: connected
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Opened channel 'GALERACLUSTER'
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Waiting for SST to complete.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 35e60dd7-48c5-11e8-ab36-b74cbcd15212
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: STATE EXCHANGE: sent state msg: 35e60dd7-48c5-11e8-ab36-b74cbcd15212
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: STATE EXCHANGE: got state msg: 35e60dd7-48c5-11e8-ab36-b74cbcd15212 from 0 (mysql02)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: STATE EXCHANGE: got state msg: 35e60dd7-48c5-11e8-ab36-b74cbcd15212 from 1 (mysql03)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: STATE EXCHANGE: got state msg: 35e60dd7-48c5-11e8-ab36-b74cbcd15212 from 2 (mysql01)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Quorum results:
Apr 25 22:14:08 galera-node-2 mysqld: #011version    = 4,
Apr 25 22:14:08 galera-node-2 mysqld: #011component  = PRIMARY,
Apr 25 22:14:08 galera-node-2 mysqld: #011conf_id    = 12,
Apr 25 22:14:08 galera-node-2 mysqld: #011members    = 2/3 (joined/total),
Apr 25 22:14:08 galera-node-2 mysqld: #011act_id     = 724818998,
Apr 25 22:14:08 galera-node-2 mysqld: #011last_appl. = -1,
Apr 25 22:14:08 galera-node-2 mysqld: #011protocols  = 0/7/3 (gcs/repl/appl),
Apr 25 22:14:08 galera-node-2 mysqld: #011group UUID = 32c39115-8d32-11e5-9122-b6a2b3e41530
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Flow-control interval: [28, 28]
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Trying to continue unpaused monitor
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 724818998)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: State transfer required:
Apr 25 22:14:08 galera-node-2 mysqld: #011Group state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724818998
Apr 25 22:14:08 galera-node-2 mysqld: #011Local state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: New cluster view: global state: 32c39115-8d32-11e5-9122-b6a2b3e41530:724818998, view# 13: Primary, number of nodes: 3, my index: 0, protocol version 3
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Warning] WSREP: Gap in state sequence. Need state transfer.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '10.55.55.86' --datadir '/var/lib/mysql/'   --parent '30772' --binlog '/var/log/mysql/mariadb-bin' '
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] Streaming with tar (20180425 22:14:08.550)
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] Using socat as streamer (20180425 22:14:08.552)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Prepared SST request: xtrabackup|10.55.55.86:4444/xtrabackup_sst
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: REPL Protocols: 7 (3, 2)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Assign initial position for certification: 724818998, protocol version: 3
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Service thread queue flushed.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: IST receiver addr using tcp://10.55.55.86:4568
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.55.55.86:4568
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | tar xfi - --recursive-unlink -h; RC=( ${PIPESTATUS[@]} ) (20180425 22:14:08.561)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Member 0.0 (mysql02) requested state transfer from '*any*'. Selected 2.0 (mysql01)(SYNCED) as donor.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 724818998)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Requesting state transfer: success, donor: 2
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: GCache history reset: 32c39115-8d32-11e5-9122-b6a2b3e41530:0 -> 32c39115-8d32-11e5-9122-b6a2b3e41530:724818998
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: 2.0 (mysql01): State transfer to 0.0 (mysql02) complete.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: Member 2.0 (mysql01) synced with group.
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20180425 22:14:08.676)
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] Total time on joiner: 0 seconds (20180425 22:14:08.678)
Apr 25 22:14:08 galera-node-2 mysqld: WSREP_SST: [INFO] Removing the sst_in_progress file (20180425 22:14:08.680)
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] WSREP: SST complete, seqno: 724808778
Apr 25 22:14:08 galera-node-2 mysqld: 2018-04-25 22:14:08 7f623bbb97c0 InnoDB: Warning: Using innodb_locks_unsafe_for_binlog is DEPRECATED. This option may be removed in future releases. Please use READ COMMITTED transaction isolation level instead, see http://dev.mysql.com/doc/refman/5.6/en/set-transaction.html.
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Using mutexes to ref count buffer pool pages
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: The InnoDB memory heap is disabled
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Compressed tables use zlib 1.2.8
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Using Linux native AIO
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Using CPU crc32 instructions
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Initializing buffer pool, size = 2.0G
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Completed initialization of buffer pool
Apr 25 22:14:08 galera-node-2 mysqld: 180425 22:14:08 [Note] InnoDB: Highest supported file format is Barracuda.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] InnoDB: 128 rollback segment(s) are active.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] InnoDB: Waiting for purge to start
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.36-83.0 started; log sequence number 1013752338227
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] Plugin 'FEEDBACK' is disabled.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] Server socket created on IP: '0.0.0.0'.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] WSREP: Signalling provider to continue.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] WSREP: SST received: 32c39115-8d32-11e5-9122-b6a2b3e41530:724808778
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] WSREP: Receiving IST: 10220 writesets, seqnos 724808778-724818998
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] WSREP: Receiving IST...  0.0% (    0/10220 events) complete.
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] Reading of all Master_info entries succeded
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] Added new Master_info '' to hash table
Apr 25 22:14:09 galera-node-2 mysqld: 180425 22:14:09 [Note] /usr/sbin/mysqld: ready for connections.
Apr 25 22:14:09 galera-node-2 mysqld: Version: '10.0.34-MariaDB-1~trusty-wsrep'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution, wsrep_25.21.rc3fc46e
Apr 25 22:14:11 galera-node-2 mysqld: 180425 22:14:11 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') connection to peer 3599655f with addr tcp://10.55.55.86:4567 timed out, no messages seen in PT3S
Apr 25 22:14:11 galera-node-2 mysqld: 180425 22:14:11 [Note] WSREP: (3599655f, 'tcp://0.0.0.0:4567') turning message relay requesting off
Apr 25 22:14:19 galera-node-2 mysqld: 180425 22:14:19 [Note] WSREP: Receiving IST... 60.0% ( 6128/10220 events) complete.
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: Receiving IST...100.0% (10220/10220 events) complete.
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: IST received: 32c39115-8d32-11e5-9122-b6a2b3e41530:724818998
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: 0.0 (mysql02): State transfer from 2.0 (mysql01) complete.
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: Shifting JOINER -> JOINED (TO: 724819126)
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: Member 0.0 (mysql02) synced with group.
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 724819126)
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: Synchronized with group, ready for connections
Apr 25 22:14:26 galera-node-2 mysqld: 180425 22:14:26 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

There seems to be a bug already opened for this issue (https://jira.mariadb.org/browse/MDEV-15254?attachmentViewMode=list) but this is about MariaDB 10.1.x and not 10.0.x as in my case on Ubuntu 14.04 so I opened https://jira.mariadb.org/browse/MDEV-16032.

 

New version 20180411 of check_esxi_hardware released
Wednesday - Apr 11th 2018 - by - (0 comments)

A new version (20180411) of the monitoring plugin check_esxi_hardware is available. This version contains an additional check to return an unknown state if, for any reason, the cim EnumerateInstances function isn't returning something useful (error message "ThreadPool --- Failed to enqueue request)".

Thanks to Peter Newman for the code contribution.

 

Unable to download attachment from Gmail, how to get it anyway (by using Takeout)
Monday - Apr 9th 2018 - by - (0 comments)

For backup and archive purposes I usually create a config backup of each of my monitoring setups. A couple of years back, in 2012, I simply sent myself the tar.gz file into my gmail mailbox. Now that I had to lookup the usage of a monitoring plugin I used back then, it turns out it wasn't the best decision. Gmail decided that the attachment suddenly contains a virus (probably because the monitoring plugins in the compressed archive are executable, but I'm not certain) and didn't allow me to download the file.

Gmail attachment disabled

Luckily there's a way around it. Unfortunately it takes some time, but if you need that data from the attachment, you still gotta do it, right? (Imagine it's a bitcoin wallet from 2012! ^_^)

Google offers an "archive" service, called Takeout, where you can download your data. You can access this service on https://takeout.google.com/. In this case I only needed the archive of my Gmail mailbox, so I only selected Gmail. After clicking next, the "split size" can be chosen. If your mailbox is larger than the given size, it will be split into several files (with the suffix .mbox).

Now I had to wait... The archive won't be ready immediately. A day later I got an e-mail telling me that my archive is ready for download:

Google data archive is ready

So I went on and downloaded archive file 1 of 2.
As mentioned above, the file is a mbox file. Which is basically explained your whole mailbox in text format. The mutt command is perfectly capable to read the mbox file. There's no need to install a GUI (like Thunderbird) to simply extract the attachment I need.

$ mutt -f /tmp/All\ mail\ Including\ Spam\ and\ Trash-001.mbox
Reading /tmp/All mail Including Spam and Trash-001.mbox... 359578 (86%)

Once mutt loaded all e-mails, the following overview was presented to me:

Opening mailbox with mutt

I wasn't happy with this, I need to immediately see the year of the mail to quickly find the mail I am looking for. I adapted my ~/.muttrc and added the year in the date format:

$ cat ~/.muttrc
set index_format="%4C %Z %{%Y %b %d} %-15.15L (%4l) %s"

After opening the mailbox again with mutt, this time I saw the year, however the mails were not sorted as I wanted them.

Displaying year in mutt index

To solve this, I added another line (set sort) in muttrc to sort by date:

$ cat ~/.muttrc
set index_format="%4C %Z %{%Y %b %d} %-15.15L (%4l) %s"
set sort = date

I opened the mailbox again and this time I saw the oldest mails of the mailbox and used "PageDown" to scroll down to my needed date. And finally, there was my mail:

Found Gmail archive mail in mbox mutt 

By selecting the mail with the cursor and hitting "Enter", the mail is opened:

Opening a mail in mutt 

The attachment can be shown by pressing "v":

View attachment in mutt 

I moved the cursor down to the tar.gz file and pressed "s" to save the file in /tmp:

Save attachment in mutt 

And finally I was able to open the file:

Open tar gz archive  

 


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7177 Days
until Death of Computers
Why?