Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

resolvconf is not always helpful, especially in a chroot rescue system
Wednesday - Feb 22nd 2017 - by - (0 comments)

Yesterday I needed to completely re-partition a server (going from 12 separate partitions on /dev/sda to a two-disk setup with LVM). In order to have ensure data-integrity I booted from a live CD (Knoppix), mounted the old partitions inside /mnt and mounted the new (target) partitions in /mnt2.

The mounts inside /mnt2 looked the following:

/dev/sda1 /mnt2
/dev/mapper/vgsystem-lvvar /mnt2/var
/dev/mapper/vgsystem-lvtmp /mnt2/tmp
/dev/mapper/vgdata-lvdb /mnt2/db
/dev/mapper/vgdata-lvwww /mnt2/www

Once I synced /mnt to /mnt2, I mounted procfs, sysfs and dev into /mnt2:

mount --bind /dev /mnt2/dev
mount --bind /proc /mnt2/proc
mount --bind /sys /mnt2/sys

And then went into the "new system":

chroot /mnt2 /bin/bash

I adapted /etc/fstab to have the new UUID's of the real partitions and the logical volumes pointing to the correct mount points. And then I knew I had to create a new initramfs because I now use LVM. But none of the lvm commands worked, because lvm2 was not installed. So I configured the network card and wanted to run apt-get install lvm2 but got errors from apt that the dns resolution didn't work.

I checked /etc/resolv.conf and it was a symlink:

/etc/resolv.conf -> ../run/resolvconf/resolv.conf

Of course ../run was empty. And even when I added it with yet another bind mount inside /mnt2, the resolvconf folder didn't exist of course (/run was mounted from the Knoppix live image and resolvconf inside chroot wasn't started).

Because my maintenance window narrowed, I deleted the symlink and simply created a plain simple /etc/resolv.conf with a nameserver entry. With that change I was finally able to run apt-get install lvm2. Once this package was installed, it automatically updated the initramfs. Well, thanks! Note: I would have run "update-initramfs -u" if apt didn't do it automatically.

The only thing left was to update the GRUB bootloader:

update-grub

This generated the new grub config file in /boot/grub/grub.cfg. And then install the bootloader itself on the new OS disk:

grub-install /dev/sdb

TL;DR: resolvconf package may be helpful in a "normal" start of the OS, but when doing some maintenance tasks as I did (in a rescue or repair mode) then a classic and static /etc/resolv.conf file does the job better because it does not depend on another script (resolvconf).

 

MySQL InnoDB: Operating system error number 13 in a file operation
Tuesday - Feb 21st 2017 - by - (0 comments)

Got an interesting MySQL error today, after I updated an Ubuntu 14.04 system with apt-get upgrade:

[...]
libwbclient0:amd64 (2:4.3.11+dfsg-0ubuntu0.14.04.4) wird eingerichtet ...
libdbd-mysql-perl (4.025-1ubuntu0.1) wird eingerichtet ...
mysql-client-core-5.5 (5.5.54-0ubuntu0.14.04.1) wird eingerichtet ...
mysql-client-5.5 (5.5.54-0ubuntu0.14.04.1) wird eingerichtet ...
mysql-server-core-5.5 (5.5.54-0ubuntu0.14.04.1) wird eingerichtet ...
mysql-server-5.5 (5.5.54-0ubuntu0.14.04.1) wird eingerichtet ...

Konfigurationsdatei »/etc/apparmor.d/usr.sbin.mysqld«
 ==> Geändert (von Ihnen oder von einem Skript) seit der Installation.
 ==> Paketverteiler hat eine aktualisierte Version herausgegeben.
   Wie möchten Sie vorgehen? Ihre Wahlmöglichkeiten sind:
    Y oder I : Die Version des Paket-Betreuers installieren
    N oder O : Die momentan installierte Version beibehalten
       D     : Die Unterschiede zwischen den Versionen anzeigen
       Z     : Eine Shell starten, um die Situation zu begutachten
 Der Standardweg ist das Beibehalten der momentanen Version.
*** usr.sbin.mysqld (Y/I/N/O/D/Z) [Vorgabe=N] ? D

Konfigurationsdatei »/etc/apparmor.d/usr.sbin.mysqld«
 ==> Geändert (von Ihnen oder von einem Skript) seit der Installation.
 ==> Paketverteiler hat eine aktualisierte Version herausgegeben.
   Wie möchten Sie vorgehen? Ihre Wahlmöglichkeiten sind:
    Y oder I : Die Version des Paket-Betreuers installieren
    N oder O : Die momentan installierte Version beibehalten
       D     : Die Unterschiede zwischen den Versionen anzeigen
       Z     : Eine Shell starten, um die Situation zu begutachten
 Der Standardweg ist das Beibehalten der momentanen Version.
*** usr.sbin.mysqld (Y/I/N/O/D/Z) [Vorgabe=N] ? I
Neue Version der Konfigurationsdatei /etc/apparmor.d/usr.sbin.mysqld wird installiert ...
start: Job failed to start
invoke-rc.d: initscript mysql, action "start" failed.
dpkg: Fehler beim Bearbeiten des Paketes mysql-server-5.5 (--configure):
 Unterprozess installiertes post-installation-Skript gab den Fehlerwert 1 zurück
dpkg: Abhängigkeitsprobleme verhindern Konfiguration von mysql-server:
 mysql-server hängt ab von mysql-server-5.5; aber:
  Paket mysql-server-5.5 ist noch nicht konfiguriert.

dpkg: Fehler beim Bearbeiten des Paketes mysql-server (--configure):
 Abhängigkeitsprobleme - verbleibt unkonfiguriert
mysql-client (5.5.54-0ubuntu0.14.04.1) wird eingerichtet ...
Es wurde kein Apport-Bericht verfasst, da die Fehlermeldung darauf hindeutet, dass dies lediglich ein Folgefehler eines vorherigen Problems ist.
[...]

So MySQL didn't start, that's why apt-get upgrade didn't finish with success. In /var/log/mysql/error I saw the following errors:

170221 14:24:14 [Warning] Using unique option prefix myisam-recover instead of myisam-recover-options is deprecated and will be removed in a future release. Please use the full name instead.
170221 14:24:14 [Note] Plugin 'FEDERATED' is disabled.
/usr/sbin/mysqld: Can't find file: './mysql/plugin.frm' (errno: 13)
170221 14:24:14 [ERROR] Can't open the mysql.plugin table. Please run mysql_upgrade to create it.
170221 14:24:14 InnoDB: The InnoDB memory heap is disabled
170221 14:24:14 InnoDB: Mutexes and rw_locks use GCC atomic builtins
170221 14:24:14 InnoDB: Compressed tables use zlib 1.2.8
170221 14:24:14 InnoDB: Using Linux native AIO
170221 14:24:14 InnoDB: Initializing buffer pool, size = 128.0M
170221 14:24:14 InnoDB: Completed initialization of buffer pool
170221 14:24:14  InnoDB: Operating system error number 13 in a file operation.
InnoDB: The error means mysqld does not have the access rights to
InnoDB: the directory.
InnoDB: File name ./ibdata1
InnoDB: File operation call: 'open'.
InnoDB: Cannot continue operation.

dmesg revealed some more information:

[  634.595214] init: mysql respawning too fast, stopped
[  790.457596] audit_printk_skb: 69 callbacks suppressed
[  790.457599] audit: type=1400 audit(1487683611.187:119): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=32765 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.457620] audit: type=1400 audit(1487683611.187:120): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=32765 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.459437] audit: type=1400 audit(1487683611.187:121): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/db/mysql/mysql/db.frm" pid=302 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=105 ouid=105
[  790.472090] audit: type=1400 audit(1487683611.199:122): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=315 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.472117] audit: type=1400 audit(1487683611.199:123): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=315 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.486818] audit: type=1400 audit(1487683611.215:124): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/db/mysql/ibdata1" pid=315 comm="mysqld" requested_mask="wr" denied_mask="wr" fsuid=105 ouid=105
[  790.610780] audit: type=1400 audit(1487683611.339:125): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=333 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.610820] audit: type=1400 audit(1487683611.339:126): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=333 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.624533] audit: type=1400 audit(1487683611.351:127): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/db/mysql/ibdata1" pid=333 comm="mysqld" requested_mask="wr" denied_mask="wr" fsuid=105 ouid=105
[  790.740241] audit: type=1400 audit(1487683611.467:128): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=349 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  791.175361] init: mysql main process (430) terminated with status 1
[  791.175369] init: mysql main process ended, respawning
[  792.060469] init: mysql post-start process (431) terminated with status 1
[  792.203859] init: mysql main process (475) terminated with status 1
[  792.203867] init: mysql main process ended, respawning
[  793.085085] init: mysql post-start process (476) terminated with status 1
[  793.237776] init: mysql main process (521) terminated with status 1
[  793.237793] init: mysql respawning too fast, stopped

Combine the errors from mysql ("InnoDB: Operating system error number 13 in a file operation.") and dmesg ("apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld") and you know where the problem is. During the system patching, a new version of the apparmor rules file was also installed. In most cases this is wanted, as the rule files themselves get patched and fixed from previous bugs.  But on this particular server this was not a wise choice, because - for an unknown reason to me - the admin who installed this system (with German as system language, ugh!) chose /db as path for the MySQL datadir. The Debian/Ubuntu default is of course /var/lib/mysql. Instead of setting /db as datadir and making a new partition on /db, it would have been easier to keep the system defaults and just mount /var/lib/mysql as a separate partition. This would have kept the MySQL installation compatible to future updates.

Because I didn't want apparmor to be running anyway, I stopped and removed apparmor (alternative would have been to edit the apparmor rules file for the MySQL binary). But even after that, MySQL was not able to start. I needed to reboot the server completely in order to get rid of the apparmor security settings. But after the reboot MySQL started correctly again.

 

UEFI boot does not like software raid at all (GRUB Error 17)
Monday - Feb 20th 2017 - by - (0 comments)

A couple of days ago I built a new home server for testing purposes with the following components:

  • Chassis: BitFenix Phenom Mini ITX
  • Power Supply: Enermax Revolution X't II 550W
  • Motherboard: Gigabyte F2A88XN-WIFI (although I didn't care about the WIFI, important to me was the FM2+ socket)
  • CPU: AMD A10-7860K
  • Memory: 2 x 8GB Corsair Vengeance Pro DDR3
  • Hard Drives: 2 x 500 GB Western Digital Caviar Blue (yes they're slow)

Just for documentation purposes, here's what the final result looks like:

 Debian Jessie Testserver Gigabyte F2A88XN-WIFI and AMD A10-7860K

For the setup of Debian Jessie, I created a bootable memory stick with the netinst image. Setup started smoothly (missing firmware was mentioned on the WIFI chip iwlwifi-7620-9/iwlwifi-7620-8 and on RTL rtl_nic/rtl8168e-3), but the network card worked fine with the default drivers) and I went through the different steps. I set up my partitions like this:

  • HDD1: 3 partitions (20G, 25G, max size), each selected to be used as a RAID device
  • HDD2: 3 partitions (20G, 25G, max size), each selected to be used as a RAID device
  • Raid Device 1: /dev/sda1 + /dev/sdb1, RAID-1, ext4 defaults, mounted as /
  • Raid Device 2: /dev/sda2 + /dev/sdb2, RAID-1, Use as LVM physical volume
  • Raid Device 3: /dev/sda3 + /dev/sdb3, RAID-1, Use as LVM physical volume
  • PV /dev/md1 used for VG vgsystem. Created three LV's lvvar (mounted on /var), lvtmp (mounted on /tmp), and lvswap as swap partition
  • PV /dev/md2 used for VG vgdata.

I wanted to get on but the Debian installer informed me that "You haven't set up an EFI partition".

Debian Installer missing EFI System Partition 

Oh jeez, right. That's a new motherboard which supports (U)EFI boot. I needed to read some docs what exactly was meant with a EFI partition and how to set it up but the Debian installer pretty much does the job automatically when the EFI System Partition (ESP) is selected. But this meant I needed to destroy my partition layout and create a 500MB (250MB would probably be enough) partition for ESP at the begin.

The new partition layout with the UEFI partition looked like this:

  • HDD1: 4 partitions (500M, 20G, 25G, max size), each selected to be used as a RAID device
  • HDD2: 4 partitions (500M, 20G, 25G, max size), each selected to be used as a RAID device
  • Raid Device 1: /dev/sda1 + /dev/sdb2, RAID-1, Use as EFI System Partition
  • Raid Device 2: /dev/sda2 + /dev/sdb2, RAID-1, ext4 defaults, mounted as /
  • Raid Device 3: /dev/sda3 + /dev/sdb3, RAID-1, Use as LVM physical volume
  • Raid Device 4: /dev/sda4 + /dev/sdb4, RAID-1, Use as LVM physical volume
  • PV /dev/md2 used for VG vgsystem. Created three LV's lvvar (mounted on /var), lvtmp (mounted on /tmp), and lvswap as swap partition
  • PV /dev/md3 used for VG vgdata.

Debian Partitioning with EFI System Partition 

The Debian installer continued with the remaining steps of installing the base system and eventually finished. Time to reboot. But at that point I made big eyes - the system didn't boot. Right after the UEFI/BIOS of the motherboard, the following error was shown:

GRUB Loading stage1.5.
GRUB loading, please wait...
Error 17

GRUB Error 17 UEFI Boot

According to http://www.uruk.org/orig-grub/errors.html the Error 17 means:

17: Invalid device requested
 This error is returned if a device string is recognizable but does not fall under the other device errors.

For some reason it looks like the UEFI loader can't find the EFI System Partition.
Some additional research revealed the following very important information (from https://wiki.debian.org/UEFI): 

RAID for the EFI System Partition
This is arguably a mis-design in the UEFI specification - the ESP is a single point of failure on one disk. For systems with hardware RAID, that will provide some backup in case of disk failure. But for software RAID systems there is currently no support for putting the ESP on two separate disks in RAID. There might be a way to do something useful with fallback options, but this will need some investigation...

What the %"*+&@??!! Are you seriously telling me that the, compared to BIOS, 20 years newer Unified Extensible Firmware Interface (UEFI) cannot boot from a EFI System Partition which is a software raid partition? This is not arguably a mis-design, this is clearly a no-go! Even for a testserver, I don't want to invest any time in making the system boot again when a HDD fails. I could set up the partitions /dev/sda1 + /dev/sdb1 as normal ESP partitions and maybe run a cronjob to sync them manually, but that's a hack/workaround. I finally decided to ditch UEFI and switch to the Legacy (BIOS) boot mode:

Gigabyte UEFI BIOS set to legacy boot 

Of course I needed to adapt the partition layout once again; I used the layout I used at the begin. I let Debian finish the installation and then reboot. This time booting (with legacy BIOS mode and GRUB as bootloader) worked like a charm.

 

Unbound DNS server behind a VIP - solving reply from unexpected source
Friday - Feb 17th 2017 - by - (0 comments)

In my previous post I wrote about getting Unbound to work without Internet connection (Get Unbound DNS lookups working in Ubuntu 16.04 Xenial, understand SERVFAIL). For the blog post I simplified the setup, but in reality the DNS architecture was built on a "cluster" on two Xenial hosts running KeepAliveD and sharing a virtual IP address (VIP).

After a couple of tests, I stumbled across the following problem:

$ dig -t A www.claudiokuenzler.com @192.168.10.10
;; reply from unexpected source: 192.168.10.11#53, expected 192.168.10.10#53
;; reply from unexpected source: 192.168.10.11#53, expected 192.168.10.10#53
;; reply from unexpected source: 192.168.10.11#53, expected 192.168.10.10#53

You've guessed it right. 192.168.10.10 is the VIP on which I tried to run the DNS lookup. But the DNS response came back from the active server's primary IP address 192.168.10.11. Dig correctly says "dude, something's not right" and doesn't accept the answer.

I already thought I had to set up some weird masquerading NAT to always respond with the VIP but that would cause problems on the second host, the VRRP BACKUP host. And also it would prevent local lookups to work (by default, if I wouldn't add yet another masquerading hack for localhost). Already by trying to write down my ideas of potential masquerading rules you can guess that this would be nothing else than a hack to make something work - which should be simple and work out of the box.

Luckily I read through the manpage of unbound.conf again and I found something very interesting:

       interface-automatic:
              Detect source interface on UDP queries and copy them to replies.  This feature is experimental, and needs support in your OS for particular
              socket options.  Default value is no.

The description sounds pretty much what I need, so I gave it a shot and added "interface-automatic: yes" into the config:

server:
        interface: 0.0.0.0
        interface-automatic: yes
        access-control: 10.0.0.0/16 allow
        access-control: 127.0.0.0/8 allow
        access-control: 172.16.0.0/12 allow
        access-control: 192.168.0.0/16 allow
        verbosity: 1
        domain-insecure: *
        root-hints: /var/lib/unbound/root.hints

After a restart of Unbound I tried it again - with success:

$ dig -t A www.claudiokuenzler.com @192.168.10.10

; <<>> DiG 9.9.5-3ubuntu0.11-Ubuntu <<>> -t A www.claudiokuenzler.com @192.168.10.10
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32099
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.claudiokuenzler.com.       IN      A

;; ANSWER SECTION:
www.claudiokuenzler.com. 28636  IN      A       144.76.83.23

;; Query time: 4 msec
;; SERVER: 192.168.10.10#53(192.168.10.10)
;; WHEN: Fri Feb 17 11:14:33 CET 2017
;; MSG SIZE  rcvd: 68

And it still works on the primary IP of the cluster hosts, too:

$ dig -t A www.claudiokuenzler.com @192.168.10.11

; <<>> DiG 9.9.5-3ubuntu0.11-Ubuntu <<>> -t A www.claudiokuenzler.com @192.168.10.11
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62942
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.claudiokuenzler.com.       IN      A

;; ANSWER SECTION:
www.claudiokuenzler.com. 28493  IN      A       144.76.83.23

;; Query time: 0 msec
;; SERVER: 192.168.10.11#53(192.168.10.11)
;; WHEN: Fri Feb 17 11:16:54 CET 2017
;; MSG SIZE  rcvd: 68

Well done, Unbound! I'm positively surprised that this config option is available!

PS: I also tried to set the Unbound listener to the VIP by setting "interface" to the virtual ip address in unbound.conf (instead of using all interfaces 0.0.0.0). But as the VIP is only assigned once Unbound runs correctly (I added specific DNS checks in keepalived.conf), Unbound was not able to start and failed with:

unbound[32235]: [1487320777] unbound[32535:0] error: can't bind socket: Cannot assign requested address for 192.168.10.10
unbound[32235]: [1487320777] unbound[32535:0] fatal error: could not open ports
unbound[32235]:    ...fail!

 

Get Unbound DNS lookups working in Ubuntu 16.04 Xenial, understand SERVFAIL
Thursday - Feb 16th 2017 - by - (1 comments)

For an internal DNS resolver/cacher, I decided to use Unbound. However after the installation of unbound 1.5.8-1ubuntu1 and an initial configuration, DNS lookups didn't work.

The config:

server:
    interface: 0.0.0.0
        access-control: 10.0.0.0/16 allow
        access-control: 127.0.0.0/8 allow
        access-control: 172.16.0.0/12 allow
        access-control: 192.168.0.0/16 allow
        verbosity: 1
        root-hints: /var/lib/unbound/root.hints

forward-zone:
      name: "."
      forward-addr: domaincontroller1.example.com
      forward-addr: domaincontroller2.example.com
      forward-addr: domaincontroller3.example.com
      #forward-addr: 8.8.4.4        # Google
      #forward-addr: 8.8.8.8        # Google

DNS resolving didn't work. Unbound returned a SERVFAIL error.

# dig claudiokuenzler.com A @localhost

; <<>> DiG 9.10.3-P4-Ubuntu <<>> claudiokuenzler.com A @localhost
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 8470
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;claudiokuenzler.com.            IN    A

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Feb 16 07:04:55 CET 2017
;; MSG SIZE  rcvd: 37

When I restarted Unbound, I was puzzled how slow the restart was and eventually some errors showed up in syslog:

Feb 16 06:46:52 dnscacher systemd[1]: Stopping unbound.service...
Feb 16 06:46:52 dnscacher unbound[20218]:  * Stopping DNS server unbound
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: service stopped (unbound 1.5.8).
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: server stats for thread 0: 8 queries, 4 answers from cache, 4 recursions, 0 prefetch
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: server stats for thread 0: requestlist max 1 avg 0.5 exceeded 0 jostled 0
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: average recursion processing time 0.183547 sec
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: histogram of recursion processing times
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: [25%]=0.000384 median[50%]=0.000512 [75%]=0.393216
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: lower(secs) upper(secs) recursions
Feb 16 06:46:52 dnscacher unbound: [19904:0] info:    0.000256    0.000512 2
Feb 16 06:46:52 dnscacher unbound: [19904:0] info:    0.262144    0.524288 2
Feb 16 06:46:52 dnscacher unbound[20218]:    ...done.
Feb 16 06:46:52 dnscacher systemd[1]: Stopped unbound.service.
Feb 16 06:46:52 dnscacher systemd[1]: Starting unbound.service...
Feb 16 06:46:52 dnscacher unbound[20247]:  * Starting DNS server unbound
Feb 16 06:47:42 dnscacher unbound-anchor: /var/lib/unbound/root.key has content
Feb 16 06:47:42 dnscacher unbound-anchor: fail: the anchor is NOT ok and could not be fixed
Feb 16 06:47:42 dnscacher unbound: [20339:0] notice: init module 0: validator
Feb 16 06:47:42 dnscacher unbound: [20339:0] notice: init module 1: iterator
Feb 16 06:47:42 dnscacher unbound: [20339:0] info: start of service (unbound 1.5.8).
Feb 16 06:47:42 dnscacher unbound: [20339:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN
Feb 16 06:47:42 dnscacher unbound: message repeated 5 times: [ [20339:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN]
Feb 16 06:47:42 dnscacher unbound[20247]:    ...done.
Feb 16 06:47:42 dnscacher systemd[1]: Started unbound.service.
Feb 16 06:47:42 dnscacher systemd[1]: Reached target Host and Network Name Lookups.

It turned out that the slow start is caused by the unbound-achor verification, which can be manually triggered using the following command:

# time unbound-anchor -a /var/lib/unbound/root.key -v
/var/lib/unbound/root.key has content
fail: the anchor is NOT ok and could not be fixed

real    0m50.453s
user    0m0.004s
sys    0m0.012s

Here we have the 50 seconds which slowed down the start (see times in the syslog log above). By using tcpdump, I was able to determine what happens during the start (and what causes this slow startup):

# service unbound restart & 2>&1; tcpdump port 53
[1] 11942
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens160, link-type EN10MB (Ethernet), capture size 262144 bytes
10:58:07.788300 IP dnscacher.23444 > 192.5.5.241.53: 1053% [1au] NS? . (28)
10:58:08.164795 IP dnscacher.58715 > 192.112.36.4.53: 41519% [1au] NS? . (28)
10:58:08.541290 IP dnscacher.26205 > 192.112.36.4.53: 52241% [1au] NS? . (28)
10:58:08.917835 IP dnscacher.18709 > 199.7.91.13.53: 42065% [1au] NS? . (28)
10:58:09.294371 IP dnscacher.42646 > 199.7.91.13.53: 21034% [1au] NS? . (28)
10:58:09.670891 IP dnscacher.56481 > 198.97.190.53.53: 9754% [1au] NS? . (28)
10:58:10.047466 IP dnscacher.61621 > 198.97.190.53.53: 51384% [1au] NS? . (28)
10:58:10.424028 IP dnscacher.37874 > 192.5.5.241.53: 14983% [1au] NS? . (28)
10:58:11.176957 IP dnscacher.43299 > 192.5.5.241.53: 31294% [1au] NS? . (28)
[...] # shortened output

Unbound tried to connect to these and some more IP addresses. They look kinda familiar, don't they? Ah, of course, the DNS root servers, the spine of the Internet!

# grep 192.5.5.241 /var/lib/unbound/root.hints
F.ROOT-SERVERS.NET.      3600000      A     192.5.5.241

As this internal DNS resolver/cacher doesn't have Internet access (and therefore no DNS access to the root servers), this explains the slow start. But why wouldn't Unbound show any results at all? The connectivity to the internal Active Directory controllers (domaincontroller1-3) works.
Here I stumbled across Debian bug report #699641:

"you can set the "domain-insecure" option on your local domains in order to prevent DNSSEC failures from impacting the resolution of those domains"

So this is the actual workaround (disable DNSSEC) but doesn't explain the reason. There are two reasons for Unbound unable to show results for the resolved domains:

1.Missing Internet connection

The error in syslog gives a good hint:

Feb 16 06:47:42 dnscacher unbound: [20339:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN
Feb 16 06:47:42 dnscacher unbound: message repeated 5 times: [ [20339:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN]

As tcpdump revealed, Unbound tried to connect to all DNS Root servers. This happened not just for fun, Unbound tried to verify the DNSKEY in order to enable DNSSEC. As the Internet connection is non-existant on this host, the verification didn't work.

2. Unbound runs in DNSSEC mode

Unbound runs by default in DNSSEC mode, at least with the base installation on Ubuntu. The failed DNSKEY validation causes Unbound to stop working for DNSSEC requests - which is enabled. All non-DNSSEC requests return a SERVFAIL status.

Solution: Either make sure the cacher host has Internet access (at least to the DNS Root servers) or disable DNSSEC completely.

As this is a DNS resolver and cacher for internal networks only, I didn't see the necessity for DNSSEC and disabled it. In the Ubuntu setup two changes are necessary in order to disable DNSSEC:

Added domain-insecure option in unbound.conf:

server:
    interface: 0.0.0.0
        access-control: 10.0.0.0/16 allow
        access-control: 127.0.0.0/8 allow
        access-control: 172.16.0.0/12 allow
        access-control: 192.168.0.0/16 allow
        verbosity: 1
        root-hints: /var/lib/unbound/root.hints
        domain-insecure: *

But that's not enough. There's also a special config file active (/etc/unbound/unbound.conf.d/root-auto-trust-anchor-file.conf). In this file I commented/disabled the auto-trust-anchor-file option:

# cat /etc/unbound/unbound.conf.d/root-auto-trust-anchor-file.conf
server:
    # The following line will configure unbound to perform cryptographic
    # DNSSEC validation using the root trust anchor.
    #auto-trust-anchor-file: "/var/lib/unbound/root.key"

Then another restart of Unbound:

# service unbound restart & 2>&1; tail -f /var/log/syslog
Feb 16 07:11:11 dnscacher unbound[23185]:  * Starting DNS server unbound
Feb 16 07:12:01 dnscacher unbound-anchor: /var/lib/unbound/root.key has content
Feb 16 07:12:01 dnscacher unbound-anchor: fail: the anchor is NOT ok and could not be fixed
Feb 16 07:12:01 dnscacher unbound: [23275:0] warning: did not exit gracefully last time (22816)
Feb 16 07:12:01 dnscacher unbound: [23276:0] notice: init module 0: validator
Feb 16 07:12:01 dnscacher unbound: [23276:0] notice: init module 1: iterator
Feb 16 07:12:01 dnscacher unbound: [23276:0] info: start of service (unbound 1.5.8).
Feb 16 07:12:01 dnscacher unbound[23185]:    ...done.
Feb 16 07:12:01 dnscacher systemd[1]: Started unbound.service.
Feb 16 07:12:01 dnscacher systemd[1]: Reached target Host and Network Name Lookups.

Yes, it still takes 50 seconds to start due to the missing Internet connection, but DNS resolution now works:

# dig -t A www.claudiokuenzler.com @localhost

; <<>> DiG 9.10.3-P4-Ubuntu <<>> -t A www.claudiokuenzler.com @localhost
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15474
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.claudiokuenzler.com.    IN    A

;; ANSWER SECTION:
www.claudiokuenzler.com. 38400    IN    A    144.76.83.23

;; Query time: 47 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Feb 16 07:12:33 CET 2017
;; MSG SIZE  rcvd: 68

And the already queried DNS record can now be found in Unbound's cache:

# unbound-control dump_cache | grep claudiokuenzler
www.claudiokuenzler.com.    23458    IN    A    144.76.83.23
msg www.claudiokuenzler.com. IN A 33152 1 23458 2 1 0 0
www.claudiokuenzler.com. IN A 0

Note: Even if I disabled DNSSEC in this case, it's still wise to open the firewall for this host to access the DNS Root servers. Not only to avoid the slow start or to keep DNSSEC as a possibility, but rather because this is the design of DNS. The Root servers should be queried if forwarders return no results.

Update February 17th 2017: As I mentioned at the end, a DNS connection to the Root servers should work if possible. So the missing firewall rules are now created and a restart is now as quick as expected:

# time service unbound restart & 2>&1; tail -f /var/log/syslog
Feb 17 11:37:15 dnscacher systemd[1]: Stopping unbound.service...
Feb 17 11:37:15 dnscacher unbound[12066]:  * Stopping DNS server unbound
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: service stopped (unbound 1.5.8).
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: server stats for thread 0: 766 queries, 763 answers from cache, 3 recursions, 0 prefetch
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: server stats for thread 0: requestlist max 0 avg 0 exceeded 0 jostled 0
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: average recursion processing time 0.000464 sec
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: histogram of recursion processing times
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: [25%]=0 median[50%]=0 [75%]=0
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: lower(secs) upper(secs) recursions
Feb 17 11:37:15 dnscacher unbound: [4569:0] info:    0.000256    0.000512 3
Feb 17 11:37:15 dnscacher unbound[12066]:    ...done.
Feb 17 11:37:15 dnscacher systemd[1]: Stopped unbound.service.
Feb 17 11:37:15 dnscacher systemd[1]: Starting unbound.service...
Feb 17 11:37:15 dnscacher unbound[12096]:  * Starting DNS server unbound
Feb 17 11:37:16 dnscacher unbound-anchor: /var/lib/unbound/root.key has content
Feb 17 11:37:16 dnscacher unbound-anchor: success: the anchor is ok
Feb 17 11:37:16 dnscacher unbound: [12112:0] notice: init module 0: validator
Feb 17 11:37:16 dnscacher unbound: [12112:0] notice: init module 1: iterator
Feb 17 11:37:16 dnscacher unbound: [12112:0] info: start of service (unbound 1.5.8).
Feb 17 11:37:16 dnscacher unbound[12096]:    ...done.
Feb 17 11:37:16 dnscacher systemd[1]: Started unbound.service.

real    0m0.152s
user    0m0.000s
sys     0m0.000s
Feb 17 11:37:16 dnscacher systemd[1]: Reached target Host and Network Name Lookups.

And as expected, the DNSKEY validation is now working, too:

Feb 17 11:37:16 dnscacher unbound-anchor: /var/lib/unbound/root.key has content
Feb 17 11:37:16 dnscacher unbound-anchor: success: the anchor is ok

 

The curious case of curl, SSL SNI and the HTTP Host header
Monday - Feb 13th 2017 - by - (0 comments)

As the digial age gets older, encryption of http traffic is becoming a new standard. Encrypted http connections will one day even be a requirement for serving web pages in Google's Chrome browser:

Eventually, we plan to label all HTTP pages as non-secure, and change the HTTP security indicator to the red triangle that we use for broken HTTPS.

Back in the dark ages of the Internet (= the 90's) every SSL listener required a dedicated IP address. But luckily nowadays there is Server Name Indication (SNI) support. SNI allows to run multiple SSL/TLS certificates on the same IP address. Much like it's been possible for a long time to serve multiple domains on the same IP address (virtual hosts). Since 2006 SNI was broadly accepted in operating systems and browsers. Before that: Your own bad luck if you still use that (it's 2017, get over it).

However not all tools are yet aware of SNI. I had to come across this myself when I wanted to check the response headers of a domain with a SNI certificate with the "curl" command:

$ curl -H "Host: testsite.example.net" https://lb.example.com -v -I
* Rebuilt URL to: https://lb.example.com/
* Hostname was NOT found in DNS cache
*   Trying 192.168.14.100...
* Connected to lb.example.com (192.168.14.100) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using ECDHE-RSA-AES256-GCM-SHA384
* Server certificate:
*      subject: OU=Domain Control Validated; OU=Gandi Standard Wildcard SSL; CN=*.example.ch
*      start date: 2016-10-05 00:00:00 GMT
*      expire date: 2019-10-05 23:59:59 GMT
*      subjectAltName does not match lb.example.com
* SSL: no alternative certificate subject name matches target host name 'lb.example.com'
* Closing connection 0
* SSLv3, TLS alert, Client hello (1):
curl: (51) SSL: no alternative certificate subject name matches target host name 'lb.example.com'

The returned server certificate is a wildcard certificate for *.example.ch which has nothing to do with...

a) the requested HTTP Host Header testsite.example.net

b) the requested server name lb.example.com 

Of course curl then exits with a warning, that no certificate could be found for the host name. And there's actually a (more or less) simple explanation (found on http://superuser.com/questions/793600/curl-and-sni-enabled-server):

"SNI sends the hostname inside the TLS handshake (ClientHello). The server then chooses the correct certificate based on this information. Only after the TLS connection is successfully established it will send the HTTP-Request, which contains the Host header you specified."

Update February 24th 2017: I found yet the best technical and understandable explanation on the Apache wiki (https://wiki.apache.org/httpd/NameBasedSSLVHosts and https://wiki.apache.org/httpd/NameBasedSSLVHostsWithSNI).:

"Apache needs to know the name of the host in order to choose the correct certificate to setup the encryption layer. But the name of the host being requested is contained only in the HTTP request headers, which are part of the encrypted content. It is therefore not available until after the encryption is already negotiated. This means that the correct certificate cannot be selected"

"The solution is an extension to the SSL protocol called Server Name Indication (RFC 4366), which allows the client to include the requested hostname in the first message of its SSL handshake (connection setup). This allows the server to determine the correct named virtual host for the request and set the connection up accordingly from the start."

Of course this explanation goes for all web server, not only Apache.

So using the "HTTP Host Header" is not a correct way to get to the SNI certificate. One would need a curl parameter to define the "SNI Hostname". Something like this, a parameter called --sni-hostname, was actually requested in issue #607 on curl's Github repository. The issue itself was closed and as a workaround the --resolve parameter was mentioned. And this indeed does the job:

$ curl --resolve testsite.example.net:443:lb.example.com https://testsite.example.net -v -I
* Resolve testsite.example.net:443:lb.example.com found illegal!
* Rebuilt URL to: https://testsite.example.net/
* Hostname was NOT found in DNS cache
*   Trying 194.40.217.90...
* Connected to testsite.example.net (194.40.217.90) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using ECDHE-RSA-AES256-GCM-SHA384
* Server certificate:
*      subject: OU=Domain Control Validated; OU=Gandi Standard Wildcard SSL; CN=*.example.net
*      start date: 2016-10-05 00:00:00 GMT
*      expire date: 2019-10-05 23:59:59 GMT
*      subjectAltName: testsite.example.net matched
*      issuer: C=FR; ST=Paris; L=Paris; O=Gandi; CN=Gandi Standard SSL CA 2
*      SSL certificate verify ok.
> HEAD / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: testsite.example.net
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
[...]

This tells curl to force a DNS resolving of the requested "testsite.example.net" to lb.example.com. The correct certificate is now used.

The curl command becomes rather complicated, agreed. But there's also another issue #614 which takes the SNI problem one step further by adding a new feature called the "--connect-to" parameter. The new parameter was officially added in curl version 7.49.0.

Fixed in 7.49.0 - May 18 2016

Changes:

    schannel: Add ALPN support
    SSH: support CURLINFO_FILETIME
    SSH: new CURLOPT_QUOTE command "statvfs"
    wolfssl: Add ALPN support
    http2: added --http2-prior-knowledge
    http2: added CURL_HTTP_VERSION_2_PRIOR_KNOWLEDGE
    libcurl: added CURLOPT_CONNECT_TO
    curl: added --connect-to
    libcurl: added CURLOPT_TCP_FASTOPEN
    curl: added --tcp-fastopen
    curl: remove support for --ftpport, -http-request and --socks

However my curl version is still 7.35.0 and I was not in the mood to recompile it manually.

Besides the workaround using curl with the --resolve parameter (and in the future using the --connect-to parameter), there's also the way to check the certificate using ... *drumrolls* ... openssl!
Yes, openssl itself can of course also be used to check the SNI certificate by using the -servername parameter:

$ openssl s_client -connect lb.example.com:443 -servername testsite.example.net
CONNECTED(00000003)
depth=2 C = US, ST = New Jersey, L = Jersey City, O = The USERTRUST Network, CN = USERTrust RSA Certification Authority
verify error:num=20:unable to get local issuer certificate
verify return:0
---
Certificate chain
 0 s:/OU=Domain Control Validated/OU=Gandi Standard Wildcard SSL/CN=*.example.net
   i:/C=FR/ST=Paris/L=Paris/O=Gandi/CN=Gandi Standard SSL CA 2
 1 s:/C=FR/ST=Paris/L=Paris/O=Gandi/CN=Gandi Standard SSL CA 2
   i:/C=US/ST=New Jersey/L=Jersey City/O=The USERTRUST Network/CN=USERTrust RSA Certification Authority
 2 s:/C=US/ST=New Jersey/L=Jersey City/O=The USERTRUST Network/CN=USERTrust RSA Certification Authority
   i:/C=SE/O=AddTrust AB/OU=AddTrust External TTP Network/CN=AddTrust External CA Root
---
Server certificate
-----BEGIN CERTIFICATE-----
[...]

As you can see, the correct wildcard certificate *.example.net was returned in the certificate chain. 

Reminder to myself: Don't let the curl output fool you.

 

Using check_netio monitoring plugin on CentOS 7 or RHEL 7
Friday - Jan 27th 2017 - by - (2 comments)

For years I've been using the monitoring plugin check_netio to get statistics of the network interfaces on Linux. Purely for graphing purposes as the plugin doesn't do a "classical check". An example of such a graph can be seen here:

check_netio graph 

But since RHEL 7 and CentOS 7 the plugin didn't return any values anymore. The reason lies in the different output of the ifconfig command:

ifconfig on CentOS 6 and RHEL6 returns:

[root@centos6 ~]# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:50:56:B5:7A:26 
          inet addr:10.162.216.65  Bcast:10.162.216.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:124417619 errors:0 dropped:0 overruns:0 frame:0
          TX packets:45443808 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:824234857174 (767.6 GiB)  TX bytes:59269515633 (55.1 GiB)

ifconfig on CentOS 7 and RHEL7 returns:

[root@centos7 ~]# ifconfig ens192
ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.162.214.132  netmask 255.255.255.0  broadcast 10.162.214.255
        inet6 fe80::fc3c:fc22:c796:54e1  prefixlen 64  scopeid 0x20<link>
        ether 00:50:56:8d:19:be  txqueuelen 1000  (Ethernet)
        RX packets 170715  bytes 56526535 (53.9 MiB)
        RX errors 0  dropped 38  overruns 0  frame 0
        TX packets 36453  bytes 7216222 (6.8 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

As the plugin uses grep 'bytes' and then cut to get the current number of bytes, you can see the problem by comparing the lines:

        RX bytes:824234857174 (767.6 GiB)  TX bytes:59269515633 (55.1 GiB)

vs.

        RX packets 170715  bytes 56526535 (53.9 MiB)
        TX packets 36453  bytes 7216222 (6.8 MiB)

In order to use the plugin on newer CentOS 7 and RHEL 7 installations, too, I adapted the plugin. Unfortunately I couldn't find a source repository to contribute to so I created a new repository on my Github account (see my repository check_netio).

Besides fixing the CentOS/RHEL 7 case, I also added a simple check if the given network interface even exists on the system. In the original plugin, this simply caused the plugin to return OK with no values. 

[root@rhel7 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.3 (Maipo)

[root@rhel7 ~]# /usr/lib/nagios/plugins/check_netio -i ens192
NETIO OK - ens192: RX=56936528, TX=7374799|NET_ens192_RX=56936528B;;;; NET_ens192_TX=7374799B;;;;

[root@rhel7 ~]# /usr/lib/nagios/plugins/check_netio -i ens999
NETIO UNKNOWN - No interface ens999 found

 

Custom partitioning on new Red Hat Enterprise Linux 7 setup
Thursday - Jan 26th 2017 - by - (0 comments)

Unbelieveable how annoying this RHEL 7 installation wizard is! Try to setup manual partitioning with a mix of primary partitions and logical volumes - and RHEL will not boot. Even worse: Although I created the root partition first, somehow the wizard switched it with the swap partition making swap on /dev/sda1 instead of /dev/sda2 (where I wanted it to be).

Solution to this: Boot from Knoppix (yes, seriously) and manually create the partitions with parted. Once this was done, rebooted from the RHEL image. On the "Installation Destination" submenu I selected "I will configure partitioning" and then clicked on the blue Done button.

RHEL 7 Setup Installation Destination 

Inside the "Manual Partitioning" window I clicked on the rescan button and the disk was rescanned. The already created partitions were shown under a section called "unknown". To tell the RHEL7 installer to use these partitions, I had to select each partition and enter the mount point and click on "Reformat". The partitions were then finally shown as I wanted them in the installer:

RHEL 7 Setup custom partitioning 

This is surely the worst partitioning wizard I have seen in a recent Linux distribution (and I installed new machines with Ubuntu 16.04, SLES 12, CentOS 7, Debian Jessie and Linux Mint 17.3 in the past months).

 

Custom HTTP headers not showing up on 404 response status
Monday - Jan 23rd 2017 - by - (0 comments)

One of the OWASP recommendations for SSL/TLS connections is to enable HSTS (HTTP Strict Transport Security). OWASP describes HSTS as:

HTTP Strict Transport Security (HSTS) is an opt-in security enhancement that is specified by a web application through the use of a special response header. Once a supported browser receives this header that browser will prevent any communications from being sent over HTTP to the specified domain and will instead send all communications over HTTPS. It also prevents HTTPS click through prompts on browsers. 

The mentioned HTTP header is called "Strict-Transport-Security" which I of course added into my nginx config:

  add_header Strict-Transport-Security max-age=2678400;

I've been using this header already since 2014. So I was pretty surprised when a developer contacted me today and mentioned that the header doesn't appear on a 404 status page. He was right:

$ curl https://www.claudiokuenzler.com -I
HTTP/1.1 200 OK
Server: nginx
Date: Mon, 23 Jan 2017 14:31:11 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.4.19
Vary: Accept-Encoding
Strict-Transport-Security: max-age=2678400

$ curl https://www.claudiokuenzler.com/i-do-not-exist/ -I
HTTP/1.1 404 Not Found
Server: nginx
Date: Mon, 23 Jan 2017 14:31:39 GMT
Content-Type: text/html; charset=iso-8859-1
Connection: keep-alive
Vary: Accept-Encoding

Frankly, I couldn't explain it at first but after reading again the nginx documentation of add_header it is all explained:

Adds the specified field to a response header provided that the response code equals 200, 201, 204, 206, 301, 302, 303, 304, or 307.

Oops, 404 is not mentioned there. The documentation provides the solution to this, too:

 If the always parameter is specified (1.7.5), the header field will be added regardless of the response code.

So if your nginx is recent enough (at least 1.7.5) you can simply add the "always" parameter like this:

  add_header Strict-Transport-Security max-age=2678400 always;

And the HTTP header is added no matter what HTTP status is returned.

 

Rancher: Error response from daemon /usr/bin/dockerd (deleted)
Monday - Jan 23rd 2017 - by - (0 comments)

In a recent post (The Docker Dilemma: Benefits and risks going into production with Docker) I mentioned we're going forward with Rancher as an orchestration layer on top of Docker. 

Since last Friday (Jan 20 2017) there were sporadic error messages shown in the user interface when new containers were about to be started:

(Expected state running but got error: Error response from daemon: oci runtime error: process_linux.go:330: running prestart hook 0 caused "fork/exec /usr/bin/dockerd (deleted): no such file or directory: ")

This got me puzzled as the file clearly exists on that particular Docker host:

root@dockerserver:~# ll /usr/bin/dockerd
-rwxr-xr-x 1 root root 39063824 Jan 13 19:44 /usr/bin/dockerd

It turned out that I ran an Ansible playbook on Friday afternoon which was written for these Rancher/Docker hosts. One of it's tasks is to ensure that the docker.io package is installed:

  - name: 1.0 - Install docker.io
    apt: name={{item}} state=latest
    with_items:
    - docker.io

The apt logs confirmed my guess:

Start-Date: 2017-01-20  15:04:06
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold install docker.io
Requested-By: ansible (1001)
Upgrade: runc:amd64 (1.0.0~rc1-0ubuntu1~16.04, 1.0.0~rc1-0ubuntu2~16.04.1), docker.io:amd64 (1.12.1-0ubuntu13~16.04.1, 1.12.3-0ubuntu4~16.04.2)
End-Date: 2017-01-20  15:04:10

The docker.io package was updated from 1.12.1-0ubuntu13~16.04.1 to 1.12.3-0ubuntu4~16.04.2, but the Docker daemon was somehow not fully restarted. This is what caused the error message in Rancher, because some stale or deleted files were probably still open by the old Docker daemon process.

Syslog shows more information what exactly happened at 15:04:

Jan 20 15:04:03 dockerserver systemd[1]: Created slice User Slice of ansible.
Jan 20 15:04:03 dockerserver systemd[1]: Starting User Manager for UID 1001...
Jan 20 15:04:03 dockerserver systemd[1]: Started Session 14956 of user ansible.
Jan 20 15:04:03 dockerserver systemd[18545]: Reached target Sockets.
Jan 20 15:04:03 dockerserver systemd[18545]: Reached target Timers.
Jan 20 15:04:03 dockerserver systemd[18545]: Reached target Paths.
Jan 20 15:04:03 dockerserver systemd[18545]: Reached target Basic System.
Jan 20 15:04:03 dockerserver systemd[18545]: Reached target Default.
Jan 20 15:04:03 dockerserver systemd[18545]: Startup finished in 22ms.
Jan 20 15:04:03 dockerserver systemd[1]: Started User Manager for UID 1001.
Jan 20 15:04:08 dockerserver systemd[1]: Reloading.
Jan 20 15:04:09 dockerserver systemd[1]: apt-daily.timer: Adding 37min 2.565881s random time.
Jan 20 15:04:09 dockerserver systemd[1]: Started ACPI event daemon.
Jan 20 15:04:09 dockerserver systemd[1]: Reloading.
Jan 20 15:04:09 dockerserver systemd[1]: apt-daily.timer: Adding 5h 27min 32.130480s random time.
Jan 20 15:04:09 dockerserver systemd[1]: Started ACPI event daemon.
Jan 20 15:04:10 dockerserver systemd[1]: Reloading.
Jan 20 15:04:10 dockerserver systemd[1]: apt-daily.timer: Adding 11h 55min 14.530665s random time.
Jan 20 15:04:10 dockerserver systemd[1]: Started ACPI event daemon.
Jan 20 15:04:10 dockerserver systemd[1]: Reloading.
Jan 20 15:04:10 dockerserver systemd[1]: apt-daily.timer: Adding 11h 16min 17.514645s random time.
Jan 20 15:04:10 dockerserver systemd[1]: Started ACPI event daemon.
Jan 20 15:04:10 dockerserver systemd[1]: Started Docker Application Container Engine.
Jan 20 15:04:10 dockerserver systemd[1]: Reloading.
Jan 20 15:04:10 dockerserver systemd[1]: apt-daily.timer: Adding 11h 42min 36.963745s random time.
Jan 20 15:04:10 dockerserver systemd[1]: Started ACPI event daemon.

So we see that Docker was started (Jan 20 15:04:10 dockerserver systemd[1]: Started Docker Application Container Engine.) but where's the stop prior to that?

So to me it looks like Docker was started twice, running in two parallel processes and therefore causing sporadic error messages. A manual restart of the Docker daemon solved the problem.

If I launch a manual restart of the Docker service, this results in the following log entries:

Jan 23 09:54:38 dockerserver systemd[1]: Stopping Docker Application Container Engine...
Jan 23 09:54:50 dockerserver systemd[1]: Stopped Docker Application Container Engine.
Jan 23 09:54:50 dockerserver systemd[1]: Closed Docker Socket for the API.
Jan 23 09:54:50 dockerserver systemd[1]: Stopping Docker Socket for the API.
Jan 23 09:54:50 dockerserver systemd[1]: Starting Docker Socket for the API.
Jan 23 09:54:50 dockerserver systemd[1]: Listening on Docker Socket for the API.
Jan 23 09:54:50 dockerserver systemd[1]: Starting Docker Application Container Engine...
Jan 23 09:54:54 dockerserver dockerd[26724]: time="2017-01-23T09:54:54.088301837+01:00" level=info msg="Docker daemon" commit=6b644ec graphdriver=aufs version=1.12.3
Jan 23 09:54:54 dockerserver systemd[1]: Started Docker Application Container Engine.

The question remains, why the update process itself didn't restart the Docker service (similar to updating other packages like MySQL or Apache). If one takes a closer look at the Debian source package, docker.io.preinst doesn't exist. In other packages (for example mysql-server) this file is responsible to stop the service prior to the package update.

 


Go to Homepage home
Linux Howtos how to's
Nagios Plugins nagios plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7633 Days
until Death of Computers
Why?