Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

Tackling Confluence migration issues (Windows+MSSQL to Linux+MariaDB)
Friday - Mar 24th 2017 - by - (0 comments)

One of my latest tasks is a migration of Atlassian Confluence 5.7 from a pure Windows environment to a Linux environment.

Status Quo: Confluence application (Tomcat) runs on a Windows Server 2012. It's database runs on a Microsoft SQL Server 2008 R2.

Target installation: Confluence application (Tomcat) runs on an Ubuntu 16.04 machine (proably a LXC container). Database runs on a MariaDB 10.x Galera Cluster. For a test environment only a single MariaDB running localhost on the same server as the application was chosen.

But already during the first test migration I experienced a couple of issues. After one issue was solved, the next one appeared etc. The following issues were all successfully tackled.

 

Confluence database installation does not complete

Before migrating anything, I wanted to have a clean and empty Confluence installation. But already right after selecting the database, the (browser) setup failed with the following error message:

Hibernate operation: Could not execute query; bad SQL grammar

This turned out to be a problem with the MariaDB default settings. The default character set and collation were defined as utf8mb4 in /etc/mysql/mariadb.conf.d/50-server.cnf. If I'd have read the Atlassian documentation for Confluence Database Setup For MySQL, I'd have seen and known that this must be set to utf8 - not utf8mb4!

# MySQL/MariaDB default is Latin1, but in Debian we rather default to the full
# utf8 4-byte character set. See also client.cnf
#
# 2017-03-16 Claudio: Do not use utf8mb4 for Atlassian products!
#character-set-server  = utf8mb4
#collation-server      = utf8mb4_general_ci
character-set-server=utf8
collation-server=utf8_bin

There are other important MySQL settings mentioned on the Atlassian documentation. These were also applied.

After the character set and collation were adjusted and the database dropped and created from scratch, the setup ran through and it created all necessary tables.

 

Database cannot be migrated

At first I thought it would be the easiest (and fastest way) to use a dump from the MSSQL database and import it into the MariaDB/MySQL DB. But with all conversion tools I got errors. I tried it with MySQL Workbench's Migration Wizard, DBConvert for MSSQL & MySQL and HeidiSQL using database export/import. But all of them failed.

MySQL Workbench failed MSSQL migration

DBConvert was able to show me why:

DBConvert Foreign Key Cyclic Reference Error 

In the CONTENT table there's a foreign key pointing to the SPACES table. On the other hand in the SPACES table there's a foreign key pointing to the CONTENT table. See the dilemma? That's what's called a cycled reference, one cannot be created without the other. It results in "foreign key constraint fails" errors:

DBConvert Migration fails foreign key constraint

I tried to set MariaDB's GLOBAL VARIABLE to ignore the foreign key constraints (SET GLOBAL FOREIGN_KEY_CHECKS=0;) but this didn't help either.

Even with a pure data import (using data dump from a csv file) it didn't work.

At the end it turned out to be the wrong choice anyway, because the migration should be done with the Confluence backup/restore method. See below.

 

Restore process fails with invalid XML character

First considered as alternative migration process, I also tried Confluence's backup and restore method. In the old Confluence application a backup without attachments was created. This took around 30 minutes and created a zip file which contained the whole database in an xml format (entities.xml within the zip file). Once this zip file was transferred to the new Confluence server, the restore was launched from the browser (right after the database setup). But unfortunately the import failed with the following error:

Confluence Restore Progress Import Failed

Import failed. Check your server logs for more information.
com.atlassian.confluence.importexport.ImportExportException:  Unable to complete import: An invalid XML character (Unicode: 0xffff) was found in the CDATA section.

I came across some similar problems on the net and found a Perl replace command which removes this invalid character. In order to do that, the zip file must be unzipped and the command run directly on entities.xml. After this, the zip file needs to be recreated:

unzip xmlexport-20170316-145220-91.zip
perl -i -pe 's/\xef\xbf\xbf//g' entities.xml
rm xmlexport-20170316-145220-91.zip
zip xmlexport-20170316-145220-91.zip entities.xml exportDescriptor.properties plugin-data

With the newly zip file, started the restore again and this time it started importing. Until 4%...

 

Restore process fails with could not insert error

So once the invalid XML character error was gone, the next restore failed at 4% with the following error:

Confluence Restore Progress Import Failed 

Import failed. Check your server logs for more information.
com.atlassian.foncluence.importexport.ImportExportException:  Unable to complete import: Error while importing backup: could not insert: [com.atlassian.confluence.pages.Page#59705648]

This error message shown in the browser is not really helpful though. The Confluence application log ($CONFLUENCE_HOME/logs/atlassian-confluence.log) gives a better idea about the real problem:

Caused by: net.sf.hibernate.HibernateException: could not insert: [com.atlassian.confluence.pages.Page#59705648]
    at net.sf.hibernate.persister.EntityPersister.insert(EntityPersister.java:475)
    at net.sf.hibernate.persister.EntityPersister.insert(EntityPersister.java:436)
    at net.sf.hibernate.impl.ScheduledInsertion.execute(ScheduledInsertion.java:37)
    at net.sf.hibernate.impl.SessionImpl.execute(SessionImpl.java:2476)
    at net.sf.hibernate.impl.SessionImpl.executeAll(SessionImpl.java:2462)
    at net.sf.hibernate.impl.SessionImpl.execute(SessionImpl.java:2419)
    at net.sf.hibernate.impl.SessionImpl.flush(SessionImpl.java:2288)
    at com.atlassian.confluence.importexport.xmlimport.ImportProcessorContext.flushIfNeeded(ImportProcessorContext.java:246)
    at com.atlassian.confluence.importexport.xmlimport.ImportProcessorContext.objectImported(ImportProcessorContext.java:81)
    at com.atlassian.confluence.importexport.xmlimport.DefaultImportProcessor.persist(DefaultImportProcessor.java:50)
    at com.atlassian.confluence.importexport.xmlimport.DefaultImportProcessor.processObject(DefaultImportProcessor.java:37)
    at com.atlassian.confluence.importexport.xmlimport.parser.BackupParser.endElement(BackupParser.java:51)
    ... 44 more
Caused by: net.sf.hibernate.exception.GenericJDBCException: Could not execute JDBC batch update
    at net.sf.hibernate.exception.ErrorCodeConverter.handledNonSpecificException(ErrorCodeConverter.java:90)
    at net.sf.hibernate.exception.ErrorCodeConverter.convert(ErrorCodeConverter.java:79)
    at net.sf.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:29)
    at net.sf.hibernate.impl.BatcherImpl.convert(BatcherImpl.java:328)
    at net.sf.hibernate.impl.BatcherImpl.executeBatch(BatcherImpl.java:135)
    at net.sf.hibernate.impl.BatcherImpl.prepareStatement(BatcherImpl.java:61)
    at net.sf.hibernate.impl.BatcherImpl.prepareStatement(BatcherImpl.java:58)
    at net.sf.hibernate.impl.BatcherImpl.prepareBatchStatement(BatcherImpl.java:111)
    at net.sf.hibernate.persister.EntityPersister.insert(EntityPersister.java:454)
    ... 55 more
Caused by: java.sql.BatchUpdateException: Data truncation: Incorrect string value: '\xF0\x9F\x93\xBAF&...' for column 'BODY' at row 1
    at com.mysql.jdbc.SQLError.createBatchUpdateException(SQLError.java:1167)
    at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1778)
    at com.mysql.jdbc.PreparedStatement.executeBatchInternal(PreparedStatement.java:1262)
    at com.mysql.jdbc.StatementImpl.executeBatch(StatementImpl.java:958)
    at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeBatch(NewProxyPreparedStatement.java:1723)
    at net.sf.hibernate.impl.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:54)
    at net.sf.hibernate.impl.BatcherImpl.executeBatch(BatcherImpl.java:128)
    ... 59 more
Caused by: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect string value: '\xF0\x9F\x93\xBAF&...' for column 'BODY' at row 1
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3971)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3909)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2501)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
    at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2079)
    at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1756)
    ... 64 more

Again problems with some characters (incorrect string value)!

After this we contacted Atlassian's support and a few hours later we got an answer with a very helpful link: https://confluence.atlassian.com/confkb/incorrect-string-value-error-thrown-when-restoring-xml-backup-in-confluence-777009599.html

On this page, three different commands are used to fix the entities.xml file:

  1. A Java program (atlassian-xml-cleaner-0.1.jar) which is run on entities.xml and the results are saved in a second file entities-clean.xml
  2. A Perl command to remove 0xffff characters (the same command as I already ran above!)
  3. A Perl command to remove 0xfffe characters

For the sake of completeness again the same play with the zip file and the zipped entities.xml:

# unzip backup file
unzip xmlexport-20170316-145220-91.zip

# Download Atlassian xml cleaner
wget https://confluence.atlassian.com/confkb/files/777009599/777009596/1/1439839413460/atlassian-xml-cleaner-0.1.jar

# Fix xml character issues
/opt/atlassian/confluence/jre/bin/java -jar atlassian-xml-cleaner-0.1.jar entities.xml > entities-clean.xml
perl -i -pe 's/\xef\xbf\xbf//g' entities-clean.xml
perl -i -pe 's/\xef\xbf\xbe//g' entities-clean.xml
 
# Recreate zip file
rm entities.xml
mv entities-clean.xml entities.xml
zip xmlexport-20170316-145220-91new.zip entities.xml exportDescriptor.properties plugin-data

Then the restore was launched again with the new created zip file xmlexport-20170316-145220-91new.zip.


Restore process fails with out of memory error

The restore process was a bit over 20% complete, when my eye caught errors in the atlassian-confluence.log logfile (which I was tailing while running the restore process). The following errors appeared:

org.apache.velocity.exception.MethodInvocationException: Invocation of method 'getPrettyElapsedTime' in  class com.atlassian.confluence.importexport.actions.ImportLongRunningTask threw exception java.lang.OutOfMemoryError: Java heap space at /admin/longrunningtask-xml.vm[line 5, column 19]
        at org.apache.velocity.runtime.parser.node.ASTIdentifier.execute(ASTIdentifier.java:237)
        at org.apache.velocity.runtime.parser.node.ASTReference.execute(ASTReference.java:262)
        at org.apache.velocity.runtime.parser.node.ASTReference.render(ASTReference.java:342)
        at org.apache.velocity.runtime.parser.node.SimpleNode.render(SimpleNode.java:336)
        at org.apache.velocity.Template.merge(Template.java:328)
        at org.apache.velocity.Template.merge(Template.java:235)
        at com.opensymphony.webwork.dispatcher.VelocityResult.doExecute(VelocityResult.java:91)
[...]
Caused by: java.lang.OutOfMemoryError: Java heap space

The restore process receives a lot of data which is loaded into Tomcat's memory. The default sizing of 1GB was just too small and needed to be increased in Tomcat's setenv.sh:

grep Xmx /opt/atlassian/confluence/bin/setenv.sh
CATALINA_OPTS="$CATALINA_OPTS -Xms1024m -Xmx4096m -XX:MaxPermSize=384m -XX:+UseG1GC"

As you see I've set a new maximum heap size of 4GB.

Restarted the restore process and followed the memory usage of the Tomcat process closely. It went up to ~3500MB. Until the import was finally completed without additional errors!

Confluence Restore Progress Complete 

Hurray!

 

What all looks very easy in retrospective was a very time-consuming effort of several minds with a lot of try'n'errors. Especially the failed database migration was causing a lot of headache.
Special thanks go to Atlassian support for giving us the link we never stumbled upon (bad Google-luck I guess) and also to DBConvert which gave us some help trying to get around the foreign key constraint errors (even though we haven't bought the full software at that point)!

 

Unbound DNS not serving reverse lookup for internal addresses
Monday - Mar 20th 2017 - by - (0 comments)

If you've read some recent posts (Get Unbound DNS lookups working in Ubuntu 16.04 Xenial, understand SERVFAIL and Unbound DNS server behind a VIP - solving reply from unexpected source) you know that I've set up a high available Unbound DNS resolver/cacher for internal networks.

But just a few days before rolling out definitively for all internal servers a problem came across: The reverse DNS lookups didn't work.

With the Unbound DNS resolver defined as nameserver in /etc/resolv.conf:

$ host 192.168.253.153
Host 153.253.168.192.in-addr.arpa. not found: 3(NXDOMAIN)

$ host 10.161.206.153
Host 153.206.161.10.in-addr.arpa. not found: 3(NXDOMAIN)

The same command and the reverse lookup worked fine in a server still using the old standalone DNS servers. But why?

It's because of this, as found in the official documentation of unbound.conf:

The default zones are localhost, reverse 127.0.0.1 and ::1,  the  onion and  the AS112 zones. The AS112 zones are reverse DNS zones for private use and reserved IP addresses for which the  servers  on  the  internet cannot  provide correct answers. They are configured by default to give nxdomain (no reverse information) answers. The defaults can  be  turned off by specifying your own local-zone of that name, or using the 'nodefault' type.

So here we got the explanation. Unbound is by default configured to serve NXDOMAIN instead of serving the reserve dns information.

In order to change that behaviour, the internal IP addresses (see RFC1918), need to be defined in unbound.conf as local-zones in the server: section:

server:
        interface: 0.0.0.0
        interface-automatic: yes
        access-control: 10.0.0.0/16 allow
        access-control: 127.0.0.0/8 allow
        access-control: 172.16.0.0/12 allow
        access-control: 192.168.0.0/16 allow
        verbosity: 1
        domain-insecure: *
        root-hints: /var/lib/unbound/root.hints
        local-zone: "10.in-addr.arpa." nodefault
        local-zone: "16.172.in-addr.arpa." nodefault
        local-zone: "17.172.in-addr.arpa." nodefault
        local-zone: "18.172.in-addr.arpa." nodefault
        local-zone: "19.172.in-addr.arpa." nodefault
        local-zone: "20.172.in-addr.arpa." nodefault
        local-zone: "21.172.in-addr.arpa." nodefault
        local-zone: "22.172.in-addr.arpa." nodefault
        local-zone: "23.172.in-addr.arpa." nodefault
        local-zone: "24.172.in-addr.arpa." nodefault
        local-zone: "25.172.in-addr.arpa." nodefault
        local-zone: "26.172.in-addr.arpa." nodefault
        local-zone: "27.172.in-addr.arpa." nodefault
        local-zone: "28.172.in-addr.arpa." nodefault
        local-zone: "29.172.in-addr.arpa." nodefault
        local-zone: "30.172.in-addr.arpa." nodefault
        local-zone: "31.172.in-addr.arpa." nodefault
        local-zone: "168.192.in-addr.arpa." nodefault

forward-zone:
      name: "."
      forward-addr: domaincontroller1.example.com
      forward-addr: domaincontroller2.example.com
      forward-addr: domaincontroller3.example.com
      #forward-addr: 8.8.4.4        # Google
      #forward-addr: 8.8.8.8        # Google

Now that the RFC1918 networks are defined as local zones (and Unbound was restarted) I checked the reverse DNS lookup again on the same host using Unbound as DNS resolver:

$ host 192.168.253.153
153.253.168.192.in-addr.arpa domain name pointer olymp.localdomain.local.

$ host 10.161.206.153
153.253.168.192.in-addr.arpa domain name pointer domaincontroller1.example.com.


 

resolvconf is not always helpful, especially in a chroot rescue system
Wednesday - Feb 22nd 2017 - by - (0 comments)

Yesterday I needed to completely re-partition a server (going from 12 separate partitions on /dev/sda to a two-disk setup with LVM). In order to have ensure data-integrity I booted from a live CD (Knoppix), mounted the old partitions inside /mnt and mounted the new (target) partitions in /mnt2.

The mounts inside /mnt2 looked the following:

/dev/sda1 /mnt2
/dev/mapper/vgsystem-lvvar /mnt2/var
/dev/mapper/vgsystem-lvtmp /mnt2/tmp
/dev/mapper/vgdata-lvdb /mnt2/db
/dev/mapper/vgdata-lvwww /mnt2/www

Once I synced /mnt to /mnt2, I mounted procfs, sysfs and dev into /mnt2:

mount --bind /dev /mnt2/dev
mount --bind /proc /mnt2/proc
mount --bind /sys /mnt2/sys

And then went into the "new system":

chroot /mnt2 /bin/bash

I adapted /etc/fstab to have the new UUID's of the real partitions and the logical volumes pointing to the correct mount points. And then I knew I had to create a new initramfs because I now use LVM. But none of the lvm commands worked, because lvm2 was not installed. So I configured the network card and wanted to run apt-get install lvm2 but got errors from apt that the dns resolution didn't work.

I checked /etc/resolv.conf and it was a symlink:

/etc/resolv.conf -> ../run/resolvconf/resolv.conf

Of course ../run was empty. And even when I added it with yet another bind mount inside /mnt2, the resolvconf folder didn't exist of course (/run was mounted from the Knoppix live image and resolvconf inside chroot wasn't started).

Because my maintenance window narrowed, I deleted the symlink and simply created a plain simple /etc/resolv.conf with a nameserver entry. With that change I was finally able to run apt-get install lvm2. Once this package was installed, it automatically updated the initramfs. Well, thanks! Note: I would have run "update-initramfs -u" if apt didn't do it automatically.

The only thing left was to update the GRUB bootloader:

update-grub

This generated the new grub config file in /boot/grub/grub.cfg. And then install the bootloader itself on the new OS disk:

grub-install /dev/sdb

TL;DR: resolvconf package may be helpful in a "normal" start of the OS, but when doing some maintenance tasks as I did (in a rescue or repair mode) then a classic and static /etc/resolv.conf file does the job better because it does not depend on another script (resolvconf).

 

MySQL InnoDB: Operating system error number 13 in a file operation
Tuesday - Feb 21st 2017 - by - (0 comments)

Got an interesting MySQL error today, after I updated an Ubuntu 14.04 system with apt-get upgrade:

[...]
libwbclient0:amd64 (2:4.3.11+dfsg-0ubuntu0.14.04.4) wird eingerichtet ...
libdbd-mysql-perl (4.025-1ubuntu0.1) wird eingerichtet ...
mysql-client-core-5.5 (5.5.54-0ubuntu0.14.04.1) wird eingerichtet ...
mysql-client-5.5 (5.5.54-0ubuntu0.14.04.1) wird eingerichtet ...
mysql-server-core-5.5 (5.5.54-0ubuntu0.14.04.1) wird eingerichtet ...
mysql-server-5.5 (5.5.54-0ubuntu0.14.04.1) wird eingerichtet ...

Konfigurationsdatei »/etc/apparmor.d/usr.sbin.mysqld«
 ==> Geändert (von Ihnen oder von einem Skript) seit der Installation.
 ==> Paketverteiler hat eine aktualisierte Version herausgegeben.
   Wie möchten Sie vorgehen? Ihre Wahlmöglichkeiten sind:
    Y oder I : Die Version des Paket-Betreuers installieren
    N oder O : Die momentan installierte Version beibehalten
       D     : Die Unterschiede zwischen den Versionen anzeigen
       Z     : Eine Shell starten, um die Situation zu begutachten
 Der Standardweg ist das Beibehalten der momentanen Version.
*** usr.sbin.mysqld (Y/I/N/O/D/Z) [Vorgabe=N] ? D

Konfigurationsdatei »/etc/apparmor.d/usr.sbin.mysqld«
 ==> Geändert (von Ihnen oder von einem Skript) seit der Installation.
 ==> Paketverteiler hat eine aktualisierte Version herausgegeben.
   Wie möchten Sie vorgehen? Ihre Wahlmöglichkeiten sind:
    Y oder I : Die Version des Paket-Betreuers installieren
    N oder O : Die momentan installierte Version beibehalten
       D     : Die Unterschiede zwischen den Versionen anzeigen
       Z     : Eine Shell starten, um die Situation zu begutachten
 Der Standardweg ist das Beibehalten der momentanen Version.
*** usr.sbin.mysqld (Y/I/N/O/D/Z) [Vorgabe=N] ? I
Neue Version der Konfigurationsdatei /etc/apparmor.d/usr.sbin.mysqld wird installiert ...
start: Job failed to start
invoke-rc.d: initscript mysql, action "start" failed.
dpkg: Fehler beim Bearbeiten des Paketes mysql-server-5.5 (--configure):
 Unterprozess installiertes post-installation-Skript gab den Fehlerwert 1 zurück
dpkg: Abhängigkeitsprobleme verhindern Konfiguration von mysql-server:
 mysql-server hängt ab von mysql-server-5.5; aber:
  Paket mysql-server-5.5 ist noch nicht konfiguriert.

dpkg: Fehler beim Bearbeiten des Paketes mysql-server (--configure):
 Abhängigkeitsprobleme - verbleibt unkonfiguriert
mysql-client (5.5.54-0ubuntu0.14.04.1) wird eingerichtet ...
Es wurde kein Apport-Bericht verfasst, da die Fehlermeldung darauf hindeutet, dass dies lediglich ein Folgefehler eines vorherigen Problems ist.
[...]

So MySQL didn't start, that's why apt-get upgrade didn't finish with success. In /var/log/mysql/error I saw the following errors:

170221 14:24:14 [Warning] Using unique option prefix myisam-recover instead of myisam-recover-options is deprecated and will be removed in a future release. Please use the full name instead.
170221 14:24:14 [Note] Plugin 'FEDERATED' is disabled.
/usr/sbin/mysqld: Can't find file: './mysql/plugin.frm' (errno: 13)
170221 14:24:14 [ERROR] Can't open the mysql.plugin table. Please run mysql_upgrade to create it.
170221 14:24:14 InnoDB: The InnoDB memory heap is disabled
170221 14:24:14 InnoDB: Mutexes and rw_locks use GCC atomic builtins
170221 14:24:14 InnoDB: Compressed tables use zlib 1.2.8
170221 14:24:14 InnoDB: Using Linux native AIO
170221 14:24:14 InnoDB: Initializing buffer pool, size = 128.0M
170221 14:24:14 InnoDB: Completed initialization of buffer pool
170221 14:24:14  InnoDB: Operating system error number 13 in a file operation.
InnoDB: The error means mysqld does not have the access rights to
InnoDB: the directory.
InnoDB: File name ./ibdata1
InnoDB: File operation call: 'open'.
InnoDB: Cannot continue operation.

dmesg revealed some more information:

[  634.595214] init: mysql respawning too fast, stopped
[  790.457596] audit_printk_skb: 69 callbacks suppressed
[  790.457599] audit: type=1400 audit(1487683611.187:119): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=32765 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.457620] audit: type=1400 audit(1487683611.187:120): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=32765 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.459437] audit: type=1400 audit(1487683611.187:121): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/db/mysql/mysql/db.frm" pid=302 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=105 ouid=105
[  790.472090] audit: type=1400 audit(1487683611.199:122): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=315 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.472117] audit: type=1400 audit(1487683611.199:123): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=315 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.486818] audit: type=1400 audit(1487683611.215:124): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/db/mysql/ibdata1" pid=315 comm="mysqld" requested_mask="wr" denied_mask="wr" fsuid=105 ouid=105
[  790.610780] audit: type=1400 audit(1487683611.339:125): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=333 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.610820] audit: type=1400 audit(1487683611.339:126): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=333 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  790.624533] audit: type=1400 audit(1487683611.351:127): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/db/mysql/ibdata1" pid=333 comm="mysqld" requested_mask="wr" denied_mask="wr" fsuid=105 ouid=105
[  790.740241] audit: type=1400 audit(1487683611.467:128): apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld" name="/db/mysql/app003.lower-test" pid=349 comm="mysqld" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[  791.175361] init: mysql main process (430) terminated with status 1
[  791.175369] init: mysql main process ended, respawning
[  792.060469] init: mysql post-start process (431) terminated with status 1
[  792.203859] init: mysql main process (475) terminated with status 1
[  792.203867] init: mysql main process ended, respawning
[  793.085085] init: mysql post-start process (476) terminated with status 1
[  793.237776] init: mysql main process (521) terminated with status 1
[  793.237793] init: mysql respawning too fast, stopped

Combine the errors from mysql ("InnoDB: Operating system error number 13 in a file operation.") and dmesg ("apparmor="DENIED" operation="mknod" profile="/usr/sbin/mysqld") and you know where the problem is. During the system patching, a new version of the apparmor rules file was also installed. In most cases this is wanted, as the rule files themselves get patched and fixed from previous bugs.  But on this particular server this was not a wise choice, because - for an unknown reason to me - the admin who installed this system (with German as system language, ugh!) chose /db as path for the MySQL datadir. The Debian/Ubuntu default is of course /var/lib/mysql. Instead of setting /db as datadir and making a new partition on /db, it would have been easier to keep the system defaults and just mount /var/lib/mysql as a separate partition. This would have kept the MySQL installation compatible to future updates.

Because I didn't want apparmor to be running anyway, I stopped and removed apparmor (alternative would have been to edit the apparmor rules file for the MySQL binary). But even after that, MySQL was not able to start. I needed to reboot the server completely in order to get rid of the apparmor security settings. But after the reboot MySQL started correctly again.

 

UEFI boot does not like software raid at all (GRUB Error 17)
Monday - Feb 20th 2017 - by - (0 comments)

A couple of days ago I built a new home server for testing purposes with the following components:

  • Chassis: BitFenix Phenom Mini ITX
  • Power Supply: Enermax Revolution X't II 550W
  • Motherboard: Gigabyte F2A88XN-WIFI (although I didn't care about the WIFI, important to me was the FM2+ socket)
  • CPU: AMD A10-7860K
  • Memory: 2 x 8GB Corsair Vengeance Pro DDR3
  • Hard Drives: 2 x 500 GB Western Digital Caviar Blue (yes they're slow)

Just for documentation purposes, here's what the final result looks like:

 Debian Jessie Testserver Gigabyte F2A88XN-WIFI and AMD A10-7860K

For the setup of Debian Jessie, I created a bootable memory stick with the netinst image. Setup started smoothly (missing firmware was mentioned on the WIFI chip iwlwifi-7620-9/iwlwifi-7620-8 and on RTL rtl_nic/rtl8168e-3), but the network card worked fine with the default drivers) and I went through the different steps. I set up my partitions like this:

  • HDD1: 3 partitions (20G, 25G, max size), each selected to be used as a RAID device
  • HDD2: 3 partitions (20G, 25G, max size), each selected to be used as a RAID device
  • Raid Device 1: /dev/sda1 + /dev/sdb1, RAID-1, ext4 defaults, mounted as /
  • Raid Device 2: /dev/sda2 + /dev/sdb2, RAID-1, Use as LVM physical volume
  • Raid Device 3: /dev/sda3 + /dev/sdb3, RAID-1, Use as LVM physical volume
  • PV /dev/md1 used for VG vgsystem. Created three LV's lvvar (mounted on /var), lvtmp (mounted on /tmp), and lvswap as swap partition
  • PV /dev/md2 used for VG vgdata.

I wanted to get on but the Debian installer informed me that "You haven't set up an EFI partition".

Debian Installer missing EFI System Partition 

Oh jeez, right. That's a new motherboard which supports (U)EFI boot. I needed to read some docs what exactly was meant with a EFI partition and how to set it up but the Debian installer pretty much does the job automatically when the EFI System Partition (ESP) is selected. But this meant I needed to destroy my partition layout and create a 500MB (250MB would probably be enough) partition for ESP at the begin.

The new partition layout with the UEFI partition looked like this:

  • HDD1: 4 partitions (500M, 20G, 25G, max size), each selected to be used as a RAID device
  • HDD2: 4 partitions (500M, 20G, 25G, max size), each selected to be used as a RAID device
  • Raid Device 1: /dev/sda1 + /dev/sdb2, RAID-1, Use as EFI System Partition
  • Raid Device 2: /dev/sda2 + /dev/sdb2, RAID-1, ext4 defaults, mounted as /
  • Raid Device 3: /dev/sda3 + /dev/sdb3, RAID-1, Use as LVM physical volume
  • Raid Device 4: /dev/sda4 + /dev/sdb4, RAID-1, Use as LVM physical volume
  • PV /dev/md2 used for VG vgsystem. Created three LV's lvvar (mounted on /var), lvtmp (mounted on /tmp), and lvswap as swap partition
  • PV /dev/md3 used for VG vgdata.

Debian Partitioning with EFI System Partition 

The Debian installer continued with the remaining steps of installing the base system and eventually finished. Time to reboot. But at that point I made big eyes - the system didn't boot. Right after the UEFI/BIOS of the motherboard, the following error was shown:

GRUB Loading stage1.5.
GRUB loading, please wait...
Error 17

GRUB Error 17 UEFI Boot

According to http://www.uruk.org/orig-grub/errors.html the Error 17 means:

17: Invalid device requested
 This error is returned if a device string is recognizable but does not fall under the other device errors.

For some reason it looks like the UEFI loader can't find the EFI System Partition.
Some additional research revealed the following very important information (from https://wiki.debian.org/UEFI): 

RAID for the EFI System Partition
This is arguably a mis-design in the UEFI specification - the ESP is a single point of failure on one disk. For systems with hardware RAID, that will provide some backup in case of disk failure. But for software RAID systems there is currently no support for putting the ESP on two separate disks in RAID. There might be a way to do something useful with fallback options, but this will need some investigation...

What the %"*+&@??!! Are you seriously telling me that the, compared to BIOS, 20 years newer Unified Extensible Firmware Interface (UEFI) cannot boot from a EFI System Partition which is a software raid partition? This is not arguably a mis-design, this is clearly a no-go! Even for a testserver, I don't want to invest any time in making the system boot again when a HDD fails. I could set up the partitions /dev/sda1 + /dev/sdb1 as normal ESP partitions and maybe run a cronjob to sync them manually, but that's a hack/workaround. I finally decided to ditch UEFI and switch to the Legacy (BIOS) boot mode:

Gigabyte UEFI BIOS set to legacy boot 

Of course I needed to adapt the partition layout once again; I used the layout I used at the begin. I let Debian finish the installation and then reboot. This time booting (with legacy BIOS mode and GRUB as bootloader) worked like a charm.

 

Unbound DNS server behind a VIP - solving reply from unexpected source
Friday - Feb 17th 2017 - by - (0 comments)

In my previous post I wrote about getting Unbound to work without Internet connection (Get Unbound DNS lookups working in Ubuntu 16.04 Xenial, understand SERVFAIL). For the blog post I simplified the setup, but in reality the DNS architecture was built on a "cluster" on two Xenial hosts running KeepAliveD and sharing a virtual IP address (VIP).

After a couple of tests, I stumbled across the following problem:

$ dig -t A www.claudiokuenzler.com @192.168.10.10
;; reply from unexpected source: 192.168.10.11#53, expected 192.168.10.10#53
;; reply from unexpected source: 192.168.10.11#53, expected 192.168.10.10#53
;; reply from unexpected source: 192.168.10.11#53, expected 192.168.10.10#53

You've guessed it right. 192.168.10.10 is the VIP on which I tried to run the DNS lookup. But the DNS response came back from the active server's primary IP address 192.168.10.11. Dig correctly says "dude, something's not right" and doesn't accept the answer.

I already thought I had to set up some weird masquerading NAT to always respond with the VIP but that would cause problems on the second host, the VRRP BACKUP host. And also it would prevent local lookups to work (by default, if I wouldn't add yet another masquerading hack for localhost). Already by trying to write down my ideas of potential masquerading rules you can guess that this would be nothing else than a hack to make something work - which should be simple and work out of the box.

Luckily I read through the manpage of unbound.conf again and I found something very interesting:

       interface-automatic:
              Detect source interface on UDP queries and copy them to replies.  This feature is experimental, and needs support in your OS for particular
              socket options.  Default value is no.

The description sounds pretty much what I need, so I gave it a shot and added "interface-automatic: yes" into the config:

server:
        interface: 0.0.0.0
        interface-automatic: yes
        access-control: 10.0.0.0/16 allow
        access-control: 127.0.0.0/8 allow
        access-control: 172.16.0.0/12 allow
        access-control: 192.168.0.0/16 allow
        verbosity: 1
        domain-insecure: *
        root-hints: /var/lib/unbound/root.hints

After a restart of Unbound I tried it again - with success:

$ dig -t A www.claudiokuenzler.com @192.168.10.10

; <<>> DiG 9.9.5-3ubuntu0.11-Ubuntu <<>> -t A www.claudiokuenzler.com @192.168.10.10
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32099
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.claudiokuenzler.com.       IN      A

;; ANSWER SECTION:
www.claudiokuenzler.com. 28636  IN      A       144.76.83.23

;; Query time: 4 msec
;; SERVER: 192.168.10.10#53(192.168.10.10)
;; WHEN: Fri Feb 17 11:14:33 CET 2017
;; MSG SIZE  rcvd: 68

And it still works on the primary IP of the cluster hosts, too:

$ dig -t A www.claudiokuenzler.com @192.168.10.11

; <<>> DiG 9.9.5-3ubuntu0.11-Ubuntu <<>> -t A www.claudiokuenzler.com @192.168.10.11
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62942
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.claudiokuenzler.com.       IN      A

;; ANSWER SECTION:
www.claudiokuenzler.com. 28493  IN      A       144.76.83.23

;; Query time: 0 msec
;; SERVER: 192.168.10.11#53(192.168.10.11)
;; WHEN: Fri Feb 17 11:16:54 CET 2017
;; MSG SIZE  rcvd: 68

Well done, Unbound! I'm positively surprised that this config option is available!

PS: I also tried to set the Unbound listener to the VIP by setting "interface" to the virtual ip address in unbound.conf (instead of using all interfaces 0.0.0.0). But as the VIP is only assigned once Unbound runs correctly (I added specific DNS checks in keepalived.conf), Unbound was not able to start and failed with:

unbound[32235]: [1487320777] unbound[32535:0] error: can't bind socket: Cannot assign requested address for 192.168.10.10
unbound[32235]: [1487320777] unbound[32535:0] fatal error: could not open ports
unbound[32235]:    ...fail!

 

Get Unbound DNS lookups working in Ubuntu 16.04 Xenial, understand SERVFAIL
Thursday - Feb 16th 2017 - by - (1 comments)

For an internal DNS resolver/cacher, I decided to use Unbound. However after the installation of unbound 1.5.8-1ubuntu1 and an initial configuration, DNS lookups didn't work.

The config:

server:
    interface: 0.0.0.0
        access-control: 10.0.0.0/16 allow
        access-control: 127.0.0.0/8 allow
        access-control: 172.16.0.0/12 allow
        access-control: 192.168.0.0/16 allow
        verbosity: 1
        root-hints: /var/lib/unbound/root.hints

forward-zone:
      name: "."
      forward-addr: domaincontroller1.example.com
      forward-addr: domaincontroller2.example.com
      forward-addr: domaincontroller3.example.com
      #forward-addr: 8.8.4.4        # Google
      #forward-addr: 8.8.8.8        # Google

DNS resolving didn't work. Unbound returned a SERVFAIL error.

# dig claudiokuenzler.com A @localhost

; <<>> DiG 9.10.3-P4-Ubuntu <<>> claudiokuenzler.com A @localhost
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 8470
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;claudiokuenzler.com.            IN    A

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Feb 16 07:04:55 CET 2017
;; MSG SIZE  rcvd: 37

When I restarted Unbound, I was puzzled how slow the restart was and eventually some errors showed up in syslog:

Feb 16 06:46:52 dnscacher systemd[1]: Stopping unbound.service...
Feb 16 06:46:52 dnscacher unbound[20218]:  * Stopping DNS server unbound
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: service stopped (unbound 1.5.8).
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: server stats for thread 0: 8 queries, 4 answers from cache, 4 recursions, 0 prefetch
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: server stats for thread 0: requestlist max 1 avg 0.5 exceeded 0 jostled 0
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: average recursion processing time 0.183547 sec
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: histogram of recursion processing times
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: [25%]=0.000384 median[50%]=0.000512 [75%]=0.393216
Feb 16 06:46:52 dnscacher unbound: [19904:0] info: lower(secs) upper(secs) recursions
Feb 16 06:46:52 dnscacher unbound: [19904:0] info:    0.000256    0.000512 2
Feb 16 06:46:52 dnscacher unbound: [19904:0] info:    0.262144    0.524288 2
Feb 16 06:46:52 dnscacher unbound[20218]:    ...done.
Feb 16 06:46:52 dnscacher systemd[1]: Stopped unbound.service.
Feb 16 06:46:52 dnscacher systemd[1]: Starting unbound.service...
Feb 16 06:46:52 dnscacher unbound[20247]:  * Starting DNS server unbound
Feb 16 06:47:42 dnscacher unbound-anchor: /var/lib/unbound/root.key has content
Feb 16 06:47:42 dnscacher unbound-anchor: fail: the anchor is NOT ok and could not be fixed
Feb 16 06:47:42 dnscacher unbound: [20339:0] notice: init module 0: validator
Feb 16 06:47:42 dnscacher unbound: [20339:0] notice: init module 1: iterator
Feb 16 06:47:42 dnscacher unbound: [20339:0] info: start of service (unbound 1.5.8).
Feb 16 06:47:42 dnscacher unbound: [20339:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN
Feb 16 06:47:42 dnscacher unbound: message repeated 5 times: [ [20339:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN]
Feb 16 06:47:42 dnscacher unbound[20247]:    ...done.
Feb 16 06:47:42 dnscacher systemd[1]: Started unbound.service.
Feb 16 06:47:42 dnscacher systemd[1]: Reached target Host and Network Name Lookups.

It turned out that the slow start is caused by the unbound-achor verification, which can be manually triggered using the following command:

# time unbound-anchor -a /var/lib/unbound/root.key -v
/var/lib/unbound/root.key has content
fail: the anchor is NOT ok and could not be fixed

real    0m50.453s
user    0m0.004s
sys    0m0.012s

Here we have the 50 seconds which slowed down the start (see times in the syslog log above). By using tcpdump, I was able to determine what happens during the start (and what causes this slow startup):

# service unbound restart & 2>&1; tcpdump port 53
[1] 11942
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens160, link-type EN10MB (Ethernet), capture size 262144 bytes
10:58:07.788300 IP dnscacher.23444 > 192.5.5.241.53: 1053% [1au] NS? . (28)
10:58:08.164795 IP dnscacher.58715 > 192.112.36.4.53: 41519% [1au] NS? . (28)
10:58:08.541290 IP dnscacher.26205 > 192.112.36.4.53: 52241% [1au] NS? . (28)
10:58:08.917835 IP dnscacher.18709 > 199.7.91.13.53: 42065% [1au] NS? . (28)
10:58:09.294371 IP dnscacher.42646 > 199.7.91.13.53: 21034% [1au] NS? . (28)
10:58:09.670891 IP dnscacher.56481 > 198.97.190.53.53: 9754% [1au] NS? . (28)
10:58:10.047466 IP dnscacher.61621 > 198.97.190.53.53: 51384% [1au] NS? . (28)
10:58:10.424028 IP dnscacher.37874 > 192.5.5.241.53: 14983% [1au] NS? . (28)
10:58:11.176957 IP dnscacher.43299 > 192.5.5.241.53: 31294% [1au] NS? . (28)
[...] # shortened output

Unbound tried to connect to these and some more IP addresses. They look kinda familiar, don't they? Ah, of course, the DNS root servers, the spine of the Internet!

# grep 192.5.5.241 /var/lib/unbound/root.hints
F.ROOT-SERVERS.NET.      3600000      A     192.5.5.241

As this internal DNS resolver/cacher doesn't have Internet access (and therefore no DNS access to the root servers), this explains the slow start. But why wouldn't Unbound show any results at all? The connectivity to the internal Active Directory controllers (domaincontroller1-3) works.
Here I stumbled across Debian bug report #699641:

"you can set the "domain-insecure" option on your local domains in order to prevent DNSSEC failures from impacting the resolution of those domains"

So this is the actual workaround (disable DNSSEC) but doesn't explain the reason. There are two reasons for Unbound unable to show results for the resolved domains:

1.Missing Internet connection

The error in syslog gives a good hint:

Feb 16 06:47:42 dnscacher unbound: [20339:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN
Feb 16 06:47:42 dnscacher unbound: message repeated 5 times: [ [20339:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN]

As tcpdump revealed, Unbound tried to connect to all DNS Root servers. This happened not just for fun, Unbound tried to verify the DNSKEY in order to enable DNSSEC. As the Internet connection is non-existant on this host, the verification didn't work.

2. Unbound runs in DNSSEC mode

Unbound runs by default in DNSSEC mode, at least with the base installation on Ubuntu. The failed DNSKEY validation causes Unbound to stop working for DNSSEC requests - which is enabled. All non-DNSSEC requests return a SERVFAIL status.

Solution: Either make sure the cacher host has Internet access (at least to the DNS Root servers) or disable DNSSEC completely.

As this is a DNS resolver and cacher for internal networks only, I didn't see the necessity for DNSSEC and disabled it. In the Ubuntu setup two changes are necessary in order to disable DNSSEC:

Added domain-insecure option in unbound.conf:

server:
    interface: 0.0.0.0
        access-control: 10.0.0.0/16 allow
        access-control: 127.0.0.0/8 allow
        access-control: 172.16.0.0/12 allow
        access-control: 192.168.0.0/16 allow
        verbosity: 1
        root-hints: /var/lib/unbound/root.hints
        domain-insecure: *

But that's not enough. There's also a special config file active (/etc/unbound/unbound.conf.d/root-auto-trust-anchor-file.conf). In this file I commented/disabled the auto-trust-anchor-file option:

# cat /etc/unbound/unbound.conf.d/root-auto-trust-anchor-file.conf
server:
    # The following line will configure unbound to perform cryptographic
    # DNSSEC validation using the root trust anchor.
    #auto-trust-anchor-file: "/var/lib/unbound/root.key"

Then another restart of Unbound:

# service unbound restart & 2>&1; tail -f /var/log/syslog
Feb 16 07:11:11 dnscacher unbound[23185]:  * Starting DNS server unbound
Feb 16 07:12:01 dnscacher unbound-anchor: /var/lib/unbound/root.key has content
Feb 16 07:12:01 dnscacher unbound-anchor: fail: the anchor is NOT ok and could not be fixed
Feb 16 07:12:01 dnscacher unbound: [23275:0] warning: did not exit gracefully last time (22816)
Feb 16 07:12:01 dnscacher unbound: [23276:0] notice: init module 0: validator
Feb 16 07:12:01 dnscacher unbound: [23276:0] notice: init module 1: iterator
Feb 16 07:12:01 dnscacher unbound: [23276:0] info: start of service (unbound 1.5.8).
Feb 16 07:12:01 dnscacher unbound[23185]:    ...done.
Feb 16 07:12:01 dnscacher systemd[1]: Started unbound.service.
Feb 16 07:12:01 dnscacher systemd[1]: Reached target Host and Network Name Lookups.

Yes, it still takes 50 seconds to start due to the missing Internet connection, but DNS resolution now works:

# dig -t A www.claudiokuenzler.com @localhost

; <<>> DiG 9.10.3-P4-Ubuntu <<>> -t A www.claudiokuenzler.com @localhost
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15474
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.claudiokuenzler.com.    IN    A

;; ANSWER SECTION:
www.claudiokuenzler.com. 38400    IN    A    144.76.83.23

;; Query time: 47 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Feb 16 07:12:33 CET 2017
;; MSG SIZE  rcvd: 68

And the already queried DNS record can now be found in Unbound's cache:

# unbound-control dump_cache | grep claudiokuenzler
www.claudiokuenzler.com.    23458    IN    A    144.76.83.23
msg www.claudiokuenzler.com. IN A 33152 1 23458 2 1 0 0
www.claudiokuenzler.com. IN A 0

Note: Even if I disabled DNSSEC in this case, it's still wise to open the firewall for this host to access the DNS Root servers. Not only to avoid the slow start or to keep DNSSEC as a possibility, but rather because this is the design of DNS. The Root servers should be queried if forwarders return no results.

Update February 17th 2017: As I mentioned at the end, a DNS connection to the Root servers should work if possible. So the missing firewall rules are now created and a restart is now as quick as expected:

# time service unbound restart & 2>&1; tail -f /var/log/syslog
Feb 17 11:37:15 dnscacher systemd[1]: Stopping unbound.service...
Feb 17 11:37:15 dnscacher unbound[12066]:  * Stopping DNS server unbound
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: service stopped (unbound 1.5.8).
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: server stats for thread 0: 766 queries, 763 answers from cache, 3 recursions, 0 prefetch
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: server stats for thread 0: requestlist max 0 avg 0 exceeded 0 jostled 0
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: average recursion processing time 0.000464 sec
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: histogram of recursion processing times
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: [25%]=0 median[50%]=0 [75%]=0
Feb 17 11:37:15 dnscacher unbound: [4569:0] info: lower(secs) upper(secs) recursions
Feb 17 11:37:15 dnscacher unbound: [4569:0] info:    0.000256    0.000512 3
Feb 17 11:37:15 dnscacher unbound[12066]:    ...done.
Feb 17 11:37:15 dnscacher systemd[1]: Stopped unbound.service.
Feb 17 11:37:15 dnscacher systemd[1]: Starting unbound.service...
Feb 17 11:37:15 dnscacher unbound[12096]:  * Starting DNS server unbound
Feb 17 11:37:16 dnscacher unbound-anchor: /var/lib/unbound/root.key has content
Feb 17 11:37:16 dnscacher unbound-anchor: success: the anchor is ok
Feb 17 11:37:16 dnscacher unbound: [12112:0] notice: init module 0: validator
Feb 17 11:37:16 dnscacher unbound: [12112:0] notice: init module 1: iterator
Feb 17 11:37:16 dnscacher unbound: [12112:0] info: start of service (unbound 1.5.8).
Feb 17 11:37:16 dnscacher unbound[12096]:    ...done.
Feb 17 11:37:16 dnscacher systemd[1]: Started unbound.service.

real    0m0.152s
user    0m0.000s
sys     0m0.000s
Feb 17 11:37:16 dnscacher systemd[1]: Reached target Host and Network Name Lookups.

And as expected, the DNSKEY validation is now working, too:

Feb 17 11:37:16 dnscacher unbound-anchor: /var/lib/unbound/root.key has content
Feb 17 11:37:16 dnscacher unbound-anchor: success: the anchor is ok

 

The curious case of curl, SSL SNI and the HTTP Host header
Monday - Feb 13th 2017 - by - (0 comments)

As the digial age gets older, encryption of http traffic is becoming a new standard. Encrypted http connections will one day even be a requirement for serving web pages in Google's Chrome browser:

Eventually, we plan to label all HTTP pages as non-secure, and change the HTTP security indicator to the red triangle that we use for broken HTTPS.

Back in the dark ages of the Internet (= the 90's) every SSL listener required a dedicated IP address. But luckily nowadays there is Server Name Indication (SNI) support. SNI allows to run multiple SSL/TLS certificates on the same IP address. Much like it's been possible for a long time to serve multiple domains on the same IP address (virtual hosts). Since 2006 SNI was broadly accepted in operating systems and browsers. Before that: Your own bad luck if you still use that (it's 2017, get over it).

However not all tools are yet aware of SNI. I had to come across this myself when I wanted to check the response headers of a domain with a SNI certificate with the "curl" command:

$ curl -H "Host: testsite.example.net" https://lb.example.com -v -I
* Rebuilt URL to: https://lb.example.com/
* Hostname was NOT found in DNS cache
*   Trying 192.168.14.100...
* Connected to lb.example.com (192.168.14.100) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using ECDHE-RSA-AES256-GCM-SHA384
* Server certificate:
*      subject: OU=Domain Control Validated; OU=Gandi Standard Wildcard SSL; CN=*.example.ch
*      start date: 2016-10-05 00:00:00 GMT
*      expire date: 2019-10-05 23:59:59 GMT
*      subjectAltName does not match lb.example.com
* SSL: no alternative certificate subject name matches target host name 'lb.example.com'
* Closing connection 0
* SSLv3, TLS alert, Client hello (1):
curl: (51) SSL: no alternative certificate subject name matches target host name 'lb.example.com'

The returned server certificate is a wildcard certificate for *.example.ch which has nothing to do with...

a) the requested HTTP Host Header testsite.example.net

b) the requested server name lb.example.com 

Of course curl then exits with a warning, that no certificate could be found for the host name. And there's actually a (more or less) simple explanation (found on http://superuser.com/questions/793600/curl-and-sni-enabled-server):

"SNI sends the hostname inside the TLS handshake (ClientHello). The server then chooses the correct certificate based on this information. Only after the TLS connection is successfully established it will send the HTTP-Request, which contains the Host header you specified."

Update February 24th 2017: I found yet the best technical and understandable explanation on the Apache wiki (https://wiki.apache.org/httpd/NameBasedSSLVHosts and https://wiki.apache.org/httpd/NameBasedSSLVHostsWithSNI).:

"Apache needs to know the name of the host in order to choose the correct certificate to setup the encryption layer. But the name of the host being requested is contained only in the HTTP request headers, which are part of the encrypted content. It is therefore not available until after the encryption is already negotiated. This means that the correct certificate cannot be selected"

"The solution is an extension to the SSL protocol called Server Name Indication (RFC 4366), which allows the client to include the requested hostname in the first message of its SSL handshake (connection setup). This allows the server to determine the correct named virtual host for the request and set the connection up accordingly from the start."

Of course this explanation goes for all web server, not only Apache.

So using the "HTTP Host Header" is not a correct way to get to the SNI certificate. One would need a curl parameter to define the "SNI Hostname". Something like this, a parameter called --sni-hostname, was actually requested in issue #607 on curl's Github repository. The issue itself was closed and as a workaround the --resolve parameter was mentioned. And this indeed does the job:

$ curl --resolve testsite.example.net:443:lb.example.com https://testsite.example.net -v -I
* Resolve testsite.example.net:443:lb.example.com found illegal!
* Rebuilt URL to: https://testsite.example.net/
* Hostname was NOT found in DNS cache
*   Trying 194.40.217.90...
* Connected to testsite.example.net (194.40.217.90) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using ECDHE-RSA-AES256-GCM-SHA384
* Server certificate:
*      subject: OU=Domain Control Validated; OU=Gandi Standard Wildcard SSL; CN=*.example.net
*      start date: 2016-10-05 00:00:00 GMT
*      expire date: 2019-10-05 23:59:59 GMT
*      subjectAltName: testsite.example.net matched
*      issuer: C=FR; ST=Paris; L=Paris; O=Gandi; CN=Gandi Standard SSL CA 2
*      SSL certificate verify ok.
> HEAD / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: testsite.example.net
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
[...]

This tells curl to force a DNS resolving of the requested "testsite.example.net" to lb.example.com. The correct certificate is now used.

The curl command becomes rather complicated, agreed. But there's also another issue #614 which takes the SNI problem one step further by adding a new feature called the "--connect-to" parameter. The new parameter was officially added in curl version 7.49.0.

Fixed in 7.49.0 - May 18 2016

Changes:

    schannel: Add ALPN support
    SSH: support CURLINFO_FILETIME
    SSH: new CURLOPT_QUOTE command "statvfs"
    wolfssl: Add ALPN support
    http2: added --http2-prior-knowledge
    http2: added CURL_HTTP_VERSION_2_PRIOR_KNOWLEDGE
    libcurl: added CURLOPT_CONNECT_TO
    curl: added --connect-to
    libcurl: added CURLOPT_TCP_FASTOPEN
    curl: added --tcp-fastopen
    curl: remove support for --ftpport, -http-request and --socks

However my curl version is still 7.35.0 and I was not in the mood to recompile it manually.

Besides the workaround using curl with the --resolve parameter (and in the future using the --connect-to parameter), there's also the way to check the certificate using ... *drumrolls* ... openssl!
Yes, openssl itself can of course also be used to check the SNI certificate by using the -servername parameter:

$ openssl s_client -connect lb.example.com:443 -servername testsite.example.net
CONNECTED(00000003)
depth=2 C = US, ST = New Jersey, L = Jersey City, O = The USERTRUST Network, CN = USERTrust RSA Certification Authority
verify error:num=20:unable to get local issuer certificate
verify return:0
---
Certificate chain
 0 s:/OU=Domain Control Validated/OU=Gandi Standard Wildcard SSL/CN=*.example.net
   i:/C=FR/ST=Paris/L=Paris/O=Gandi/CN=Gandi Standard SSL CA 2
 1 s:/C=FR/ST=Paris/L=Paris/O=Gandi/CN=Gandi Standard SSL CA 2
   i:/C=US/ST=New Jersey/L=Jersey City/O=The USERTRUST Network/CN=USERTrust RSA Certification Authority
 2 s:/C=US/ST=New Jersey/L=Jersey City/O=The USERTRUST Network/CN=USERTrust RSA Certification Authority
   i:/C=SE/O=AddTrust AB/OU=AddTrust External TTP Network/CN=AddTrust External CA Root
---
Server certificate
-----BEGIN CERTIFICATE-----
[...]

As you can see, the correct wildcard certificate *.example.net was returned in the certificate chain. 

Reminder to myself: Don't let the curl output fool you.

 

Using check_netio monitoring plugin on CentOS 7 or RHEL 7
Friday - Jan 27th 2017 - by - (2 comments)

For years I've been using the monitoring plugin check_netio to get statistics of the network interfaces on Linux. Purely for graphing purposes as the plugin doesn't do a "classical check". An example of such a graph can be seen here:

check_netio graph 

But since RHEL 7 and CentOS 7 the plugin didn't return any values anymore. The reason lies in the different output of the ifconfig command:

ifconfig on CentOS 6 and RHEL6 returns:

[root@centos6 ~]# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:50:56:B5:7A:26 
          inet addr:10.162.216.65  Bcast:10.162.216.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:124417619 errors:0 dropped:0 overruns:0 frame:0
          TX packets:45443808 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:824234857174 (767.6 GiB)  TX bytes:59269515633 (55.1 GiB)

ifconfig on CentOS 7 and RHEL7 returns:

[root@centos7 ~]# ifconfig ens192
ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.162.214.132  netmask 255.255.255.0  broadcast 10.162.214.255
        inet6 fe80::fc3c:fc22:c796:54e1  prefixlen 64  scopeid 0x20<link>
        ether 00:50:56:8d:19:be  txqueuelen 1000  (Ethernet)
        RX packets 170715  bytes 56526535 (53.9 MiB)
        RX errors 0  dropped 38  overruns 0  frame 0
        TX packets 36453  bytes 7216222 (6.8 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

As the plugin uses grep 'bytes' and then cut to get the current number of bytes, you can see the problem by comparing the lines:

        RX bytes:824234857174 (767.6 GiB)  TX bytes:59269515633 (55.1 GiB)

vs.

        RX packets 170715  bytes 56526535 (53.9 MiB)
        TX packets 36453  bytes 7216222 (6.8 MiB)

In order to use the plugin on newer CentOS 7 and RHEL 7 installations, too, I adapted the plugin. Unfortunately I couldn't find a source repository to contribute to so I created a new repository on my Github account (see my repository check_netio).

Besides fixing the CentOS/RHEL 7 case, I also added a simple check if the given network interface even exists on the system. In the original plugin, this simply caused the plugin to return OK with no values. 

[root@rhel7 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.3 (Maipo)

[root@rhel7 ~]# /usr/lib/nagios/plugins/check_netio -i ens192
NETIO OK - ens192: RX=56936528, TX=7374799|NET_ens192_RX=56936528B;;;; NET_ens192_TX=7374799B;;;;

[root@rhel7 ~]# /usr/lib/nagios/plugins/check_netio -i ens999
NETIO UNKNOWN - No interface ens999 found

 

Custom partitioning on new Red Hat Enterprise Linux 7 setup
Thursday - Jan 26th 2017 - by - (0 comments)

Unbelieveable how annoying this RHEL 7 installation wizard is! Try to setup manual partitioning with a mix of primary partitions and logical volumes - and RHEL will not boot. Even worse: Although I created the root partition first, somehow the wizard switched it with the swap partition making swap on /dev/sda1 instead of /dev/sda2 (where I wanted it to be).

Solution to this: Boot from Knoppix (yes, seriously) and manually create the partitions with parted. Once this was done, rebooted from the RHEL image. On the "Installation Destination" submenu I selected "I will configure partitioning" and then clicked on the blue Done button.

RHEL 7 Setup Installation Destination 

Inside the "Manual Partitioning" window I clicked on the rescan button and the disk was rescanned. The already created partitions were shown under a section called "unknown". To tell the RHEL7 installer to use these partitions, I had to select each partition and enter the mount point and click on "Reformat". The partitions were then finally shown as I wanted them in the installer:

RHEL 7 Setup custom partitioning 

This is surely the worst partitioning wizard I have seen in a recent Linux distribution (and I installed new machines with Ubuntu 16.04, SLES 12, CentOS 7, Debian Jessie and Linux Mint 17.3 in the past months).

 


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7605 Days
until Death of Computers
Why?