Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

Custom HTTP headers not showing up on 404 response status
Monday - Jan 23rd 2017 - by - (0 comments)

One of the OWASP recommendations for SSL/TLS connections is to enable HSTS (HTTP Strict Transport Security). OWASP describes HSTS as:

HTTP Strict Transport Security (HSTS) is an opt-in security enhancement that is specified by a web application through the use of a special response header. Once a supported browser receives this header that browser will prevent any communications from being sent over HTTP to the specified domain and will instead send all communications over HTTPS. It also prevents HTTPS click through prompts on browsers. 

The mentioned HTTP header is called "Strict-Transport-Security" which I of course added into my nginx config:

  add_header Strict-Transport-Security max-age=2678400;

I've been using this header already since 2014. So I was pretty surprised when a developer contacted me today and mentioned that the header doesn't appear on a 404 status page. He was right:

$ curl https://www.claudiokuenzler.com -I
HTTP/1.1 200 OK
Server: nginx
Date: Mon, 23 Jan 2017 14:31:11 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.4.19
Vary: Accept-Encoding
Strict-Transport-Security: max-age=2678400

$ curl https://www.claudiokuenzler.com/i-do-not-exist/ -I
HTTP/1.1 404 Not Found
Server: nginx
Date: Mon, 23 Jan 2017 14:31:39 GMT
Content-Type: text/html; charset=iso-8859-1
Connection: keep-alive
Vary: Accept-Encoding

Frankly, I couldn't explain it at first but after reading again the nginx documentation of add_header it is all explained:

Adds the specified field to a response header provided that the response code equals 200, 201, 204, 206, 301, 302, 303, 304, or 307.

Oops, 404 is not mentioned there. The documentation provides the solution to this, too:

 If the always parameter is specified (1.7.5), the header field will be added regardless of the response code.

So if your nginx is recent enough (at least 1.7.5) you can simply add the "always" parameter like this:

  add_header Strict-Transport-Security max-age=2678400 always;

And the HTTP header is added no matter what HTTP status is returned.

 

Rancher: Error response from daemon /usr/bin/dockerd (deleted)
Monday - Jan 23rd 2017 - by - (0 comments)

In a recent post (The Docker Dilemma: Benefits and risks going into production with Docker) I mentioned we're going forward with Rancher as an orchestration layer on top of Docker. 

Since last Friday (Jan 20 2017) there were sporadic error messages shown in the user interface when new containers were about to be started:

(Expected state running but got error: Error response from daemon: oci runtime error: process_linux.go:330: running prestart hook 0 caused "fork/exec /usr/bin/dockerd (deleted): no such file or directory: ")

This got me puzzled as the file clearly exists on that particular Docker host:

root@dockerserver:~# ll /usr/bin/dockerd
-rwxr-xr-x 1 root root 39063824 Jan 13 19:44 /usr/bin/dockerd

It turned out that I ran an Ansible playbook on Friday afternoon which was written for these Rancher/Docker hosts. One of it's tasks is to ensure that the docker.io package is installed:

  - name: 1.0 - Install docker.io
    apt: name={{item}} state=latest
    with_items:
    - docker.io

The apt logs confirmed my guess:

Start-Date: 2017-01-20  15:04:06
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold install docker.io
Requested-By: ansible (1001)
Upgrade: runc:amd64 (1.0.0~rc1-0ubuntu1~16.04, 1.0.0~rc1-0ubuntu2~16.04.1), docker.io:amd64 (1.12.1-0ubuntu13~16.04.1, 1.12.3-0ubuntu4~16.04.2)
End-Date: 2017-01-20  15:04:10

The docker.io package was updated from 1.12.1-0ubuntu13~16.04.1 to 1.12.3-0ubuntu4~16.04.2, but the Docker daemon was somehow not fully restarted. This is what caused the error message in Rancher, because some stale or deleted files were probably still open by the old Docker daemon process.

Syslog shows more information what exactly happened at 15:04:

Jan 20 15:04:03 dockerserver systemd[1]: Created slice User Slice of ansible.
Jan 20 15:04:03 dockerserver systemd[1]: Starting User Manager for UID 1001...
Jan 20 15:04:03 dockerserver systemd[1]: Started Session 14956 of user ansible.
Jan 20 15:04:03 dockerserver systemd[18545]: Reached target Sockets.
Jan 20 15:04:03 dockerserver systemd[18545]: Reached target Timers.
Jan 20 15:04:03 dockerserver systemd[18545]: Reached target Paths.
Jan 20 15:04:03 dockerserver systemd[18545]: Reached target Basic System.
Jan 20 15:04:03 dockerserver systemd[18545]: Reached target Default.
Jan 20 15:04:03 dockerserver systemd[18545]: Startup finished in 22ms.
Jan 20 15:04:03 dockerserver systemd[1]: Started User Manager for UID 1001.
Jan 20 15:04:08 dockerserver systemd[1]: Reloading.
Jan 20 15:04:09 dockerserver systemd[1]: apt-daily.timer: Adding 37min 2.565881s random time.
Jan 20 15:04:09 dockerserver systemd[1]: Started ACPI event daemon.
Jan 20 15:04:09 dockerserver systemd[1]: Reloading.
Jan 20 15:04:09 dockerserver systemd[1]: apt-daily.timer: Adding 5h 27min 32.130480s random time.
Jan 20 15:04:09 dockerserver systemd[1]: Started ACPI event daemon.
Jan 20 15:04:10 dockerserver systemd[1]: Reloading.
Jan 20 15:04:10 dockerserver systemd[1]: apt-daily.timer: Adding 11h 55min 14.530665s random time.
Jan 20 15:04:10 dockerserver systemd[1]: Started ACPI event daemon.
Jan 20 15:04:10 dockerserver systemd[1]: Reloading.
Jan 20 15:04:10 dockerserver systemd[1]: apt-daily.timer: Adding 11h 16min 17.514645s random time.
Jan 20 15:04:10 dockerserver systemd[1]: Started ACPI event daemon.
Jan 20 15:04:10 dockerserver systemd[1]: Started Docker Application Container Engine.
Jan 20 15:04:10 dockerserver systemd[1]: Reloading.
Jan 20 15:04:10 dockerserver systemd[1]: apt-daily.timer: Adding 11h 42min 36.963745s random time.
Jan 20 15:04:10 dockerserver systemd[1]: Started ACPI event daemon.

So we see that Docker was started (Jan 20 15:04:10 dockerserver systemd[1]: Started Docker Application Container Engine.) but where's the stop prior to that?

So to me it looks like Docker was started twice, running in two parallel processes and therefore causing sporadic error messages. A manual restart of the Docker daemon solved the problem.

If I launch a manual restart of the Docker service, this results in the following log entries:

Jan 23 09:54:38 dockerserver systemd[1]: Stopping Docker Application Container Engine...
Jan 23 09:54:50 dockerserver systemd[1]: Stopped Docker Application Container Engine.
Jan 23 09:54:50 dockerserver systemd[1]: Closed Docker Socket for the API.
Jan 23 09:54:50 dockerserver systemd[1]: Stopping Docker Socket for the API.
Jan 23 09:54:50 dockerserver systemd[1]: Starting Docker Socket for the API.
Jan 23 09:54:50 dockerserver systemd[1]: Listening on Docker Socket for the API.
Jan 23 09:54:50 dockerserver systemd[1]: Starting Docker Application Container Engine...
Jan 23 09:54:54 dockerserver dockerd[26724]: time="2017-01-23T09:54:54.088301837+01:00" level=info msg="Docker daemon" commit=6b644ec graphdriver=aufs version=1.12.3
Jan 23 09:54:54 dockerserver systemd[1]: Started Docker Application Container Engine.

The question remains, why the update process itself didn't restart the Docker service (similar to updating other packages like MySQL or Apache). If one takes a closer look at the Debian source package, docker.io.preinst doesn't exist. In other packages (for example mysql-server) this file is responsible to stop the service prior to the package update.

 

grub-install: warning: Couldnt find physical volume
Monday - Jan 23rd 2017 - by - (0 comments)

I recently installed a new HP Microserver Gen8 with a Debian Jessie and a software raid-1 (because the integrated B120i controller is not a "real" raid controller and is not supported by Linux).

After the initial installation, which installed grub2 only on the first disk, I also wanted to install grub2 on the second drive (SDB) so the server is able to boot from both disks in case one fails. So I tried grub-install on SDB:

# grub-install /dev/sdb

Then I shut down the server, removed the first drive and booted. But nothing. The BIOS skipped the boot from the local hard drive and went on in the boot order to PXE (Network boot). So no boot loader was found on the remaining drive.

I booted the server again with both drives active and ran dpkg-reconfigure grub-pc. Some warnings showed up:

# dpkg-reconfigure grub-pc
Replacing config file /etc/default/grub with new version
Installing for i386-pc platform.
grub-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
Installing for i386-pc platform.
grub-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.
Generating grub configuration file ...
/usr/sbin/grub-probe: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Found linux image: /boot/vmlinuz-3.16.0-4-amd64
Found initrd image: /boot/initrd.img-3.16.0-4-amd64
/usr/sbin/grub-probe: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
done

I was wondering about that warnings but carried on due to the message that no errors were reported and the installation finished. But my boot test with only the second drive active failed again.

Back into the system with both drives, I checked out the raid status and found this:

# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda6[0]
      470674432 blocks super 1.2 [2/1] [U_]
      bitmap: 2/4 pages [8KB], 65536KB chunk

md2 : active raid1 sda5[0]
      3903488 blocks super 1.2 [2/1] [U_]

md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      3904512 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[0]
      9756672 blocks super 1.2 [2/1] [U_]

unused devices: <none>

Now it made all sense. The second drive was not active at all (I assume that I rebooted the system too quickly after the installation finished so mdadm didn't have enough time to finish the raid build, which caused this problem). Hence the warning "couldn't find physical volume".

I manually rebuilt the raid with mdadm commands and waited until the raid recovery finished:

# mdadm --add /dev/md0 /dev/sdb1
mdadm: added /dev/sdb1

# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda6[0]
      470674432 blocks super 1.2 [2/1] [U_]
      bitmap: 2/4 pages [8KB], 65536KB chunk

md2 : active raid1 sda5[0]
      3903488 blocks super 1.2 [2/1] [U_]

md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      3904512 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sdb1[2] sda1[0]
      9756672 blocks super 1.2 [2/1] [U_]
      [===>.................]  recovery = 15.5% (1521792/9756672) finish=0.7min speed=190224K/sec

unused devices: <none>

# mdadm --add /dev/md2 /dev/sdb5
mdadm: added /dev/sdb5

# mdadm --add /dev/md3 /dev/sdb6
mdadm: re-added /dev/sdb6

# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sdb6[1] sda6[0]
      470674432 blocks super 1.2 [2/2] [UU]
      bitmap: 0/4 pages [0KB], 65536KB chunk

md2 : active raid1 sdb5[2] sda5[0]
      3903488 blocks super 1.2 [2/1] [U_]
      [=============>.......]  recovery = 67.8% (2649664/3903488) finish=0.1min speed=176644K/sec

md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      3904512 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sdb1[2] sda1[0]
      9756672 blocks super 1.2 [2/2] [UU]

unused devices: <none>

# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sdb6[1] sda6[0]
      470674432 blocks super 1.2 [2/2] [UU]
      bitmap: 0/4 pages [0KB], 65536KB chunk

md2 : active raid1 sdb5[2] sda5[0]
      3903488 blocks super 1.2 [2/2] [UU]

md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      3904512 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sdb1[2] sda1[0]
      9756672 blocks super 1.2 [2/2] [UU]

Now that the raid-1 drives are recovered and are seen by the OS, I re-installed grub2 on both drives:

# for disk in sd{a,b} ; do grub-install --recheck /dev/$disk ; done
Installing for i386-pc platform.
Installation finished. No error reported.
Installing for i386-pc platform.
Installation finished. No error reported.

Looked much better this time!

After a shut down of the server, I removed the first drive, booted the server and this time it worked. BIOS found the grub2 bootloader on the second drive, booted from it and the OS fully worked.  

 

Encrypted http connections (https) use four times more CPU resources
Thursday - Jan 19th 2017 - by - (0 comments)

Yesterday we finally enabled encrypted HTTP using TLS connections on https://www.nzz.ch, one of the largest newspapers in Switzerland. Besides the "switch" on the load balancers - which was the easy part - there was a lot of work involved between many different teams and external service providers. During the kickoff meeting a few weeks ago I was asked how the load balancers would perform when we enable HTTPS. I knew that the additional encryption of the HTTP traffic will use more CPU (every connection needs to be en- and decrypted), but I couldn't give a accurate number. But what I was sure of: We're not in the 90's anymore and the servers can handle additional load.

Well, yesterday was the big day and as soon as I forced the redirect from http to https, the CPU load went up. The network traffic itself staid the same so the increased CPU usage is caused by the http encryption. But see for yourself:

Encrypted http traffic causes more cpu load 

Based on these graphs it's fair to say that encrypted http traffic uses around 4x more CPU than before.

 

Rename video files to filename with recorded date using mediainfo
Monday - Jan 16th 2017 - by - (0 comments)

In my family we use videos - of course - to document our kids growing up. Compared to our parents in the 80's there isn't just one video camera available, nowadays we have many cams everywhere we look. Especially the cameras on mobile phones are handy to shoot some films. 

The problem with several video sources is however, that each source has its own file naming (let alone the video and audio encoding). I found it especially annoying when I wanted to sort the videos according to their recording date. Android in particular has the problem that you cannot use the "modified date" of the video file to use as recorded date. Example: When you move all your photos and videos from the internal to the external SD card, the modified timestamp changes and therefore they all have the same date.

Luckily video files have meta data containing a lot of information about the used encoding and the recorded date! This information can be retrieved using "mediainfo":

# mediainfo MOV_1259.mp4
General
Complete name                            : MOV_1259.mp4
Format                                   : MPEG-4
Format profile                           : Base Media / Version 2
Codec ID                                 : mp42
File size                                : 53.7 MiB
Duration                                 : 25s 349ms
Overall bit rate                         : 17.8 Mbps
Encoded date                             : UTC 2015-11-21 10:04:28
Tagged date                              : UTC 2015-11-21 10:04:28

Video
ID                                       : 1
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : High@L4.0
Format settings, CABAC                   : Yes
Format settings, ReFrames                : 1 frame
Format settings, GOP                     : M=1, N=18
Codec ID                                 : avc1
Codec ID/Info                            : Advanced Video Coding
Duration                                 : 25s 349ms
Bit rate                                 : 17.5 Mbps
Width                                    : 1 920 pixels
Height                                   : 1 080 pixels
Display aspect ratio                     : 16:9
Frame rate mode                          : Variable
Frame rate                               : 29.970 fps
Minimum frame rate                       : 29.811 fps
Maximum frame rate                       : 30.161 fps
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Progressive
Bits/(Pixel*Frame)                       : 0.281
Stream size                              : 52.8 MiB (98%)
Title                                    : VideoHandle
Language                                 : English
Encoded date                             : UTC 2015-11-21 10:04:28
Tagged date                              : UTC 2015-11-21 10:04:28

Audio
ID                                       : 2
Format                                   : AAC
Format/Info                              : Advanced Audio Codec
Format profile                           : LC
Codec ID                                 : 40
Duration                                 : 25s 335ms
Duration_FirstFrame                      : 13ms
Bit rate mode                            : Constant
Bit rate                                 : 156 Kbps
Nominal bit rate                         : 96.0 Kbps
Channel(s)                               : 2 channels
Channel positions                        : Front: L R
Sampling rate                            : 48.0 KHz
Compression mode                         : Lossy
Stream size                              : 483 KiB (1%)
Title                                    : SoundHandle
Language                                 : English
Encoded date                             : UTC 2015-11-21 10:04:28
Tagged date                              : UTC 2015-11-21 10:04:28
mdhd_Duration                            : 25335

Told you there's a lot of meta information.

What I needed in this case was the line "Encoded date" - the day and time the video was encoded/recorded. Using this information, I am able to rename all the video files.

Simulation first:

# ls | grep "^MOV" | while read line; do targetname=$(mediainfo $line | grep "Encoded date" | sort -u | awk '{print $5"-"$6}'); echo "Old name: $line, new name: ${targetname}.mp4"; done
Old name: MOV_0323.mp4, new name: 2015-03-08-17:24:27.mp4
Old name: MOV_0324.mp4, new name: 2015-03-13-19:12:33.mp4
Old name: MOV_0325.mp4, new name: 2015-03-13-19:18:40.mp4
Old name: MOV_0329.mp4, new name: 2015-03-18-18:41:55.mp4
Old name: MOV_0355.mp4, new name: 2015-03-21-10:05:55.mp4
Old name: MOV_0369.mp4, new name: 2015-03-22-08:38:06.mp4
Old name: MOV_0370.mp4, new name: 2015-03-22-08:38:44.mp4
Old name: MOV_0371.mp4, new name: 2015-03-22-08:39:36.mp4
Old name: MOV_0372.mp4, new name: 2015-03-22-14:05:30.mp4
Old name: MOV_0374.mp4, new name: 2015-03-24-18:31:21.mp4
Old name: MOV_0375.mp4, new name: 2015-03-24-18:31:52.mp4
Old name: MOV_0392.mp4, new name: 2015-03-28-10:54:17.mp4
[...]

And final renaming:

# ls | grep "^MOV" | while read line; do targetname=$(mediainfo $line | grep "Encoded date" | sort -u | awk '{print $5"-"$6}'); echo "Old name: $line, new name: ${targetname}.mp4"; mv $line ${targetname}.mp4; done
Old name: MOV_0323.mp4, new name: 2015-03-08-17:24:27.mp4
Old name: MOV_0324.mp4, new name: 2015-03-13-19:12:33.mp4
Old name: MOV_0325.mp4, new name: 2015-03-13-19:18:40.mp4
Old name: MOV_0329.mp4, new name: 2015-03-18-18:41:55.mp4
Old name: MOV_0355.mp4, new name: 2015-03-21-10:05:55.mp4
Old name: MOV_0369.mp4, new name: 2015-03-22-08:38:06.mp4
Old name: MOV_0370.mp4, new name: 2015-03-22-08:38:44.mp4
Old name: MOV_0371.mp4, new name: 2015-03-22-08:39:36.mp4
Old name: MOV_0372.mp4, new name: 2015-03-22-14:05:30.mp4
Old name: MOV_0374.mp4, new name: 2015-03-24-18:31:21.mp4
Old name: MOV_0375.mp4, new name: 2015-03-24-18:31:52.mp4
Old name: MOV_0392.mp4, new name: 2015-03-28-10:54:17.mp4
[...]

Yes, looks good! With this approach I am now able to unify the file names of all different video sources and can save them according to their recorded (real) date.

 

Simple HTTP check monitoring plugin on Windows (check_http alternative)
Thursday - Jan 5th 2017 - by - (0 comments)

I was looking for a way to run a monitoring plugin, similar to check_http, on a Windows OS. The plugin itself would be executed through NRPE (using NSClient installation) and result the HTTP connectivity from the point of view of this Windows server.

I came across some scripts, including some tcp port checks (worth to mention: Protocol.vbs), some overblown power shell scripts and also a kind of check_tcp fork for Windows. Unfortunately none of them was really doing what I needed. So I built my own little vbscript put together with the information found on the following pages using the MSXML2.ServerXMLHTTP object :

So at the end the script looks like this:

url = "ENTER_FULL_URL_HERE"
Set http = CreateObject("MSXML2.ServerXMLHTTP")
http.open "GET",url,false
http.send
If http.Status = 200 Then
  wscript.echo "HTTP OK - " & url & " returns " & http.Status
  exitCode = 0
ElseIf http.Status > 400 And http.Status < 500 Then
  wscript.echo "HTTP WARNING - " & url & " returns " & http.Status
  exitCode = 1
Else
  wscript.echo "HTTP CRITICAL - " & url & " returns " & http.Status
  exitCode = 2
End If

WScript.Quit(exitCode)

First define the URL in the first line (e.g. https://www.google.com) and then execute the script using cscript (without cscript you get the script's output as a dialog box):

C:\Users\Claudio\Documents>cscript check_http.vbs
Microsoft (R) Windows Script Host Version 5.8
Copyright (C) Microsoft Corporation. All rights reserved.

HTTP OK - https://www.google.com returns 200

Or hitting a page not found error:

C:\Users\Claudio\Documents>cscript check_http.vbs
Microsoft (R) Windows Script Host Version 5.8
Copyright (C) Microsoft Corporation. All rights reserved.

HTTP WARNING - https://www.google.com/this-should-not-exist returns 404

There's still much room to improve the script. It would be very nice to use "url" as an argument added in the command line. Maybe I get to that some time.

Finally in the nsclient.ini the script was defined to be called as nrpe command:

; External scripts
[/settings/external scripts]
allow arguments=true
allow nasty characters=true
[/settings/external scripts/scripts]
check_http_google=scripts\\check_http_google.vbs

 

Creating custom PNP4Nagios templates in Icinga 2 for NRPE checks
Tuesday - Jan 3rd 2017 - by - (0 comments)

Since my early Nagios days (2005), I've used Nagiosgraph as my graphing service of choice. But in the last few years, other technologies came up. PNP4Nagios has became the de facto graphing standard for Nagios and Icinga installations. On big setups with several hundreds of hosts and thousands of services this is a wise choice; PNP4Nagios is a lot faster than Nagiosgraph. But Nagiosgraph can be more easily adapted to create custom graphs using the "map" file. That's why I ran PNP4Nagios and Nagiosgraph in parallel for the last few years on my Icinga 2 installation.

The main reason why I couldn't get rid of Nagiosgraph were performance data which were retrieved by plugins executed through check_nrpe. For example the monitoring plugin check_netio:

$ /usr/lib/nagios/plugins/check_nrpe -H remotehost -c check_netio -a eth0
NETIO OK - eth0: RX=2849414346, TX=1809023474|NET_eth0_RX=2849414346B;;;; NET_eth0_TX=1809023474B;;;;

The plugin reads the RX and TX values from the ifconfig command. As we know, these are counter values; a value which starts from 0 (at boot time) and increases with the number of Bytes passed through that interface.
While a check_disk through NRPE gives correct graphs in PNP4Nagios, the mentioned check_netio didn't:

PNP4Nagios NRPE check_disk graph
PNP4Nagios NRPE check_netio graph FAIL

The first graph on top shows the values from a check_disk plugin. The second graph below represents the values from the check_netio plugin. Both plugins were executed through NRPE on the remote host.

The comparison between the two graphs shows pretty clearly that only unique values (GAUGE in RRD terms; a good example: temperature) are working correctly. The counter values are shown with their increasing value instead of the difference between two values to determine the change.

Where does this come from? Why does PNP4Nagios doesn't reflect these values correctly? The problem can be found in the communication between Icinga 2 and PNP4Nagios.
Each time a host or service is checked in Icinga 2, the perfdata feature writes the performance data log file - by default in /var/spool/icinga2/perfdata. Inside such a log file Icinga 2 shows the following information:

$ cat /var/spool/icinga2/perfdata/service-perfdata*
[...]
DATATYPE::SERVICEPERFDATA    TIMET::1483441246    HOSTNAME::remotehost    SERVICEDESC::Network IO eth0    SERVICEPERFDATA::NET_eth0_RX=2316977534837B;;;; NET_eth0_TX=41612087322B;;;;    SERVICECHECKCOMMAND::nrpe    HOSTSTATE::UP    HOSTSTATETYPE::HARD    SERVICESTATE::OK    SERVICESTATETYPE::HARD
[...]

Take a closer look at the variable SERVICECHECKCOMMAND and you now see that it only contains nrpe - for each remote plugin executed through NRPE, whether this is check_disk, check_netio, check_ntp or whatever.
So Icinga 2 feeds this infromation to poor PNP4Nagios, which of course thinks all the checks are the same (nrpe) and handle all the graphs exactly the same (GAUGE by default). Which explains why the graphs for plugins with COUNTER results fail.

In order to tell PNP4Nagios that we're running different kinds of plugins and values behind NRPE, Icinga 2's PerfdataWriter needs to be adapted a little bit. I edited the default PerfdataWriter object called "perfdata":

$ cat /etc/icinga2/features-enabled/perfdata.conf
object PerfdataWriter "perfdata" {
  service_format_template = "DATATYPE::SERVICEPERFDATA\tTIMET::$icinga.timet$\tHOSTNAME::$host.name$\tSERVICEDESC::$service.name$\tSERVICEPERFDATA::$service.perfdata$\tSERVICECHECKCOMMAND::$service.check_command$$pnp_check_arg1$\tHOSTSTATE::$host.state$\tHOSTSTATETYPE::$host.state_type$\tSERVICESTATE::$service.state$\tSERVICESTATETYPE::$service.state_type$"
  rotation_interval = 15s
}

I only changed the definition of the service_format_template. All other configurable options are still default. And even this is only a minor change, which in short looks like this:

SERVICECHECKCOMMAND::$service.check_command$$pnp_check_arg1$

With that change, Icinga 2's PerfdataWriter is ready. But the variable needs yet to be set within the service object. As I use apply rules on such generic service checks as "Network IO", this was a quick modification in the apply rule of this service:

$ cat /etc/icinga2/zones.d/global-templates/applyrules/networkio.conf
apply Service "Network IO " for (interface in host.vars.interfaces) {
  import "generic-service"

  check_command = "nrpe"
  vars.nrpe_command = "check_netio"
  vars.nrpe_arguments = [ interface ]
  vars.pnp_check_arg1 = "_$nrpe_command$"

  assign where host.address && host.vars.interfaces && host.vars.os == "Linux"
  ignore where host.vars.applyignore.networkio == true
}

In this apply rule, where the "Network IO" service object is assigned to all Linux hosts (host.vars.os == "Linux") with existing interfaces (host.vars.interfaces), I simply added the value for the vars.pnp_check_arg1 variable. Which, in this case, is an underscore followed by the actual command launched by NRPE: "_check_netio".

After a reload of Icinga 2 and a manual check in the performance log file, all things look good. Which means: The SERVICECHECKCOMMAND now contains both nrpe and the remote command (nrpe_check_netio):

$ cat /var/spool/icinga2/perfdata/service-perfdata*
[...]
DATATYPE::SERVICEPERFDATA    TIMET::1483441246    HOSTNAME::remotehost    SERVICEDESC::Network IO eth0    SERVICEPERFDATA::NET_eth0_RX=2316977634837B;;;; NET_eth0_TX=41612088322B;;;;    SERVICECHECKCOMMAND::nrpe_check_netio    HOSTSTATE::UP    HOSTSTATETYPE::HARD    SERVICESTATE::OK    SERVICESTATETYPE::HARD
[...]

Icinga 2 now gives the correct and unique information to PNP4Nagios. But PNP4Nagios still needs to be told what to do. PNP4Nagios parses every line of the performance data it gets from Icinga 2 and checks if there is any template for the found command. Prior to the changes in the PerfdataWriter this was always only "nrpe", so PNP4Nagios used the following file: /etc/pnp4nagios/check_commands/check_nrpe.cfg. This is a standard file which comes with the PNP4Nagios installation.
Now that the command is "nrpe_check_netio", PNP4Nagios checks if there is any command definition called like this. When log level >=2 is activated within PNP4Nagios' perfdata process (set LOG_LEVEL to at least 2 in /etc/pnp4nagios/process_perfdata.cfg), the LOG_FILE (usually /var/log/pnp4nagios/perfdata.log) will show the following information:

$ cat /var/log/pnp4nagios/perfdata.log
[...]
2017-01-03 12:22:55 [15957] [3] DEBUG: RAW Command -> nrpe_check_netio
2017-01-03 12:22:55 [15958] [3]   -- name -> pl
2017-01-03 12:22:55 [15958] [3]   -- rrd_heartbeat -> 8460
2017-01-03 12:22:55 [15957] [2] No Custom Template found for nrpe_check_netio (/etc/pnp4nagios/check_commands/nrpe_check_netio.cfg)
[...]

PNP4Nagios now correctly understood that this is performance data for the command "nrpe_check_netio". And now we can create this config file and tell PNP4Nagios to create DERIVE graphs. DERIVE is another kind of COUNTER data type with the difference that DERIVE values can be resetted to 0, which is the case for the values in ifconfig.

$ cat /etc/pnp4nagios/check_commands/nrpe_check_netio.cfg
#
# Adapt the Template if check_command should not be the PNP Template
#
# check_command check_nrpe!check_disk!20%!10%
# ________0__________|          |      |  |
# ________1_____________________|      |  |
# ________2____________________________|  |
# ________3_______________________________|
#
CUSTOM_TEMPLATE = 1
#
# Change the RRD Datatype based on the check_command Name.
# Defaults to GAUGE.
#
# Adjust the whole RRD Database
DATATYPE = DERIVE
#
# Adjust every single DS by using a List of Datatypes.
DATATYPE = DERIVE,DERIVE

# Use the MIN value for newly created RRD Databases.
# This value defaults to 0
# USE_MIN_ON_CREATE = 1
#
# Use the MAX value for newly created RRD Databases.
# This value defaults to 0
# USE_MAX_ON_CREATE = 1

# Use a single RRD Database per Service
# This Option is only used while creating new RRD Databases
#
#RRD_STORAGE_TYPE = SINGLE
#
# Use multiple RRD Databases per Service
# One RRD Database per Datasource.
# RRD_STORAGE_TYPE = MULTIPLE
#
RRD_STORAGE_TYPE = MULTIPLE

# RRD Heartbeat in seconds
# This Option is only used while creating new RRD Databases
# Existing RRDs can be changed by "rrdtool tune"
# More on http://oss.oetiker.ch/rrdtool/doc/rrdtune.en.html
#
# This value defaults to 8640
# RRD_HEARTBEAT = 305


After a new check of a Network IO service, the xml file of that particular service was re-created with the following information:

# cat /var/lib/pnp4nagios/perfdata/remotehost/Network_IO_eth0.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<NAGIOS>
  <DATASOURCE>
    <TEMPLATE>nrpe_check_netio</TEMPLATE>
    <RRDFILE>/var/lib/pnp4nagios/perfdata/remotehost/Network_IO_eth0_NET_eth0_RX.rrd</RRDFILE>
    <RRD_STORAGE_TYPE>MULTIPLE</RRD_STORAGE_TYPE>
    <RRD_HEARTBEAT>8460</RRD_HEARTBEAT>
    <IS_MULTI>0</IS_MULTI>
    <DS>1</DS>
    <NAME>NET_eth0_RX</NAME>
    <LABEL>NET_eth0_RX</LABEL>
    <UNIT>B</UNIT>
    <ACT>1462883655</ACT>
    <WARN></WARN>
    <WARN_MIN></WARN_MIN>
    <WARN_MAX></WARN_MAX>
    <WARN_RANGE_TYPE></WARN_RANGE_TYPE>
    <CRIT></CRIT>
    <CRIT_MIN></CRIT_MIN>
    <CRIT_MAX></CRIT_MAX>
    <CRIT_RANGE_TYPE></CRIT_RANGE_TYPE>
    <MIN></MIN>
    <MAX></MAX>
  </DATASOURCE>
  <DATASOURCE>
    <TEMPLATE>nrpe_check_netio</TEMPLATE>
    <RRDFILE>/var/lib/pnp4nagios/perfdata/remotehost/Network_IO_eth0_NET_eth0_TX.rrd</RRDFILE>
    <RRD_STORAGE_TYPE>MULTIPLE</RRD_STORAGE_TYPE>
    <RRD_HEARTBEAT>8460</RRD_HEARTBEAT>
    <IS_MULTI>0</IS_MULTI>
    <DS>1</DS>
    <NAME>NET_eth0_TX</NAME>
    <LABEL>NET_eth0_TX</LABEL>
    <UNIT>B</UNIT>
    <ACT>1567726688</ACT>
    <WARN></WARN>
    <WARN_MIN></WARN_MIN>
    <WARN_MAX></WARN_MAX>
    <WARN_RANGE_TYPE></WARN_RANGE_TYPE>
    <CRIT></CRIT>
    <CRIT_MIN></CRIT_MIN>
    <CRIT_MAX></CRIT_MAX>
    <CRIT_RANGE_TYPE></CRIT_RANGE_TYPE>
    <MIN></MIN>
    <MAX></MAX>
  </DATASOURCE>
  <RRD>
    <RC>0</RC>
    <TXT>successful updated</TXT>
  </RRD>
  <NAGIOS_AUTH_HOSTNAME>remotehost</NAGIOS_AUTH_HOSTNAME>
  <NAGIOS_AUTH_SERVICEDESC>Network IO eth0</NAGIOS_AUTH_SERVICEDESC>
  <NAGIOS_CHECK_COMMAND>nrpe_check_netio</NAGIOS_CHECK_COMMAND>
  <NAGIOS_DATATYPE>SERVICEPERFDATA</NAGIOS_DATATYPE>
  <NAGIOS_DISP_HOSTNAME>remotehost</NAGIOS_DISP_HOSTNAME>
  <NAGIOS_DISP_SERVICEDESC>Network IO eth0</NAGIOS_DISP_SERVICEDESC>
  <NAGIOS_HOSTNAME>remotehost</NAGIOS_HOSTNAME>
  <NAGIOS_HOSTSTATE>UP</NAGIOS_HOSTSTATE>
  <NAGIOS_HOSTSTATETYPE>HARD</NAGIOS_HOSTSTATETYPE>
  <NAGIOS_MULTI_PARENT></NAGIOS_MULTI_PARENT>
  <NAGIOS_PERFDATA>NET_eth0_RX=1462883655B;;;; NET_eth0_TX=1567726688B;;;; </NAGIOS_PERFDATA>
  <NAGIOS_RRDFILE></NAGIOS_RRDFILE>
  <NAGIOS_SERVICECHECKCOMMAND>nrpe_check_netio</NAGIOS_SERVICECHECKCOMMAND>
  <NAGIOS_SERVICEDESC>Network_IO_eth0</NAGIOS_SERVICEDESC>
  <NAGIOS_SERVICEPERFDATA>NET_eth0_RX=1462883655B;;;; NET_eth0_TX=1567726688B;;;;</NAGIOS_SERVICEPERFDATA>
  <NAGIOS_SERVICESTATE>OK</NAGIOS_SERVICESTATE>
  <NAGIOS_SERVICESTATETYPE>HARD</NAGIOS_SERVICESTATETYPE>
  <NAGIOS_TIMET>1483442747</NAGIOS_TIMET>
  <NAGIOS_XMLFILE>/var/lib/pnp4nagios/perfdata/remotehost/Network_IO_eth0.xml</NAGIOS_XMLFILE>
  <XML>
   <VERSION>4</VERSION>
  </XML>
</NAGIOS>

The xml file shows that the nrpe_check_netio  PNP4Nagios template is now used:

<TEMPLATE>nrpe_check_netio</TEMPLATE>

and the service check command is correctly identified as nrpe_check_netio:

<NAGIOS_CHECK_COMMAND>nrpe_check_netio</NAGIOS_CHECK_COMMAND>

Once /etc/pnp4nagios/check_commands/nrpe_check_netio.cfg was created, all the other hosts with this "Network IO" check were adapted and are now showing the correct graphs.

PNP4Nagios NRPE check_netio correct counter graph 

The same procedure can now be created for all kinds of plugins which are executed through NRPE and output counter/derive values, for example check_diskio.

 

check_esxi_hardware and pywbem 0.10.x tested
Wednesday - Dec 21st 2016 - by - (0 comments)

Yesterday a new version (0.10.0) of pywbem was released. Will the monitoring plugin check_esxi_hardware continue to run without a glitch? It should, since the plugin was "made ready" for future releases of pywbem (see New version of check_esxi_hardware supports pywbem 0.9.x).

Let's try check_esxi_hardware with the new pywbem version. I used pip to upgrade pywbem to the latest available version:

 $ sudo pip install --upgrade pywbem
Downloading/unpacking pywbem from https://pypi.python.org/packages/9a/50/839b059c351c4bc22c181c0f6a5817da7ca38cc0ab676c9a76fec373d5f5/pywbem-0.10.0-py2.py3-none-any.whl#md5=1bc01e6fd91f5e7ca64c058f3e0c1254
  Downloading pywbem-0.10.0-py2.py3-none-any.whl (201kB): 201kB downloaded
Requirement already up-to-date: PyYAML in /usr/local/lib/python2.7/dist-packages (from pywbem)
Requirement already up-to-date: six in /usr/local/lib/python2.7/dist-packages (from pywbem)
Requirement already up-to-date: ply in /usr/local/lib/python2.7/dist-packages (from pywbem)
Requirement already up-to-date: M2Crypto>=0.24 in /usr/local/lib/python2.7/dist-packages (from pywbem)
Requirement already up-to-date: typing in /usr/local/lib/python2.7/dist-packages (from M2Crypto>=0.24->pywbem)
Installing collected packages: pywbem
  Found existing installation: pywbem 0.9.0
    Uninstalling pywbem:
      Successfully uninstalled pywbem
Successfully installed pywbem
Cleaning up...

And then launched the plugin:

$ ./check_esxi_hardware.py -H esxiserver -U root -P secret -v
20161221 12:17:44 Connection to https://esxiserver
20161221 12:17:44 Found pywbem version 0.10.0
20161221 12:17:44 Check classe OMC_SMASHFirmwareIdentity
20161221 12:17:45   Element Name = System BIOS
20161221 12:17:45     VersionString = B200M4.3.1.2b.0.042920161158
[...]
20161221 12:17:49 Check classe VMware_SASSATAPort
OK - Server: Cisco Systems Inc UCSB-B200-M4 s/n: XXXXXXXX Chassis S/N: XXXXXXXX  System BIOS: B200M4.3.1.2b.0.042920161158 2016-04-29

As you can see, the new version works without problems. Go ahead and upgrade pywbem.

 

The Docker Dilemma: Benefits and risks going into production with Docker
Friday - Dec 16th 2016 - by - (2 comments)

Over a period of more than one year I've followed the Docker hype. What is it about? And why does it seem do all developers absolutely want to use Docker and not other container technologies? Important note: Although it may seem that I'm a sworn enemy of Docker; I am not! I find all kinds of new technologies interesting, to say at least. But I'm a skeptical, always have been, when it comes to phrases like "this is the ultimate solution to all your problems". So this article is mainly to document the most important points I dealt with in a period of one year, mainly handling risks and misunderstandings and trying to get a solution for them.

When Docker came up the first time as a request (which then turned into a demand), I began my research. And completely understood, what Docker was about. Docker was created to fire up new application instances quickly and therefore allow greater and faster scalability. A good idea, basically, which sounds very interesting and makes sense - as long as your application can run independently. What I mean with that is that:

  • Data is stored elsewhere (not on local file system), for example in an object store or database which is accessed by the network layer
  • There are no hardcoded (internal) IP addresses in the code
  • The application is NOT run as root user, therefore not requiring privileged rights
  • The application is scalable and can run in parallel in several containers

But the first problems already arose. The developer in question (let's call him Dave) wanted to store data in the container. He didn't care if his application ran as root or not (I'm quoting: "At least then I got no permission problems"). And he wanted to access an existing NFS share from within the container.

I told him about Linux Containers and that this would be better solved with the LXC technology. To my big surprise, he didn't even know what LXC was. So not only was the original concept of Docker containers misunderstood, the origins of the project (Docker was originally based on LXC until Docker eventually rewrote the library to libdocker) were not even known. Another reason to use Docker, accoording to this developer: "I can just install any application as a Docker container I want - and I don't even need to know how to configure it." Good lord. It's as if I wanted to build a car myself just because I don't want anyone else to do it. The fact that I have no clue how to build a car, does obviously not matter.

More or less in parallel, another developer from another dev team (let's call him Frank), was also pushing for Docker. He created his code in a Docker environment (which is absolutely fine) using a MongoDB in the background. It's important to understand that using a code library to access a MongoDB and managing a MongoDB is entirely different. So by installing the MongoDB from a Docker image (he had found on the Internet) he had a working MongoDB, yes. But what about the tuning and security settings of MongoDB? This was left as is, because the knowledge of managing MongoDB was not there. As I've been managing MongoDB since 2013, I know where to find it's most important weaknesses and how to tackle them (I wrote about this in an older article "It is 2015 and we still should not use MongoDB (POV of a Systems Engineer)"). If I had let this project go in production as is, the MongoDB would have been available to the whole wide world - without any authentication! So I was able to convince this developer that MongoDB should be run separately, managed separately, and most important: MongoDB stores persistant data. Don't run this as a Docker container.

While I was able to talk some sense into Frank, Dave still didn't see any issues or risks. So I created the following list to have an overview of unnecessary risks and problems:

  • readonly file system (Overlay FS, Layers per app): Means you can temporarily alter files but at the next boot of the container, these changes are gone. You will have to redeploy the full container, even for fixing a typo in a config file. This also means that no security patches can be applied.
  • If you want to save persistant data, an additional mount of a data volume is required. Which adds complexity, dependencies and risks (see details further down in this article).
  • Shutdown/Reboot means data loss, unless you are using data volumes or your application is programmed smart enough to use object stores like S3 (cloud-ready).
  • If you use data volumes mounted from the host, you lose flexibility of the containers, because they're now bound to the host.
  • Docker containers are meant to run one application/process, no additional services and daemons. This makes troubleshooting hard, because direct SSH is not possible, monitoring and backup agents are not running. You can solve this by using a docker image already being prepped up with all the necessary stuff. But when adding all this stuff in the first place, LXC would be a better choice.
  • A crash of the application which crashes the container cannot be analyzed properly, because log files are not saved (unless, again, a separate data volume is used for the logs).
  • Not a full network stack: Docker containers are not "directly attached" to the network. They're connected through the host and connections are going through Network Address Table (NAT) firewall rules. This adds additional complexity for troubleshooting network problems.
  • The containers run as root and install external contents through public registries (Dockerhub for example). Unless this is defined differently by using an internal and private Docker registry, this adds risks. What is installed? Who verified the integrity of the downloaded image/software? This is not me just saying this, it's proven that this is a security problem. See InfoQ article Security vulnerabilities in Docker Hub Images.
  • OverlayFS/ReadOnly FS are less performant.
  • In general troubleshooting a problem will take more time because of additional complexity compared to "classic" systems or Linux containers because of the network stack, additional file system layers, data volume mounts, missing log files and image analysis.
  • Most of these problems can be solved with workarounds. For example by sing your own registry with approved code. Or rewrite your application code to use object stores for file handling. Or create custom base images which contain all your necessary settings and programs/daemons. Or use a central syslog server. But as we all know, workarounds means additional work which means costs.

Even with all these technical points, Dave went on with his "must use Docker for everything" monologue. He was even convinced that he wanted to manage all the servers himself, even database servers. I asked him why he'd want to do that in the first place and his answer was "So I can try a new MySQL version". Let's assume for a moment, that is a good idea and MySQL runs as a Docker container with an additional volume holding /var/lib/mysql. Now Dave deploys a new Docker container with a new MySQL version - being smart and shutting down the old version first. As soon as MySQL starts up, it will start running over the databases found in /var/lib/mysql. And upgrades the tables according to the new version (mainly the tables in the mysql database). And now let's assume after two days a new bug is found in the production app, that the current application code is not fully compatible with the newer MySQL version. You cannot downgrade to the older MySQL version anymore because tables were already altered. I've seen such problems in the past already (see Some notes on a MySQL downgrade 5.1 to 5.0). So I know the problems of downgrading already upgraded data. But obviously my experience and my warnings didn't count and were ignored.
Eventually Dave's team started to build their own hosting environment. I later heard that they had destroyed their ElasticSearch data, because something wrong happened within their Docker environment and the data volume holding the ES data...

Meanwhile I continued my research and created my own test-lab using plain Docker (without any orchestration). I came across several risks. Especially the volume mounts from the host caught my eye. A Docker container is able to mount any path from it's host when the container is started up (docker run). As a simple test, I created a docker container with the following volume information:

docker run ... -v /:/tmp ...

The whole file system of the host was therefore mounted in the container as /tmp. With write permissions. Meaning you can delete your entire host's filesystem, by error or on purpose. You can read and alter the (hashed) passwords from /etc/shadow (in this case by simply accessing /tmp/etc/shadow in the container).

root@5a87a58982f9:/# cat /tmp/etc/shadow | head
root:$6$9JbiWxjT$QKL4M1GiRKwtQrJmgX657XvMW02u8KjOzxSaRRWhFaSJwcpLXLdJZwkD8QEwk0H
IaxzOlf.JtWcwVykXAex2..:17143:0:99999:7:::
daemon:*:17001:0:99999:7::: 
bin:*:17001:0:99999:7:::
sys:*:17001:0:99999:7:::
sync:*:17001:0:99999:7:::
games:*:17001:0:99999:7:::
man:*:17001:0:99999:7:::
lp:*:17001:0:99999:7:::
mail:*:17001:0:99999:7:::
news:*:17001:0:99999:7:::
root@5a87a58982f9:/#

Basically by being root in the container with such a volume mount, you take over the host - which is supposed to be the security guard for all containers. A nice article with another practical example can be found here: Using the docker command to root the host (totally not a security issue).

Another risk, less dangerous but still worth to mention it, is the mount of the hosts docker socket (/var/run/docker.sock) into a container. This container is then able to pull information about all containers running on the same host. This information sometimes contains environment variables. Some of these may contain cleartext passwords (e.g. to start up a service to connect to a remote DB with given credentials, see The Dangers of Docker.sock).

In general you can find a lot of articles warning you about exposing the docker socket. Interestingly these articles were mainly written by System Engineers, rarely by developers.  Some of them:

Besides the volumes, another risk is the creation of privileged containers. They basically are allowed to do anything, even when they're already running. This means that within a running container you can create a new mount point and mount the host's file system right into the container. For unprivileged containers this would only work during the creation/start of the container. Privileged containers can do that anytime.

My task, as being responsible for systems and their stability and security, is to prevent volumes and privileged containers in general. Once more: A volume from a point of view of a container is only needed, if persistant data needs to be written on the locoal filesystem. And if you do that, Docker is anyway not the right solution to you. I started looking but to my big surprise there is no way to simply prevent Docker containers to create and mount volumes. So I created the following wrapper script, which acts as main "docker" command:

#!/bin/bash
# Simple Docker wrapper script by www.claudiokuenzler.com

ERROR=0
CMD="$@"

echo "Your command was: $CMD" >> /var/log/dockerwrapper.log

if echo $CMD | grep -e "-v" > /dev/null; then echo "Parameter for volume mounting detected. This is not allowed."; exit 1;fi
if echo $CMD | grep -e "--volume" > /dev/null; then echo "Parameter for volume mounting detected. This is not allowed."; exit 1;fi
if echo $CMD | grep -e "--privileged" > /dev/null; then echo "Parameter for privileged containers detected. This is not allowed."; exit 1;fi

/usr/bin/docker.orig $CMD

While this works on the local Docker host, this does not work when the Docker API is used through the Docker socket. And because in the meantime we decided (together with yet another developer, who understands my concerns and will be in charge for the Docker deployments) to use Rancher as overlying administration interface (which at the end uses Docker socket through a local agent), the wrapper script is not enough. So a prevention should either be configurable in Docker or Rancher; most importantly Docker itself should support security configurations to prevent certain functions or container settings (comparable to disable_functions in PHP).

In my trials to prevent Docker mounting host volumes, I also came across a plugin called docker-novolume-plugin. This plugin prevents creation of data volumes - but unfortunately does not prevent the mounting of the host's filesystem. I opened up a feature request issue on the Github repository but as of today it's not resolved.

Another potential solution could have been a working AppArmor profile of the Docker engine. But a working AppArmor profile is only in place for a running container itself, not for the engine creating and managing containers:

Docker automatically loads container profiles. The Docker binary installs a docker-default profile in the /etc/apparmor.d/docker file. This profile is used on containers, not on the Docker Daemon.

I also turrned to Rancher and created a feature request issue on their Github repo as well. To be honest with little hope that this will be implemented soon, because as of this writing, the Rancher repo still has over 1200 issues open to be addressed and solved.

So neither Docker, nor Rancher, nor AppArmor are at this moment capable of preventing dangerous (and unnecessary) container settings.

How to proceed from here? I didn't want to "block" the technology, yet "volume" and "privileged" are the clear no-gos for a production environment (once again, OK in development environments). I started digging around in the Rancher API and this is actually a very nice and easy to learn API. It turns out, a container can be stopped and deleted/purged through the API using an authorization key and password. I decided to combine this with our Icinga2 monitoring in place. The goal: On each Docker host, a monitoring plugin is called every other minute. This plugin goes through the list of every container using the ID of the "docker ps" output.

root@dockerhost:~# docker ps | grep claudio
CONTAINER ID  IMAGE           COMMAND       CREATED        STATUS         PORTS     NAMES
5a87a58982f9  ubuntu:14.04.3  "/bin/bash"   5 seconds ago  Up 4 seconds             r-claudiotest

This ID represents the "externalId" which can be looked up in the Rancher API. Using this information, the Rancher API can be queried to find out about the data volumes of this container in the given environment (1a12) using the "externalId_prefix" filter :

curl -s -u "ACCESSUSER:ACCESSKEY" -X GET -H 'Accept: application/json' -H 'Content-Type: application/json' -d '{}' 'https://rancher.example.com/v1/projects/1a12/containers?externalId_prefix=5a87a58982f9' | jshon -e data -a -e dataVolumes
[
 "\/:\/tmp"
]

As soon as something shows up in the array, this is considered bad and further action takes place. The Docker "id" within the Rancher environment can be figured out, too:

curl -s -u "ACCESSUSER:ACCESSKEY" -X GET -H 'Accept: application/json' -H 'Content-Type: application/json' -d '{}' 'https://rancher.example.com/v1/projects/1a12/containers?externalId_prefix=5a87a58982f9' | jshon -e data -a -e id
"1i22342"

Using this "id", the bad container can then be stopped and deleted/purged:

curl -s -u "ACCESSUSER:ACCESSKEY" -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' -d '{"remove":true, "timeout":0}' 'https://rancher.example.com/v1/projects/1a12/instances/1i22342/?action=stop'

Still, this does not prevent the creation and mounting of volumes or the hosts filesystem, nor does it prevent privileged containers upon creation of a container. But it will ensure that such containers, created on purpose, by error or through a hack, are immediately destroyed. Hopefully before they can do any harm. I sincerely hope that Docker will look more into security and Docker settings though. Without such workarounds and efforts - and a cloud-ready application - it's not advisable to run Docker containers in production. And most importantly: You need the technical understanding of your developer colleagues where and when Docker containers make sense.

For now the path to Docker continues with the mentioned workaround and there will be a great learning curve, probably with some inevitable problems at times - but at the end (hopefully) a stable, dynamic and scalable production environment running Docker containers.

How did YOU tackle these security issues? What kind of workarounds or top layer orchestration are you using to prevent the dangerous settings? Please leave a comment, that would be much appreciated!

 

Install LineAge 14 (Android 7 Nougat) on Samsung Galaxy S5 Plus G901F
Monday - Dec 12th 2016 - by - (6 comments)

It has been a while since I wrote an Android article. Because it has been a while since I saw there was an update for the Samsung Galaxy S5 Plus (model number G901F). Back in July 2015 I wrote two articles for this device:

Since July 2015 I kept using CM12.0 (Android 5.0) on the G901F. The CM 12.1 turned out to be a battery burner and 12.0, although without any updates anymore, still was much better than the original stock Android (Touchwiz) from Samsung.

Out of curiousity I checked, if there was a recent version and, to my big surprise, someone really cared about that device and created new CM versions (see this XDA forums thread).

So this article describes how you can install CyanogenMod 14 (Android 7) on your Samsung Galaxy S5 Plus (G901F). But first some preparations need to be done. As it turns out, the newer Android versions require a newer bootloader and modem driver. I had to fall on my nose myself to figure that out. Please read and follow the following steps carefully.

1. You understand that you most likely void your warranty of your Samsung device. As with all other tutorials, you are responsible for your own actions etc bla bla. If you brick/destroy your device it's your own fault.

2. Download the newest version of Odin from http://odindownload.com/download. Odin is a tool to install/flash firmware to Samsung devices. As of this writing I downloaded and installed Odin 3.12.3.

3. Download newer bootloader and modem driver for this phone with version CPE1. Original links were given to me in the XDA forums by user ruchern:

Some notes on the Bootloader (BL) and Modem (CP) versions: Besides CPE1 I also tried versions BOH4 (always rebooted the phone during the Wifi screen in Android setup) and CPHA (which never completely booted Android).

4. Download a new Recovery ROM. I chose TWRP which can be downloaded here: http://teamw.in/devices/samsunggalaxys5plus.html . Download the "tar" package. During this writing the current version was twrp-3.0.2-0-kccat6.img.tar. Note: In my older article I used CWM recovery. TWRP offers to mount the phone as USB drive when in Recovery, which is very helpful for the installation of zip files.

5. Download and install the Samsung USB drivers (SAMSUNG_USB_Driver_for_Mobile_Phones.zip) if you haven't already. You can download this from http://developer.samsung.com/technical-doc/view.do?v=T000000117.

6. Power off the Galaxy S5.

7. Boot your phone into the Download Mode by pressing the following buttons altogether: [Volume Down] + [Home] + [Power] until you see a warning triangle. Accept the warning by pressing the [Volume Up] button.

G901F Downloader Boot 

8. Start the Odin executable. You might have to unzip/unpack the downloaded Odin version first. 

9. Connect the phone to the computer with the phone's USB cable. In Odin one of the ID:COM fields should now show a connection. In the "Log" field you should see an entry like "Added!!".

10. Let's start by installing TWRP recovery. In Odin click on the "AP" button and select the tar file from twrp (twrp-3.0.2-0-kccat6.img.tar).

G901F Odin install TWRP

Then click on Start. The phone will reboot (unless you have unticked auto-reboot in the Odin options). Let the phone finish boot your existing OS and then power off the phone again. Exit Odin and disconnect the USB cable.

11. This is for verification: Boot the phone into Recovery mode by pressing the following buttons altogether: [Volume Up] + [Home] + [Power] until you see a blue text at the top. You should now see the TWRP Recovery. If this was working for you - great, we can proceed. If not, you can try it again or try to install another Recovery (check out Samsung Galaxy S5 (G901F): Pain to install custom recovery or Cyanogenmod again). Power off the Galaxy S5.

G901F TWRP Recovery 

12. Boot your phone into the Download Mode again by pressing the following buttons altogether: [Volume Down] + [Home] + [Power] until you see a warning triangle. Accept the warning by pressing the [Volume Up] button.

13. Start Odin again and connect your phone with the USB cable. This time we're going to flash the new Bootloader (BL) and Modem (CP) versions. Click on the "BL" button and select the bootloader file (G901FXXU1CPE1_bootloader.tar.md5). Click on the "CP" button and select the modem file (G901FXXU1CPE1_modem.tar.md5).

G901F Bootloader Odin 

G901F Odin Modem Flash 

Then click on the "Start" button. The phone will reboot again, once done.

14. Now I'm not sure whether your old Android installation will still boot with the new bootloader or not. If it doesn't even after several minutes and it is stuck showing the same screen, just power off the phone (in the worst case by pulling the battery). If it does still boot your old Android OS, do a normal power off of the phone. Disconnect the USB cable. Exit Odin if you haven't already.

15. Boot the phone into Recovery mode by pressing the following buttons altogether: [Volume Up] + [Home] + [Power] until you see a blue text at the top. Connect the USB cable. In TWRP tap on "Mount". In the next window tap on "Mount USB Storage". Your phone should now be appearing as USB storage on your computer and you can simply transfer files to the phone.

G901F TWRP Menu G901F TWRP Mount G901F TWRP Mount

16. On your computer download CM14 from http://ionkiwi.nl/archive/view/4/samsung-galaxy-s5--g901f--kccat6xx. In my case, I downloaded the currently latest CM14.0 (cm-14.0-20161208-UNOFFICIAL-kccat6xx.zip). Once download is complete, transfer the file to your phone using the mounted USB storage.

17. On your computer download the Google Apps (GApps) using http://opengapps.org/. Select Platform: ARM, Android: 7.0, Variant: mini (Note: The default "stock" didn't work for me, it has caused a crash of "Google Play Services" in the Android setup after initial boot of the phone). This should give you a file like this: open_gapps-arm-7.0-mini-20161211.zip. Once download is complete, transfer the file to your phone using the mounted USB storage.

G901F Placing zips on internal SD card 

18. On your phone in TWRP go back to the main screen and tap on "Wipe". Swipe the blue bar to the right for a Factory Reset.

G901F TWRP Wipe G901F Wipe 

19. In TWRP go back to the main screen and tap on "Install". Cool in TWRP: You can select several zip files to install one after another. So first select the cm-14 zip file, then tap on "Add more Zips" and then select the open_gapps zip file. After you selected the open_gapps zip file, tick the "Reboot after installation is complete" checkbox. Then swipe the blue bar to the right to install the zip files ("Swipe to confirm Flash").

G901F Install CM14 zip G901F Install CM14 zip G901F Install CM14 zip

20. After the installation the phone reboots and the CyanogenMod robo logo should appear. Give the phone some time to boot, it took my phone around 3 mins for the first boot. Then the Android setup starts up. This I really don't need to explain.

G901F CM14 boot G901F CM14 Android Setup G901F Android 7 Nougat

21. After the Android setup you can check out your phone's version in Settings -> About.

G901F Android Nougat CM14 G901F Android 7 Cyanogenmod 14

Enjoy your phone not being dead :D

PS: I created a stale mirror of the mentioned files in case the original links don't work in the future: https://www.claudiokuenzler.com/downloads/G901F-CM14/  

Update January 3rd 2017: As you may have heard, the CyanogenMod project is dead. A fork of CM, called Lineage, is available though. This howto of course also works for the newer Lineage zip files. I changed the title of this howto accordingly.

 


Go to Homepage home
Linux Howtos how to's
Nagios Plugins nagios plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7666 Days
until Death of Computers
Why?