Header RSS Feed
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

Application (Docker/Kubernetes) containers and STDOUT logging
Tuesday - Jan 15th 2019 - by - (0 comments)

In our Docker container environment (on premise, using Rancher) I have configured the Docker daemon to forward STDOUT logs from the containers to a central Logstash using GELF.

For applications logging by default to STDOUT this works out of the box. But for some hand-written applications this might require some additional work.

In this particular example the application simply logged into a local log file on the AUFS filesystem (/tmp/application.log). But all these log messages of course never arrive in the ELK stack because they were not logged to STDOUT but written into a file.

The developer then adjusted the Dockerfile and instead of creating the log file, created a symlink:

# forward logs to docker log collector
RUN ln -sf /dev/stdout /tmp/application.log

To be honest, I thought this would do the trick. But once the new container image was deployed, the application logs didn't arrive in our ELK stack. Why?

I went into the container and tested myself:

root@af8e2147f8ba:/app# cd /tmp/

root@af8e2147f8ba:/tmp# ls -la
total 12
drwxrwxrwt  3 root root 4096 Jan 15 12:55 .
drwxr-xr-x 54 root root 4096 Jan 15 12:57 ..
lrwxrwxrwx  1 root root   11 Jan 15 12:52 application.log -> /dev/stdout
drwxr-xr-x  3 root root 4096 Jan 15 12:52 npm-6-d456bc8a

Yes, there is the  application log file, which is a symlink to /dev/stdout. Should work, right? Let's try this:

root@af8e2147f8ba:/tmp# echo "test test test" > application.log
test test test

Although I saw "test test test" appearing in the terminal, this message never made it into the ELK stack. On my research why, I came across a VERY GOOD explanation by user "phemmer" on this GitHub issue:

"The reason this doesn't work is because /dev/stdout is a link to STDOUT of the process accessing it. So by doing foo > /dev/stdout, you're saying "redirect my STDOUT to my STDOUT". Kinda doesn't do anything :-).
And since /var/log/test.log is a symlink to it, the same thing applies. What you want is to redirect output to STDOUT of PID 1. PID 1 is the process launched by docker, and its STDOUT will be what docker picks up."

So to sum this up, we need to use the STDOUT of PID 1 (the container itself), otherwise the message won't be picked up by the Docker daemon.

Let's try this inside the still running container:

root@af8e2147f8ba:/tmp# rm application.log
root@af8e2147f8ba:/tmp# ln -sf /proc/1/fd/1 /tmp/application.log
root@af8e2147f8ba:/tmp# echo 1 2 3 > application.log

And hey, my 1 2 3 appeared in Kibana!

Docker container logs STDOUT logging

I slightly modified the Dockerfile with that new knowledge:

RUN ln -sf /proc/1/fd/1 /tmp/application.log
# forward logs to docker log collector

Note: /proc/1 obviously is PID 1. fd/1 is stdout, as you might know from typical cron jobs, e.g. */5 * * * * myscript.sh 2>&1. fd/2 would be STDERR by the way.

After the new container image was built, deployed and started, the ELK stack is now getting the application logs:

Container logs appearing in ELK stack


Outages in the Cloud. Whom to blame and how to prove it?
Friday - Jan 11th 2019 - by - (0 comments)

Enterprises have started to use "the cloud" more and more often in the past few years. Applications are sometimes running completely in the cloud, sometimes as a hybrid construct. In a hybrid environment, some "parts" of the application architecture (for example a database) are running in the cloud, while others (for example an API server) is running on premise. Sometimes the other way around, the combinations are endless.

Using a hybrid architecture has several positive points. If you build your application and its architecture correctly, you may run the full stack in either location (on-premise and in the cloud). Which means you now got a disaster recovery environment for your application. That's pretty neat.

But this also leads to additional problems: Network interruptions and latency between on-premise and the cloud may cause at least delays, at worst outages, in the application. The Internet connection between these two points becomes a very important dependency.

We've been using a hybrid architecture for a couple of years now and mostly this runs pretty well. But when we do experience issues (e.g. timeouts), we need to find the source of the problem as soon as possible. Most of the times in the past we've identified the Internet connectivity between the on-premise data center and the cloud as source. But whose fault is it? Our own (firewall), our Internet Service Provider (ISP), the Cloud Service Provider (CSP) or something in between (Internet Exchanges)? That question was never easy to answer and whenever we contacted our ISP to help identify where the problem was, it took 2 days to get a response which mostly didn't help us either (in technical aspects).

What you do in such moments of connectivity problems is: Troubleshooting. And this includes getting as much data from all available resources you got. Check stats and graphs from monitoring, external verification (is it just me or everyone?) and most importantly a connectivity check. A very well known tool for this is "mtr" which stands for "my traceroute". It's basically an advanced version of the classic traceroute (or tracert in Windows) command.
By following the "mtr" output, sometimes the hop causing the issue can be identified immediately. But in almost all the cases I wished to have a comparison at hand: Now we're going through these 20 hops, but what happened 30 minutes ago, when there were no connectivity issues?

I've been looking for a monitoring plugin which basically runs mtr and graphs the data for a couple of months, but there was no real solution. Until I finally came across a Github repository where mtr is run in the background from a python command. The results are written into a time series database (InfluxDB). Sounded pretty good to me and I gave it a try. Unfortunately there were some problems in running the python script standalone. It was supposed to be started within a Docker container and install an InfluxDB inside that container, which I didn't want.
Note: If you don't have an InfluxDB at hand (or you don't know how to administrate InfluxDB) and don't mind that you run the data in a container, the existing script is great!

I rewrote that script in Bash and added some additional stuff. Then I let it run in a 2 minute interval for a couple of destinations. Each hop and it's data (latency, packet loss, etc) is entered into InfluxDB using the timestamp of that run. Using a Grafana dashboard it is now possible to see the exact hops, their latencies and packet drops at a certain moment in time. This also allows to compare the different hops in case there was a routing change.

MTR Grafana Dashboard 

Now that I had this tool in place and collecting data, I just needed to wait for the next outage.

Yesterday, January 10 2019 it finally happened. We experienced timeouts in all our web applications which connect to an ElasticSearch service in the cloud, running in AWS Ireland. Thanks to our monitoring we almost immediately saw that the latency between our applications in our data center and ElasticSearch spiked, while the same checks from within AWS Ireland were still around the same:

ElasticSearch Roundtrip Graph

There was a downtime of 13 minutes for most of our web applications between 16:13 and 16:26. Once the issue was resolved the blame game started (of course!).
But this time I was able to refer to the MTR Dashboard and compared the hops at 16:00-16:05 with 16:15-16:20 and with 16:30-16:35:

Routing change within AWS caused downtime 

By comparing the hops side by side something interesting is revealed: There were indeed routing changes but only after the hop, which already belongs to AWS.
This means that there were some internal routing changes causing the outage. Yet in the AWS status dashboard, everything was green (as always...).

Thanks to this MTR dashboard we are now able to identify when a routing change happened and where and therefore can help solve the blame game.

PS: I will release the Bash script, installations instructions and the Grafana dashboard (it's slightly different than the one of the mtr-monitor repository) on GitHub soon.


Monitoring plugin check_smart 5.11.1 released
Tuesday - Jan 8th 2019 - by - (0 comments)

The monitoring plugin check_smart, to monitor hard and solid state drives' SMART attributes, is out with a new version.

Version 5.11.1 is a bugfix version and removes Perl warnings of uninitialized values (issues #29 and #31 on GitHub).

It also adds relevant information in the "help" output for the recently introduced exclude parameter.


SSL TLS SNI certificate monitoring with Icingaweb2 x509 module
Monday - Jan 7th 2019 - by - (2 comments)

Back in November 2018 I attended the Open Source Monitoring Conference (OSMC) in Nuremberg, Germany. One of the presentations included a quick demo of a new module for Icingaweb2, the "quasi standard" user interface for Icinga2. The module was called "x509" and it's purpose, as one may already think when reading the name, is to monitor SSL/TLS certificates. 

This article will cover the installation, activation, configuration of the module. And some first impressions (pros/cons).

[ Requirements ]

But before the actual module is installed, some pre-requirements need to be verified.

1. Icingaweb2 requirements
- Your Icingaweb2 version shouldn't be too old. At least version 2.5 (or newer) is required.
- There are two additional modules needed as dependency: reactbundle and ipl (I'll describe this later)

2. OpenSSL
- You need to have OpenSSL installed.

# apt-get install openssl

3. Database requirements
- The module supports either MySQL or MariaDB
- Make sure the InnoDB variables are the following (in my installation on Ubuntu 16.04 this was already the default):

# mysql -e "show variables where variable_name like 'innodb_file%'"
| Variable_name            | Value     |
| innodb_file_format       | Barracuda |
| innodb_file_format_check | ON        |
| innodb_file_format_max   | Barracuda |
| innodb_file_per_table    | ON        |

# mysql -e "show variables where variable_name like 'innodb_large%'"
| Variable_name       | Value |
| innodb_large_prefix | ON    |

4. PHP requirements
- At least PHP 5.6 is needed, PHP 7.x is recommended
- The additional PHP package "php-gmp" (or "php7.0-gmp") needs to be installed

# apt-get install php7.0-gmp


[ Installation ]

If you're already using multiple Icingaweb2 modules, this might be very easy for you. But for me, this was my first additional Icingaweb2 module and some installation points were not very clear to me and had to figure the details out myself (e.g. placing the module into /usr/share/icingaweb2/modules and not into /etc/icingaweb2/modules took me a couple of minutes to figure out).

Clone the repository into /usr/share/icingaweb2/modules:

root@icingaweb2:~# cd /usr/share/icingaweb2/modules/
root@icingaweb2:/usr/share/icingaweb2/modules# git clone https://github.com/Icinga/icingaweb2-module-x509.git

Then rename the newly created directory and give it the correct permissions (optional):

root@icingaweb2:~# mv /usr/share/icingaweb2/modules/icingaweb2-module-x509 /usr/share/icingaweb2/modules/x509
root@icingaweb2:~# chown -R www-data:icingaweb2 /usr/share/icingaweb2/modules/x509/

Now the database needs to be created. The module will use its own database (and therefore database resource).

mysql> CREATE DATABASE x509;
Query OK, 1 row affected (0.00 sec)

Query OK, 0 rows affected, 1 warning (0.00 sec)

Then import the schema which is part of the repository you just cloned before:

root@icingaweb2:~# mysql -u root x509 < /usr/share/icingaweb2/modules/x509/etc/schema/mysql.schema.sql

As mentioned before, the x509 module requires two other modules (ipl and reactbundle) as a dependency. They can be installed pretty quickly:

root@icingaweb2:~# REPO="https://github.com/Icinga/icingaweb2-module-ipl" \
&& MODULES_PATH="/usr/share/icingaweb2/modules" \
&& mkdir -p "$MODULES_PATH" \
&& git clone ${REPO} "${MODULES_PATH}/ipl" --branch v${MODULE_VERSION}
icingacli module enable ipl

root@icingaweb2:~# REPO="https://github.com/Icinga/icingaweb2-module-reactbundle" \
&& MODULES_PATH="/usr/share/icingaweb2/modules" \
&& mkdir -p "$MODULES_PATH" \
&& git clone ${REPO} "${MODULES_PATH}/reactbundle" --branch v${MODULE_VERSION}
icingacli module enable reactbundle

The next steps are taken in the Icingaweb2 user interface. Log in to Icingaweb2 with a privileged user (Administrator role) and enable the x509 module in Configuration -> Modules -> x509. You should also see the other new modules (ipl and reactbundle) already enabled.

As the x509 database is already prepared and ready, we now have to configure it in Icingaweb2 as a resource: Configuration -> Resources -> Create a new resource

Icingaweb2 x509 create database resource

Now the x509 module needs to be configured to use this resource in the backend:

Configuration -> Modules -> x509 -> Backend

Select the newly created resource:

Icingaweb2 x509 resource configuration 

So far this was the setup and activation of the x509 module.


[ Scan jobs ]

The next step is to create "jobs" which run in the background and scan the network (according to the configured input) for certificates.

But before the first scan job is created, the module's own trust store needs to be filled. When you install the "ca-certificates" packages, you can import these (as Apache user):

www-data@icingaweb2:~$ icingacli x509 import --file /etc/ssl/certs/ca-certificates.crt
Processed 148 X.509 certificates.

Now create the first scan job: Configuration -> Modules -> x509 -> Jobs -> Create a new job

Icingaweb2 X509 Module Create Scan Job 

In this example the job named "Subnet253" is created. Every day at 18:00 (6pm) the job should scan through the entire C-Class range on port 443.
You can now manually launch the scan job on the cli as apache user:

www-data@icingaweb2:~$ icingacli x509 scan --job Subnet253
openssl verify failed for command openssl verify -CAfile '/tmp/5c3349e6374c4/ca5c3349e637541' '/tmp/5c3349e6374c4/cert5c3349e644d4e' 2>&1: /tmp/5c3349e6374c4/cert5c3349e644d4e: OU = Domain Control Validated, OU = Gandi Standard Wildcard SSL, CN = *.example.com
error 20 at 0 depth lookup:unable to get local issuer certificate

Such a warning can show if either the Root CA issuer is unknown (was not imported into the module's trust store) or if the server was missing a certificate in the chain. This will then also be shown in the list of found certificates in the UI and marked with a red icon:

Icingaweb2 X509 module invalid chain

What the module did in the background is basically a scan in the job's defined IP range. You can manually do the same and verify with the openssl command:

$ openssl s_client -connect
depth=0 O = Acme Co, CN = Kubernetes Ingress Controller Fake Certificate
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 O = Acme Co, CN = Kubernetes Ingress Controller Fake Certificate
verify error:num=21:unable to verify the first certificate
verify return:1
Certificate chain
 0 s:/O=Acme Co/CN=Kubernetes Ingress Controller Fake Certificate
   i:/O=Acme Co/CN=Kubernetes Ingress Controller Fake Certificate
Server certificate
subject=/O=Acme Co/CN=Kubernetes Ingress Controller Fake Certificate
issuer=/O=Acme Co/CN=Kubernetes Ingress Controller Fake Certificate
No client certificate CA names sent
SSL handshake has read 1553 bytes and written 421 bytes
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-GCM-SHA384
    Session-ID: AAAEE62F2E75A3FB796C398636E7DA4AA051E7A91D067690C9FCA68DA2BDF0A8
    Master-Key: 1AF1ECB19DC5E830E73BF601878B72F486A710A207315E73168A079D991E2C8D9F72148942028048CFA65B6E2F018E6B
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    TLS session ticket lifetime hint: 600 (seconds)
    TLS session ticket:
    0000 - ea 09 28 0c ef 2b 21 a6-22 dd a3 79 2a 14 5e 05   ..(..+!."..y*.^.
    0010 - c1 96 e8 0c dc 11 95 f6-f9 1c aa 18 75 3b ca 61   ............u;.a
    0020 - 4f d8 5e b6 f7 ac 88 fb-f9 e1 6e df 0d 18 3b a2   O.^.......n...;.
    0030 - 4f 54 39 55 07 df e5 d7-81 c0 c5 23 f8 b4 6e 26   OT9U.......#..n&
    0040 - c8 9c 69 53 29 b5 f4 c8-b1 60 fc 63 54 c6 8a e7   ..iS)....`.cT...
    0050 - b7 0d 94 72 dd 26 ce 2a-e7 ed 3d 63 61 2d 77 f8   ...r.&.*..=ca-w.
    0060 - 52 be bc 3d 9e f2 d7 f6-01 c0 c2 ba e0 68 e6 9d   R..=.........h..
    0070 - 72 c4 46 af 07 a7 0e 7c-08 ba fd 67 5f 1e 00 e7   r.F....|...g_...
    0080 - f2 ba 5b c4 0f e2 1c 7e-fb 91 ef e2 b6 9e 12 91   ..[....~........
    0090 - 5b eb bb fd 04 53 4f 47-77 0a 97 f5 ef b7 de 0e   [....SOGw.......
    00a0 - f3 46 04 08 30 5e 08 8d-e6 43 2b 58 f6 da 74 da   .F..0^...C+X..t.

    Start Time: 1546866271
    Timeout   : 300 (sec)
    Verify return code: 21 (unable to verify the first certificate)

The x509 module of course found this information on host as well. Because this is an internal Docker host managed by Kubernetes it has an invalid (self signed) certificate from Kubernetes installed. This is nicely shown in the UI, too:

Icingaweb2 x509 module chain invalid tls

You could now create a cron job which runs the scan task at your wanted time. Or you could also install a SystemD service using the unit file in the repo (see /usr/share/icingaweb2/modules/x509/config/systemd/icinga-x509.service).


[ SNI certificates ]

So far so good. The module shows us which certificates are invalid or have a missing chain. But what about SNI certificates?
The module only scans the primary certificate of each IP/host on the configured port (443 in this case).

To use Server Name Indication certificates, the module needs to be told the hostnames (how could it know otherwise). This can be added in the configuration of the module:

Configuration -> Modules -> x509 -> SNI -> Create a new SNI map

So basically you specify for each IP address all the hostnames (comma-separated) which you have configured on that host:

Icingaweb2 x509 Module SNI Mapping

After this you need to run another scan and afterwards all your SNI certificates show up, too.


[ Remove certificates ]

But what if a certificate was removed or needs to be deleted?
Well, that as of now is a little problem. The module doesn't delete "old" certificates from the database. It is unfortunately not possible to manually delete a discovered database either. Check out this Github issue where this problem (might) be tackled.

As a workaround you can delete the certificate directly from the database off the table x509_target:

mysql> DELETE FROM x509_target WHERE hostname = "myhostname.example.com";

Where hostname is either the discovered hostname from a scan or the hostname defined in the SNI map.


[ Pros vs. Cons ]

The module is relatively new but offers a lot of cool features and graphics. But certain factors (especially SNI certificate monitoring) still requires manual fiddling, more than if you'd build a smart apply rule in Icinga2.

+ TLS Chain verification
+ Certificate verification (self signed vs. validated)
+ Several sorting options
+ Nice overview (dashboard) of all certificates, including CA's
- SNI certificates need to be configured manually as comma-separated hostnames per IP (that's a lot of work if you run huge central reverse proxies)
- Certificates (even falsely discovered ones) cannot be deleted in the UI

Altogether a very handy module even when there's still a lot of room for improvement (which software isn't).

Icingaweb2 x509 Module Dashboard


Linux going AMD Ryzen with Debian 9 (Stretch)
Monday - Dec 31st 2018 - by - (0 comments)

Ever since AMD announced the new Ryzen processors, I was eager to test such a CPU myself and replace a weaker CPU on a local micro-server running Debian Stretch.
But a couple of public forum posts kept me from buying one. Some of these:

So neither did I get somewhere a real confirmation that Ryzen works as it should but neither a confirmation that it doesn't work at all. I decided to go for it and test it myself. So I ordered an AMD Ryzen 1700 (yes, that's the first generation Ryzen) and a new AM4 motherboard (ASRock AB350 GAMING-ITX).

While waiting for the delivery, I already prepared Debian Stretch to run with a newer Kernel. It seems that full Ryzen support started with Kernel 4.11.0, further improvements and fixes in later releases. But default Debian Stretch ships with 4.10. Luckily Kernel 4.18 is available in stretch-backports. So let's enable backports then!

root@irbwsrvp01 ~ # echo "deb http://ftp.ch.debian.org/debian stretch-backports main non-free contrib" > /etc/apt/sources.list.d/backports.list
root@irbwsrvp01 ~ # apt-get update

To install the current Linux Kernel from backports:

root@irbwsrvp01 ~ # apt-get install -t stretch-backports linux-image-amd64
Reading package lists... Done
Building dependency tree      
Reading state information... Done
The following additional packages will be installed:
Suggested packages:
  linux-doc-4.18 debian-kernel-handbook
Recommended packages:
  firmware-linux-free irqbalance apparmor
The following NEW packages will be installed:
The following packages will be upgraded:
1 upgraded, 1 newly installed, 0 to remove and 83 not upgraded.
Need to get 45.4 MB of archives.
After this operation, 257 MB of additional disk space will be used.
Do you want to continue? [Y/n]

Just to make sure, I don't run into missing firmware dependencies, I decided to install these packages from backports, too:

root@irbwsrvp01 ~ # apt-get install -t stretch-backports firmware-amd-graphics firmware-linux-nonfree firmware-misc-nonfree firmware-realtek

Followed by reboot:

root@irbwsrvp01 ~ # reboot

The micro-server came up again and booted the new Kernel 4.18:

root@irbwsrvp01 ~ # uname -a
Linux irbwsrvp01 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13) x86_64 GNU/Linux

Yay! Ready for Ryzen!

Two days later I got my packages: The AMD Ryzen 1700 CPU, AM4 ASRock Motherboard, 2 x 8GB DDR4 RAM

AM4 Motherboard ASRock AB350 Mini-ITX with AMD Ryzen 1700

Once I built the new motherboard with the Ryzen CPU into the micro-server it was the moment of truth: Will the server boot?

Side-Note: It did, but there was no video output! What did I do wrong? I tried both HDMI outputs of the motherboard but there was no video output at all, not even showing a POST screen. Turns out the motherboard ASRock B350 Gaming ITX has a lot of video output plugs, but no embedded GPU. On the old motherboard I used an AMD A10 processor which is very weak (compared with Ryzen) but has an embedded GPU. Ryzen itself does not have an embedded GPU. Means: I had to attach a GPU/graphics card on the PCI-Ex slot. Luckily I still had one.

The processor shows up in ASRock's BIOS, so far so good:

ASRock AB350 Bios Ryzen 1700

What about the OS? Will it boot?

And yes. It did!

$ cat /proc/cpuinfo  | head -n 28
processor    : 0
vendor_id    : AuthenticAMD
cpu family    : 23
model        : 1
model name    : AMD Ryzen 7 1700 Eight-Core Processor
stepping    : 1
microcode    : 0x8001136
cpu MHz        : 1481.761
cache size    : 512 KB
physical id    : 0
siblings    : 16
core id        : 0
cpu cores    : 8
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
bugs        : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips    : 5988.04
TLB size    : 2560 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

And multi-threading is correctly working:

$ cat /proc/cpuinfo |grep "core id"
core id        : 0
core id        : 0
core id        : 1
core id        : 1
core id        : 2
core id        : 2
core id        : 3
core id        : 3
core id        : 4
core id        : 4
core id        : 5
core id        : 5
core id        : 6
core id        : 6
core id        : 7
core id        : 7

The view in htop:

AMD Ryzen Linux htop

So far this microserver is now running for 16 days straight, without any issues so far.

$ uptime
 14:23:53 up 16 days,  2:20,  2 users,  load average: 2.25, 3.06, 3.05

TL;DR: AMD Ryzen (1) works very well and fast with Debian Stretch and backported Linux Kernel (4.18)! 

And here's "Team Red" at work:

Debian Stretch running with AMD Ryzen 1700

Update January 9th 2019:
I saw a couple of forum posts from users that complain they get no sensor data from the Ryzen CPU. I cannot confirm that. At least with the backported 4.18 Kernel on Debian 9 and "lm-sensors" package installed, I'm able to read the temperature from the CPU:

# sensors
Adapter: PCI adapter
Tdie:         +38.9°C  (high = +70.0°C)
Tctl:         +38.9°C 

Adapter: PCI adapter
GPU core:     +1.05 V  (min =  +0.80 V, max =  +1.19 V)
temp1:        +38.0°C  (high = +95.0°C, hyst =  +3.0°C)
                       (crit = +105.0°C, hyst =  +5.0°C)
                       (emerg = +135.0°C, hyst =  +5.0°C)

Important here is to know that "k10temp-pci-00c3" is the Ryzen 1700 CPU.

Below that, the nouveau-pci-2600, is a Nvidia GPU (GeForce GT 730).


Monitoring plugin check_smart 5.11 available, introducing exclude list
Friday - Dec 28th 2018 - by - (0 comments)

The monitoring plugin check_smart, to monitor hard drives' and solid state drives' SMART attributes, is out with a new version.

Version 5.11 introduces a new parameter "-e" or "--exclude" which stands for exclude list (aka ignore list).

The exclude list is a list of strings, separated by comma. The exclude list basically tells the plugin which SMART attributes to ignore, even if they are in a failing or failed state.

Let's take a temperature failed in the past error as an example.

194 Temperature_Celsius     0x0002   113   113   000    Old_age   Always  In_the_past  53 (Lifetime Min/Max 25/62)

Without the exclude list, the plugin will return a WARNING when the temperature SMART attribute once failed in the past:

# ./check_smart.pl -d /dev/sda -i sat
WARNING: Attribute Temperature_Celsius failed at In_the_past|Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

It's a nice info that it once failed in the past. But once we know that, we get over it and want the warning to disappear. With the exclude list, the plugin can be told to ignore this attribute "Temperature_Celsius":

# ./check_smart.pl -d /dev/sda -i sat -e Temperature_Celsius
OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

And hurray, no alert anymore. 

But this could also be a bit dangerous. What if the drive has a new (live!) temperature alert? You'd certainly want to know about it. That's why, besides excluding a SMART attribute, it is also possible to exclude certain values in the "When_failed" column. In the following example, the "When_Failed" value "In_the_past" (as seen above) can be used in the exclude list:

# ./check_smart.pl -d /dev/sda -i sat -e "In_the_past"
OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

As you can see, the plugin doesn't alert anymore on the "Temperature_Celsius" because it detected the "In_the_past" value in the "When_failed" column and successfully ignored it.

To ignore multiple attributes, simply separate them with a comma:

# ./check_smart.pl -d /dev/sda -i sat -e "In_the_past","Current_Pending_Sector"
OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

But you better make sure you're not cutting yourself with this. The main reason why the exclude list was created in the first place is clearly the temperature attribute.


Reduce the number of shards of an Elasticsearch index
Thursday - Dec 27th 2018 - by - (0 comments)

When you run a lot of indexes, this can create quite a large sum of shards in your ELK stack. As the documentation states, the default creates 5 shards per index:

    The number of primary shards that an index should have. Defaults to 5. This setting can only be set at index creation time. It cannot be changed on a closed index. Note: the number of shards are limited to 1024 per index. This limitation is a safety limit to prevent accidental creation of indices that can destabilize a cluster due to resource allocation. The limit can be modified by specifying export ES_JAVA_OPTS="-Des.index.max_number_of_shards=128" system property on every node that is part of the cluster.

Another interesting default setting is the number of replicas of each shard:

    The number of replicas each primary shard has. Defaults to 1.

Once a new index was created, the number of shards is fixed. Only by creating a new index the number of shards can be defined; either during the creation of the index itself or by defining the settings in the template. 

Note: Yes, it is possible to change the number of shards on already created indexes, but that means you must re-index that index again, possibly causing downtime.

In my setup, a classical ELK stack, there are a couple of indexes (logstash, filebeat, haproxy, ...)  created every day, typically with the date in the index name (logstash-2018.12.27). By adjusting the shard settings in the templates "logstash" and "filebeat", the indexes created from tomorrow on and later will have a reduced number of shards.

First let's take a backup:

# elk=localhost:9200
# curl $elk/_template/logstash?pretty -u elastic -p > /root/template-logstash.backup

Now create a new file, e.g. /tmp/logstash, based on the backup flie. Add the "number_of_shards" and "number_of_replicas" into the settings key.

Also make sure, that you remove the "logstash" main key itself. So the file looks like this:

# cat /tmp/logstash
    "order" : 0,
    "version" : 60001,
    "index_patterns" : [
    "settings" : {
      "number_of_shards" : 2,
      "number_of_replicas" : 1,
      "index" : {
        "refresh_interval" : "5s"
    "mappings" : {
        "properties" : {
          "@timestamp" : {
            "type" : "date"
          "@version" : {
            "type" : "keyword"
          "geoip" : {
            "dynamic" : true,
            "properties" : {
              "ip" : {
                "type" : "ip"
              "location" : {
                "type" : "geo_point"
              "latitude" : {
                "type" : "half_float"
              "longitude" : {
                "type" : "half_float"
    "aliases" : { }

And now this file can be "PUT" into Elasticsearch templates:

# curl -H "Content-Type: application/json" -XPUT $elk/_template/logstash -d "@/tmp/logstash" -u elastic -p
Enter host password for user 'elastic':

By checking out the template again, our adjusted shard settings are now showing:

# curl $elk/_template/logstash?pretty -u elastic -p
  "logstash" : {
    "order" : 0,
    "version" : 60001,
    "index_patterns" : [
    "settings" : {
      "index" : {
        "number_of_shards" : "2",
        "number_of_replicas" : "1",
        "refresh_interval" : "5s"
    "mappings" : {


Monitoring plugin check_netio 1.3 released
Friday - Dec 21st 2018 - by - (0 comments)

The monitoring plugin check_netio, to monitor input/output on Linux network interfaces, has been around for over a decade (2007). And I've been using it on thousands of servers (counting all the data centers of my former employers together) in the past years.

For me it was always a very lightweight yet incredibly easy way to monitor a network interfaces performance. The plugin's code didn't change a lot.

But in 2017 I actually needed to adapt the source code because some Linux distributions changed the output of the command "ifconfig". Suddenly the plugin didn't work anymore on these distributions (it started with RHEL/CentOS 7 by the way).

Now in 2018 it seems that some distributions don't ship "ifconfig" by default anymore (seen in Ubuntu 18.04) because "ip" has been around for many years, too.

Version 1.2
Instead of relying on "ifconfig" and running into possible parsing errors due to different formatting in various distros, I changed the plugin to directly parse /proc/net/dev. This procfs file should be the same on all distributions. And it has another nice side effect: The plugin is now 50% faster, too!

If for some reason the plugin doesn't work (which would mean there's no read access to /proc/net/dev for some reason), there's the new parameter "-l" to use the "legacy" mode. The legacy mode continues to use "ifconfig" in the background.

Version 1.3
Removed the plugin's dependency to /usr/lib/nagios/plugins/utils.sh which is part of the nagios-plugins-common or monitoring-plugins-common package. check_netio actually only used one small variable from utils.sh. A full dependency of utils.sh just doesn't make sense. And yes, this also means a small increase in execution. 

And because there have been quite some adjustments now, I finally created a documentation page: https://www.claudiokuenzler.com/monitoring-plugins/check_netio.php  .



Regex != regex in sed (or: replacing digits in sed)
Friday - Dec 14th 2018 - by - (0 comments)

This is supposed to be a quick reminder to myself, the next time I run into such a problem: regular expressions are not exactly the same in sed!

On my previous article "How to manually clean up Zoneminder events" I wrote a shell script in which I wanted to remove a certain part of a path:


should become:


Simple, right? Just use sed replace and remove ".448512/" out of the string.

But see for yourself:

$ echo "/var/cache/zoneminder/events/5/18/12/14/.448512/06/45/12" | sed "s/\.\d+\///g"

The old path is still shown. Nothing was replaced. My first thought was of course that I've made a mistake in my regular expression, but on all the regex checkers online confirmed my regex was correct. For example on https://regexr.com/:

Regex match dot and digit 

I was able to break it down that it must have something to do with the regular expression for the number (\d+) because simply replacing the dot character works:

$ echo "/var/cache/zoneminder/events/5/18/12/14/.448512/06/45/12" | sed "s/\.//g"

And then I received the final hint from a friend: Some typical regex don't work in sed! Excerpt from sed's documentation:

*    Matches a sequence of zero or more instances of matches for the preceding regular expression, which must be an ordinary character, a special character preceded by \, a ., a grouped regexp (see below), or a bracket expression. As a GNU extension, a postfixed regular expression can also be followed by *; for example, a** is equivalent to a*. POSIX 1003.1-2001 says that * stands for itself when it appears at the start of a regular expression or subexpression, but many nonGNU implementations do not support this and portable scripts should instead use \* in these contexts.

\+   As *, but matches one or more. It is a GNU extension. 


‘[a-zA-Z0-9]’  In the C locale, this matches any ASCII letters or digits.

So first of all the plus-sign (+) must be escaped. And second to match a digit, \d doesn't work, it must be used in [0-9] style!

With these adjustments, sed now finally does the replace part:

$ echo "/var/cache/zoneminder/events/5/18/12/14/.448512/06/45/12" | sed "s/\.[0-9]\+\///g"

Dang it, I am sure that I ran into this at least once already in my Linux career. Hence this post to not lose much time the next time this happens again.


How to manually clean up Zoneminder events
Friday - Dec 14th 2018 - by - (0 comments)

Zoneminder is a great tool to build a surveillance system, combining all kinds of ip cameras in one dashboard and use it to manage recordings. 

But sometimes Zoneminder can be a bit of a pain, especially when the disk is getting filled. With the high resolutions of todays IP cameras this can happen pretty quickly. Although Zoneminder has an internal "filter" to automatically purge old events when the disk threshold hits a certain limit.

Zoneminder Purge Filter 

However this filter only works if there's actually still some space left available. The filter searches the oldest N events in the database, deleting them from the database and also on the filesystem. But when there's no disk space available at all, the database is likely to be frozen/unavailable. Ergo no clean up anymore. And in this situation you're stuck with a non-working Zoneminder.

This happened to me twice already on my Zoneminder installation (side note: I have to admit my current disk size of 120GB dedicated for Zoneminder is rather small) so I built a clean up script and this is what this post is about.

Step one: We don't want to delete the archived events!

When you archive a footage, this usually means you want to keep it. The cleanup script needs to respect that. But it needs to know about the archived events first. This can be done by getting the relevant information from the database:

root@zoneminder:~# mysql -N -b -r -e "select Id from zm.Events where Archived = 1;"
| 135933 |
| 136590 |
| 154831 |
| 160832 |
| 162647 |
| 162649 |
| 167562 |

Step two: Find all events except the archived ones

We can use the find command and the ID's from step one to find all events except the archived ones:

root@zoneminder:~# find /var/cache/zoneminder/events/ -mindepth 2 -type l ! -name ".135933" ! -name ".136590" ! -name ".154831" ! -name ".160832" ! -name ".162647" ! -name ".162649" ! -name ".167562" -exec ls -la {} + > /tmp/xxx

find will now look in the path /var/cache/zoneminder/events/ for symlinks (type l), except for the given names (excluded with exclamation mark). The output of the full path and other information will be saved in /tmp/xxx.

The output file /tmp/xxx will now look something like that:

root@zoneminder:~# tail /tmp/xxx
lrwxrwxrwx 1 www-data www-data 8 Dec 14 06:45 /var/cache/zoneminder/events/5/18/12/14/.448512 -> 06/45/12
lrwxrwxrwx 1 www-data www-data 8 Dec 14 06:51 /var/cache/zoneminder/events/5/18/12/14/.448517 -> 06/51/29
lrwxrwxrwx 1 www-data www-data 8 Dec 14 06:51 /var/cache/zoneminder/events/5/18/12/14/.448518 -> 06/51/34
lrwxrwxrwx 1 www-data www-data 8 Dec 14 07:02 /var/cache/zoneminder/events/5/18/12/14/.448533 -> 07/02/28
lrwxrwxrwx 1 www-data www-data 8 Dec 14 07:44 /var/cache/zoneminder/events/5/18/12/14/.448546 -> 07/44/56
lrwxrwxrwx 1 www-data www-data 8 Dec 14 07:47 /var/cache/zoneminder/events/5/18/12/14/.448548 -> 07/47/09
lrwxrwxrwx 1 www-data www-data 8 Dec 14 08:22 /var/cache/zoneminder/events/5/18/12/14/.448551 -> 08/22/17
lrwxrwxrwx 1 www-data www-data 8 Dec 14 08:26 /var/cache/zoneminder/events/5/18/12/14/.448552 -> 08/26/13
lrwxrwxrwx 1 www-data www-data 8 Dec 14 08:27 /var/cache/zoneminder/events/5/18/12/14/.448555 -> 08/27/30
lrwxrwxrwx 1 www-data www-data 8 Dec 14 08:28 /var/cache/zoneminder/events/5/18/12/14/.448557 -> 08/28/19

Step three: Get the event id and real path

Each path in /tmp/xxx contains two important information: The event id and the real path.

/var/cache/zoneminder/events/5/18/12/14/.448512 -> 06/45/12

In this case .448512 is the symlink of the event pointing to the subfolders 06/45/12.
The name of the symlink also contains the event id (448512).
By removing the symlink and adding the subfolders into the path, we get the real path where the footage is stored:


Step four: Delete the footage and the info in the database

Now that the real path is known, it can be deleted:

root@zoneminder:~# rm -rf /var/cache/zoneminder/events/5/18/12/14/06/45/12

We should also delete the symlink:

root@zoneminder:~# rm -f /var/cache/zoneminder/events/5/18/12/14/.448512

And now that there's some disk space available again, the MySQL database should accept writes again. Therefore we can delete this event (448512) from the relevant tables:

mysql> DELETE FROM zm.Events where Id = 448512;
mysql> DELETE FROM zm.Frames where EventId = 448512;
mysql> DELETE FROM zm.Stats where EventId = 448512;

Step five: Automate it with a script

As I mentioned at the beginning, I wrote a script to automate these tasks. It's called zoneminder-event-cleanup.sh and you can download it here:

Using the script is very simple.

1) Download it

# wget https://www.claudiokuenzler.com/downloads/scripts/zoneminder-event-cleanup.sh

2) Give execute permissions:

# chmod 755 zoneminder-event-cleanup.sh

3) Open the script with an editor and adjust the user variables:

# User variables
olderthan=2 # Defines the minimum age in days of the events to be deleted
zmcache=/var/cache/zoneminder/events # Defines the path where zm stores events
mysqlhost=localhost # Defines the MySQL host for the zm database
mysqldb=zm # Defines the MySQL database name used by zm
mysqluser=zmuser # Defines a MySQL user to connect to the database
mysqlpass=secret # Defines the password for the MySQL user

4) Run the script

root@zoneminder:~# ./zoneminder-cleanup.sh
Deleting 447900
Deleting 447901
Deleting 447902
Deleting 447903
Deleting 447904
Deleting 447905
Deleting 447906
Deleting 447907
Deleting 447908
Deleting 447909
Deleting 447911
Deleting 447912
Deleting 447913
Deleting 447914
Deleting 447915

The script is also available on Github: https://github.com/Napsty/scripts/blob/master/zoneminder/zoneminder-event-cleanup.sh


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

6939 Days
until Death of Computers