Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

How to play an audio file on the command line or as a cron job in Linux
Tuesday - Jan 22nd 2019 - by - (0 comments)

In October 2016 I already wrote how a multimedia file could be played with VLC and as a cron job (Play a multimedia file in VLC as cron job).

In this case we're still talking about the same idea as in my article from October 2016: The cron job should play the "It's coffee time" audio file. But opening a VLC player to play an audio file is kind of overkill.

Let's first create an audio file from the Youtube video using "youtube-dl":

$ youtoube-dl -x https://www.youtube.com/watch?v=6SRXUufvZUE

The -x parameter extracts the audio from the video file, leaving you with just the sound of the video: COFFEE-TIME-6SRXUufvZUE.m4a.

Now this file can be played using ffplay, which is a command from the package "ffmpeg":

$ /usr/bin/ffplay -nodisp -autoexit /home/myuser/Music/COFFEE-TIME-6SRXUufvZUE.m4a

Important parameters here:

-nodisp: Avoids opening a user interface to play the audio (we don't need this for a cron job in the background)
- autoexit: Automatically exit ffplay once the file finished, otherwise the command will continue to run

With these parameters we can now schedule the cron job:

00 09 * * 1-5 /usr/bin/ffplay -nodisp -autoexit /home/myuser/Music/COFFEE-TIME-6SRXUufvZUE.m4a

Definitely a much more lightweight and elegant solution than the previous one using VLC. 

 

Investigating high load on Icinga2 monitoring (caused by browser accessing Nagvis)
Monday - Jan 21st 2019 - by - (0 comments)

Since this weekend we experienced a very high load on the Icinga 2 monitoring server, running Icinga 2 version 2.6:

Icinga2 high load 

Restarts of Icinga2 didn't help. And it became worse: Icinga2 became so slow, we experienced outages between master and satellite servers and also in the user interface (classicui in this case) experienced status outages:

Icinga 2 status data outdated

In the application log (/var/log/icinga2/icinga2.log) I came across a lot of such errors I haven't seen before:

[2019-01-21 14:06:59 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-21 14:06:59 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-21 14:06:59 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-21 14:06:59 +0100] critical/ThreadPool: Exception thrown in event handler:
Error: Tried to read from closed socket.

    (0) libbase.so.2.6.1: (+0xc9148) [0x2b503d005148]
    (1) libbase.so.2.6.1: (+0xc91f9) [0x2b503d0051f9]
    (2) libbase.so.2.6.1: icinga::NetworkStream::Read(void*, unsigned long, bool) (+0x7e) [0x2b503cfa343e]
    (3) libbase.so.2.6.1: icinga::StreamReadContext::FillFromStream(boost::intrusive_ptr const&, bool) (+0x7f) [0x2b503cfab40f]
    (4) libbase.so.2.6.1: icinga::Stream::ReadLine(icinga::String*, icinga::StreamReadContext&, bool) (+0x5c) [0x2b503cfb3bbc]
    (5) liblivestatus.so.2.6.1: icinga::LivestatusListener::ClientHandler(boost::intrusive_ptr const&) (+0x103) [0x2b504c32da93]
    (6) libbase.so.2.6.1: icinga::ThreadPool::WorkerThread::ThreadProc(icinga::ThreadPool::Queue&) (+0x328) [0x2b503cfe9f78]
    (7) libboost_thread.so.1.54.0: (+0xba4a) [0x2b503c6a1a4a]
    (8) libpthread.so.0: (+0x8184) [0x2b503cd26184]
    (9) libc.so.6: clone (+0x6d) [0x2b503de51bed]

These errors started on January 20th at 02:58:

root@icingahost:/ # zgrep critical icinga2.log-20190120.gz |more
[2019-01-20 02:47:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:55543)
[2019-01-20 02:52:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:57401)
[2019-01-20 02:57:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:59271)
[2019-01-20 02:58:50 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 02:58:50 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 02:58:50 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 02:58:50 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 02:59:05 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 02:59:05 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 02:59:05 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 02:59:05 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 02:59:08 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 02:59:08 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 02:59:08 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 02:59:08 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:02:10 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:02:10 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:02:10 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 03:02:10 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:02:44 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:02:44 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:02:44 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 03:02:44 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:02:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:32903)
[2019-01-20 03:03:15 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:03:15 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:03:15 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 03:03:15 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:03:21 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:03:21 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:03:21 +0100] critical/LivestatusQuery: Cannot write query response to socket.

So this time correlates with the graph when the load started to increase! 

But what is causing this? According to the errors in icinga2.log it must have something to do with Livestatus, which in this setup is serving as a local socket and only accessed by a Nagvis installation.

By looking for this error message, I came across an issue on GitHub which didn't really solve my load problem (in this case it was Thruk causing the errors) but the comments from dnsmichi pointed me in the right direction:

"If your client application does not close the socket, or wait for processing the response, such errors occur."

As in this case it can only be Nagvis accessing Livestatus, I checked the Apache access logs for Nagvis and narrowed it down to four internal IP addresses constantly accessing Nagvis. By identifying these hosts and the responsible teams one browser after another was closed until finally only one machine left was accessing Nagvis. And it turned out to be this single machine causing the issues. After a reboot of this particular machine the load of our Icinga2 server immediately went back to normal and no more errors appeared in the logs.

TL;DR: It's not always the application on the server to blame. Clients/browsers can be the source of a problem, too.

 

Application (Docker/Kubernetes) containers and STDOUT logging
Tuesday - Jan 15th 2019 - by - (0 comments)

In our Docker container environment (on premise, using Rancher) I have configured the Docker daemon to forward STDOUT logs from the containers to a central Logstash using GELF.

For applications logging by default to STDOUT this works out of the box. But for some hand-written applications this might require some additional work.

In this particular example the application simply logged into a local log file on the AUFS filesystem (/tmp/application.log). But all these log messages of course never arrive in the ELK stack because they were not logged to STDOUT but written into a file.

The developer then adjusted the Dockerfile and instead of creating the log file, created a symlink:

# forward logs to docker log collector
RUN ln -sf /dev/stdout /tmp/application.log

To be honest, I thought this would do the trick. But once the new container image was deployed, the application logs didn't arrive in our ELK stack. Why?

I went into the container and tested myself:

root@af8e2147f8ba:/app# cd /tmp/

root@af8e2147f8ba:/tmp# ls -la
total 12
drwxrwxrwt  3 root root 4096 Jan 15 12:55 .
drwxr-xr-x 54 root root 4096 Jan 15 12:57 ..
lrwxrwxrwx  1 root root   11 Jan 15 12:52 application.log -> /dev/stdout
drwxr-xr-x  3 root root 4096 Jan 15 12:52 npm-6-d456bc8a

Yes, there is the  application log file, which is a symlink to /dev/stdout. Should work, right? Let's try this:

root@af8e2147f8ba:/tmp# echo "test test test" > application.log
test test test

Although I saw "test test test" appearing in the terminal, this message never made it into the ELK stack. On my research why, I came across a VERY GOOD explanation by user "phemmer" on this GitHub issue:

"The reason this doesn't work is because /dev/stdout is a link to STDOUT of the process accessing it. So by doing foo > /dev/stdout, you're saying "redirect my STDOUT to my STDOUT". Kinda doesn't do anything :-).
And since /var/log/test.log is a symlink to it, the same thing applies. What you want is to redirect output to STDOUT of PID 1. PID 1 is the process launched by docker, and its STDOUT will be what docker picks up."

So to sum this up, we need to use the STDOUT of PID 1 (the container itself), otherwise the message won't be picked up by the Docker daemon.

Let's try this inside the still running container:

root@af8e2147f8ba:/tmp# rm application.log
root@af8e2147f8ba:/tmp# ln -sf /proc/1/fd/1 /tmp/application.log
root@af8e2147f8ba:/tmp# echo 1 2 3 > application.log

And hey, my 1 2 3 appeared in Kibana!

Docker container logs STDOUT logging

I slightly modified the Dockerfile with that new knowledge:

RUN ln -sf /proc/1/fd/1 /tmp/application.log
# forward logs to docker log collector

Note: /proc/1 obviously is PID 1. fd/1 is stdout, as you might know from typical cron jobs, e.g. */5 * * * * myscript.sh 2>&1. fd/2 would be STDERR by the way.

After the new container image was built, deployed and started, the ELK stack is now getting the application logs:

Container logs appearing in ELK stack

 

Outages in the Cloud. Whom to blame and how to prove it?
Friday - Jan 11th 2019 - by - (0 comments)

Enterprises have started to use "the cloud" more and more often in the past few years. Applications are sometimes running completely in the cloud, sometimes as a hybrid construct. In a hybrid environment, some "parts" of the application architecture (for example a database) are running in the cloud, while others (for example an API server) is running on premise. Sometimes the other way around, the combinations are endless.

Using a hybrid architecture has several positive points. If you build your application and its architecture correctly, you may run the full stack in either location (on-premise and in the cloud). Which means you now got a disaster recovery environment for your application. That's pretty neat.

But this also leads to additional problems: Network interruptions and latency between on-premise and the cloud may cause at least delays, at worst outages, in the application. The Internet connection between these two points becomes a very important dependency.

We've been using a hybrid architecture for a couple of years now and mostly this runs pretty well. But when we do experience issues (e.g. timeouts), we need to find the source of the problem as soon as possible. Most of the times in the past we've identified the Internet connectivity between the on-premise data center and the cloud as source. But whose fault is it? Our own (firewall), our Internet Service Provider (ISP), the Cloud Service Provider (CSP) or something in between (Internet Exchanges)? That question was never easy to answer and whenever we contacted our ISP to help identify where the problem was, it took 2 days to get a response which mostly didn't help us either (in technical aspects).

What you do in such moments of connectivity problems is: Troubleshooting. And this includes getting as much data from all available resources you got. Check stats and graphs from monitoring, external verification (is it just me or everyone?) and most importantly a connectivity check. A very well known tool for this is "mtr" which stands for "my traceroute". It's basically an advanced version of the classic traceroute (or tracert in Windows) command.
By following the "mtr" output, sometimes the hop causing the issue can be identified immediately. But in almost all the cases I wished to have a comparison at hand: Now we're going through these 20 hops, but what happened 30 minutes ago, when there were no connectivity issues?

I've been looking for a monitoring plugin which basically runs mtr and graphs the data for a couple of months, but there was no real solution. Until I finally came across a Github repository where mtr is run in the background from a python command. The results are written into a time series database (InfluxDB). Sounded pretty good to me and I gave it a try. Unfortunately there were some problems in running the python script standalone. It was supposed to be started within a Docker container and install an InfluxDB inside that container, which I didn't want.
Note: If you don't have an InfluxDB at hand (or you don't know how to administrate InfluxDB) and don't mind that you run the data in a container, the existing script is great!

I rewrote that script in Bash and added some additional stuff. Then I let it run in a 2 minute interval for a couple of destinations. Each hop and it's data (latency, packet loss, etc) is entered into InfluxDB using the timestamp of that run. Using a Grafana dashboard it is now possible to see the exact hops, their latencies and packet drops at a certain moment in time. This also allows to compare the different hops in case there was a routing change.

MTR Grafana Dashboard 

Now that I had this tool in place and collecting data, I just needed to wait for the next outage.

Yesterday, January 10 2019 it finally happened. We experienced timeouts in all our web applications which connect to an ElasticSearch service in the cloud, running in AWS Ireland. Thanks to our monitoring we almost immediately saw that the latency between our applications in our data center and ElasticSearch spiked, while the same checks from within AWS Ireland were still around the same:

ElasticSearch Roundtrip Graph

There was a downtime of 13 minutes for most of our web applications between 16:13 and 16:26. Once the issue was resolved the blame game started (of course!).
But this time I was able to refer to the MTR Dashboard and compared the hops at 16:00-16:05 with 16:15-16:20 and with 16:30-16:35:

Routing change within AWS caused downtime 

By comparing the hops side by side something interesting is revealed: There were indeed routing changes but only after the hop 52.93.128.156, which already belongs to AWS.
This means that there were some internal routing changes causing the outage. Yet in the AWS status dashboard, everything was green (as always...).

Thanks to this MTR dashboard we are now able to identify when a routing change happened and where and therefore can help solve the blame game.

PS: I will release the Bash script, installations instructions and the Grafana dashboard (it's slightly different than the one of the mtr-monitor repository) on GitHub soon.

 

Monitoring plugin check_smart 5.11.1 released
Tuesday - Jan 8th 2019 - by - (0 comments)

The monitoring plugin check_smart, to monitor hard and solid state drives' SMART attributes, is out with a new version.

Version 5.11.1 is a bugfix version and removes Perl warnings of uninitialized values (issues #29 and #31 on GitHub).

It also adds relevant information in the "help" output for the recently introduced exclude parameter.

 

SSL TLS SNI certificate monitoring with Icingaweb2 x509 module
Monday - Jan 7th 2019 - by - (2 comments)

Back in November 2018 I attended the Open Source Monitoring Conference (OSMC) in Nuremberg, Germany. One of the presentations included a quick demo of a new module for Icingaweb2, the "quasi standard" user interface for Icinga2. The module was called "x509" and it's purpose, as one may already think when reading the name, is to monitor SSL/TLS certificates. 

This article will cover the installation, activation, configuration of the module. And some first impressions (pros/cons).

[ Requirements ]

But before the actual module is installed, some pre-requirements need to be verified.

1. Icingaweb2 requirements
- Your Icingaweb2 version shouldn't be too old. At least version 2.5 (or newer) is required.
- There are two additional modules needed as dependency: reactbundle and ipl (I'll describe this later)

2. OpenSSL
- You need to have OpenSSL installed.

# apt-get install openssl

3. Database requirements
- The module supports either MySQL or MariaDB
- Make sure the InnoDB variables are the following (in my installation on Ubuntu 16.04 this was already the default):

# mysql -e "show variables where variable_name like 'innodb_file%'"
+--------------------------+-----------+
| Variable_name            | Value     |
+--------------------------+-----------+
| innodb_file_format       | Barracuda |
| innodb_file_format_check | ON        |
| innodb_file_format_max   | Barracuda |
| innodb_file_per_table    | ON        |
+--------------------------+-----------+

# mysql -e "show variables where variable_name like 'innodb_large%'"
+---------------------+-------+
| Variable_name       | Value |
+---------------------+-------+
| innodb_large_prefix | ON    |
+---------------------+-------+

4. PHP requirements
- At least PHP 5.6 is needed, PHP 7.x is recommended
- The additional PHP package "php-gmp" (or "php7.0-gmp") needs to be installed

# apt-get install php7.0-gmp

 

[ Installation ]

If you're already using multiple Icingaweb2 modules, this might be very easy for you. But for me, this was my first additional Icingaweb2 module and some installation points were not very clear to me and had to figure the details out myself (e.g. placing the module into /usr/share/icingaweb2/modules and not into /etc/icingaweb2/modules took me a couple of minutes to figure out).

Clone the repository into /usr/share/icingaweb2/modules:

root@icingaweb2:~# cd /usr/share/icingaweb2/modules/
root@icingaweb2:/usr/share/icingaweb2/modules# git clone https://github.com/Icinga/icingaweb2-module-x509.git

Then rename the newly created directory and give it the correct permissions (optional):

root@icingaweb2:~# mv /usr/share/icingaweb2/modules/icingaweb2-module-x509 /usr/share/icingaweb2/modules/x509
root@icingaweb2:~# chown -R www-data:icingaweb2 /usr/share/icingaweb2/modules/x509/

Now the database needs to be created. The module will use its own database (and therefore database resource).

mysql> CREATE DATABASE x509;
Query OK, 1 row affected (0.00 sec)

mysql> GRANT SELECT, INSERT, UPDATE, DELETE, DROP, CREATE VIEW, INDEX, EXECUTE ON x509.* TO 'icinga'@'localhost' IDENTIFIED BY 'secret';
Query OK, 0 rows affected, 1 warning (0.00 sec)

Then import the schema which is part of the repository you just cloned before:

root@icingaweb2:~# mysql -u root x509 < /usr/share/icingaweb2/modules/x509/etc/schema/mysql.schema.sql

As mentioned before, the x509 module requires two other modules (ipl and reactbundle) as a dependency. They can be installed pretty quickly:

root@icingaweb2:~# REPO="https://github.com/Icinga/icingaweb2-module-ipl" \
&& MODULES_PATH="/usr/share/icingaweb2/modules" \
&& MODULE_VERSION=0.1.1 \
&& mkdir -p "$MODULES_PATH" \
&& git clone ${REPO} "${MODULES_PATH}/ipl" --branch v${MODULE_VERSION}
icingacli module enable ipl


root@icingaweb2:~# REPO="https://github.com/Icinga/icingaweb2-module-reactbundle" \
&& MODULES_PATH="/usr/share/icingaweb2/modules" \
&& MODULE_VERSION=0.4.1 \
&& mkdir -p "$MODULES_PATH" \
&& git clone ${REPO} "${MODULES_PATH}/reactbundle" --branch v${MODULE_VERSION}
icingacli module enable reactbundle

The next steps are taken in the Icingaweb2 user interface. Log in to Icingaweb2 with a privileged user (Administrator role) and enable the x509 module in Configuration -> Modules -> x509. You should also see the other new modules (ipl and reactbundle) already enabled.

As the x509 database is already prepared and ready, we now have to configure it in Icingaweb2 as a resource: Configuration -> Resources -> Create a new resource

Icingaweb2 x509 create database resource


Now the x509 module needs to be configured to use this resource in the backend:

Configuration -> Modules -> x509 -> Backend

Select the newly created resource:

Icingaweb2 x509 resource configuration 

So far this was the setup and activation of the x509 module.

 

[ Scan jobs ]

The next step is to create "jobs" which run in the background and scan the network (according to the configured input) for certificates.

But before the first scan job is created, the module's own trust store needs to be filled. When you install the "ca-certificates" packages, you can import these (as Apache user):

www-data@icingaweb2:~$ icingacli x509 import --file /etc/ssl/certs/ca-certificates.crt
Processed 148 X.509 certificates.

Now create the first scan job: Configuration -> Modules -> x509 -> Jobs -> Create a new job

Icingaweb2 X509 Module Create Scan Job 

In this example the job named "Subnet253" is created. Every day at 18:00 (6pm) the job should scan through the entire C-Class range 192.168.253.0/24 on port 443.
You can now manually launch the scan job on the cli as apache user:

www-data@icingaweb2:~$ icingacli x509 scan --job Subnet253
openssl verify failed for command openssl verify -CAfile '/tmp/5c3349e6374c4/ca5c3349e637541' '/tmp/5c3349e6374c4/cert5c3349e644d4e' 2>&1: /tmp/5c3349e6374c4/cert5c3349e644d4e: OU = Domain Control Validated, OU = Gandi Standard Wildcard SSL, CN = *.example.com
error 20 at 0 depth lookup:unable to get local issuer certificate

Such a warning can show if either the Root CA issuer is unknown (was not imported into the module's trust store) or if the server was missing a certificate in the chain. This will then also be shown in the list of found certificates in the UI and marked with a red icon:

Icingaweb2 X509 module invalid chain

What the module did in the background is basically a scan in the job's defined IP range. You can manually do the same and verify with the openssl command:

$ openssl s_client -connect 192.168.253.15:443
CONNECTED(00000003)
depth=0 O = Acme Co, CN = Kubernetes Ingress Controller Fake Certificate
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 O = Acme Co, CN = Kubernetes Ingress Controller Fake Certificate
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
 0 s:/O=Acme Co/CN=Kubernetes Ingress Controller Fake Certificate
   i:/O=Acme Co/CN=Kubernetes Ingress Controller Fake Certificate
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIDbzCCAlegAwIBAgIQNjPMVgsbs/QcapdZeZZ5WDANBgkqhkiG9w0BAQsFADBL
MRAwDgYDVQQKEwdBY21lIENvMTcwNQYDVQQDEy5LdWJlcm5ldGVzIEluZ3Jlc3Mg
Q29udHJvbGxlciBGYWtlIENlcnRpZmljYXRlMB4XDTE4MTIwNDA3NDg1MVoXDTE5
MTIwNDA3NDg1MVowSzEQMA4GA1UEChMHQWNtZSBDbzE3MDUGA1UEAxMuS3ViZXJu
ZXRlcyBJbmdyZXNzIENvbnRyb2xsZXIgRmFrZSBDZXJ0aWZpY2F0ZTCCASIwDQYJ
KoZIhvcNAQEBBQADggEPADCCAQoCggEBAKquGymJpl49Weph8hsusqV4pLOdx6NV
8CCcumTJMSd35VeZUOaHh2mohvkJRaTeXD+QE1VX3vlT2Nt6CCHnM4Q1ldaXdazU
HXGy6XrDDax6GsDR72lpDQ3g6PYwffRwZlVTbISRIhIE0WQLSshjNQ4T29AbOazl
/3DJ2A34BBB3yzWLtMA5HEWDZF8h/RWXJgw2w2gDKq3doY0aYOnpOjjxEOlQIXZ2
GPpv7VHokbGU2f+6myqV9eevLtZy0zKrqmIPualuoDGKhmd0fQv70cA42HZj73Pf
pbbgHb+hSMGAXO1hUkusIfERXTVSxG/OEayrS3MrwGKDL8DrjDZmEeMCAwEAAaNP
ME0wDgYDVR0PAQH/BAQDAgWgMBMGA1UdJQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB
/wQCMAAwGAYDVR0RBBEwD4INaW5ncmVzcy5sb2NhbDANBgkqhkiG9w0BAQsFAAOC
AQEAeSAKvMz6TpK0MZuNkAwozRE9IQuzrUA77+xgmorYxjxfkX4q1biR6CzQgQx/
uY8LutvKCf5ygnhPJjunhjCCq0OrgWvqj268H8suWxXpErwYP7Nh8Zricn+ALLsq
48mF81tjoOa9FsYUU4hNrkqOMEuPSHIXTr4+xgmdzQjhBrP+tEq9ISwvVX5eQ7E3
BX79v4K3Wb/BFXii1xlPMbLjBAfOCGW9zlCapcil94mfpEHMwqitsgnpurZMhpvH
udQ1nzgXPcFeOCZBZecLnaG1kD2PEL/9zdq9FAB8Bk7iLFloAqFLOjjeTOpv+St6
ehGjRV7Cji4rGDVJSy5pE5GUNA==
-----END CERTIFICATE-----
subject=/O=Acme Co/CN=Kubernetes Ingress Controller Fake Certificate
issuer=/O=Acme Co/CN=Kubernetes Ingress Controller Fake Certificate
---
No client certificate CA names sent
---
SSL handshake has read 1553 bytes and written 421 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-GCM-SHA384
    Session-ID: AAAEE62F2E75A3FB796C398636E7DA4AA051E7A91D067690C9FCA68DA2BDF0A8
    Session-ID-ctx:
    Master-Key: 1AF1ECB19DC5E830E73BF601878B72F486A710A207315E73168A079D991E2C8D9F72148942028048CFA65B6E2F018E6B
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    TLS session ticket lifetime hint: 600 (seconds)
    TLS session ticket:
    0000 - ea 09 28 0c ef 2b 21 a6-22 dd a3 79 2a 14 5e 05   ..(..+!."..y*.^.
    0010 - c1 96 e8 0c dc 11 95 f6-f9 1c aa 18 75 3b ca 61   ............u;.a
    0020 - 4f d8 5e b6 f7 ac 88 fb-f9 e1 6e df 0d 18 3b a2   O.^.......n...;.
    0030 - 4f 54 39 55 07 df e5 d7-81 c0 c5 23 f8 b4 6e 26   OT9U.......#..n&
    0040 - c8 9c 69 53 29 b5 f4 c8-b1 60 fc 63 54 c6 8a e7   ..iS)....`.cT...
    0050 - b7 0d 94 72 dd 26 ce 2a-e7 ed 3d 63 61 2d 77 f8   ...r.&.*..=ca-w.
    0060 - 52 be bc 3d 9e f2 d7 f6-01 c0 c2 ba e0 68 e6 9d   R..=.........h..
    0070 - 72 c4 46 af 07 a7 0e 7c-08 ba fd 67 5f 1e 00 e7   r.F....|...g_...
    0080 - f2 ba 5b c4 0f e2 1c 7e-fb 91 ef e2 b6 9e 12 91   ..[....~........
    0090 - 5b eb bb fd 04 53 4f 47-77 0a 97 f5 ef b7 de 0e   [....SOGw.......
    00a0 - f3 46 04 08 30 5e 08 8d-e6 43 2b 58 f6 da 74 da   .F..0^...C+X..t.

    Start Time: 1546866271
    Timeout   : 300 (sec)
    Verify return code: 21 (unable to verify the first certificate)
---

The x509 module of course found this information on host 192.168.253.15 as well. Because this is an internal Docker host managed by Kubernetes it has an invalid (self signed) certificate from Kubernetes installed. This is nicely shown in the UI, too:

Icingaweb2 x509 module chain invalid tls

You could now create a cron job which runs the scan task at your wanted time. Or you could also install a SystemD service using the unit file in the repo (see /usr/share/icingaweb2/modules/x509/config/systemd/icinga-x509.service).

 

[ SNI certificates ]

So far so good. The module shows us which certificates are invalid or have a missing chain. But what about SNI certificates?
The module only scans the primary certificate of each IP/host on the configured port (443 in this case).

To use Server Name Indication certificates, the module needs to be told the hostnames (how could it know otherwise). This can be added in the configuration of the module:

Configuration -> Modules -> x509 -> SNI -> Create a new SNI map

So basically you specify for each IP address all the hostnames (comma-separated) which you have configured on that host:

Icingaweb2 x509 Module SNI Mapping

After this you need to run another scan and afterwards all your SNI certificates show up, too.

 

[ Remove certificates ]

But what if a certificate was removed or needs to be deleted?
Well, that as of now is a little problem. The module doesn't delete "old" certificates from the database. It is unfortunately not possible to manually delete a discovered database either. Check out this Github issue where this problem (might) be tackled.

As a workaround you can delete the certificate directly from the database off the table x509_target:

mysql> DELETE FROM x509_target WHERE hostname = "myhostname.example.com";

Where hostname is either the discovered hostname from a scan or the hostname defined in the SNI map.

 

[ Pros vs. Cons ]

The module is relatively new but offers a lot of cool features and graphics. But certain factors (especially SNI certificate monitoring) still requires manual fiddling, more than if you'd build a smart apply rule in Icinga2.

+ TLS Chain verification
+ Certificate verification (self signed vs. validated)
+ Several sorting options
+ Nice overview (dashboard) of all certificates, including CA's
- SNI certificates need to be configured manually as comma-separated hostnames per IP (that's a lot of work if you run huge central reverse proxies)
- Certificates (even falsely discovered ones) cannot be deleted in the UI

Altogether a very handy module even when there's still a lot of room for improvement (which software isn't).

Icingaweb2 x509 Module Dashboard

 

Linux going AMD Ryzen with Debian 9 (Stretch)
Monday - Dec 31st 2018 - by - (0 comments)

Ever since AMD announced the new Ryzen processors, I was eager to test such a CPU myself and replace a weaker CPU on a local micro-server running Debian Stretch.
But a couple of public forum posts kept me from buying one. Some of these:

So neither did I get somewhere a real confirmation that Ryzen works as it should but neither a confirmation that it doesn't work at all. I decided to go for it and test it myself. So I ordered an AMD Ryzen 1700 (yes, that's the first generation Ryzen) and a new AM4 motherboard (ASRock AB350 GAMING-ITX).

While waiting for the delivery, I already prepared Debian Stretch to run with a newer Kernel. It seems that full Ryzen support started with Kernel 4.11.0, further improvements and fixes in later releases. But default Debian Stretch ships with 4.10. Luckily Kernel 4.18 is available in stretch-backports. So let's enable backports then!

root@irbwsrvp01 ~ # echo "deb http://ftp.ch.debian.org/debian stretch-backports main non-free contrib" > /etc/apt/sources.list.d/backports.list
root@irbwsrvp01 ~ # apt-get update

To install the current Linux Kernel from backports:

root@irbwsrvp01 ~ # apt-get install -t stretch-backports linux-image-amd64
Reading package lists... Done
Building dependency tree      
Reading state information... Done
[...]
The following additional packages will be installed:
  linux-image-4.18.0-0.bpo.1-amd64
Suggested packages:
  linux-doc-4.18 debian-kernel-handbook
Recommended packages:
  firmware-linux-free irqbalance apparmor
The following NEW packages will be installed:
  linux-image-4.18.0-0.bpo.1-amd64
The following packages will be upgraded:
  linux-image-amd64
1 upgraded, 1 newly installed, 0 to remove and 83 not upgraded.
Need to get 45.4 MB of archives.
After this operation, 257 MB of additional disk space will be used.
Do you want to continue? [Y/n]

Just to make sure, I don't run into missing firmware dependencies, I decided to install these packages from backports, too:

root@irbwsrvp01 ~ # apt-get install -t stretch-backports firmware-amd-graphics firmware-linux-nonfree firmware-misc-nonfree firmware-realtek

Followed by reboot:

root@irbwsrvp01 ~ # reboot

The micro-server came up again and booted the new Kernel 4.18:

root@irbwsrvp01 ~ # uname -a
Linux irbwsrvp01 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13) x86_64 GNU/Linux

Yay! Ready for Ryzen!

Two days later I got my packages: The AMD Ryzen 1700 CPU, AM4 ASRock Motherboard, 2 x 8GB DDR4 RAM

AM4 Motherboard ASRock AB350 Mini-ITX with AMD Ryzen 1700

Once I built the new motherboard with the Ryzen CPU into the micro-server it was the moment of truth: Will the server boot?

Side-Note: It did, but there was no video output! What did I do wrong? I tried both HDMI outputs of the motherboard but there was no video output at all, not even showing a POST screen. Turns out the motherboard ASRock B350 Gaming ITX has a lot of video output plugs, but no embedded GPU. On the old motherboard I used an AMD A10 processor which is very weak (compared with Ryzen) but has an embedded GPU. Ryzen itself does not have an embedded GPU. Means: I had to attach a GPU/graphics card on the PCI-Ex slot. Luckily I still had one.

The processor shows up in ASRock's BIOS, so far so good:

ASRock AB350 Bios Ryzen 1700

What about the OS? Will it boot?

And yes. It did!

$ cat /proc/cpuinfo  | head -n 28
processor    : 0
vendor_id    : AuthenticAMD
cpu family    : 23
model        : 1
model name    : AMD Ryzen 7 1700 Eight-Core Processor
stepping    : 1
microcode    : 0x8001136
cpu MHz        : 1481.761
cache size    : 512 KB
physical id    : 0
siblings    : 16
core id        : 0
cpu cores    : 8
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
bugs        : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips    : 5988.04
TLB size    : 2560 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

And multi-threading is correctly working:

$ cat /proc/cpuinfo |grep "core id"
core id        : 0
core id        : 0
core id        : 1
core id        : 1
core id        : 2
core id        : 2
core id        : 3
core id        : 3
core id        : 4
core id        : 4
core id        : 5
core id        : 5
core id        : 6
core id        : 6
core id        : 7
core id        : 7

The view in htop:

AMD Ryzen Linux htop

So far this microserver is now running for 16 days straight, without any issues so far.

$ uptime
 14:23:53 up 16 days,  2:20,  2 users,  load average: 2.25, 3.06, 3.05

TL;DR: AMD Ryzen (1) works very well and fast with Debian Stretch and backported Linux Kernel (4.18)! 

And here's "Team Red" at work:

Debian Stretch running with AMD Ryzen 1700

Update January 9th 2019:
I saw a couple of forum posts from users that complain they get no sensor data from the Ryzen CPU. I cannot confirm that. At least with the backported 4.18 Kernel on Debian 9 and "lm-sensors" package installed, I'm able to read the temperature from the CPU:

# sensors
k10temp-pci-00c3
Adapter: PCI adapter
Tdie:         +38.9°C  (high = +70.0°C)
Tctl:         +38.9°C 

nouveau-pci-2600
Adapter: PCI adapter
GPU core:     +1.05 V  (min =  +0.80 V, max =  +1.19 V)
temp1:        +38.0°C  (high = +95.0°C, hyst =  +3.0°C)
                       (crit = +105.0°C, hyst =  +5.0°C)
                       (emerg = +135.0°C, hyst =  +5.0°C)

Important here is to know that "k10temp-pci-00c3" is the Ryzen 1700 CPU.

Below that, the nouveau-pci-2600, is a Nvidia GPU (GeForce GT 730).

 

Monitoring plugin check_smart 5.11 available, introducing exclude list
Friday - Dec 28th 2018 - by - (0 comments)

The monitoring plugin check_smart, to monitor hard drives' and solid state drives' SMART attributes, is out with a new version.

Version 5.11 introduces a new parameter "-e" or "--exclude" which stands for exclude list (aka ignore list).

The exclude list is a list of strings, separated by comma. The exclude list basically tells the plugin which SMART attributes to ignore, even if they are in a failing or failed state.

Let's take a temperature failed in the past error as an example.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
194 Temperature_Celsius     0x0002   113   113   000    Old_age   Always  In_the_past  53 (Lifetime Min/Max 25/62)

Without the exclude list, the plugin will return a WARNING when the temperature SMART attribute once failed in the past:

# ./check_smart.pl -d /dev/sda -i sat
WARNING: Attribute Temperature_Celsius failed at In_the_past|Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

It's a nice info that it once failed in the past. But once we know that, we get over it and want the warning to disappear. With the exclude list, the plugin can be told to ignore this attribute "Temperature_Celsius":

# ./check_smart.pl -d /dev/sda -i sat -e Temperature_Celsius
OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

And hurray, no alert anymore. 

But this could also be a bit dangerous. What if the drive has a new (live!) temperature alert? You'd certainly want to know about it. That's why, besides excluding a SMART attribute, it is also possible to exclude certain values in the "When_failed" column. In the following example, the "When_Failed" value "In_the_past" (as seen above) can be used in the exclude list:

# ./check_smart.pl -d /dev/sda -i sat -e "In_the_past"
OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

As you can see, the plugin doesn't alert anymore on the "Temperature_Celsius" because it detected the "In_the_past" value in the "When_failed" column and successfully ignored it.

To ignore multiple attributes, simply separate them with a comma:

# ./check_smart.pl -d /dev/sda -i sat -e "In_the_past","Current_Pending_Sector"
OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

But you better make sure you're not cutting yourself with this. The main reason why the exclude list was created in the first place is clearly the temperature attribute.

 

Reduce the number of shards of an Elasticsearch index
Thursday - Dec 27th 2018 - by - (0 comments)

When you run a lot of indexes, this can create quite a large sum of shards in your ELK stack. As the documentation states, the default creates 5 shards per index:

index.number_of_shards
    The number of primary shards that an index should have. Defaults to 5. This setting can only be set at index creation time. It cannot be changed on a closed index. Note: the number of shards are limited to 1024 per index. This limitation is a safety limit to prevent accidental creation of indices that can destabilize a cluster due to resource allocation. The limit can be modified by specifying export ES_JAVA_OPTS="-Des.index.max_number_of_shards=128" system property on every node that is part of the cluster.

Another interesting default setting is the number of replicas of each shard:

 index.number_of_replicas
    The number of replicas each primary shard has. Defaults to 1.

Once a new index was created, the number of shards is fixed. Only by creating a new index the number of shards can be defined; either during the creation of the index itself or by defining the settings in the template. 

Note: Yes, it is possible to change the number of shards on already created indexes, but that means you must re-index that index again, possibly causing downtime.

In my setup, a classical ELK stack, there are a couple of indexes (logstash, filebeat, haproxy, ...)  created every day, typically with the date in the index name (logstash-2018.12.27). By adjusting the shard settings in the templates "logstash" and "filebeat", the indexes created from tomorrow on and later will have a reduced number of shards.

First let's take a backup:

# elk=localhost:9200
# curl $elk/_template/logstash?pretty -u elastic -p > /root/template-logstash.backup

Now create a new file, e.g. /tmp/logstash, based on the backup flie. Add the "number_of_shards" and "number_of_replicas" into the settings key.

Also make sure, that you remove the "logstash" main key itself. So the file looks like this:

# cat /tmp/logstash
{
    "order" : 0,
    "version" : 60001,
    "index_patterns" : [
      "logstash-*"
    ],
    "settings" : {
      "number_of_shards" : 2,
      "number_of_replicas" : 1,
      "index" : {
        "refresh_interval" : "5s"
      }
    },
    "mappings" : {
[...]
        ],
        "properties" : {
          "@timestamp" : {
            "type" : "date"
          },
          "@version" : {
            "type" : "keyword"
          },
          "geoip" : {
            "dynamic" : true,
            "properties" : {
              "ip" : {
                "type" : "ip"
              },
              "location" : {
                "type" : "geo_point"
              },
              "latitude" : {
                "type" : "half_float"
              },
              "longitude" : {
                "type" : "half_float"
              }
            }
          }
        }
      }
    },
    "aliases" : { }
}

And now this file can be "PUT" into Elasticsearch templates:

# curl -H "Content-Type: application/json" -XPUT $elk/_template/logstash -d "@/tmp/logstash" -u elastic -p
Enter host password for user 'elastic':
{"acknowledged":true}

By checking out the template again, our adjusted shard settings are now showing:

# curl $elk/_template/logstash?pretty -u elastic -p
{
  "logstash" : {
    "order" : 0,
    "version" : 60001,
    "index_patterns" : [
      "logstash-*"
    ],
    "settings" : {
      "index" : {
        "number_of_shards" : "2",
        "number_of_replicas" : "1",
        "refresh_interval" : "5s"
      }
    },
    "mappings" : {


 

Monitoring plugin check_netio 1.3 released
Friday - Dec 21st 2018 - by - (0 comments)

The monitoring plugin check_netio, to monitor input/output on Linux network interfaces, has been around for over a decade (2007). And I've been using it on thousands of servers (counting all the data centers of my former employers together) in the past years.

For me it was always a very lightweight yet incredibly easy way to monitor a network interfaces performance. The plugin's code didn't change a lot.

But in 2017 I actually needed to adapt the source code because some Linux distributions changed the output of the command "ifconfig". Suddenly the plugin didn't work anymore on these distributions (it started with RHEL/CentOS 7 by the way).

Now in 2018 it seems that some distributions don't ship "ifconfig" by default anymore (seen in Ubuntu 18.04) because "ip" has been around for many years, too.

Version 1.2
Instead of relying on "ifconfig" and running into possible parsing errors due to different formatting in various distros, I changed the plugin to directly parse /proc/net/dev. This procfs file should be the same on all distributions. And it has another nice side effect: The plugin is now 50% faster, too!

If for some reason the plugin doesn't work (which would mean there's no read access to /proc/net/dev for some reason), there's the new parameter "-l" to use the "legacy" mode. The legacy mode continues to use "ifconfig" in the background.

Version 1.3
Removed the plugin's dependency to /usr/lib/nagios/plugins/utils.sh which is part of the nagios-plugins-common or monitoring-plugins-common package. check_netio actually only used one small variable from utils.sh. A full dependency of utils.sh just doesn't make sense. And yes, this also means a small increase in execution. 

And because there have been quite some adjustments now, I finally created a documentation page: https://www.claudiokuenzler.com/monitoring-plugins/check_netio.php  .

Enjoy!

 


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

6937 Days
until Death of Computers
Why?