Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

Major update on Elasticsearch monitoring plugin check_es_system!
Wednesday - Feb 20th 2019 - by - (0 comments)

It's been quite a while since the last update on the monitoring plugin check_es_system, a plugin to monitor Elasticsearch nodes.

That's why this post tries to describe the latest changes more detailed.

Let's start with a change that got under my radar: Tom Barton (@deric) already created a pull request quite a while ago in March 2018. I had my Github notifications on off so I never saw that one - sorry!
He added a helpful new parameter "-m" which stands for "max time" (aka timeout). This allows to have an additional verification that Elasticsearch responds fast enough. This change is shown as 20180313 in the plugin's change history.

Yesterday I came across a strange bug but when I made a configuration error in the Icinga2 service definition. This lead to open issue #4 which was then solved in version 20190219. Basically this bug hits you when the plugin tries to access Elasticsearch on a https port but you didn't select the -S parameter. In the background this launches curl to talk http on a https listener port. Got the idea?

And also yesterday I started to work on a new check type: status. Yes, pretty standard, I know. I actually never added a status check in the first place because I was successfully using a different plugin (check_elasticsearch.sh by Andrew Lyon) for the status checks. But in the recent few weeks we increased our Elasticsearch fleet (internal and in the cloud) and this led to many different credentials, ports, ES with and without HTTPS, etc. So I needed a plugin which can be as dynamic as our environment.

The new "status" check type does not only output green, yellow or red. No, it also adds some helpful information about the cluster structure. How many nodes are there, how many data nodes? How many shards are there? And in case i'm in yellow or red state, are there shards to be relocated/initialize/assign? And finally something which is not often though of: Number of documents. This seems irrelevant for a status check but when you create graphs with the numbers, you can see the growth rate of your Elasticsearch cluster.

To round this up, all this is now released as version 1.1, which makes it a bit easier to remember as the history dates as release numbers.

The documentation page has been updated and greatly enhanced with additional examples.

Enjoy!

 

LXC 2.0 container not starting on Debian 9 Stretch when using cgroup limits
Monday - Feb 18th 2019 - by - (0 comments)

I just hit a problem on a Debian 9 (Stretch) machine with the latest LXC 2.0.7 (package 2.0.7-2+deb9u2 from Debian repos) installed.

When I tried to run a LXC container with cgroup limits enabled, I got the following error:

# grep limit /var/lib/lxc/container/config
lxc.cgroup.memory.limit_in_bytes = 64G
lxc.cgroup.memory.memsw.limit_in_bytes = 68G

# lxc-start -n container -F
lxc-start: cgroups/cgfsng.c: cgfsng_setup_limits: 1949 Permission denied - Error setting memory.memsw.limit_in_bytes to 68G for container
lxc-start: start.c: lxc_spawn: 1190 Failed to setup cgroup limits for container "container".
lxc-start: start.c: __lxc_start: 1346 Failed to spawn container "container".
 lxc-start: tools/lxc_start.c: main: 366 The container failed to start.
lxc-start: tools/lxc_start.c: main: 370 Additional information can be obtained by setting the --logfile and --logpriority options.

After some research I came across a very interesting thread in the linuxcontainers.org forums. There was indeed a problem in the 2.0.7 version but it was fixed in 2.0.8. The problem with Debian? Stretch still runs with 2.0.7 and has since quite a long time (over a year) according to the changelog:

lxc (1:2.0.7-2+deb9u2) stretch; urgency=medium

  * 0005-debian-Use-iproute2-instead-of-iproute.patch: pull iproute2 instead
    of iproute, fixing the creation of testing and unstable containers after
    the iproute binary package was dropped.

 -- Antonio Terceiro   Mon, 29 Jan 2018 20:23:36 -0200

lxc (1:2.0.7-2+deb9u1) stretch; urgency=medium

  * 0003-lxc-debian-don-t-hardcode-valid-releases.patch: don't
    hardcode list of valid Debian releases. Allows creating stable, buster,
    testing, and unstable containers.
  * 0004-lxc-debian-don-t-write-C.-locales-to-etc-locale.gen.patch: don't
    insert C.* locales into /etc/locale.gen (Closes: #879595)

 -- Antonio Terceiro   Fri, 27 Oct 2017 15:13:31 -0200

lxc (1:2.0.7-2) unstable; urgency=high

  * use bash-completion's pkg-config support and don't move files around
  * ignore lxc-test-cloneconfig if kernel has no overlay support
  * CVE-2017-5985: Ensure target netns is caller-owned (Closes: #857295)

 -- Evgeni Golov   Sat, 11 Mar 2017 09:47:20 +0100

lxc (1:2.0.7-1) unstable; urgency=medium

  * New upstream version 2.0.7
    + Closes: #847909, #847894, #847466

 -- Evgeni Golov   Mon, 23 Jan 2017 22:03:24 +0100

 According to the thread in the forums, the problem was fixed in lxcfs, partitcularly in the package libpam-cgfs. The discussion in the LXC forums also led to the report of Debian bug #867619. However this bug was only reported on the upcoming Debian 10 (Buster). According to one of the maintainers (Evgeni Golov) this was fixed in 2.0.7-2. The problem? The latest available package version in Stretch is as of today (February 18th 2019) still 2.0.7-1:

# apt-cache show libpam-cgfs
Package: libpam-cgfs
Source: lxcfs
Version: 2.0.7-1+deb9u1
Installed-Size: 47
Maintainer: pkg-lxc
Architecture: amd64
Depends: libc6 (>= 2.14), libfuse2 (>= 2.2), libpam0g (>= 0.99.7.1), libpam-runtime (>= 1.0.1-6), systemd | cgroupfs-mount
Conflicts: libpam-cgm
Description-en: PAM module for managing cgroups for LXC
 LXCFS provides a FUSE based filesystem to improve the LXC experience
 within the containers.
 .
 This provides a Pluggable Authentication Module (PAM) to provide
 logged-in users with a set of cgroups which they can administer.
 This allows for instance unprivileged containers, and session
 management using cgroup process tracking.
Description-md5: e709f3eddd48d5ce8595be4d003fd4f5
Homepage: https://linuxcontainers.org
Section: admin
Priority: optional
Filename: pool/main/l/lxcfs/libpam-cgfs_2.0.7-1+deb9u1_amd64.deb
Size: 18332
MD5sum: df18b81dc8e1dabffa7be5eaf586dc01
SHA256: 76e265bfb9a361db019c2fc1dc2ad6cf2b58cc62528f160c1107b77a6377af00

So how can this be tackled?

There are several possibilities:

Note: I haven't tried these yet! Stand by!

1) Use a manually fixed and prepared package of libpam-cgfs from the Ubuntu suite, packaged by Stéphane Graber:

https://launchpad.net/ubuntu/+source/lxcfs/2.0.7-0ubuntu4/+build/12785691/+files/libpam-cgfs_2.0.7-0ubuntu4_amd64.deb

However these packages were made for an Ubuntu system, although they should be (pretty much) compatible with Debian Stretch.

2) Use the 2.0.7-2 packages from Debian maintainer Evgeni Golov:

https://people.debian.org/~evgeni/tmp/lxcfs/

However, these packages were made for Debian 10. To be tested...

3) Use Debian Stretch backports. 

stretch-backports offers the 2.0.8 version of LXC and related packages:

  • lxcfs (2.0.8-1~bpo9+1)
  • libpam-cgfs (2.0.8-1~bpo9+1)

Interestingly the last comment in the lxcfs package in backports (Pierre-Elliott Bécue) wrote this in the changelog:

lxcfs (2.0.8-1~bpo9+1) stretch-backports; urgency=medium

  * Team upload
  * Rebuild for stretch-backports.
  * This backport release is an alternative to 2.0.7-1 that has a couple of
    issues, and shouldn't have reached stable.
    See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=867619 for more
    intel.

 -- Pierre-Elliott Bécue   Sat, 17 Nov 2018 09:01:07 +0100

 "This backport release is an alternative to 2.0.7-1 that has a couple of issues, and shouldn't have reached stable. See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=867619 for more intel."

Ah, and here is the mentioned bug again ;-)

So let's try and see which of these potential solutions work.

Update February 19th 2019:
Today I continued my tests and finally got the cgroup limits to work on Debian Stretch with the following packages installed:

# dpkg -l|egrep "(lxc|libpam-cgfs)"
ii  liblxc1          1:2.0.7-2+deb9u2    amd64        Linux Containers userspace tools (library)
ii  libpam-cgfs      2.0.7-1+deb9u1      amd64        PAM module for managing cgroups for LXC
ii  lxc              1:2.0.7-2+deb9u2    amd64        Linux Containers userspace tools
ii  lxcfs            2.0.7-1+deb9u1      amd64        FUSE based filesystem for LXC
ii  python3-lxc      1:2.0.7-2+deb9u2    amd64        Linux Containers userspace tools (Python 3.x bindings)

On another Debian Stretch server I also successfully tested it with a newer lxcfs package from debian stretch-backports (2.0.8-1~bpo9+1).

Additional Kernel parameters (cgroup_enable=memory swapaccount=1) were set in /etc/default/grub.

However as soon as I touched lxcfs (package upgrade or downgrade) I needed a reboot, otherwise I would get the following error when trying to start a container with cgroup limits:

# lxc-start -n test -F
lxc-start: cgroups/cgfsng.c: cgfsng_setup_limits: 1949 Permission denied - Error setting memory.memsw.limit_in_bytes to 68G for test
lxc-start: start.c: lxc_spawn: 1190 Failed to setup cgroup limits for container "test".
lxc-start: start.c: __lxc_start: 1346 Failed to spawn container "test".
lxc-start: tools/lxc_start.c: main: 366 The container failed to start.
lxc-start: tools/lxc_start.c: main: 370 Additional information can be obtained by setting the --logfile and --logpriority options.

Because lxcfs doesn't run anymore once the package was touched:

# systemctl status lxcfs
? lxcfs.service - FUSE filesystem for LXC
   Loaded: loaded (/lib/systemd/system/lxcfs.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2019-02-19 07:17:38 CET; 13h ago
     Docs: man:lxcfs(1)
 Main PID: 31389 (code=exited, status=1/FAILURE)
      CPU: 5ms

After a reboot:

# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.9.0-8-amd64 root=UUID=XXXXXXXX-XXXX-XXXX-XXXXXXXXXXXX ro quiet cgroup_enable=memory swapaccount=1

# lxc-start -n test -d
# lxc-ls -f
NAME   STATE   AUTOSTART GROUPS IPV4           IPV6
test   RUNNING 0         -      192.168.12.199 -    

The following cgroup limits were set by the way:

lxc.cgroup.cpuset.cpus = 1-12
lxc.cgroup.cpu.shares = 1024
lxc.cgroup.memory.limit_in_bytes = 64G
lxc.cgroup.memory.memsw.limit_in_bytes = 68G

 

Upgrade a Rancher 2 HA management cluster with helm
Thursday - Feb 14th 2019 - by - (0 comments)

Until recently, RKE (Rancher Kubernetes Engine) had to be used to upgrade Rancher to a newer version.

Since 2.0.8 it is possible to use helm for this. helm can be compared to the "apt" package manager for Debian based systems, just for Kubernetes nodes. It manages "repositories" and Rancher does offer such a helm repository.

You can initiate the upgrade from any machine you like, as long as you can access the Rancher 2 management URL and you have kubectl and helm installed locally. Read this article to learn how to install kubectl and how to connect to a Rancher 2 cluster. And read the helm installation instructions from Rancher.

Once you have kubectl and helm installed, you can now configure kubectl to connect to your Rancher 2 management cluster.
For this I have prepared a kube config yaml file:

$ export KUBECONFIG=~/.kube/local-teststage.yaml

Verify that you are able to connect to the Kubernetes cluster:

$ kubectl get nodes
NAME             STATUS    AGE       VERSION
192.168.253.15   Ready     98d       v1.11.3
192.168.253.16   Ready     98d       v1.11.3
192.168.253.17   Ready     98d       v1.11.3

Yep, these are the internal IP's of the Rancher 2 cluster "local". I can also verify this in the Rancher 2 UI:

Rancher 2 local cluster nodes 

Make sure your local helm version is up to date:

$ helm init --upgrade --service-account tiller
$HELM_HOME has been configured at /srv/ansible/.helm.

Tiller (the Helm server-side component) has been upgraded to the current version.
Happy Helming!

$ helm version
Client: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}

As I mentioned before, helm is a package manager using repos. Let's make sure the rancher repository is active:

$ helm repo list
NAME              URL                                             
stable            https://kubernetes-charts.storage.googleapis.com
local             http://127.0.0.1:8879/charts                    
rancher-stable    https://releases.rancher.com/server-charts/stable

Let's get the latest updates from all the listed repos (comparable to apt-get update):

$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Skip local chart repository
...Successfully got an update from the "rancher-stable" chart repository
...Successfully got an update from the "stable" chart repository
Update Complete. ? Happy Helming!?

Before you upgrade Rancher, check for specific package values (we need them in the upgrade command):

$ helm get values rancher
hostname: rancher2.example.com
ingress:
  tls:
    source: secret

So here we got two keys/values back:

  • hostname: rancher2.example.com (= Hostname of the Rancher 2 management URL/cluster)
  • ingress.tls.source: secret (= where to get the ingress certificates from or how to create the certificates)
See the Helm Chart Options for all available options.

With this information we can now launch the upgrade:

$ helm upgrade rancher rancher-stable/rancher --set hostname=rancher.example.com --set ingress.tls.source=secret
Release "rancher" has been upgraded. Happy Helming!
LAST DEPLOYED: Thu Feb 14 09:53:04 2019
NAMESPACE: cattle-system
STATUS: DEPLOYED

RESOURCES:
==> v1/Deployment
NAME     AGE
rancher  98d

==> v1beta1/Ingress
rancher  98d

==> v1/Pod(related)

NAME                      READY  STATUS   RESTARTS  AGE
rancher-5dc9f9b886-jhrrm  0/1    Pending  0         0s
rancher-6dc68bb996-j66lw  1/1    Running  1         72d
rancher-6dc68bb996-jrl7k  1/1    Running  0         72d
rancher-6dc68bb996-mjg8t  1/1    Running  0         72d

==> v1/ServiceAccount

NAME     AGE
rancher  98d

==> v1/ClusterRoleBinding
rancher  98d

==> v1/Service
rancher  98d


NOTES:
Rancher Server has been installed.

NOTE: Rancher may take several minutes to fully initialize. Please standby while Certificates are being issued and Ingress comes up.

Check out our docs at https://rancher.com/docs/rancher/v2.x/en/

Browse to https://rancher.example.com

Happy Containering!

 In the Rancher UI it took a couple of seconds and then the version at the lower left corner changed from 1.2.1 to 1.2.6. The UI also stated that the cluster API is currently unavailable. This took around 5 minutes until it was up again.

 

How to solve Rancher 1.x service stuck in removing (in progress)
Wednesday - Feb 13th 2019 - by - (0 comments)

Today I came across an annoying bug in Rancher 1.6 (currently running 1.6.26) where a service was stuck in removing state:

Rancher Service stuck removing

All containers of this service were already deleted in the user interface. I verified this on the Docker hosts using "docker ps -a" and yes, all container instances were correctly removed. But the service in Rancher was still stuck in removing.

Furthermore in Admin -> Processes the service.remove processes (which seem to be cause of being stuck in that service removing in progress) never disappeared and were re-started every 2 minutes:

Rancher service.remove process running 

Rancher service.remove processes restarted

Although I'm not sure what caused this, the reason *might* be several actions happening on that particular service almost at the same time:

 Rancher audit log

As you can see, while I attempted a service rollback, another user deleted the same service at (almost) the same time. I wouldn't be surprised if this has upset Rancher in such a way, that the "delete" task happened faster than the "rollback", causing the "rollback" to hiccup the system. The second "delete" attempt was to see if it would somehow "force" the removal, but it didn't work. So far to the theory (only someone from Rancher could eventually confirm this or better give the real reason what has happened), let's go solve this.

Because all attempts using the Rancher UI and API failed (the service staid in removing state), I began my research and came across the following issues:

The last one (issue 16694) basically describes the exact same bug and I also shared my information there. Unfortunately the issue was closed recently (23 days ago at the time of this writing), indicating:

"With the release of Rancher 2.0, development on v1.6 is only limited to critical bug fixes and security patches."

That's a shame because I consider this bug a critical bug. But as I didn't have time to wait for a potential bugfix, I had to go ahead and fix the problem anyway.

The other links above often mention SQL queries directly in Rancher's database to solve the issue, but nobody so far has really solved the problem. Let's try and solve this in the DB then!

Spoiler + Disclaimer: The following steps worked for me and solved the problem. However this should really be your last resort and you have to make sure all containers launched by this service are really stopped/removed. Also make sure there are no volume mounts (or unmounts in that case) hanging which could cause the service being stuck in removing (and waiting for a host). You do this at your own risk! (You made a backup, right?)

Let's take a closer look at the Rancher database. The following tables exist as of 1.6.26:

mysql> show tables;
+-----------------------------------------------+
| Tables_in_rancher                             |
+-----------------------------------------------+
| DATABASECHANGELOG                             |
| DATABASECHANGELOGLOCK                         |
| account                                       |
| account_link                                  |
| agent                                         |
| agent_group                                   |
| audit_log                                     |
| auth_token                                    |
| backup                                        |
| backup_target                                 |
| catalog                                       |
| catalog_category                              |
| catalog_file                                  |
| catalog_label                                 |
| catalog_template                              |
| catalog_template_category                     |
| catalog_version                               |
| catalog_version_label                         |
| certificate                                   |
| cluster                                       |
| cluster_host_map                              |
| cluster_membership                            |
| config_item                                   |
| config_item_status                            |
| container_event                               |
| credential                                    |
| credential_instance_map                       |
| data                                          |
| deployment_unit                               |
| dynamic_schema                                |
| dynamic_schema_role                           |
| environment                                   |
| external_event                                |
| external_handler                              |
| external_handler_external_handler_process_map |
| external_handler_process                      |
| generic_object                                |
| global_load_balancer                          |
| healthcheck_instance                          |
| healthcheck_instance_host_map                 |
| host                                          |
| host_ip_address_map                           |
| host_label_map                                |
| host_template                                 |
| host_vnet_map                                 |
| image                                         |
| image_storage_pool_map                        |
| instance                                      |
| instance_host_map                             |
| instance_label_map                            |
| instance_link                                 |
| ip_address                                    |
| ip_address_nic_map                            |
| ip_association                                |
| ip_pool                                       |
| label                                         |
| load_balancer                                 |
| load_balancer_certificate_map                 |
| load_balancer_config                          |
| load_balancer_config_listener_map             |
| load_balancer_host_map                        |
| load_balancer_listener                        |
| load_balancer_target                          |
| machine_driver                                |
| mount                                         |
| network                                       |
| network_driver                                |
| network_service                               |
| network_service_provider                      |
| network_service_provider_instance_map         |
| nic                                           |
| offering                                      |
| physical_host                                 |
| port                                          |
| process_execution                             |
| process_instance                              |
| project_member                                |
| project_template                              |
| region                                        |
| resource_pool                                 |
| scheduled_upgrade                             |
| secret                                        |
| service                                       |
| service_consume_map                           |
| service_event                                 |
| service_expose_map                            |
| service_index                                 |
| service_log                                   |
| setting                                       |
| snapshot                                      |
| snapshot_storage_pool_map                     |
| storage_driver                                |
| storage_pool                                  |
| storage_pool_host_map                         |
| subnet                                        |
| subnet_vnet_map                               |
| task                                          |
| task_instance                                 |
| ui_challenge                                  |
| user_preference                               |
| vnet                                          |
| volume                                        |
| volume_storage_pool_map                       |
| volume_template                               |
| zone                                          |
+-----------------------------------------------+
105 rows in set (0.03 sec)

In one of the links above, the table process_instance was mentioned. This table looks like this:

mysql> describe process_instance;
+---------------------------+--------------+------+-----+---------+----------------+
| Field                     | Type         | Null | Key | Default | Extra          |
+---------------------------+--------------+------+-----+---------+----------------+
| id                        | bigint(20)   | NO   | PRI | NULL    | auto_increment |
| start_time                | datetime     | YES  | MUL | NULL    |                |
| end_time                  | datetime     | YES  | MUL | NULL    |                |
| data                      | mediumtext   | YES  |     | NULL    |                |
| priority                  | int(11)      | YES  | MUL | 0       |                |
| process_name              | varchar(128) | YES  |     | NULL    |                |
| resource_type             | varchar(128) | YES  |     | NULL    |                |
| resource_id               | varchar(128) | YES  |     | NULL    |                |
| result                    | varchar(128) | YES  |     | NULL    |                |
| exit_reason               | varchar(128) | YES  |     | NULL    |                |
| phase                     | varchar(128) | YES  |     | NULL    |                |
| start_process_server_id   | varchar(128) | YES  |     | NULL    |                |
| running_process_server_id | varchar(128) | YES  |     | NULL    |                |
| execution_count           | bigint(20)   | NO   |     | 0       |                |
| run_after                 | datetime     | YES  | MUL | NULL    |                |
| account_id                | bigint(20)   | YES  | MUL | NULL    |                |
+---------------------------+--------------+------+-----+---------+----------------+
16 rows in set (0.03 sec)

After checking some entries of this table, I figured that "still running" processes (as seen in the UI) don't have an end_time and I can also check for a specific process.name:

mysql> select * from process_instance where end_time is NULL and process_name = 'service.remove';
+----------+---------------------+----------+------+----------+----------------+---------------+-------------+--------+-------------+----------+-------------------------+---------------------------+-----------------+---------------------+------------+
| id       | start_time          | end_time | data | priority | process_name   | resource_type | resource_id | result | exit_reason | phase    | start_process_server_id | running_process_server_id | execution_count | run_after           | account_id |
+----------+---------------------+----------+------+----------+----------------+---------------+-------------+--------+-------------+----------+-------------------------+---------------------------+-----------------+---------------------+------------+
| 16772517 | 2019-02-13 12:09:56 | NULL     | {}   |        0 | service.remove | service       | 534         | NULL   | NULL        | HANDLERS | 172.17.0.2              | 172.17.0.2                |              51 | 2019-02-13 13:34:35 |       3480 |
| 16785969 | 2019-02-13 12:46:11 | NULL     | {}   |        0 | service.remove | service       | 534         | NULL   | NULL        | HANDLERS | 172.17.0.2              | 172.17.0.2                |              33 | 2019-02-13 13:34:53 |       3480 |
+----------+---------------------+----------+------+----------+----------------+---------------+-------------+--------+-------------+----------+-------------------------+---------------------------+-----------------+---------------------+------------+
2 rows in set (0.03 sec)

Surprise surprise, these are the same entries as seen in the Rancher UI in Admin -> Processes!

From this query response (but also from the UI) we can grab the service's resource id (534). Let's take a closer look at the service table:

mysql> describe service;
+--------------------+---------------+------+-----+---------+----------------+
| Field              | Type          | Null | Key | Default | Extra          |
+--------------------+---------------+------+-----+---------+----------------+
| id                 | bigint(20)    | NO   | PRI | NULL    | auto_increment |
| name               | varchar(255)  | YES  | MUL | NULL    |                |
| account_id         | bigint(20)    | YES  | MUL | NULL    |                |
| kind               | varchar(255)  | NO   |     | NULL    |                |
| uuid               | varchar(128)  | NO   | UNI | NULL    |                |
| description        | varchar(1024) | YES  |     | NULL    |                |
| state              | varchar(128)  | NO   | MUL | NULL    |                |
| created            | datetime      | YES  |     | NULL    |                |
| removed            | datetime      | YES  | MUL | NULL    |                |
| remove_time        | datetime      | YES  | MUL | NULL    |                |
| data               | mediumtext    | YES  |     | NULL    |                |
| environment_id     | bigint(20)    | YES  | MUL | NULL    |                |
| vip                | varchar(255)  | YES  |     | NULL    |                |
| create_index       | bigint(20)    | YES  |     | NULL    |                |
| selector_link      | varchar(4096) | YES  |     | NULL    |                |
| selector_container | varchar(4096) | YES  |     | NULL    |                |
| external_id        | varchar(255)  | YES  | MUL | NULL    |                |
| health_state       | varchar(128)  | YES  |     | NULL    |                |
| system             | bit(1)        | NO   |     | b'0'    |                |
| skip               | bit(1)        | NO   |     | b'0'    |                |
+--------------------+---------------+------+-----+---------+----------------+
20 rows in set (0.03 sec)

Each service has a unique id, so let's check if we can find a service with id 534 as seen before:

mysql> select id,name,account_id,kind,uuid,description,state,created,removed,remove_time,environment_id,external_id,health_state from service where id = 534;
+-----+----------------------+------------+---------+--------------------------------------+-------------+----------+---------------------+---------+-------------+----------------+-------------+--------------+
| id  | name                 | account_id | kind    | uuid                                 | description | state    | created             | removed | remove_time | environment_id | external_id | health_state |
+-----+----------------------+------------+---------+--------------------------------------+-------------+----------+---------------------+---------+-------------+----------------+-------------+--------------+
| 534 | Q-Election-Executive |       3480 | service | 710e9254-d03d-4373-b37b-4f7ef854e2d4 | NULL        | removing | 2018-02-09 13:13:21 | NULL    | NULL        |             74 | NULL        | unhealthy    |
+-----+----------------------+------------+---------+--------------------------------------+-------------+----------+---------------------+---------+-------------+----------------+-------------+--------------+
1 row in set (0.03 sec)

Note: Basically I just removed the "data" column from the SELECT statement as it contained some sensitive data.

So far so good, this is definitely our service stuck in removing!
What about services which were successfully removed so far? How do they look?

mysql> select id,name,account_id,kind,state,removed,remove_time,health_state from service where state = 'removed';
+------+-----------------------------------+------------+---------------------+---------+---------------------+---------------------+--------------+
| id   | name                              | account_id | kind                | state   | removed             | remove_time         | health_state |
+------+-----------------------------------+------------+---------------------+---------+---------------------+---------------------+--------------+
|   49 | Loadbalancer-Spellchecker-Staging |         79 | loadBalancerService | removed | 2017-06-27 14:18:15 | 2017-06-27 14:26:27 | unhealthy    |
|   57 | rancher-compose-executor          |       3078 | service             | removed | 2016-12-15 14:00:44 | 2016-12-15 14:09:54 | unhealthy    |
|   61 | load-balancer-swarm               |       3078 | loadBalancerService | removed | 2016-12-15 14:00:45 | 2016-12-15 14:02:38 | unhealthy    |
|   62 | load-balancer                     |       3078 | loadBalancerService | removed | 2016-12-15 14:00:45 | 2016-12-15 14:08:16 | unhealthy    |
|  146 | Loadbalancer-Intern               |         79 | loadBalancerService | removed | 2017-02-01 12:03:34 | 2017-02-01 12:10:31 | unhealthy    |
|  231 | Departments-Tool                  |       5160 | service             | removed | 2019-02-12 11:11:40 | 2019-02-12 11:16:10 | unhealthy    |
|  403 | Departments-Tool-Regio            |       5160 | service             | removed | 2019-02-12 11:11:37 | 2019-02-12 11:13:35 | unhealthy    |
|  771 | Flugplan                          |       5160 | service             | removed | 2018-11-27 06:48:53 | 2018-11-27 06:49:11 | unhealthy    |
|  891 | st-1830-server                    |      59885 | service             | removed | 2019-01-14 16:20:16 | 2019-01-14 16:30:05 | unhealthy    |
|  917 | st-1830-server                    |       3480 | service             | removed | 2018-12-10 16:41:23 | 2018-12-10 16:41:25 | healthy      |
|  933 | Kenny-Varnish                     |         12 | service             | removed | 2018-12-14 13:03:16 | 2018-12-14 13:04:49 | unhealthy    |
|  941 | Kenny-Varnish                     |         79 | service             | removed | 2018-12-14 13:03:35 | 2018-12-14 13:06:28 | unhealthy    |
|  944 | Kenny-Varnish                     |         80 | service             | removed | 2018-12-14 13:03:44 | 2018-12-14 13:05:57 | unhealthy    |
| 1036 | KennyApp                          |       8735 | service             | removed | 2018-11-30 15:14:13 | 2018-11-30 15:14:45 | healthy      |
| 1043 | KennyApp                          |      60825 | service             | removed | 2019-02-13 11:33:31 | 2019-02-13 11:42:02 | healthy      |
| 1094 | Thunder-Judi                      |       5161 | service             | removed | 2019-02-12 11:03:29 | 2019-02-12 11:09:40 | unhealthy    |
| 1099 | claudiotest                       |      59885 | service             | removed | 2019-02-13 13:20:59 | 2019-02-13 13:28:41 | healthy      |
+------+-----------------------------------+------------+---------------------+---------+---------------------+---------------------+--------------+
17 rows in set (0.03 sec)

Obviously the state is set to "removed" (not removing) and the column removed and remove_time contain a date. The health_state doesn't seem to matter here.

Let's try and manually set our confused service to state "removed", including 'removed' and 'remove_time' timestamps:

mysql> UPDATE service SET state = 'removed', removed = '2019-02-13 12:50:00', remove_time = '2019-02-13 12:48:00' WHERE id = 534;
Query OK, 1 row affected (0.06 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Let's take a look at our service again:

mysql> select id,name,account_id,kind,uuid,description,state,created,removed,remove_time,environment_id,external_id,health_state from service where id = 534;
+-----+----------------------+------------+---------+--------------------------------------+-------------+---------+---------------------+---------------------+---------------------+----------------+-------------+--------------+
| id  | name                 | account_id | kind    | uuid                                 | description | state   | created             | removed             | remove_time         | environment_id | external_id | health_state |
+-----+----------------------+------------+---------+--------------------------------------+-------------+---------+---------------------+---------------------+---------------------+----------------+-------------+--------------+
| 534 | Q-Election-Executive |       3480 | service | 710e9254-d03d-4373-b37b-4f7ef854e2d4 | NULL        | removed | 2018-02-09 13:13:21 | 2019-02-13 12:50:00 | 2019-02-13 12:48:00 |             74 | NULL        | unhealthy    |
+-----+----------------------+------------+---------+--------------------------------------+-------------+---------+---------------------+---------------------+---------------------+----------------+-------------+--------------+
1 row in set (0.03 sec)

According to the Rancher database, this service is now properly removed. What about the Rancher UI?

The service Q-Election-Executive, which was stuck in "Removing" has disappeared from the list:

Rancher Service Stuck in Removing has disappeared

Note Q-Election-Executive-a was a clone of the original service as a workaround. But as the original service was still in removing status, the clone could not be renamed. 

And also the service.remove processes have disappeared from the "Running" tab:

Rancher service.remove processes gone 

Now the cloned service could be renamed to run under the original service name again. Yes!

 

Getting the EDIMAX EW-7811UN to work with Linux Mint 18.1
Monday - Feb 11th 2019 - by - (0 comments)

At work we have an "emergency" machine available. Goal of this machine is to have a dedicated Internet link via another ISP. This also helps us to simulate and compare accessing web-applications from internal networks vs. external networks. 

The machine itself runs with Linux Mint 18.1. While I never experienced any major issues with my Linux Mint installation (17.3) on my notebook (Dell Latitude E7440) so far, this machine has random issues with wireless connectivity. A simple ping to 8.8.8.8 revealed extreme changes of RTA. Sometimes the response time was 20ms, sometimes jumped up to 5000ms. Sometimes the WLAN connectivity was lost completely. 

"That's it, I've had it" I thought and ordered a new USB wireless adapter to replace the current wireless adapter from Ralink (Product: 802.11 n WLAN, idVendor=148f, idProduct=5370). Because I didn't really have the time to fiddle around with broken wlreless nic drivers, I ordered a Edimax EW-7811UN. Because it advertises to work with Mac, Windows and Linux and it is "ideal for Rasberry Pi" (which uses Debian in the background).

 Edimax EW 7811

To my understanding this should work out of the box. *buzzer* EEERRRR *buzzer* Mistake!

Once I connected the new EW-7811Un adapter to our emergency machine, it got discovered just fine in dmesg and was able to select a wireless lan from the wireless connections. But as soon as I clicked on my target WLAN, the machine froze. Ugh. Reset the machine.
Tried it again and Linux Mint froze again when I tried to connect. Reset again. Third time's the lucky charm. But again, freeze immediately when I clicked on the WLAN connect. Dang it! I will have to spend more time on that after all.

Before going into the solution, let's take a look at the current OS versions of this machine:

$ cat /etc/*release*
DISTRIB_ID=LinuxMint
DISTRIB_RELEASE=18.1
DISTRIB_CODENAME=serena
DISTRIB_DESCRIPTION="Linux Mint 18.1 Serena"
NAME="Linux Mint"
VERSION="18.1 (Serena)"
ID=linuxmint
ID_LIKE=ubuntu
PRETTY_NAME="Linux Mint 18.1"
VERSION_ID="18.1"
HOME_URL="http://www.linuxmint.com/"
SUPPORT_URL="http://forums.linuxmint.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/linuxmint/"
VERSION_CODENAME=serena
UBUNTU_CODENAME=xenial

$ uname -a
Linux emergency 4.8.17-040817-generic #201701090438 SMP Mon Jan 9 09:40:28 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

When I plugged the Edimax Wi-Fi Nano USB adapter, this was logged just before the freeze:

Feb 11 09:52:11 emergency kernel: [348572.061670] usb 1-4: new high-speed USB device number 5 using xhci_hcd
Feb 11 09:52:11 emergency kernel: [348572.203106] usb 1-4: New USB device found, idVendor=7392, idProduct=7811
Feb 11 09:52:11 emergency kernel: [348572.203111] usb 1-4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Feb 11 09:52:11 emergency kernel: [348572.203114] usb 1-4: Product: 802.11n WLAN Adapter
Feb 11 09:52:11 emergency kernel: [348572.203116] usb 1-4: Manufacturer: Realtek
Feb 11 09:52:11 emergency kernel: [348572.203119] usb 1-4: SerialNumber: 00e04c000001
Feb 11 09:52:12 emergency kernel: [348573.311847] rtl8192cu: Chip version 0x10
Feb 11 09:52:12 emergency kernel: [348573.344421] rtl8192cu: Board Type 0
Feb 11 09:52:12 emergency kernel: [348573.344494] rtl_usb: rx_max_size 15360, rx_urb_num 8, in_ep 1
Feb 11 09:52:12 emergency kernel: [348573.344537] rtl8192cu: Loading firmware rtlwifi/rtl8192cufw_TMSC.bin
Feb 11 09:52:12 emergency kernel: [348573.347070] ieee80211 phy1: Selected rate control algorithm 'rtl_rc'
Feb 11 09:52:12 emergency kernel: [348573.348931] usbcore: registered new interface driver rtl8192cu
Feb 11 09:52:12 emergency NetworkManager[800]:   [1549875132.6946] (wlan0): using nl80211 for WiFi device control
Feb 11 09:52:12 emergency NetworkManager[800]:   [1549875132.6948] device (wlan0): driver supports Access Point (AP) mode
Feb 11 09:52:12 emergency NetworkManager[800]:   [1549875132.6968] manager: (wlan0): new 802.11 WiFi device (/org/freedesktop/NetworkManager/Devices/3)
Feb 11 09:52:12 emergency kernel: [348573.400854] usbcore: registered new interface driver rtl8xxxu
Feb 11 09:52:12 emergency NetworkManager[800]:   [1549875132.7526] rfkill2: found WiFi radio killswitch (at /sys/devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4:1.0/ieee80211/phy1/rfkill2) (driver rtl8192cu)
Feb 11 09:52:12 emergency kernel: [348573.406344] rtl8192cu 1-4:1.0 wlx74da38f4dfe0: renamed from wlan0
Feb 11 09:52:12 emergency NetworkManager[800]:   [1549875132.7715] device (wlan0): interface index 4 renamed iface from 'wlan0' to 'wlx74da38f4dfe0'
Feb 11 09:52:12 emergency NetworkManager[800]:   [1549875132.7781] devices added (path: /sys/devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4:1.0/net/wlx74da38f4dfe0, iface: wlx74da38f4dfe0)
Feb 11 09:52:12 emergency NetworkManager[800]:   [1549875132.7781] device added (path: /sys/devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4:1.0/net/wlx74da38f4dfe0, iface: wlx74da38f4dfe0): no ifupdown configuration found.
Feb 11 09:52:12 emergency NetworkManager[800]:   [1549875132.7790] device (wlx74da38f4dfe0): state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
Feb 11 09:52:12 emergency kernel: [348573.433745] IPv6: ADDRCONF(NETDEV_UP): wlx74da38f4dfe0: link is not ready
Feb 11 09:52:12 emergency kernel: [348573.435559] rtl8192cu: MAC auto ON okay!
Feb 11 09:52:12 emergency kernel: [348573.448007] rtl8192cu: Tx queue select: 0x05
Feb 11 09:52:13 emergency NetworkManager[800]:   [1549875133.1725] (wlx74da38f4dfe0): using nl80211 for WiFi device control
Feb 11 09:52:13 emergency kernel: [348573.826813] IPv6: ADDRCONF(NETDEV_UP): wlx74da38f4dfe0: link is not ready
Feb 11 09:52:13 emergency NetworkManager[800]:   [1549875133.2179] device (wlx74da38f4dfe0): supplicant interface state: starting -> ready
Feb 11 09:52:13 emergency NetworkManager[800]:   [1549875133.2180] device (wlx74da38f4dfe0): state change: unavailable -> disconnected (reason 'supplicant-available') [20 30 42]
Feb 11 09:52:13 emergency kernel: [348573.872647] IPv6: ADDRCONF(NETDEV_UP): wlx74da38f4dfe0: link is not ready
Feb 11 09:52:14 emergency NetworkManager[800]:   [1549875134.1708] device (wlx74da38f4dfe0): supplicant interface state: ready -> inactive

Something important to read out from here: The Edimax EW-7811UN uses a Realteak chip, which uses the rtl8192cu driver. This driver seems to be defective, according to several posts:

  • https://adamscheller.com/systems-administration/rtl8192cu-fix-wifi/
  • https://forums.linuxmint.com/viewtopic.php?t=94495&f=53
  • https://edimax.freshdesk.com/support/solutions/articles/14000035492-how-to-resolve-ew-7811un-built-in-driver-issues-in-linux-kernel-v3-10-or-higher
  • http://www.cianmcgovern.com/getting-the-edimax-ew-7811un-working-on-linux/#comment-764845109
  • https://askubuntu.com/questions/509498/is-there-a-standard-wifi-driver-for-the-edimax-ew-7811un

Fortunately, a driver fix can be installed as a dynamic module for the Linux Kernel (dkm; dynamic Kernel module). The following steps explain how you do it.

Note: As of Kernel 4.4 and later, the Realteak chipset is _supposed_ to be handled by a newer driver called rtl8xxxu (see https://wireless.wiki.kernel.org/en/users/Drivers/rtl819x). But obviously this was not the case here with Kernel 4.8.

Install necessary build tools and the Kernel headers:

$ sudo apt-get install --reinstall linux-headers-$(uname -r) linux-headers-generic build-essential dkms git

Clone the following repository from Github to get the source code of the Kernel module:

$ git clone https://github.com/pvaret/rtl8192cu-fixes.git

Run dkms to add the module to the build tree:

 $ sudo dkms add ./rtl8192cu-fixes

Install the new module (this will take some time):

$ sudo dkms install 8192cu/1.11

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area....
make KERNELRELEASE=4.8.17-040817-generic -C /lib/modules/4.8.17-040817-generic/build M=/var/lib/dkms/8192cu/1.11/build...........
cleaning build area....

DKMS: build completed.

8192cu.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.8.17-040817-generic/kernel/drivers/net/wireless//

depmod.....

Backing up initrd.img-4.8.17-040817-generic to /boot/initrd.img-4.8.17-040817-generic.old-dkms
Making new initrd.img-4.8.17-040817-generic
(If next boot fails, revert to initrd.img-4.8.17-040817-generic.old-dkms image)
update-initramfs..........

DKMS: install completed.

Note: As of this writing, the module's version is 1.11. This might of course change.

Copy the module blacklist config, to disable the Kernel's "internal" driver of rtl8192cu:

$ sudo cp ./rtl8192cu-fixes/blacklist-native-rtl8192.conf /etc/modprobe.d/

Probe/Load all modules:

$ sudo depmod -a

I now rebooted the machine to make sure this works properly after booting:

$ sudo reboot

After the machine was up again, I connected the Edimax USB adapter. Interestingly the name of the adapter changed in the wireless connection list (in the UI). With the original driver, this was shown as a Realtek adapter, now it just says "Wifi".

Nevertheless I connected to the destination wireless LAN.... No freeze! And the connection was established!

Edimax EW-7811UN Linux Mint 18.1 connection established 

I verified with some ping checks for a couple of minutes and saw a much more stable connection than with the previous Ralink adapter.

By the way the module is shown being in use by wireless:

$ lsmod | grep 8192
8192cu                569344  0
cfg80211              581632  5 iwlmvm,iwlwifi,rt2x00lib,mac80211,8192cu

Success - technically speaking! My plan was to avoid all these manual tasks after all.

 

It is 2019, time to upgrade your Ubuntu 14.04 Trusty machines!
Wednesday - Jan 30th 2019 - by - (0 comments)

News! It's 2019!

Usually the uneven years are "boring". There is no big football tournament in summer. That's probably the reason why Canonical decided to use uneven years as EOL year for their long term support (LTS) versions.

Ubuntu 14.04 (Trusty Tahr) was released in April 2014 (hence 14.04) and came with a 5 year LTS "patching warranty". That's 2019. Oh no, that's now!

So it's time to upgrade the still running (and damn stable! pre-systemd... ;P) 14.04 machines.

For the sake of documentation, I noted down my double-release-upgrade steps from 14.04 -> 16.04 -> 18.04.

0. Before you even begin, make sure you have a backup. Create a snapshot of your VM, of your LXC LV's, dd your hard drive or whatever you're using. 

1. Run apt-get update and apt-get dist-upgrade on existing version to get the latest updates.

# apt-get update && apt-get dist-upgrade

Note: I'd even reboot the machine afterwards, to make sure the latest Kernel is running.

2. Copy existing apt source list file /etc/apt/sources.list

# cp -p  /etc/apt/sources.list{,.trusty}

3. Disable special sources in /etc/apt/sources.list.d/ directory -> rename files to end with .disabled or similar

# for file in $(ls /etc/apt/sources.list.d/*.list); do mv ${file} ${file}.disabled; done

4. Create the new /etc/apt/sources.list for 16.04 (xenial):

deb http://ch.archive.ubuntu.com/ubuntu/ xenial main restricted universe multiverse
deb http://ch.archive.ubuntu.com/ubuntu/ xenial-updates main restricted universe multiverse
deb http://security.ubuntu.com/ubuntu xenial-security main restricted universe multiverse

Note: Obviously you should use your own local mirror or country.

5. Run apt-get update and apt-get dist-upgrade. If apt asks to overwrite config files, usually chose "Y" to get the new config file from the updated package, unless you're sure that you require the current config file.

# apt-get update && apt-get dist-upgrade

6. If no errors occurred during dist-upgrade, reboot

# reboot

7. Verify system/applications work

8. Copy existing apt source list file /etc/apt/sources.list -> /etc/apt/sources.list.xenial

# cp -p /etc/apt/sources.list{,.xenial}

9. Create the new /etc/apt/sources.list for 18.04 (bionic):

deb http://ch.archive.ubuntu.com/ubuntu/ bionic main restricted universe multiverse
deb http://ch.archive.ubuntu.com/ubuntu/ bionic-updates main restricted universe multiverse
deb http://security.ubuntu.com/ubuntu bionic-security main restricted universe multiverse

10. Run apt-get update and apt-get dist-upgrade. If apt asks to overwrite config files, usually chose "Y" to get the new config file from the updated package, unless you're sure that we require the current config file (in my case that was only nrpe.cfg).

# apt-get update && apt-get dist-upgrade

11. If no errors occurred during dist-upgrade, reboot

# reboot

12. Verify system/applications work

13. Enable additional apt sources in /etc/apt/sources.list.d (adjust them to match the new distribution version). Maybe this requires manual research to see if these additional/third party repos even support the new distribution version.

Hopefully these steps worked as fine for you as they did for me. It always depends on the applications you're running on the Ubuntu machine of course.

 

Elasticsearch ignored disk watermark settings and enforced read only index
Monday - Jan 28th 2019 - by - (0 comments)

When Elasticsearch experiences a disk full event, its "defense mechanism" is to try to move shards to another cluster member or, when nothing helps, set its indexes to read-only.
By default Elasticsearch starts to react at 85% disk usage (filesystem usage). To change the thresholds, the parameters "cluster.routing.allocation.disk.watermark" can be used. I already wrote about this in an older article "ElasticSearch stopped to assign shards due to low disk space" back in December 2017. Back then I was using Elasticsearch 5.x.

This morning I came across a similar issue with Elasticsearch 6.5.4. In Kibana no new log entries were shown; the last entries were from Friday evening:

Elasticsearch stopped logging, seen in Kibana Graph 

When I looked at the Elasticsearch log files, I found a lot of lines like this one:

[2019-01-28T08:52:55,038][ERROR][o.e.x.w.e.ExecutionService] [inf-elkesi01-p] could not store triggered watch with id [prOmO-8rSreTrgLGraNv6w_kibana_version_mismatch_672f0d26-e20b-40d0-a54c-006543689e7c-2019-01-28T07:52:55.036Z]: [ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]]

So Elasticsearch decided to set the read-only flag. But why? Although disk usage was at 96%, I remembered that we set different watermark thresholds based on units, not percentage:

# df -h | grep elastic
/dev/mapper/vges-lves      ext4         3.9T  3.7T  168G  96% /var/lib/elasticsearch

# grep watermark /etc/elasticsearch/elasticsearch.yml
#cluster.routing.allocation.disk.watermark.low: "95%"
# 20180719: Set new watermarks
cluster.routing.allocation.disk.watermark.low: "100G"
cluster.routing.allocation.disk.watermark.high: "50G"
cluster.routing.allocation.disk.watermark.flood_stage: "10G"

Elasticsearch should only start to be aware of high disk usage when there's only 100GB left, yet it seems that the defaults were used. According to the documentation:

 cluster.routing.allocation.disk.watermark.low
    Controls the low watermark for disk usage. It defaults to 85%

 cluster.routing.allocation.disk.watermark.high
    Controls the high watermark. It defaults to 90%

 cluster.routing.allocation.disk.watermark.flood_stage
    Controls the flood stage watermark. It defaults to 95%, meaning that Elasticsearch enforces a read-only index block on every index

Even though the watermark thresholds were defined in elasticsearch.yml, the cluster settings didn't show it:

# curl -s http://localhost:9200/_cluster/settings -u elastic:XXX
{"persistent":{},"transient":{}}

Note: I'm actually not sure if the parameters defined in the config file are supposed to show up in the API. Maybe this is normal. Couldn't find any info for this.

So I decided to tell the Elasticsearch cluster using the API:

# curl -X PUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -u elastic:XXX -d '{ "persistent": { "cluster.routing.allocation.disk.watermark.low": "100gb", "cluster.routing.allocation.disk.watermark.high": "50gb", "cluster.routing.allocation.disk.watermark.flood_stage": "10gb", "cluster.info.update.interval": "1m" } }'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"disk":{"watermark":{"low":"100gb","flood_stage":"10gb","high":"50gb"}}}},"info":{"update":{"interval":"1m"}}}},"transient":{}}

Note: I used a "persistent" setting to survive cluster restarts.

The watermark thresholds now show up in the API:

# curl -s http://localhost:9200/_cluster/settings -u elastic:XXX
{"persistent":{"cluster":{"routing":{"allocation":{"disk":{"watermark":{"low":"100gb","flood_stage":"10gb","high":"50gb"}}}},"info":{"update":{"interval":"1m"}}}},"transient":{}}

But the cluster is still read only at this time. I decided to manually delete some older indexes and freed quite some space:

# df -h | grep elastic
/dev/mapper/vges-lves      ext4         3.9T  3.6T  279G  93% /var/lib/elasticsearch

So we're far away from the thresholds now yet still no new data.

I tried a full restart of the cluster and waited until the cluster was green again. But still, no data coming in. Logstash reports:

[2019-01-28T11:00:56,068][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})

On my research I came across a blog post and this shed some lights into the read-only situation:

" [...] elasticsearch is switching to read-only if it cannot index more documents because your hard drive is full [...] Elasticsearch will not switch back automatically [...] "

Oh? So I basically need to tell Elasticsearch "hey, you're good and stop being read-only"? Let's do this:

# curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}' -u elastic:XXX
{"acknowledged":true}

Right after this, data started to arrive in ES again!

To sum it up, there are two issues and solutions here:

1) The watermark thresholds from the Elasticsearch config file were obviously ignored. As we have a support contract with ES, I will ask in a ticket why this happened. To work around this for now, I defined the thresholds manually in the Elasticsearch API using the persistent settings.

2) When Elasticsearch switches to read-only mode, it will not recover automatically. You will have to manually reset this setting in the API.

 

HAProxy backend server behind AWS LB remains down with HTTP 503
Friday - Jan 25th 2019 - by - (0 comments)

A few days ago I came across an issue on an internal HAProxy (1.6.3) which uses a backend server in the AWS cloud. The backend server in this case was using a DNS record which was a CNAME to an AWS load balancer. 

Over the last couple of weeks this particular backend reported being down several times and only a manual reload of HAProxy would resolve the issue.

After a detailed analysis, I came to the conclusion that this is related to HAProxy's internal DNS caching and that AWS change the DNS records of their load balancers (sometimes more, sometimes less often).

I posted the analysis and solution as a response on Stackoverflow, but I'll also share it here.

The HAProxy running in our internal networks would suddenly take this backend server DOWN with a L7STS/503 check result, while our monitoring was accessing the backend server (directly) just fine. As we run a HAProxy pair (LB01 and LB02) a reload of LB01 immediately worked and the backend server was UP again. On LB02 (not reloaded on purpose) this backend server is still down.

All this seems to related to a DNS change of the AWS LB and how HAProxy does DNS caching. By default, HAProxy resolves all DNS records (e.g. for backends) at startup/reload. These resolved DNS records then stay in HAProxy's own DNS cache. So you would have to launch a reload of HAProxy to renew the DNS cache.

Another and without doubt the better solution is to define DNS servers and the HAProxy internal DNS cache TTL. This is possible since HAProxy version 1.6 with a config snippet like this:

global
[...]

defaults
[...]

resolvers mydns
  nameserver dnsmasq 127.0.0.1:53
  nameserver dns1 192.168.1.1:53
  nameserver dns1 192.168.1.253:53
  hold valid 60s

frontend app-in
  bind *:8080
  default_backend app-out

backend app-out
  server appincloud myawslb.example.com:443 check inter 2s ssl verify none resolvers mydns resolve-prefer ipv4 

So what this does is to define a DNS nameserver set called "mydns" using the DNS servers defined by the entries starting with "nameserver". An internal DNS cache should be kept for 60s defined by "hold valid 60s". In the backend server's definition you now refer to this DNS nameserver set by adding "resolvers mydns". In this example it is preferred to resolve to IPv4 addresses by adding "resolve-prefer ipv4" (default is to use ipv6).

Note that in order to use "resolvers" in the backend server, "check" must be defined, too. The DNS lookup happens whenever the backend server check is triggered. In this example "check inter 2s" is defined which means a DNS lookup happens would happen every 2 seconds. This would be quite a lot of lookups. By setting the internal "hold" cache to 60 seconds, you can therefore limit the number of DNS lookups until the cache expires; latest after 62 seconds a new DNS lookup should therefore happen.

Starting with HAProxy version 1.8 there is even an advanced possibility called "Service Discovery over DNS" which uses DNS SRV Records. These records contain multiple response fields such as priorities, weights, etc. which can be parsed by HAProxy and update the backends accordingly.

Further information:

 

How to play an audio file on the command line or as a cron job in Linux
Tuesday - Jan 22nd 2019 - by - (0 comments)

In October 2016 I already wrote how a multimedia file could be played with VLC and as a cron job (Play a multimedia file in VLC as cron job).

In this case we're still talking about the same idea as in my article from October 2016: The cron job should play the "It's coffee time" audio file. But opening a VLC player to play an audio file is kind of overkill.

Let's first create an audio file from the Youtube video using "youtube-dl":

$ youtoube-dl -x https://www.youtube.com/watch?v=6SRXUufvZUE

The -x parameter extracts the audio from the video file, leaving you with just the sound of the video: COFFEE-TIME-6SRXUufvZUE.m4a.

Now this file can be played using ffplay, which is a command from the package "ffmpeg":

$ /usr/bin/ffplay -nodisp -autoexit /home/myuser/Music/COFFEE-TIME-6SRXUufvZUE.m4a

Important parameters here:

-nodisp: Avoids opening a user interface to play the audio (we don't need this for a cron job in the background)
- autoexit: Automatically exit ffplay once the file finished, otherwise the command will continue to run

With these parameters we can now schedule the cron job:

00 09 * * 1-5 /usr/bin/ffplay -nodisp -autoexit /home/myuser/Music/COFFEE-TIME-6SRXUufvZUE.m4a

Definitely a much more lightweight and elegant solution than the previous one using VLC. 

 

Investigating high load on Icinga2 monitoring (caused by browser accessing Nagvis)
Monday - Jan 21st 2019 - by - (0 comments)

Since this weekend we experienced a very high load on the Icinga 2 monitoring server, running Icinga 2 version 2.6:

Icinga2 high load 

Restarts of Icinga2 didn't help. And it became worse: Icinga2 became so slow, we experienced outages between master and satellite servers and also in the user interface (classicui in this case) experienced status outages:

Icinga 2 status data outdated

In the application log (/var/log/icinga2/icinga2.log) I came across a lot of such errors I haven't seen before:

[2019-01-21 14:06:59 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-21 14:06:59 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-21 14:06:59 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-21 14:06:59 +0100] critical/ThreadPool: Exception thrown in event handler:
Error: Tried to read from closed socket.

    (0) libbase.so.2.6.1: (+0xc9148) [0x2b503d005148]
    (1) libbase.so.2.6.1: (+0xc91f9) [0x2b503d0051f9]
    (2) libbase.so.2.6.1: icinga::NetworkStream::Read(void*, unsigned long, bool) (+0x7e) [0x2b503cfa343e]
    (3) libbase.so.2.6.1: icinga::StreamReadContext::FillFromStream(boost::intrusive_ptr const&, bool) (+0x7f) [0x2b503cfab40f]
    (4) libbase.so.2.6.1: icinga::Stream::ReadLine(icinga::String*, icinga::StreamReadContext&, bool) (+0x5c) [0x2b503cfb3bbc]
    (5) liblivestatus.so.2.6.1: icinga::LivestatusListener::ClientHandler(boost::intrusive_ptr const&) (+0x103) [0x2b504c32da93]
    (6) libbase.so.2.6.1: icinga::ThreadPool::WorkerThread::ThreadProc(icinga::ThreadPool::Queue&) (+0x328) [0x2b503cfe9f78]
    (7) libboost_thread.so.1.54.0: (+0xba4a) [0x2b503c6a1a4a]
    (8) libpthread.so.0: (+0x8184) [0x2b503cd26184]
    (9) libc.so.6: clone (+0x6d) [0x2b503de51bed]

These errors started on January 20th at 02:58:

root@icingahost:/ # zgrep critical icinga2.log-20190120.gz |more
[2019-01-20 02:47:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:55543)
[2019-01-20 02:52:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:57401)
[2019-01-20 02:57:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:59271)
[2019-01-20 02:58:50 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 02:58:50 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 02:58:50 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 02:58:50 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 02:59:05 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 02:59:05 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 02:59:05 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 02:59:05 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 02:59:08 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 02:59:08 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 02:59:08 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 02:59:08 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:02:10 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:02:10 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:02:10 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 03:02:10 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:02:44 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:02:44 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:02:44 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 03:02:44 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:02:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:32903)
[2019-01-20 03:03:15 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:03:15 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:03:15 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 03:03:15 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:03:21 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:03:21 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:03:21 +0100] critical/LivestatusQuery: Cannot write query response to socket.

So this time correlates with the graph when the load started to increase! 

But what is causing this? According to the errors in icinga2.log it must have something to do with Livestatus, which in this setup is serving as a local socket and only accessed by a Nagvis installation.

By looking for this error message, I came across an issue on GitHub which didn't really solve my load problem (in this case it was Thruk causing the errors) but the comments from dnsmichi pointed me in the right direction:

"If your client application does not close the socket, or wait for processing the response, such errors occur."

As in this case it can only be Nagvis accessing Livestatus, I checked the Apache access logs for Nagvis and narrowed it down to four internal IP addresses constantly accessing Nagvis. By identifying these hosts and the responsible teams one browser after another was closed until finally only one machine left was accessing Nagvis. And it turned out to be this single machine causing the issues. After a reboot of this particular machine the load of our Icinga2 server immediately went back to normal and no more errors appeared in the logs.

TL;DR: It's not always the application on the server to blame. Clients/browsers can be the source of a problem, too.

 


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

6905 Days
until Death of Computers
Why?