Investigating high load on Icinga2 monitoring (caused by browser accessing Nagvis)

Written by Claudio Kuenzler - 0 comments

Published on January 21st 2019 - last updated on March 13th 2019 - Listed in Icinga Monitoring

Since this weekend we experienced a very high load on the Icinga 2 monitoring server, running Icinga 2 version 2.6:

Restarts of Icinga2 didn't help. And it became worse: Icinga2 became so slow, we experienced outages between master and satellite servers and also in the user interface (classicui in this case) experienced status outages:

In the application log (/var/log/icinga2/icinga2.log) I came across a lot of such errors I haven't seen before:

[2019-01-21 14:06:59 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-21 14:06:59 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-21 14:06:59 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-21 14:06:59 +0100] critical/ThreadPool: Exception thrown in event handler:
Error: Tried to read from closed socket.

    (0) libbase.so.2.6.1: (+0xc9148) [0x2b503d005148]
    (1) libbase.so.2.6.1: (+0xc91f9) [0x2b503d0051f9]
    (2) libbase.so.2.6.1: icinga::NetworkStream::Read(void*, unsigned long, bool) (+0x7e) [0x2b503cfa343e]
    (3) libbase.so.2.6.1: icinga::StreamReadContext::FillFromStream(boost::intrusive_ptr const&, bool) (+0x7f) [0x2b503cfab40f]
    (4) libbase.so.2.6.1: icinga::Stream::ReadLine(icinga::String*, icinga::StreamReadContext&, bool) (+0x5c) [0x2b503cfb3bbc]
    (5) liblivestatus.so.2.6.1: icinga::LivestatusListener::ClientHandler(boost::intrusive_ptr const&) (+0x103) [0x2b504c32da93]
    (6) libbase.so.2.6.1: icinga::ThreadPool::WorkerThread::ThreadProc(icinga::ThreadPool::Queue&) (+0x328) [0x2b503cfe9f78]
    (7) libboost_thread.so.1.54.0: (+0xba4a) [0x2b503c6a1a4a]
    (8) libpthread.so.0: (+0x8184) [0x2b503cd26184]
    (9) libc.so.6: clone (+0x6d) [0x2b503de51bed]

These errors started on January 20th at 02:58:

root@icingahost:/ # zgrep critical icinga2.log-20190120.gz |more
[2019-01-20 02:47:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:55543)
[2019-01-20 02:52:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:57401)
[2019-01-20 02:57:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:59271)
[2019-01-20 02:58:50 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 02:58:50 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 02:58:50 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 02:58:50 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 02:59:05 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 02:59:05 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 02:59:05 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 02:59:05 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 02:59:08 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 02:59:08 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 02:59:08 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 02:59:08 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:02:10 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:02:10 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:02:10 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 03:02:10 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:02:44 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:02:44 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:02:44 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 03:02:44 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:02:53 +0100] critical/ApiListener: Client TLS handshake failed (from [satellite]:32903)
[2019-01-20 03:03:15 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:03:15 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:03:15 +0100] critical/LivestatusQuery: Cannot write query response to socket.
[2019-01-20 03:03:15 +0100] critical/ThreadPool: Exception thrown in event handler:
[2019-01-20 03:03:21 +0100] critical/Socket: send() failed with error code 32, "Broken pipe"
[2019-01-20 03:03:21 +0100] critical/LivestatusQuery: Cannot write to TCP socket.
[2019-01-20 03:03:21 +0100] critical/LivestatusQuery: Cannot write query response to socket.

So this time correlates with the graph when the load started to increase!

But what is causing this? According to the errors in icinga2.log it must have something to do with Livestatus, which in this setup is serving as a local socket and only accessed by a Nagvis installation.

By looking for this error message, I came across an issue on GitHub which didn't really solve my load problem (in this case it was Thruk causing the errors) but the comments from dnsmichi pointed me in the right direction:

"If your client application does not close the socket, or wait for processing the response, such errors occur."

As in this case it can only be Nagvis accessing Livestatus, I checked the Apache access logs for Nagvis and narrowed it down to four internal IP addresses constantly accessing Nagvis. By identifying these hosts and the responsible teams one browser after another was closed until finally only one machine left was accessing Nagvis. And it turned out to be this single machine causing the issues. After a reboot of this particular machine the load of our Icinga2 server immediately went back to normal and no more errors appeared in the logs.

TL;DR: It's not always the application on the server to blame. Clients/browsers can be the source of a problem, too.

Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

Blog Tags:

AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder