Elasticsearch monitoring plugin check_es_system 1.12.0 released: Major improvements and enhancements!

Written by - 0 comments

Published on December 3rd 2021 - Listed in Elasticsearch ELK Monitoring


A new version of check_es_system, an open source monitoring plugin to monitor Elasticsearch nodes and clusters, is available!

Version 1.12.0 is a release with major improvements and enhancements which I will try to present here in this post. 

The major code changes were created by user chicco27 and submitted in PR #41. To further improve these submitted changes, allow backward compatibility and do some additional fixes, the PR 41 was replaced by PR #44.

Thanks to everyone involved getting this into the code and for the testers!

Additional authentication method

Besides the "classical" authentication method using user and password credentials, an additional method is now possible by using authentication with a SSL key and client certificate(s) authentication. This can be invoked using the new -K (followed by path to the key file) and -E (followed by the path to the certificate file) parameters.

New CPU check

A new check type was added: CPU! This allows to monitor the CPU usage of the Elasticsearch cluster or node:

$ ./check_es_system.sh -H elasticsearch.example.com -P 9243 -S -u user -p secret -t cpu
ES SYSTEM OK - CPU usage is at 2% |es_cpu=2%;80;95;0;100

Default warning threshold is set to 80%, critical threshold is set to 95%. These can be adjusted, of course:

$ ./check_es_system.sh -H elasticsearch.example.com -P 9243 -S -u user -p secret -t cpu -w 2 -c 5
ES SYSTEM WARNING - CPU usage is at 3% |es_cpu=3%;1;5;0;100

Disabled -d parameter / capacities read directly from Elasticsearch

In previous versions, the -d parameter was used in combination with the mem and disk check types. The goal was to "tell" check_es_system how many resources Elasticsearch had at hand. For example if your cluster has 16GB memory available, you would have used -t mem -d 16. However the available resources can also be read directly from the Elasticsearch API and this makes therefore the -d parameter obsolete.

$ ./check_es_system.sh -H elasticsearch.example.com -P 9243 -S -u user -p secret -t disk
ES SYSTEM OK - Disk usage is at 1% (21 G from 1860 G)|es_disk=23105256049B;1597727834112;1897301803008;0;1997159792640

$ ./check_es_system.sh -H elasticsearch.example.com -P 9243 -S -u user -p secret -t mem
ES SYSTEM OK - Memory usage is at 45% (13 G from 29 G)|es_memory=14490041000B;25425870848;30193221632;0;31782338560

If the -d parameter is still being used, the plugin will return an error that the parameter is now invalid:

$ ./check_es_system.sh -H elasticsearch.example.com -P 9243 -S -u user -p secret -t disk -d 1860
ES SYSTEM UNKNOWN: -d parameter is now invalid. Capacities are now discovered directly from Elasticsearch.

Local node checks

Version 1.12.0 also introduces a new parameter -L. This is short for "Local" and does what it says: The checks are executed against a local node instead of the cluster. The previous versions of check_es_system executed all the checks on a cluster level. By adding the -L parameter to the check command, a different Elasticsearch API URI is requested in the background, retrieving only the statistics of the checked cluster member.

As of version 1.12.0, the following check types are supported in combination with the -L parameter: cpu, disk, mem, jthreads.

Note: Using the -L parameter only makes sense, if you have direct access to the cluster members. If you are using a hosted Elasticsearch or EaaS (Elasticsearch as a Service), you shouldn't use this parameter.

In the following examples, the checks are first executed against an Elasticsearch cluster member. First on a cluster level, then on local node level:

# Cluster level checks
$ ./check_es_system.sh -H esnode1.example.com -P 9200 -t cpu
ES SYSTEM CRITICAL - CPU usage is at 98% |es_cpu=98%;80;95;0;100

$ ./check_es_system.sh -H esnode1.example.com -P 9200 -t disk
ES SYSTEM OK - Disk usage is at 22% (105 G from 470 G)|es_disk=112912303491B;404114689228;479886193459;0;505143361536

$ ./check_es_system.sh -H esnode1.example.com -P 9200 -t mem
ES SYSTEM OK - Memory usage is at 44% (14 G from 31 G)|es_memory=15332326008B;27376222208;32509263872;0;34220277760

$ ./check_es_system.sh -H esnode1.example.com -P 9200 -t jthreads
ES SYSTEM OK - Number of JVM threads is 272|es_jvm_threads=272;;;;

# Local node checks
$ ./check_es_system.sh -H esnode1.example.com -P 9200 -t cpu -L
ES SYSTEM OK - CPU usage is at 29% |es_cpu=29%;80;95;0;100

$ ./check_es_system.sh -H esnode1.example.com -P 9200 -t disk -L
ES SYSTEM OK - Disk usage is at 20% (23 G from 117 G)|es_disk=25317853327B;101028672307;119971548364;0;126285840384

$ ./check_es_system.sh -H esnode1.example.com -P 9200 -t mem -L
ES SYSTEM OK - Memory usage is at 63% (5 G from 7 G)|es_memory=5424138456B;6844055552;8127315968;0;8555069440

$ ./check_es_system.sh -H esnode1.example.com -P 9200 -t jthreads -L
ES SYSTEM OK - Number of JVM threads is 67|es_jvm_threads=67;;;;

Using local node checks can help to quickly identify which cluster member(s) are under heavy usage.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.