check_rancher2

Last update: February 08, 2024

Monitoring Plugin check_rancher2

This is a monitoring plugin to check Kubernetes (Docker) container infrastructures managed with Rancher 2.x. It uses Rancher 2's API to monitor states of clusters, nodes, workloads or pods.

Note: This plugin is not created nor officially supported by Rancher Labs nor SUSE. As with most monitoring plugins, it's the (monitoring) open source community contributing to this plugin.

Commercial support

If you are looking for commercial support for this monitoring plugin, need customized modifications or in general customized monitoring plugins, contact us at Infiniroot.com.

Download

Download check_rancher2.sh

check_rancher2.sh

3518 downloads so far...

Download plugin and save it in your Nagios/Monitoring plugin folder (usually /usr/lib/nagios/plugins, depends on your distribution). Afterwards adjust the permissions (usually chmod 755).

Community contributions welcome on GitHub repo.

Compatibility

The check_rancher2 plugin was successfully tested on the following Rancher 2 versions:

  • 2.0.x
  • 2.1.x
  • 2.2.x
  • 2.3.x
  • 2.4.x
  • 2.5.x
  • 2.6.x
  • 2.7.x
  • 2.8.x

Version history / Changelog

# 20180629 alpha Started programming of script
# 20180713 beta1 Public release in repository
# 20180803 beta2 Check for "type", echo project name in "all workload" check, too
# 20180806 beta3 Fix important bug in for loop in workload check, check for 'paused'
# 20180906 beta4 Catch cluster not found and zero workloads in workload check
# 20180906 beta5 Fix paused check (type 'object' has no elements to extract (arg 5)
# 20180921 beta6 Added pod(s) check within a project
# 20180926 beta7 Handle a workflow in status 'updating' as warning, not critical
# 20181107 beta8 Missing pod check type in help, documentation completed
# 20181109 1.0.0 Do not alert for succeeded pods
# 20190308 1.1.0 Added node(s) check
# 20190903 1.1.1 Detect invalid hostname (non-API hostname)
# 20190903 1.2.0 Allow self-signed certificates (-s)
# 20190913 1.2.1 Detect additional redirect (308)
# 20200129 1.2.2 Fix typos in workload perfdata (#11) and single cluster health (issue #12)
# 20200523 1.2.3 Handle 403 forbidden error (issue #15)
# 20200617 1.3.0 Added ignore parameter (-i)
# 20210210 1.4.0 Checking specific workloads and pods inside a namespace
# 20210413 1.5.0 Plugin now uses jq instead of jshon, fix cluster error check (issue #19)
# 20210504 1.6.0 Add usage performance data on single cluster check, fix project check
# 20210824 1.6.1 Fix cluster and project not found error (#24)
# 20211021 1.7.0 Check for additional node (pressure) conditions (#27)
# 20211201 1.7.1 Fix cluster state detection (#26)
# 20220610 1.8.0 More performance data, long parameters, other improvements (#31)
# 20220729 1.9.0 Output improvements (#32), show workload namespace (#33)
# 20220909 1.10.0 Fix ComponentStatus (#35), show K8s version in single cluster check
# 20220909 1.10.0 Allow ignoring statuses on workload checks (#29)
# 20230110 1.11.0 Allow ignoring specific workload names, provisioning cluster not critical (#39)
# 20230202 1.12.0 Add local-certs check type
# 20231208 1.12.1 Use 'command -v' instead of 'which' for required command check

Requirements

  • Rancher 2 API access
  • curl package/command
  • jq package/command (prior to version 1.5.0 the jshon command was required)

How to create a Rancher2 API access

This section describes how to create an API access in Rancher 2.x. Log in to your Rancher 2 environment and on the top right corner, hover over your user icon and then click on 'API and Keys'. Starting with Rancher 2.6 this is now called 'Account & API Keys'.

Rancher 2 user api and keys

You will see an overview of existing API access tokens. To create a new API access, click on the button 'Create API Key' (previously 'Add Key').

Rancher 2 add api key

In Rancher 2.6.x and newer, make sure you create an API Key without scope! Otherwise the plugin will not work.

Rancher 2.6 Add API Key

Rancher 2 will output the access credentials with two fields: The access key (username starting with token-) and the secret key (password). You must store these credentials in a safe place as this is the one and only time the password is shown.

Rancher 2 API access created

You can now use these credentials with check_rancher2.

Definition of the parameters

Parameter Short Parameter Long Description
-H * --apihost * Hostname (DNS name) of Rancher 2 management
-U * --apiuser * Username for API access (will be in format 'token-xxxxx')
-P * --apipass * Password for API access
-S --secure Use secure connection (https) to Rancher API
-s --selfsigned Allow self-signed certificates
-t * --type * Check type; defines what kind of check you want to run
-c --clustername Cluster name (for specific cluster check)
-p --projectname Project ID in format c-xxxxx:p-xxxxx (for specific project check, needed for workload checks)
-n --namespacename Namespace name (optional for specific workload, required for specific pod check)
-w --workloadname Workload name (needed for specific workload checks)
-o --podname Pod name (needed for specific pod checks, this makes only sense if you have pods with static names)
-i --ignore Comma-separated list of status(es) to ignore ('node' and 'workload' check types), specific workload name(s) to ignore ('workload' check type) or certificate to ignore ('local-certs' check type)
N/A --cpu-warn Exit with WARNING status if more than PERCENT of cpu capacity is used (supported check types: node, cluster)
N/A --cpu-crit Exit with CRITICAL status if more than PERCENT of cpu capacity is used (supported check types: node, cluster)
N/A --memory-warn Exit with WARNING status if more than PERCENT of mem capacity is used (supported check types: node, cluster)
N/A --memory-crit Exit with CRITICAL status if more than PERCENT of mem capacity is used (supported check types: node, cluster)
N/A --pods-warn Exit with WARNING status if more than PERCENT of pod capacity is used (supported check types: node, cluster)
N/A --pods-crit Exit with CRITICAL status if more than PERCENT of pod capacity is used (supported check types: node, cluster)
N/A --cert-warn Warning threshold in days to warn before a certificate expires (supported check types: local-certs)
-h --help Show help and usage

* mandatory parameter

Definition of the check types

Type Description
info Informs about available clusters and projects and their API ID's. These ID's are needed for specific checks.
cluster Checks the current status of all clusters or of a specific cluster (defined with -c clusterid)
node Checks the current status of all nodes or nodes of a specific cluster (defined with -c clusterid)
project Checks the current status of all projects or of a specific project (defined with -p projectid)
workload Checks the current status of all or a specific (-w workloadname) workload within a project (-p projectid must be set!) and optional within a namespace (-n namespace)
pod Checks the current status of all or a specific (-o podname) pod within a project (-p projectid must be set!).
A specific pod check can be used with -o podname, however this also requires the namespace (-n namespace).
Note: A specific pod check makes only sense if you use pods with static names (this is rather rare)
local-certs Checks the current status of all internal Rancher certificates (e.g. rancher-webhook) in local cluster under the System project (namespace: cattle-system)

Usage / running the plugin on the command line

Usage:

./check_rancher2.sh -H hostname -U token -P password [-S] [-s] -t checktype [-p string] [-o string] [-n string] [-w string]

Example: Show information about discovered clusters:

./check_rancher2.sh -H rancher2.example.com -S -U token-xxxxx -P aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai -t info
CHECK_RANCHER2 OK - Found 3 clusters: "c-5f7k2" alias "gamma-stage" - "c-scsb6" alias "hugo-stage" - "local" alias "local" - and 6 projects: "c-5f7k2:p-4fdsd" alias "Gamma" - "c-5f7k2:p-cngpb" alias "System" - "c-scsb6:p-cxk9s" alias "hugo-stage" - "c-scsb6:p-dqxdx" alias "System" - "local:p-9rwc7" alias "Default" - "local:p-ls96l" alias "System" -|'clusters'=3;;;; 'projects'=6;;;;

Example: Check all pods within a project (c-5f7k2:p-4fdsd):

./check_rancher2.sh -H rancher2.example.com -S -U token-xxxxx -P aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai -t pod -p c-5f7k2:p-4fdsd
CHECK_RANCHER2 OK - All pods (85) in project c-5f7k2:p-4fdsd are running|'pods_total'=85;;;; 'pods_errors'=0;;;;

Example: Single pod check within a project (c-5f7k2:p-4fdsd) and namespace (gamma):

./check_rancher2.sh -H rancher2.example.com -S -U token-xxxxx -P aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai -t pod -p c-5f7k2:p-4fdsd -n gamma -o nginx-85789c55b6-625tz
CHECK_RANCHER2 OK - Pod nginx-85789c55b6-625tz is running|'pod_active'=1;;;; 'pod_error'=0;;;;

Example: Monitor all nodes across all clusters:

./check_rancher2.sh -H rancher2.example.com -S -U token-xxxxx -P aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai -t node
CHECK_RANCHER2 OK - All 73 nodes are active|'nodes_total'=73;;;; 'node_errors'=0;;;; 'node_ignored'=0;;;; 'nodes_cpu_total'=98149;;;0;372000 'nodes_memory_total'=211285966848B;;;0;1321235513344 'nodes_pods_total'=1431;;;0;8030

Example: Monitor all nodes across all clusters but ignore cordoned and drained nodes:

./check_rancher2.sh -H rancher2.example.com -S -U token-xxxxx -P aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai -t node -i "cordoned,drained"
CHECK_RANCHER2 OK - All nodes OK - Info: 1 node errors ignored|'nodes_total'=17;;;; 'node_errors'=0;;;; 'node_ignored'=1;;;; 'nodes_cpu_total'=23374;;;0;48000 'nodes_memory_total'=30635196416B;;;0;130837262336 'nodes_pods_total'=497;;;0;1870
node12 in cluster c-xxxxx is drained but ignored

Example: Monitor all clusters:

./check_rancher2.sh -H rancher2.example.com -S -U token-xxxxx -P aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai -t cluster
CHECK_RANCHER2 OK - All clusters (5) are healthy|'clusters_total'=5;;;; 'clusters_errors'=0;;;;

Example: Monitor a single cluster (note the performance data here):

./check_rancher2.sh -H rancher2.example.com -S -U token-xxxxx -P aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai -t cluster -c c-xxxxx
CHECK_RANCHER2 OK - Cluster my-test is healthy|'cluster_healthy'=1;;;; 'component_errors'=0;;;; 'cpu'=2380;;;;4000 'memory'=1572864000B;;;0;12469846016 'pods'=55;;;;220 'usage_cpu'=59%;;;0;100 'usage_memory'=12%;;;0;100 'usage_pods'=25%

Example: Monitor Rancher internal certificates:

./check_rancher2.sh -H rancher2.example.com -S -U token-xxxxx -P aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai -t local-certs
CHECK_RANCHER2 CRITICAL - 1 certificate(s) expired (tls-rancher-internal expired 71 days ago -) |'total_certs'=6;;;; 'expired_certs'=1;;;; 'warning_certs'=0;;;; 'ignored_certs'=0;;;;

Example: Monitor Rancher internal certificates but ignore a specific certificate:

./check_rancher2.sh -H rancher2.example.com -S -U token-xxxxx -P aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai -t local-certs -i "tls-rancher-internal"
CHECK_RANCHER2 OK - All 6 certificates are valid - 1 certificate(s) ignored: tls-rancher-internal|'total_certs'=6;;;; 'expired_certs'=0;;;; 'warning_certs'=0;;;; 'ignored_certs'=1;;;;

Command definition

Command definition in Nagios, Icinga 1.x, Shinken, Naemon

# check_rancher2 command definition
define command{
  command_name check_rancher2
  command_line $USER1$/check_rancher2.sh -H $HOSTADDRESS$ -S -U $ARG1$ -P $ARG2$ -t $ARG3$ $ARG4$
}

Note: HTTPS is used in this case (-S). All mandatory parameters are fixed defined. The optional parameters can be added inside $ARG4$ (e.g. -c clustername).

Important: Using $HOSTADDRESS$ as -H value assumes that you have created a host object which uses the Rancher2 DNS name as address.

Command definition in Icinga 2.x

Also see CheckCommand object in GitHub repository

# check_rancher2 command definition
object CheckCommand "check_rancher2" {
  import "plugin-check-command"
  command = [ PluginDir + "/check_rancher2.sh" ]

  arguments = {
   "-H" = "$rancher2_address$"
   "-U" = "$rancher2_username$"
   "-P" = "$rancher2_password$"
   "-S" = { set_if = "$rancher2_ssl$" }
   "-t" = "$rancher2_type$"
   "-c" = "$rancher2_cluster$"
   "-p" = "$rancher2_project$"
   "-n" = "$rancher2_namespace$"
   "-w" = "$rancher2_workload$"
   "-o" = "$rancher2_pod$"
   "-i" = "$rancher2_ignore_status$"
   "--cpu-warn" = "$rancher2_cpu_warn$"
   "--cpu-crit" = "$rancher2_cpu_crit$"
   "--memory-warn" = "$rancher2_memory_warn$"
   "--memory-crit" = "$rancher2_memory_crit$"
   "--pods-warn" = "$rancher2_pods_warn$"
   "--pods-crit" = "$rancher2_pods_crit$"
   "--cert-warn" = "$rancher2_cert_warn$"
  }

  vars.rancher2_address = "$address$"
  vars.rancher2_cert_warn = "14"
  # If you only run one Rancher2, you can define api access here, too:
  #vars.rancher2_username = "token-xxxxx"
  #vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  #vars.rancher2_ssl = true
}

Service definition

Service definition in Nagios, Icinga 1.x, Shinken, Naemon

Show Rancher 2 info:

# Check Rancher 2 Pods in project c-5f7k2:p-4fdsd
define service{
  use generic-service
  host_name my-rancher2-host
  service_description Rancher2 Info
  check_command check_rancher2!token-xxxxx!aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai!info
}

Check all pods in a project:

# Check Rancher 2 Pods in project c-5f7k2:p-4fdsd
define service{
  use generic-service
  host_name my-rancher2-host
  service_description Rancher2 Project 1 Pods
  check_command check_rancher2!token-xxxxx!aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai!pod!-c c-5f7k2:p-4fdsd
}

Service object definition Icinga 2.x

Information about discovered clusters and projects:

# Just show some info about discovered clusters and projects
object Service "Rancher2 Info" {
  import "generic-service"
  host_name = "my-rancher2-host"
  check_command = "check_rancher2"
  vars.rancher2_username = "token-xxxxx"
  vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  vars.rancher2_ssl = true
  vars.rancher2_type = "info"
}

Check all available/found clusters for their health:

# Check all available/found clusters for their health
object Service "Rancher2 All Clusters" {
  import "generic-service"
  host_name = "my-rancher2-host"
  check_command = "check_rancher2"
  vars.rancher2_username = "token-xxxxx"
  vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  vars.rancher2_ssl = true
  vars.rancher2_type = "cluster"
}

Check a single cluster for its health:

# Check a single cluster for its health
object Service "Rancher2 Cluster Test" {
  import "generic-service"
  host_name = "my-rancher2-host"
  check_command = "check_rancher2"
  vars.rancher2_username = "token-xxxxx"
  vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  vars.rancher2_ssl = true
  vars.rancher2_type = "cluster"
  vars.rancher2_cluster = "c-5f7k2"
}

Check all nodes of a single cluster for their health but ignore cordoned and drained statuses:

# Check nodes in cluster c-5f7k2 (ignore status cordoned and drained)
object Service "Rancher2 Nodes in Cluster c-5f7k2" {
  import "generic-service"
  host_name = "my-rancher2-host"
  check_command = "check_rancher2"
  vars.rancher2_username = "token-xxxxx"
  vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  vars.rancher2_ssl = true
  vars.rancher2_type = "node"
  vars.rancher2_cluster = "c-5f7k2"
  vars.rancher2_ignore_status = "cordoned,drained"
}

Check all available/found projects (across all clusters) for their health:

# Check all available/found projects (across all clusters) for their health
object Service "Rancher2 All Projects" {
  import "generic-service"
  host_name = "my-rancher2-host"
  check_command = "check_rancher2"
  vars.rancher2_username = "token-xxxxx"
  vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  vars.rancher2_ssl = true
  vars.rancher2_type = "project"
}

Check a single project:

# Check a single projects
object Service "Rancher2 Project Test" {
  import "generic-service"
  host_name = "my-rancher2-host"
  check_command = "check_rancher2"
  vars.rancher2_username = "token-xxxxx"
  vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  vars.rancher2_ssl = true
  vars.rancher2_type = "project"
  vars.rancher2_project = "c-5f7k2:p-4fdsd"
}

Check all workloads in a certain project:

# Check all workloads in a certain project
object Service "Rancher2 Workloads in Project Test" {
  import "generic-service"
  host_name = "my-rancher2-host"
  check_command = "check_rancher2"
  vars.rancher2_username = "token-xxxxx"
  vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  vars.rancher2_ssl = true
  vars.rancher2_type = "workload"
  vars.rancher2_project = "c-5f7k2:p-4fdsd"
}

Check a single workload in a certain project:

# Check a single workload in a certain project
object Service "Rancher2 Workload Web in Project Test" {
  import "generic-service"
  host_name = "my-rancher2-host"
  check_command = "check_rancher2"
  vars.rancher2_username = "token-xxxxx"
  vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  vars.rancher2_ssl = true
  vars.rancher2_type = "workload"
  vars.rancher2_project = "c-5f7k2:p-4fdsd"
  vars.rancher2_workload = "Web"
}

Check all pods in a certain project:

# Check all pods in a certain project object Service "Rancher2 Pods in Project Test" {
  import "generic-service"
  host_name = "my-rancher2-host"
  check_command = "check_rancher2"
  vars.rancher2_username = "token-xxxxx"
  vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  vars.rancher2_ssl = true
  vars.rancher2_type = "pod"
  vars.rancher2_project = "c-5f7k2:p-4fdsd"
}

Check a single pod in a certain project and namespace:

# Check a single pod in a certain project and namespace
object Service "Rancher2 Pod Nginx1 in Project Test Namespace Test" {
  import "generic-service"
  host_name = "my-rancher2-host"
  check_command = "check_rancher2"
  vars.rancher2_username = "token-xxxxx"
  vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  vars.rancher2_ssl = true
  vars.rancher2_type = "pod"
  vars.rancher2_project = "c-5f7k2:p-4fdsd"
  vars.rancher2_namespace = "test"
  vars.rancher2_pod = "nginx1"
}

Check internal certificates deployed in Rancher in "local" cluster:

# Check internal certificates in local cluster
object Service "Rancher2 internal certificates (cattle-system)" {
  import "generic-service"
  host_name = "my-rancher2-host"
  check_command = "check_rancher2"
  vars.rancher2_username = "token-xxxxx"
  vars.rancher2_password = "aethooFaaGohthah8aezup5wiew5aedainooG2goh9Kaeti9hurolai"
  vars.rancher2_ssl = true
  vars.rancher2_type = "local-certs"
}

Screenshots

check_rancher2 pods all ok
check_rancher2 local internal certificates check
check_rancher2 node has disk pressure
check_rancher2 grafana pods errors
check_rancher2 grafana pods total

Presentation

The monitoring plugin check_rancher2 was presented and introduced at the Open Source Monitoring Conference (OSMC) 2018 in Nuremberg, Germany. You can download the presentation as PDF document or watch the recorded video online.

Its all about the containers presentation