How monitoring plugin check_nwc_health calculates network interface traffic speed

Written by - 0 comments

Published on - last updated on March 19th 2020 - Listed in Monitoring Network


The graphs must be wrong!

As a monitoring expert "living monitoring" (yes, this is my passion), this statement from a network administrator caught me by surprise. The graphs referred to interface traffic graphs shown in Grafana, using interface performance data written by check_nwc_health into an InfluxDB (via Icinga 2).

But nothing can be ruled out unless proven. So I decided to take a closer look at how check_nwc_health calculates the interface traffic.

check_nwc_health and its output

When running check_nwc_health on the command line using the interface-usage mode, the following output is shown:

$ ./check_nwc_health --hostname=switch --statefilesdir=/tmp --protocol=2c --community=public --mode=interface-usage --name TenGigabitEthernet0/0/1
OK - interface TenGigabitEthernet0/0/1 (alias Link to LAN) usage is in:2.09% (209197071.70bit/s) out:0.91% (90867146.68bit/s) | 'TenGigabitEthernet0/0/1_usage_in'=2.09%;80;90;0;100 'TenGigabitEthernet0/0/1_usage_out'=0.91%;80;90;0;100 'TenGigabitEthernet0/0/1_traffic_in'=209197071.70;8000000000;9000000000;0;10000000000 'TenGigabitEthernet0/0/1_traffic_out'=90867146.68;8000000000;9000000000;0;10000000000

The plugin shows two relevant performance data here:

  1. Usage: The interface's capacity (here 10Gb) is read from SNMP and the current traffic is calculated against the capacity, resulting in a percentage usage of the capacity.
  2. Traffic: The interface's incoming (in) and outgoing (out) traffic is shown in bit/s.

To dig in deeper, check_nwc_health provides different verbose levels by appending one or more -v parameters. This will show additional information directly coming from SNMP:

$ ./check_nwc_health --hostname=switch --statefilesdir=/tmp --protocol=2c --community=public --mode=interface-usage --name TenGigabitEthernet0/0/1 -vvv
[INTERFACESUBSYSTEM]
bootTime: 1535241918
duplicates: HASH(0x5741770)
ifCacheLastChange: 1584633871
ifTableLastChange: 1566825882.12
interface_cache: HASH(0x5745480)
info: checking interfaces
[64BIT_2]
delta_ifHCInOctets: 12851200579
delta_ifHCOutOctets: 6711710306
delta_ifInBits: 102809604632
delta_ifOutBits: 53693682448
delta_timestamp: 547
ifAlias: Link to LAN
ifDescr: TenGigabitEthernet0/0/1
ifHCInOctets: 974071772432616
ifHCInOctets_per_sec: 23493968.1517367
ifHCOutOctets: 619657250186324
ifHCOutOctets_per_sec: 12270037.1224863
ifHighSpeed: 10000
ifInOctets: 3254470888
ifIndex: 2
ifName: Te0/0/1
ifOperStatus: up
ifOutOctets: 843555924
ifSpeed: 4294967295
inputRate: 187951745.213894
inputUtilization: 1.87951745213894
maxInputRate: 10000000000
maxOutputRate: 10000000000
outputRate: 98160296.9798903
outputUtilization: 0.981602969798903
info: interface TenGigabitEthernet0/0/1 (alias Link to LAN) usage is in:1.88% (187951745.21bit/s) out:0.98% (98160296.98bit/s)


OK - interface TenGigabitEthernet0/0/1 (alias Link to LAN) usage is in:1.88% (187951745.21bit/s) out:0.98% (98160296.98bit/s)
checking interfaces
interface TenGigabitEthernet0/0/1 (alias Link to LAN) usage is in:1.88% (187951745.21bit/s) out:0.98% (98160296.98bit/s) | 'TenGigabitEthernet0/0/1_usage_in'=1.88%;80;90;0;100 'TenGigabitEthernet0/0/1_usage_out'=0.98%;80;90;0;100 'TenGigabitEthernet0/0/1_traffic_in'=187951745.21;8000000000;9000000000;0;10000000000 'TenGigabitEthernet0/0/1_traffic_out'=98160296.98;8000000000;9000000000;0;10000000000

Thanks to the verbose output we can see that the inputRate and outputRate (highlighted in bold) are taken as actual interface speeds.

Traffic calculations

When looking at the plugin's source code, one can find the following relevant code parts in InterfaceSubsystem.pm in the "usage" part:

    $self->valdiff({name => $self->{ifIndex}.'#'.$self->{ifDescr}}, qw(ifInOctets ifOutOctets));
    $self->{delta_ifInBits} = $self->{delta_ifInOctets} * 8;
    $self->{delta_ifOutBits} = $self->{delta_ifOutOctets} * 8;
   [...]
    $self->{inputRate} = $self->{delta_ifInBits} / $self->{delta_timestamp};
    $self->{outputRate} = $self->{delta_ifOutBits} / $self->{delta_timestamp};

First let's try to explain the valdiff function: When the plugin runs, it writes a file into the "statefilesdir" (defaults to /tmp/) using the target hostname and interface as filename. Inside this file, the interface's traffic performance data is saved along the timestamp of the check:

$ cat "/tmp/switch_device::interfaces::usage_2#tengigabitethernet0_0_123cf4a9df48212f44"
$VAR1 = {
  'timestamp' => 1584634579,
  'localtime' => 'Thu Mar 19 17:16:19 2020',
  'ifHCInOctets' => '974071772432616',
  'ifHCOutOctets' => '619657250186324'
};

When the plugin does the same check on the same interface again, valdiff is launched: The previous data is compared with the new collected data and a difference (hence valdiff = value difference) is calculated and stored as "delta" variables. The octet deltas are multiplied by 8 to get the same value in Bit (1 Octet = 8 Bit).

Now to something tricky: How does the plugin correctly determine the speed in Bits/second? When the plugin runs the same check every second on an interface, the Bits/second make sense. But what about running the plugin every 5 minutes or even once an hour? That's why the stored timestamp in the temp file is so important! The current and the stored timestamp are compared and yet another delta (= time difference) is calculated. And using this time difference as divider results in the final Bits/second values - which finally are stored in the previously mentioned inputRate and outputRate variables (which can be seen in the verbose output).

And, you probably guessed it, the temp file is rewritten with the now new data and waiting for the next run for comparison and valdiff calculations.

$ cat "/tmp/switch_device::interfaces::usage_2#tengigabitethernet0_0_123cf4a9df48212f44"
$VAR1 = {
  'ifHCInOctets' => '974114105928875',
  'timestamp' => 1584636386,
  'localtime' => 'Thu Mar 19 17:46:26 2020',
  'ifHCOutOctets' => '619679721786303'
};

Note: ifHCInOctets and ifHCOutOctets are 64bit SNMP counters. Compared to 32bit counters (ifInOctets and ifOutOctets) they can hold much larger numbers. On long-running network devices with a lot of traffic it is mandatory to use the 64bit counters or you will get incorrect traffic data.

Manual traffic comparisons

To rule out a potential bug in valdiff or another part of the plugin, the raw SNMP counter values were taken and traffic calculations were created. The plugin was run once a minute using a while loop, sleeping 60s in between:

$ while true; do date; ./check_nwc_health --hostname=switch --statefilesdir=/tmp --protocol=2c --community=public --mode=interface-usage --name TenGigabitEthernet0/0/1 -vvv | egrep "(ifHCInOctets|ifInOctets|inputRate)"; sleep 60; echo "---------"; done

Example output:

Thu Mar 19 11:45:03 CET 2020
delta_ifHCInOctets: 60840931786
ifHCInOctets: 973732081243736
ifHCInOctets_per_sec: 21830258.983136
ifInOctets: 2865698392
inputRate: 174642071.865088
-------
Thu Mar 19 11:46:04 CET 2020
delta_ifHCInOctets: 855553551
ifHCInOctets: 973732936797287
ifHCInOctets_per_sec: 14259225.85
ifInOctets: 3721251943
inputRate: 114073806.8

The raw data was taken into a spread sheet and Mbit/s calculations were used on the following fields:

  • delta_ifHCInOctets (delta calculated by the plugin using valdiff)
  • ifHCInOctets (raw data from 64bit SNMP counter, using manual diff calculation)
  • ifHCInOctets_per_sec
  • ifInOctets (raw data from 32bit SNMP counter, using manual diff calculation)
  • inputRate

The results talk for themselves:

Interface traffic calculations comparisons

The manually calculated Mbit/s traffic using the raw ifHCInOctets matched the speed calculated from inputRate. Only the 32bit counter (ifInOctets) showed different Mbit/s data, but this is because the counter probably ran out of its 32bit limitation to continue storing data.

Columns G (Mbit/s based on the plugin's delta) and H (manual calculation from raw data) always matched.

One certain lines the columns G/H and I (Mbit/s based on ifHCInOctets_per_sec SNMP value) differed. But column K (Mbit/s based on the inputRate from the plugin) always matched column I. The difference between non-matching lines however was not as big.

So at the end the plugin's interface statistics turned out to be completely correct and so did the Grafana graphs!


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder