How to monitor a PostgreSQL replication

Written by Claudio Kuenzler - 3 comments

Published on July 26th 2017 - last updated on June 15th 2021 - Listed in DB Postgres Database Monitoring Icinga Nagios

There are multiple ways of monitoring a working master-slave-replication on PostgreSQL servers.

Using PSQL

First of all there is of course the replication status which can be read directly from the master PostgreSQL server:

This information can only be read on the master. If you try that on the slave (hot_standby = on), you don't get to see anything:

postgres@dbslave:~$ psql -x -c "select * from pg_stat_replication;"
(0 rows)

Obviously most important information here is the sync_state:

postgres@dbmaster:~$ psql -x -c "select sync_state from pg_stat_replication;"
-[ RECORD 1 ]----
sync_state | sync

Possible values of sync_state are:

async: This standby server is asynchronous -> CRITICAL!
potential: This standby server is asynchronous, but can potentially become synchronous if one of current synchronous ones fails -> WARNING
sync: This standby server is synchronous -> OK
quorum: This standby server is considered as a candidate for quorum standbys -> OK

Other important values are the different "locations":

sent_location    | 0/6000290
write_location   | 0/6000290
flush_location   | 0/6000290
replay_location | 0/6000290

From the documentation:

sent_location: Last write-ahead log location sent on this connection
write_location: Last write-ahead log location written to disk by this standby server
flush_location: Last write-ahead log location flushed to disk by this standby server
replay_location: Last write-ahead log location replayed into the database on this standby server

This basically shows where the slave server is. If all values are the same it is caught up 100%.

Using monitoring plugin check_postgres

Note: There are different monitoring plugins with the name "check_postgres". Make sure you use check_postgresl.pl!

The monitoring plugin check_postgres also features a replication check (hot_standby_delay). The trick is to correctly understand this check. Using the hot_standby_delay check, the plugin connects to both the master and slave and compares the replay delay and receive delay to the given warning and critical thresholds. In order to connect to both the master and the slave, the pg_hba.conf must be adapted accordingly.

On the master (IP 10.10.10.10) I added the following lines:

# Monitoring
host all monitoring 127.0.0.1/32 md5
host all monitoring 10.10.10.11/32 md5

On the slave (IP 10.10.10.11) I added the following lines:

# Monitoring
host all monitoring 127.0.0.1/32 md5

The plugin will be executed on the slave server ergo there the monitoring line for localhost is enough.

To not use the db password with the plugin (the password would show up in cleartext in the process list), I created a .pgpass file for the nagios user (under which this plugin will run). This file contains two entries; first for the localhost connection and secondly for the remote connection to the master server:

nagios@dbslave:~$ whoami
nagios

nagios@dbslave:~$ ls -la .pgpass
-rw------- 1 nagios nagios 94 Jul 26 15:25 .pgpass

nagios@dbslave:~$ cat .pgpass
localhost:5432:*:monitoring:mysupersecretpassword
dbmaster:5432:*:monitoring:mysupersecretpassword

Make sure the .pgpass file has correct permissions (chmod 0600), otherwise it won't be used for psql commands!

Now the plugin can be executed with the hot_standby_delay check:

nagios@dbslave:~$ /usr/lib/nagios/plugins/check_postgres.pl -H localhost,dbmaster -u monitoring -db mydb --action hot_standby_delay --warning 60 --critical 600
POSTGRES_HOT_STANDBY_DELAY OK: DB "mydb" (host:localhost) 0 and 432 seconds | time=0.05s replay_delay=0;60;600 receive-delay=0;60;600 time_delay=432;

Note the -H parameter uses two hostnames (comma separated). The plugin will connect to both localhost and the dbmaster host using the SQL user "monitoring" (password will automatically be read from .pgpass file). I set a delay warning to 60 seconds, a critical delay to 600 seconds (10 minutes).

Add a comment

Show form to leave a comment

Comments (newest first)

ck from Switzerland wrote on Aug 28th, 2017:

Majales, I wrote the answer in the Github issue. For sake of completeness here again:

I went through the source code and time_delay is $time_delta:

if ($version >= 9.1) {
$db->{perf} .= sprintf ' %s=%s;%s;%s',
perfname(msg('hs-time-delay')), $time_delta, $wtime, $ctime;
}

time_delta is read from xlog positions (line 5110 and below). So if you run the SQL command you see these "seconds":

postgres=# SELECT pg_last_xlog_receive_location() AS receive, pg_last_xlog_replay_location() AS replay , COALESCE(ROUND(EXTRACT(epoch FROM now() - pg_last_xact_replay_timestamp())),0) AS seconds;
receive | replay | seconds
------------+------------+---------
2/266055D8 | 2/266055D8 | 151763
(1 row)

So seconds is defined by the function pg_last_xact_replay_timestamp(). The value represents the "time passed since last sync operation" (it doesn't mean that the servers are out of sync then).

ck from Switzerland wrote on Aug 19th, 2017:

Hi Majales. This is indeed a very good question and I don't know either. I found this issue on the check_postgres Github repository: https://github.com/bucardo/check_postgres/issues/119. The OP asks the same question as you. This Github issue is probably the best way to get a correct answer.
Update: Oh!! I just saw that YOU are the OP of that Github issue. Loop :D

Majales from wrote on Aug 18th, 2017:

What actualy means that time_delay=432? I can probably understand replay_delay or receive-delay. time should be check execution time, but what is time_delay?

Blog Tags:

AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder