Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

How to monitor a PostgreSQL replication
Wednesday - Jul 26th 2017 - by - (3 comments)

There are multiple ways of monitoring a working master-slave-replication on PostgreSQL servers.

Using PSQL

First of all there is of course the replication status which can be read directly from the master PostgreSQL server:

postgres@dbmaster:~$ psql -x -c "select * from pg_stat_replication;"
-[ RECORD 1 ]----+------------------------------
pid              | 13014
usesysid         | 16387
usename          | replica
application_name | dbslave
client_addr      | 10.10.10.11
client_hostname  |
client_port      | 48596
backend_start    | 2017-07-26 13:07:00.617621+00
backend_xmin     |
state            | streaming
sent_location    | 0/6000290
write_location   | 0/6000290
flush_location   | 0/6000290
replay_location  | 0/6000290
sync_priority    | 1
sync_state       | sync

This information can only be read on the master. If you try that on the slave (hot_standby = on), you don't get to see anything:

postgres@dbslave:~$ psql -x -c "select * from pg_stat_replication;"
(0 rows)

Obviously the moast important information here is the sync_state:

postgres@dbmaster:~$ psql -x -c "select sync_state from pg_stat_replication;"
-[ RECORD 1 ]----
sync_state | sync

Possible values of sync_state are:

  • async: This standby server is asynchronous -> CRITICAL!
  • potential: This standby server is asynchronous, but can potentially become synchronous if one of current synchronous ones fails -> WARNING
  • sync: This standby server is synchronous -> OK
  • quorum: This standby server is considered as a candidate for quorum standbys -> OK

Other important values are the different "locations":

sent_location    | 0/6000290
write_location   | 0/6000290
flush_location   | 0/6000290
replay_location  | 0/6000290

From the documentation:

  • sent_location: Last write-ahead log location sent on this connection
  • write_location: Last write-ahead log location written to disk by this standby server
  • flush_location: Last write-ahead log location flushed to disk by this standby server
  • replay_location: Last write-ahead log location replayed into the database on this standby server

This basically shows where the slave server is. If all values are the same it is caught up 100%.

Using monitoring plugin check_postgres

The monitoring plugin check_postgres also features a replication check (hot_standby_delay). The trick is to correctly understand this check. Using the hot_standby_delay check, the plugin connects to both the master and slave and compares the replay delay and receive delay to the given warning and critical thresholds. In order to connect to both the master and the slave, the pg_hba.conf must be adapted accordingly.

On the master (IP 10.10.10.10) I added the following lines:

# Monitoring
host    all             monitoring      127.0.0.1/32          md5
host    all             monitoring      10.10.10.11/32        md5

On the slave (IP 10.10.10.11) I added the following lines:

# Monitoring
host    all             monitoring      127.0.0.1/32          md5

The plugin will be executed on the slave server ergo there the monitoring line for localhost is enough.

To not use the db password with the plugin (the password would show up in cleartext in the process list), I created a .pgpass file for the nagios user (under which this plugin will run). This file contains two entries; first for the localhost connection and secondly for the remote connection to the master server:

nagios@dbslave:~$ whoami
nagios

nagios@dbslave:~$ ls -la .pgpass
-rw------- 1 nagios nagios 94 Jul 26 15:25 .pgpass

nagios@dbslave:~$ cat .pgpass
localhost:5432:*:monitoring:mysupersecretpassword
dbmaster:5432:*:monitoring:mysupersecretpassword

Make sure the .pgpass file has correct permissions (chmod 0600), otherwise it won't be used for psql commands!

Now the plugin can be executed with the hot_standby_delay check:

nagios@dbslave:~$ /usr/lib/nagios/plugins/check_postgres.pl -H localhost,dbmaster -u monitoring -db mydb --action hot_standby_delay --warning 60 --critical 600
POSTGRES_HOT_STANDBY_DELAY OK: DB "mydb" (host:localhost) 0 and 432 seconds | time=0.05s replay_delay=0;60;600  receive-delay=0;60;600 time_delay=432;

Note the -H parameter uses two hostnames. The plugin will connect to both localhost and the dbmaster host using the SQL user "monitoring" (password will automatically be read from .pgpass file). I set a delay warning to 60 seconds, a critical delay to 600 seconds (10 minutes).

 

Add a comment

Show form to leave a comment

Comments (newest first):

ck from Switzerland wrote on Aug 28th, 2017:
Majales, I wrote the answer in the Github issue. For sake of completeness here again:

I went through the source code and time_delay is $time_delta:

if ($version >= 9.1) {
$db->{perf} .= sprintf ' %s=%s;%s;%s',
perfname(msg('hs-time-delay')), $time_delta, $wtime, $ctime;
}



time_delta is read from xlog positions (line 5110 and below). So if you run the SQL command you see these "seconds":

postgres=# SELECT pg_last_xlog_receive_location() AS receive, pg_last_xlog_replay_location() AS replay , COALESCE(ROUND(EXTRACT(epoch FROM now() - pg_last_xact_replay_timestamp())),0) AS seconds;
receive | replay | seconds
------------+------------+---------
2/266055D8 | 2/266055D8 | 151763
(1 row)



So seconds is defined by the function pg_last_xact_replay_timestamp(). The value represents the "time passed since last sync operation" (it doesn't mean that the servers are out of sync then).

ck from Switzerland wrote on Aug 19th, 2017:
Hi Majales. This is indeed a very good question and I don't know either. I found this issue on the check_postgres Github repository: https://github.com/bucardo/check_postgres/issues/119. The OP asks the same question as you. This Github issue is probably the best way to get a correct answer.
Update: Oh!! I just saw that YOU are the OP of that Github issue. Loop :D

Majales wrote on Aug 18th, 2017:
What actualy means that time_delay=432? I can probably understand replay_delay or receive-delay. time should be check execution time, but what is time_delay?


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7393 Days
until Death of Computers
Why?