It's been a couple of weeks since I set up a PostgreSQL replication and added it to our monitoring system (see How to monitor a PostgreSQL replication) and it has been running smoothly so far. But in the past few days a disk usage warning popped up.
Although the databases themselves only use around 10GB of disk space, the WAL files (especially the archived WAL files) eat 63GB!
This is because by default the archived WAL files are kept forever if "archive_mode" is set to on in the PostgreSQL config:
archive_mode = on # enables archiving; off, on, or always
archive_command = 'cp %p /var/lib/postgresql/9.6/main/archive/%f'
I thought the solution is easy: I just disable the archiv_mode on the master and enable it on the replica (a hot standby). NOPE! I was following the replica as the WAL files were rotating through (I have wal_keep_segments = 32) but no files in the archive directory were created.
A look at an older mail from February 2014 in the PostgreSQL mailing list reveals:
"It works fine, only the server will not generate WAL while it is in recovery. As soon as you promote the standby, it will archive ist WALs."
A hot_standby replica server is basically ALWAYS running in recovery; means that the "archive_command" will never run on it. Lesson 1 learned: Cleaning up must be done on the master server.
Note: This is only true for hot_standby, it may be different for other kinds of replication modes.
To clean up the archived WAL files, there's a special command pg_archivecleanup. The program can be added into the recovery.conf on a standby server (not hot_standby!) or used as standalone command:
pg_archivecleanup [option...] archivelocation oldestkeptwalfile
I decided to go with the standalone command and build a wrapper around the command. This resulted in a shell script walarchivecleanup.sh. The script allows different options and is able to dynamically looking up a the "oldestkeptwalfile" given by the a max age parameter (-a). A specific "oldestkeptwalfile" can also be given (-f).
# ./walarchivecleanup.sh -p /var/lib/postgresql/9.6/main/archive -a 14 -d
pg_archivecleanup: keep WAL file "/var/lib/postgresql/9.6/main/archive/0000000100000002000000B6" and later
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000E6"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000002000000B1"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000B0"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/000000010000000200000056"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000008F"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000006F"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000BC"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000A2"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000B6"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000A4"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000004F"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000D0"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000004E"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000F1"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000002000000B5"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/000000010000000200000070"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000001C"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000002000000B4"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/000000010000000200000039"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000E0"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000FD"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000003E"
General information and usage:
./walarchivecleanup.sh (c) 2017 Claudio Kuenzler
This script helps to clean up archived WAL logs on a PostgreSQL master server using the pg_archivecleanup command.
Please note that WAL archiving currently only works on a master server (as of 9.6).
-p Path to the archived WAL logs (e.g. /var/lib/postgresql/9.6/main/archive)
-a Age of archived logs to keep (days), anything older will be deleted
-f Specify a certain archived WAL file, anything older than this file will be deleted
Note: If you use -f, it will override -a parameter
-c Full path to pg_archivecleanup command (if not found in $PATH)
-d Show debug information
-n Dry run (simulation only)
Usage: ./walarchivecleanup.sh -p archivepath -a age (days) [-d debug] [-f archivefile] [-c path_to_pg_archivecleanup]
Example 1: ./walarchivecleanup.sh -p /var/lib/postgresql/9.6/main/archive -a 10
Example 2: ./walarchivecleanup.sh -p /var/lib/postgresql/9.6/main/archive -f 00000001000000010000001E
Cronjob example: 00 03 * * * /root/scripts/walarchivecleanup.sh -p /var/lib/postgresql/9.6/main/archive -a 14
The script is now published on Github and can be found here: https://github.com/Napsty/scripts/blob/master/pgsql/walarchivecleanup.sh. Enjoy!