Header RSS Feed
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

How to monitor a PostgreSQL replication
Wednesday - Jul 26th 2017 - by - (0 comments)

There are multiple ways of monitoring a working master-slave-replication on PostgreSQL servers.

Using PSQL

First of all there is of course the replication status which can be read directly from the master PostgreSQL server:

postgres@dbmaster:~$ psql -x -c "select * from pg_stat_replication;"
-[ RECORD 1 ]----+------------------------------
pid              | 13014
usesysid         | 16387
usename          | replica
application_name | dbslave
client_addr      |
client_hostname  |
client_port      | 48596
backend_start    | 2017-07-26 13:07:00.617621+00
backend_xmin     |
state            | streaming
sent_location    | 0/6000290
write_location   | 0/6000290
flush_location   | 0/6000290
replay_location  | 0/6000290
sync_priority    | 1
sync_state       | sync

This information can only be read on the master. If you try that on the slave (hot_standby = on), you don't get to see anything:

postgres@dbslave:~$ psql -x -c "select * from pg_stat_replication;"
(0 rows)

Obviously the moast important information here is the sync_state:

postgres@dbmaster:~$ psql -x -c "select sync_state from pg_stat_replication;"
-[ RECORD 1 ]----
sync_state | sync

Possible values of sync_state are:

  • async: This standby server is asynchronous -> CRITICAL!
  • potential: This standby server is asynchronous, but can potentially become synchronous if one of current synchronous ones fails -> WARNING
  • sync: This standby server is synchronous -> OK
  • quorum: This standby server is considered as a candidate for quorum standbys -> OK

Other important values are the different "locations":

sent_location    | 0/6000290
write_location   | 0/6000290
flush_location   | 0/6000290
replay_location  | 0/6000290

From the documentation:

  • sent_location: Last write-ahead log location sent on this connection
  • write_location: Last write-ahead log location written to disk by this standby server
  • flush_location: Last write-ahead log location flushed to disk by this standby server
  • replay_location: Last write-ahead log location replayed into the database on this standby server

This basically shows where the slave server is. If all values are the same it is caught up 100%.

Using monitoring plugin check_postgres

The monitoring plugin check_postgres also features a replication check (hot_standby_delay). The trick is to correctly understand this check. Using the hot_standby_delay check, the plugin connects to both the master and slave and compares the replay delay and receive delay to the given warning and critical thresholds. In order to connect to both the master and the slave, the pg_hba.conf must be adapted accordingly.

On the master (IP I added the following lines:

# Monitoring
host    all             monitoring          md5
host    all             monitoring        md5

On the slave (IP I added the following lines:

# Monitoring
host    all             monitoring          md5

The plugin will be executed on the slave server ergo there the monitoring line for localhost is enough.

To not use the db password with the plugin (the password would show up in cleartext in the process list), I created a .pgpass file for the nagios user (under which this plugin will run). This file contains two entries; first for the localhost connection and secondly for the remote connection to the master server:

nagios@dbslave:~$ whoami

nagios@dbslave:~$ ls -la .pgpass
-rw------- 1 nagios nagios 94 Jul 26 15:25 .pgpass

nagios@dbslave:~$ cat .pgpass

Make sure the .pgpass file has correct permissions (chmod 0600), otherwise it won't be used for psql commands!

Now the plugin can be executed with the hot_standby_delay check:

nagios@dbslave:~$ /usr/lib/nagios/plugins/check_postgres.pl -H localhost,dbmaster -u monitoring -db mydb --action hot_standby_delay --warning 60 --critical 600
POSTGRES_HOT_STANDBY_DELAY OK: DB "spark" (host:localhost) 0 and 432 seconds | time=0.05s replay_delay=0;60;600  receive-delay=0;60;600 time_delay=432;

Note the -H parameter uses two hostnames. The plugin will connect to both localhost and the dbmaster host using the SQL user "monitoring" (password will automatically be read from .pgpass file). I set a delay warning to 60 seconds, a critical delay to 600 seconds (10 minutes).


check_nwc_health: rumms - UNKNOWN no interfaces
Thursday - Jul 20th 2017 - by - (0 comments)

Today I had to solve a special case where an Icinga 2 satellite server ran out of disk space in /var. After I increased the disk size I noticed that almost all network switches, checked via this satellite using check_nwc_health, returned an UNKNOWN status. Service output: rumms. 

check_nwc_health rumms

I manually verified this on the cli:

# /usr/lib/nagios/plugins/check_nwc_health --hostname aswitch --community public --mode interface-usage --name Ethernet1/1
UNKNOWN - no interfaces

I manually re-listed all interfaces:

# /usr/lib/nagios/plugins/check_nwc_health --hostname aswitch --community public --mode list-interfaces
83886080 mgmt0
151060482 Vlan2
526649088 Ethernet101/1/29
526649152 Ethernet101/1/30
526649216 Ethernet101/1/31
526649280 Ethernet101/1/32
OK - have fun

And then the check worked again:

# /usr/lib/nagios/plugins/check_nwc_health --hostname aswitch --community public --mode interface-usage --name Ethernet1/1
OK - interface Ethernet1/1 (alias UCS-FI-A) usage is in:0.82% (82014272.36bit/s) out:3.21% (320758024.71bit/s) | 'Ethernet1/1_usage_in'=0.82%;80;90;0;100 'Ethernet1/1_usage_out'=3.21%;80;90;0;100 'Ethernet1/1_traffic_in'=82014272.36;8000000000;9000000000;0;10000000000 'Ethernet1/1_traffic_out'=320758024.71;8000000000;9000000000;0;10000000000

The reason for this is that by default check_nwc_health creates a "cached" list of interfaces per checked device. This cached list is a file in /var/tmp/check_nwc_health:

# ls -l /var/tmp/check_nwc_health | grep cache
-rw-r--r-- 1 nagios nagios  8192 Jul 20 08:03 01switch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  8577 Jul 20 08:17 02switch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  8192 Jul 20 08:04 aswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios     0 Jul 20 07:32 bswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  8192 Jul 20 08:06 cswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  7017 Jul 20 08:18 dswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  7013 Jul 20 08:19 eswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios     0 Jul 20 07:31 fswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-rw-r-- 1 nagios nagios  9291 Jul 20 08:16 gswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  6245 Jul 20 07:44 hswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios     0 Jul 20 07:46 iswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  4096 Jul 20 08:12 jswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  4096 Jul 20 07:46 kswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d

Note the cache-files with a 0-byte size. That's an empty list of interfaces for the specific device - ergo unknown interface for any given interface.
Because /var was full during the time the interface cache file was written the last time, it was a 0-byte file causing check_nwc_health to think there are no interfaces at all on the network device to check.

By removing the cache files the check worked again (if there is no interface cache file, it will re-created).


Presenting new monitoring plugin: check_lxc
Wednesday - Jul 19th 2017 - by - (1 comments)

I'm proud to announce a new Nagios/Monitoring plugin: check_lxc.

As the name already tells you, this is a plugin to monitor Linux Containers (LXC). It needs to run on the LXC host and allows to check CPU, Memory, Swap usage of a container. The plugin also allows to check for an automatic boot of a container.

The work on this plugin already began several years ago, in 2013. After having recently added a CPU check I think the plugin is now "ready" to be used in the wild.

Back in 2013, when the plugin development was started, LXC was at version 0.8. I have taken extra precautions to keep compatibility between the LXC releases. As of today I can say that the plugin works from LXC 0.8 upwards.

Enough talk for now. Read the documentation of check_lxc, use the plugin and enjoy!


Varnish vcl reload not working with SystemD on Ubuntu 16.04
Tuesday - Jul 18th 2017 - by - (0 comments)

When using Varnish, a restart of it is not often wanted because the cache is cleared. For configuration changes in a vcl a reload comes in more handy.

However I came across an issue today, that this reload doesn't work with SystemD.  OS is Ubuntu 16.04.2 LTS. The reason for this is the "ExecReload" in the SystemD unit file for Varnish:

# grep ExecReload /etc/systemd/system/varnish.service

This command (/usr/share/varnish/reload-vcl) reads the config file /etc/default/varnish - which is now obsolete when using SystemD (see Configure Varnish custom settings on Debian 8 Jessie and Ubuntu 16.04 LTS). An issue on the Github repository of Varnish confirms this bug.

A workaround (and it's a working workaround, I tested it) is to use the new "varnishreload" script. As of this writing this script is not part of the varnish package yet, but will probably soon be added. I downloaded the script and saved it as /usr/sbin/varnishreload (and gave it executable permissions). Then I modified the SystemD unit file for the Varnish service:

# grep ExecReload /etc/systemd/system/varnish.service

Followed by a reload of SystemD:

# systemctl daemon-reload

 and a restart of Varnish:

# systemctl restart varnish

To test this, I modified the used vcl (which is not the default.vcl by the way) and removed special debug headers in the new config. If a reload works, Varnish should stop sending this header in the response.

# systemctl reload varnish

# systemctl status varnish
? varnish.service - Varnish Cache, a high-performance HTTP accelerator
   Loaded: loaded (/etc/systemd/system/varnish.service; disabled; vendor preset: enabled)
   Active: active (running) since Tue 2017-07-18 15:52:08 CEST; 31min ago
  Process: 7229 ExecReload=/usr/sbin/varnishreload (code=exited, status=0/SUCCESS)
  Process: 26848 ExecStart=/usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/zerberos.vcl -S /etc/varnish/secret -s malloc,2048m (code=exited, status
 Main PID: 26850 (varnishd)
    Tasks: 218
   Memory: 143.6M
      CPU: 3.623s
   CGroup: /system.slice/varnish.service
           +-26850 /usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/zerberos.vcl -S /etc/varnish/secret -s malloc,2048m
           +-26858 /usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/zerberos.vcl -S /etc/varnish/secret -s malloc,2048m

Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54288 ::1 6082 Wr 200 VCL compiled.
Jul 18 16:22:01 varnish1 varnishreload[7229]: VCL compiled.
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Rd auth b3a13c2d09d6d3551504ace7665994ea9bccab035be9d9518d00ea6f36a8ead3
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Wr 200 -----------------------------
                                            Varnish Cache CLI 1.0
                                            varnish-5.1.2 revision 6ece695
                                            Type 'help' for command list.
                                            Type 'quit' to close CLI session.
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Rd ping
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Wr 200 PONG 1500387721 1.0
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Rd vcl.use reload_20170718_162201
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Wr 200 VCL 'reload_20170718_162201' now active
Jul 18 16:22:01 varnish1 varnishreload[7229]: VCL 'reload_20170718_162201' now active
Jul 18 16:22:01 varnish1 systemd[1]: Reloaded Varnish Cache, a high-performance HTTP accelerator.

systemctl status seems to verify a working reload. However don't let yourself be fooled - the same kind of entries also appeared with the non-working reload script before. But a manual check confirmed that the reload of the changed vcl config actually worked; the debug headers were gone in HTTP responses.

Changed the vcl again, re-enabled the headers and ran another systemctl reload varnish. And the headers are back again. So make sure you're using the new varnishreload script when using Varnish on Ubuntu 16.04 LTS with SystemD (might also affect other Linux distributions, didn't test that).


Count backwards with seq in Linux
Wednesday - Jul 12th 2017 - by - (0 comments)

Needed to manually create some basic web statistics using awstats (a one shot statistic). My approach was to get all the rotated logs and create one big access log. I wanted the lines of that access log in the correct order to avoid awstats tumbling.

First I unzipped all rotated logs:

gunzip *gz

Then I needed to get the log entries from rotated file 40 down to rotated file 9. But here's the catch: How do I count down without having to note every single number from 40 to 9 (that would be something like for i in 40 39 38 37, etc)? I know how to automatically count up using seq:

$ seq 1 5

So I needed to find a way to count backwards. The solution? seq again :-)

seq offers an optional parameter between the starting and the ending number. From the --help output:

$ seq --help
Usage: seq [OPTION]... LAST
  or:  seq [OPTION]... FIRST LAST
Print numbers from FIRST to LAST, in steps of INCREMENT.

Example: Count up to 10 but increase with 2 numbers:

$ seq 1 2 10

The INCREMENT number can be negative, too:

$ seq 10 -1 1

And this is actually the way to count down. To put together all rotated logs in the correct order, I finally used the following command:

$ for i in $(seq 40 -1 9); do cat access.log.$i >> final.access.log; done


Gandi domain registrar hacked?
Friday - Jul 7th 2017 - by - (0 comments)

Today we've received several messages that some websites didn't work anymore. Further analysis revealed that several domains suddenly had their DNS nameservers changed.

A whois lookup of an affected domain showed the following nameservers:


A DNS lookup using "dig -t NS" on affected domains all showed NS records of 


A records were set to: (an IP address registered in Latvia).

Currently we have 922 domains registered at Gandi. 7 domains were affected and all nameservers pointed to the ones above. Without our doing. Without Gandi having done anything.

Direct communication with Gandi revealed that these manipulations didn't happen on our account only, several customers were affected. I was also assured that it has nothing to do with the new Gandi v5 version but that the problem was in between the Gandi backend and the communication of the domain registries (like nic.ch for Swiss domains).

This pretty much sounds like a hack of Gandi's backend to me. Ouch :-((

The domain settings were quickly restored and an update to the nic servers were initiated. After a couple of hours our affected domains were running again. However I'm still curious in hearing, what exactly was causing this.

Update July 10th 2017: Gandi confirmed an "unauthorized connection" in their backend in a statement sent to the affected customers:

Following an unauthorized connection which occurred at one of the
technical providers we use to manage a number of geographic TLDs[2].

In all, 751 domains in total were affected by this incident, which
involved a unauthorized modification of the name servers [NS] assigned
to the affected domains that then forwarded traffic to a malicious site
exploiting security flaws in several browsers.

Additionally, SWITCH security (the registry of .ch domains) added a good technical article about that case here: https://securityblog.switch.ch/2017/07/07/94-ch-li-domain-names-hijacked-and-used-for-drive-by/ 

Update July 11th 2017: Gandi added a special article on their news blog. On this article Gandi shares details about what happened. It's really worth to check it out. Appreciate the transparency at Gandi!


55700 hours or the grey old age of a SATA drive
Friday - Jul 7th 2017 - by - (0 comments)

A few days ago I wrote about a problem where I could not boot Linux Mint 18.1 after running apt-get upgrade. It turned out to be a problem of a dying hard drive.

The SMART output of this drive is quite impressive, especially the power on hours:

smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-79-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Model Family:     Western Digital Caviar Blue (SATA)
Device Model:     WDC WD5000AAKS-00V1A0
Serial Number:    WD-WMAWF2141256
LU WWN Device Id: 5 0014ee 0575fb08f
Firmware Version: 05.01D05
User Capacity:    500.107.862.016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue Jul  4 08:43:57 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)    Offline data collection activity
                          was suspended by an interrupting command from host.
                          Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                          without error or no self-test has ever
                          been run.
Total time to complete Offline
data collection:           ( 7380) seconds.
Offline data collection
capabilities:               (0x7b) SMART execute Offline immediate.
                          Auto Offline data collection on/off support.
                          Suspend Offline collection upon new
                          Offline surface scan supported.
                          Self-test supported.
                          Conveyance Self-test supported.
                          Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                          power-saving mode.
                          Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                          General Purpose Logging supported.
Short self-test routine
recommended polling time:  (   2) minutes.
Extended self-test routine
recommended polling time:  (  88) minutes.
Conveyance self-test routine
recommended polling time:  (   5) minutes.
SCT capabilities:           (0x303f) SCT Status supported.
                          SCT Error Recovery Control supported.
                          SCT Feature Control supported.
                          SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       698
  3 Spin_Up_Time            0x0027   144   141   021    Pre-fail  Always       -       3800
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       104
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   024   024   000    Old_age   Always       -       55700
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       102
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       66
194 Temperature_Celsius     0x0022   097   093   000    Old_age   Always       -       46
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   189   189   000    Old_age   Always       -       899
198 Offline_Uncorrectable   0x0030   197   197   000    Old_age   Offline      -       277
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   198   193   000    Old_age   Offline      -       464

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     53425         579131664
# 2  Extended offline    Completed without error       00%     50463         -
# 3  Extended offline    Completed without error       00%     40082         -

SMART Selective self-test log data structure revision number 1
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Yes, this drive ran for 55'700 hours. Although a lot of defect sectors (and read errors) were detected, I was still able to copy the whole Linux Mint 18.1 installation to a new hard drive (only one file was not able to be copied due to read errors).


Samsung UE55KU6400 defect picture when using external sources
Tuesday - Jul 4th 2017 - by - (0 comments)

A few days ago something strange happened with my television, a Samsung UE55KU6400 (Samsung 6 Series) I bought in November 2016. All of a sudden the picture of external sources (HDMI connected) was gone/defect. The TV channels via cable still was working fine - so not a defect of the LED screen itself. But see for yourself:

Samsung UE55KU6400 defect picture on external HDMI source 

I tried several things to find the exact problem:

  • TV channels: Picture normal
  • Different HDMI cable: Same problem
  • Different external source (using a notebook's HDMI output): Same problem
  • Different HDMI port: Same problem

So either suddenly all HDMI ports have gone defect or it must be a problem inside the TV's software (a few days before that issue happened I noticed a software update of the TV). So I decided to go for a reset of the TV.

Once I reset the TV (through "Settings") I needed to re-setup the TV, including setting language, searching through all TV channels, setting up a Samsung accounts, etc. But at the end this turned out to be successful: The external sources were shown correctly again.

Summary: In case you experience picture problems with your external sources using HDMI on your Samsung 6 series TV, try a software reset through the settings menu (of course once you ruled out a bad HDMI cable or port and you're sure the external source works correctly, too).


Permissions of log files automatically being reset by syslog
Tuesday - Jun 27th 2017 - by - (0 comments)

For a special application I built a simple monitoring check which reads /var/log/mail.log (and the rotated /var/log/mail.log.1) and counts the number of e-mails sent by that application.

The check itself is executed through NRPE which runs as "nagios" user. Therefore the nagios user needs to be able to read /var/log/mail.log. Easy:

# chmod 644 /var/log/mail.log
# chmod 644 /var/log/mail.log.1

I even adapted the logrotate config file to ensure the rotated log file is also readable after a log rotation (using the "create" option):

    create 644 root adm
        invoke-rc.d syslog-ng reload > /dev/null

I enabled the monitoring check and it worked. But just a couple of minutes later the check returned critical because the nagios user wasn't able to read the log file anymore. I verified and indeed, the permissions of /var/log/mail.log were reset:

$ ll /var/log/mail.log
-rw-r----- 1 root adm 108437 Jun 27 10:00 /var/log/mail.log

It turns out that syslog-ng (which runs on this application server) resets the permissions automatically to the ones defined in the syslog-ng config. By default (here on a Debian Wheezy installation) this means:

# grep 640 /etc/syslog-ng/syslog-ng.conf
      owner("root"); group("adm"); perm(0640); stats_freq(0);

This ownership and permission setting is part of syslog-ng's global configuration. Of course I could just set the permissions to 0644 here. But his means that all log files would be readable by all the users on this application server. Some logs contain sensitive information so I wouldn't like to just grant read access to everyone.

Instead the permissions can also be set in syslog-ng's "destination" option. For /var/log/mail.log this is the default setting:

# grep "mail.log" /etc/syslog-ng/syslog-ng.conf
destination d_mail { file("/var/log/mail.log"); };

For this destination d_mail I want to create special file permissions:

# grep "mail.log" /etc/syslog-ng/syslog-ng.conf
destination d_mail { file("/var/log/mail.log" perm(0644)); };

Followed by a syslog-ng restart:

# /etc/init.d/syslog-ng restart
[ ok ] Stopping system logging: syslog-ng.
[ ok ] Starting system logging: syslog-ng.

Checking the log's file permissions again:

# ll /var/log/mail.log
-rw-r----- 1 root adm 108437 Jun 27 10:00 /var/log/mail.log

Hmm... the permissions are still the same?! Oh, wait... maybe syslog-ng needs to actually receive something from the mail log facility in order to reset the permissions? Let's try that:

# echo "testmail" | mailx -s test root

Checking again:

# ll /var/log/mail.log
-rw-r--r-- 1 root adm 113261 Jun 27 10:55 /var/log/mail.log

Yep, that's it!


Magento2 Load Balancer Health Check for AWS ELB
Monday - Jun 26th 2017 - by - (0 comments)

The Problem

Amazon Web Services (AWS) provides Elastic Load Balancing (ELB) by performing a back end 'Health Check' for any compute resources, expecting a code such as HTTP 200 'OK' in order to start sending traffic.

Out of the box, eCommerce CMS Magento 2, can cater for 2 potential causes of failure in a typical LEMP (Linux, NginX, MySQL & PHP) stack -

  • PHP Fails = 502 Bad Gateway
  • MySQL  Fails = 503 Service Unavailable

The Solution

How are others doing it?

 A few minutes Googling turned up this approach: https://serverfault.com/questions/578984/how-to-set-up-elb-health-checks-with-multiple-applications-running-on-each-ec2-i

Adapting it to the Magento2 API URI '/rest/default/schema' results in the following code snippet;

## All AWS Health Checks from the ELBs arrive at the default server. Forward these requests on the appropriate configuration on this host.
  location /health-check/ {
    rewrite ^/health-check/(?[a-zA-Z0-9\.]+) /rest/default/schema break;
    # Lie about incoming protocol, to avoid the backend issuing a 301 redirect from insecure->secure,
    #  which would not be considered successful.
    proxy_set_header X-Forwarded-Proto 'https';
    proxy_set_header "Host" $domain;

Why is this useful?

Unfortunately the ELB does not allow us to pass the header information to NginX, so the above must be included in the default server block - more on that in a minute. The format the above ingests the URL lets us specify the intended host header, very useful if running more than one site on each server. So now, by calling this check in the format http://{EC2 Public IP}/health-check/{base url} we retrieve a poorly formatted JSON encoded response.

What are the problems running in this configuration? 

It is a useful building block that we can extend, but it does not cover all of the failure domains of a Magento2 store. These include -

  • Being Administratively in maintenance mode 'php bin/magento maintenance:enable'
  • /vendor folder missing dependencies (returns: Autoload error Vendor autoload is not found. Please run 'composer install' under application root directory)
  • Redis Server becoming unavailable (returns: An error has happened during application run. See exception log for details. Could not write error message to log. Please use developer mode to see the message)

The above health check still returns 'HTTP 200 OK' in the above scenarios even if they return the error messages, meaning they will still receive live traffic routed to them, despite the fact they're not functioning correctly. 

So how can we cater to these additional failure domains? Enter some simple PHP JSON decoding.

$url = '' . $_REQUEST['hostname'];
$ch = curl_init($url);

curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$data = curl_exec($ch);

$result = json_decode($data);

if (json_last_error() === JSON_ERROR_NONE) {
  } else

PHP is the natural language choice, given this is what Magento2 uses natively. You can place the above inside an index.php file in an alternative location on the server - and use a separate PHP pool and user combination for additional security if required. It calls the initial solution, and parses the result to test for valid JSON. In normal circumstances, this will correctly route traffic to functioning nodes but when any of the previously mentioned failure conditions are met will instead tell the Load Balancer 'HTTP 418 I AM A TEAPOT' to stop the traffic being routed.

As an added bonus, the data being sent back to the load balancer is only 5 bytes, with the full API response staying locally within the server. 

Full NginX Config

# ELB Health Check Config
server {
  listen 80 default_server;
  server_name  _;
  index index.php;
  root /var/www/lbstatus;
  access_log off;
  location = /favicon.ico {access_log off; log_not_found off;}
  location ~ index\.php$ {
    try_files $uri =404;
    fastcgi_intercept_errors on;
    fastcgi_pass fastcgi_backend;
    fastcgi_index index.php;
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;

  ## All AWS Health Checks from the ELBs arrive at the default server. Forward these requests on the appropriate configuration on this host.
  location /health-check/ {
    rewrite ^/health-check/(?[a-zA-Z0-9\.]+) /rest/default/schema break;
    # Lie about incoming protocol, to avoid the backend issuing a 301 redirect from insecure->secure,
    #  which would not be considered successful.
    proxy_set_header X-Forwarded-Proto 'https';
    proxy_set_header "Host" $domain;

So there we go, with 2 levels of 'Health Checks' within the NginX default server configuration, 1 for injecting the header information to return the API JSON (when working) and a simple PHP function to check the JSON response we can tell the ELB when Magento2 is healthy. The URI for health check in ELB will become '/?hostname={your base url}' and you should see 200 OK responses in the logs presuming everything is ship shape.

Failure Domains Not Catered For

  • Front end issues, eg. broken CSS, JS, images
  • Products missing from categories, often Indexing issues
If you have any solutions that might help us programmatically address them, please get in touch and let us know in the commments.


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7481 Days
until Death of Computers