Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

check_nwc_health: rumms - UNKNOWN no interfaces
Thursday - Jul 20th 2017 - by - (0 comments)

Today I had to solve a special case where an Icinga 2 satellite server ran out of disk space in /var. After I increased the disk size I noticed that almost all network switches, checked via this satellite using check_nwc_health, returned an UNKNOWN status. Service output: rumms. 

check_nwc_health rumms

I manually verified this on the cli:

# /usr/lib/nagios/plugins/check_nwc_health --hostname aswitch --community public --mode interface-usage --name Ethernet1/1
rumms
UNKNOWN - no interfaces

I manually re-listed all interfaces:

# /usr/lib/nagios/plugins/check_nwc_health --hostname aswitch --community public --mode list-interfaces
83886080 mgmt0
151060482 Vlan2
[...]
526649088 Ethernet101/1/29
526649152 Ethernet101/1/30
526649216 Ethernet101/1/31
526649280 Ethernet101/1/32
OK - have fun

And then the check worked again:

# /usr/lib/nagios/plugins/check_nwc_health --hostname aswitch --community public --mode interface-usage --name Ethernet1/1
OK - interface Ethernet1/1 (alias UCS-FI-A) usage is in:0.82% (82014272.36bit/s) out:3.21% (320758024.71bit/s) | 'Ethernet1/1_usage_in'=0.82%;80;90;0;100 'Ethernet1/1_usage_out'=3.21%;80;90;0;100 'Ethernet1/1_traffic_in'=82014272.36;8000000000;9000000000;0;10000000000 'Ethernet1/1_traffic_out'=320758024.71;8000000000;9000000000;0;10000000000

The reason for this is that by default check_nwc_health creates a "cached" list of interfaces per checked device. This cached list is a file in /var/tmp/check_nwc_health:

# ls -l /var/tmp/check_nwc_health | grep cache
-rw-r--r-- 1 nagios nagios  8192 Jul 20 08:03 01switch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  8577 Jul 20 08:17 02switch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  8192 Jul 20 08:04 aswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios     0 Jul 20 07:32 bswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  8192 Jul 20 08:06 cswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  7017 Jul 20 08:18 dswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  7013 Jul 20 08:19 eswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios     0 Jul 20 07:31 fswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-rw-r-- 1 nagios nagios  9291 Jul 20 08:16 gswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  6245 Jul 20 07:44 hswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios     0 Jul 20 07:46 iswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  4096 Jul 20 08:12 jswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  4096 Jul 20 07:46 kswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
[...]

Note the cache-files with a 0-byte size. That's an empty list of interfaces for the specific device - ergo unknown interface for any given interface.
Because /var was full during the time the interface cache file was written the last time, it was a 0-byte file causing check_nwc_health to think there are no interfaces at all on the network device to check.

By removing the cache files the check worked again (if there is no interface cache file, it will re-created).

 

Presenting new monitoring plugin: check_lxc
Wednesday - Jul 19th 2017 - by - (1 comments)

I'm proud to announce a new Nagios/Monitoring plugin: check_lxc.

As the name already tells you, this is a plugin to monitor Linux Containers (LXC). It needs to run on the LXC host and allows to check CPU, Memory, Swap usage of a container. The plugin also allows to check for an automatic boot of a container.

The work on this plugin already began several years ago, in 2013. After having recently added a CPU check I think the plugin is now "ready" to be used in the wild.

Back in 2013, when the plugin development was started, LXC was at version 0.8. I have taken extra precautions to keep compatibility between the LXC releases. As of today I can say that the plugin works from LXC 0.8 upwards.

Enough talk for now. Read the documentation of check_lxc, use the plugin and enjoy!

 

Varnish vcl reload not working with SystemD on Ubuntu 16.04
Tuesday - Jul 18th 2017 - by - (0 comments)

When using Varnish, a restart of it is not often wanted because the cache is cleared. For configuration changes in a vcl a reload comes in more handy.

However I came across an issue today, that this reload doesn't work with SystemD.  OS is Ubuntu 16.04.2 LTS. The reason for this is the "ExecReload" in the SystemD unit file for Varnish:

# grep ExecReload /etc/systemd/system/varnish.service
ExecReload=/usr/share/varnish/reload-vcl

This command (/usr/share/varnish/reload-vcl) reads the config file /etc/default/varnish - which is now obsolete when using SystemD (see Configure Varnish custom settings on Debian 8 Jessie and Ubuntu 16.04 LTS). An issue on the Github repository of Varnish confirms this bug.

A workaround (and it's a working workaround, I tested it) is to use the new "varnishreload" script. As of this writing this script is not part of the varnish package yet, but will probably soon be added. I downloaded the script and saved it as /usr/sbin/varnishreload (and gave it executable permissions). Then I modified the SystemD unit file for the Varnish service:

# grep ExecReload /etc/systemd/system/varnish.service
ExecReload=/usr/sbin/varnishreload

Followed by a reload of SystemD:

# systemctl daemon-reload

 and a restart of Varnish:

# systemctl restart varnish

To test this, I modified the used vcl (which is not the default.vcl by the way) and removed special debug headers in the new config. If a reload works, Varnish should stop sending this header in the response.

# systemctl reload varnish

# systemctl status varnish
? varnish.service - Varnish Cache, a high-performance HTTP accelerator
   Loaded: loaded (/etc/systemd/system/varnish.service; disabled; vendor preset: enabled)
   Active: active (running) since Tue 2017-07-18 15:52:08 CEST; 31min ago
  Process: 7229 ExecReload=/usr/sbin/varnishreload (code=exited, status=0/SUCCESS)
  Process: 26848 ExecStart=/usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/zerberos.vcl -S /etc/varnish/secret -s malloc,2048m (code=exited, status
 Main PID: 26850 (varnishd)
    Tasks: 218
   Memory: 143.6M
      CPU: 3.623s
   CGroup: /system.slice/varnish.service
           +-26850 /usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/zerberos.vcl -S /etc/varnish/secret -s malloc,2048m
           +-26858 /usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/zerberos.vcl -S /etc/varnish/secret -s malloc,2048m

Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54288 ::1 6082 Wr 200 VCL compiled.
Jul 18 16:22:01 varnish1 varnishreload[7229]: VCL compiled.
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Rd auth b3a13c2d09d6d3551504ace7665994ea9bccab035be9d9518d00ea6f36a8ead3
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Wr 200 -----------------------------
                                            Varnish Cache CLI 1.0
                                            -----------------------------
                                            Linux,4.4.0-77-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit
                                            varnish-5.1.2 revision 6ece695
                                           
                                            Type 'help' for command list.
                                            Type 'quit' to close CLI session.
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Rd ping
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Wr 200 PONG 1500387721 1.0
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Rd vcl.use reload_20170718_162201
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Wr 200 VCL 'reload_20170718_162201' now active
Jul 18 16:22:01 varnish1 varnishreload[7229]: VCL 'reload_20170718_162201' now active
Jul 18 16:22:01 varnish1 systemd[1]: Reloaded Varnish Cache, a high-performance HTTP accelerator.

systemctl status seems to verify a working reload. However don't let yourself be fooled - the same kind of entries also appeared with the non-working reload script before. But a manual check confirmed that the reload of the changed vcl config actually worked; the debug headers were gone in HTTP responses.

Changed the vcl again, re-enabled the headers and ran another systemctl reload varnish. And the headers are back again. So make sure you're using the new varnishreload script when using Varnish on Ubuntu 16.04 LTS with SystemD (might also affect other Linux distributions, didn't test that).

 

Count backwards with seq in Linux
Wednesday - Jul 12th 2017 - by - (0 comments)

Needed to manually create some basic web statistics using awstats (a one shot statistic). My approach was to get all the rotated logs and create one big access log. I wanted the lines of that access log in the correct order to avoid awstats tumbling.

First I unzipped all rotated logs:

gunzip *gz

Then I needed to get the log entries from rotated file 40 down to rotated file 9. But here's the catch: How do I count down without having to note every single number from 40 to 9 (that would be something like for i in 40 39 38 37, etc)? I know how to automatically count up using seq:

$ seq 1 5
1
2
3
4
5

So I needed to find a way to count backwards. The solution? seq again :-)

seq offers an optional parameter between the starting and the ending number. From the --help output:

$ seq --help
Usage: seq [OPTION]... LAST
  or:  seq [OPTION]... FIRST LAST
  or:  seq [OPTION]... FIRST INCREMENT LAST
Print numbers from FIRST to LAST, in steps of INCREMENT.
[...]

Example: Count up to 10 but increase with 2 numbers:

$ seq 1 2 10
1
3
5
7
9

The INCREMENT number can be negative, too:

$ seq 10 -1 1
10
9
8
7
6
5
4
3
2
1

And this is actually the way to count down. To put together all rotated logs in the correct order, I finally used the following command:

$ for i in $(seq 40 -1 9); do cat access.log.$i >> final.access.log; done

 

Gandi domain registrar hacked?
Friday - Jul 7th 2017 - by - (0 comments)

Today we've received several messages that some websites didn't work anymore. Further analysis revealed that several domains suddenly had their DNS nameservers changed.

A whois lookup of an affected domain showed the following nameservers:

ns1.dnshost.ga
ns2.dnshost.ga

A DNS lookup using "dig -t NS" on affected domains all showed NS records of 

ns1.example.com
ns2.example.com

A records were set to:46.183.219.205 (an IP address registered in Latvia).

Currently we have 922 domains registered at Gandi. 7 domains were affected and all nameservers pointed to the ones above. Without our doing. Without Gandi having done anything.

Direct communication with Gandi revealed that these manipulations didn't happen on our account only, several customers were affected. I was also assured that it has nothing to do with the new Gandi v5 version but that the problem was in between the Gandi backend and the communication of the domain registries (like nic.ch for Swiss domains).

This pretty much sounds like a hack of Gandi's backend to me. Ouch :-((

The domain settings were quickly restored and an update to the nic servers were initiated. After a couple of hours our affected domains were running again. However I'm still curious in hearing, what exactly was causing this.

Update July 10th 2017: Gandi confirmed an "unauthorized connection" in their backend in a statement sent to the affected customers:

Following an unauthorized connection which occurred at one of the
technical providers we use to manage a number of geographic TLDs[2].

In all, 751 domains in total were affected by this incident, which
involved a unauthorized modification of the name servers [NS] assigned
to the affected domains that then forwarded traffic to a malicious site
exploiting security flaws in several browsers.

Additionally, SWITCH security (the registry of .ch domains) added a good technical article about that case here: https://securityblog.switch.ch/2017/07/07/94-ch-li-domain-names-hijacked-and-used-for-drive-by/ 

Update July 11th 2017: Gandi added a special article on their news blog. On this article Gandi shares details about what happened. It's really worth to check it out. Appreciate the transparency at Gandi!

 

55700 hours or the grey old age of a SATA drive
Friday - Jul 7th 2017 - by - (0 comments)

A few days ago I wrote about a problem where I could not boot Linux Mint 18.1 after running apt-get upgrade. It turned out to be a problem of a dying hard drive.

The SMART output of this drive is quite impressive, especially the power on hours:

smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-79-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue (SATA)
Device Model:     WDC WD5000AAKS-00V1A0
Serial Number:    WD-WMAWF2141256
LU WWN Device Id: 5 0014ee 0575fb08f
Firmware Version: 05.01D05
User Capacity:    500.107.862.016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue Jul  4 08:43:57 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)    Offline data collection activity
                          was suspended by an interrupting command from host.
                          Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                          without error or no self-test has ever
                          been run.
Total time to complete Offline
data collection:           ( 7380) seconds.
Offline data collection
capabilities:               (0x7b) SMART execute Offline immediate.
                          Auto Offline data collection on/off support.
                          Suspend Offline collection upon new
                          command.
                          Offline surface scan supported.
                          Self-test supported.
                          Conveyance Self-test supported.
                          Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                          power-saving mode.
                          Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                          General Purpose Logging supported.
Short self-test routine
recommended polling time:  (   2) minutes.
Extended self-test routine
recommended polling time:  (  88) minutes.
Conveyance self-test routine
recommended polling time:  (   5) minutes.
SCT capabilities:           (0x303f) SCT Status supported.
                          SCT Error Recovery Control supported.
                          SCT Feature Control supported.
                          SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       698
  3 Spin_Up_Time            0x0027   144   141   021    Pre-fail  Always       -       3800
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       104
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   024   024   000    Old_age   Always       -       55700
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       102
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       66
194 Temperature_Celsius     0x0022   097   093   000    Old_age   Always       -       46
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   189   189   000    Old_age   Always       -       899
198 Offline_Uncorrectable   0x0030   197   197   000    Old_age   Offline      -       277
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   198   193   000    Old_age   Offline      -       464

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     53425         579131664
# 2  Extended offline    Completed without error       00%     50463         -
# 3  Extended offline    Completed without error       00%     40082         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Yes, this drive ran for 55'700 hours. Although a lot of defect sectors (and read errors) were detected, I was still able to copy the whole Linux Mint 18.1 installation to a new hard drive (only one file was not able to be copied due to read errors).

 

Samsung UE55KU6400 defect picture when using external sources
Tuesday - Jul 4th 2017 - by - (0 comments)

A few days ago something strange happened with my television, a Samsung UE55KU6400 (Samsung 6 Series) I bought in November 2016. All of a sudden the picture of external sources (HDMI connected) was gone/defect. The TV channels via cable still was working fine - so not a defect of the LED screen itself. But see for yourself:

Samsung UE55KU6400 defect picture on external HDMI source 

I tried several things to find the exact problem:

  • TV channels: Picture normal
  • Different HDMI cable: Same problem
  • Different external source (using a notebook's HDMI output): Same problem
  • Different HDMI port: Same problem

So either suddenly all HDMI ports have gone defect or it must be a problem inside the TV's software (a few days before that issue happened I noticed a software update of the TV). So I decided to go for a reset of the TV.

Once I reset the TV (through "Settings") I needed to re-setup the TV, including setting language, searching through all TV channels, setting up a Samsung accounts, etc. But at the end this turned out to be successful: The external sources were shown correctly again.

Summary: In case you experience picture problems with your external sources using HDMI on your Samsung 6 series TV, try a software reset through the settings menu (of course once you ruled out a bad HDMI cable or port and you're sure the external source works correctly, too).

 

Permissions of log files automatically being reset by syslog
Tuesday - Jun 27th 2017 - by - (0 comments)

For a special application I built a simple monitoring check which reads /var/log/mail.log (and the rotated /var/log/mail.log.1) and counts the number of e-mails sent by that application.

The check itself is executed through NRPE which runs as "nagios" user. Therefore the nagios user needs to be able to read /var/log/mail.log. Easy:

# chmod 644 /var/log/mail.log
# chmod 644 /var/log/mail.log.1

I even adapted the logrotate config file to ensure the rotated log file is also readable after a log rotation (using the "create" option):

/var/log/mail.info
/var/log/mail.warn
/var/log/mail.err
/var/log/mail.log
{
    weekly
    missingok
    notifempty
    compress
    delaycompress
    create 644 root adm
    sharedscripts
    postrotate
        invoke-rc.d syslog-ng reload > /dev/null
    endscript
}

I enabled the monitoring check and it worked. But just a couple of minutes later the check returned critical because the nagios user wasn't able to read the log file anymore. I verified and indeed, the permissions of /var/log/mail.log were reset:

$ ll /var/log/mail.log
-rw-r----- 1 root adm 108437 Jun 27 10:00 /var/log/mail.log

It turns out that syslog-ng (which runs on this application server) resets the permissions automatically to the ones defined in the syslog-ng config. By default (here on a Debian Wheezy installation) this means:

# grep 640 /etc/syslog-ng/syslog-ng.conf
      owner("root"); group("adm"); perm(0640); stats_freq(0);

This ownership and permission setting is part of syslog-ng's global configuration. Of course I could just set the permissions to 0644 here. But his means that all log files would be readable by all the users on this application server. Some logs contain sensitive information so I wouldn't like to just grant read access to everyone.

Instead the permissions can also be set in syslog-ng's "destination" option. For /var/log/mail.log this is the default setting:

# grep "mail.log" /etc/syslog-ng/syslog-ng.conf
destination d_mail { file("/var/log/mail.log"); };

For this destination d_mail I want to create special file permissions:

# grep "mail.log" /etc/syslog-ng/syslog-ng.conf
destination d_mail { file("/var/log/mail.log" perm(0644)); };

Followed by a syslog-ng restart:

# /etc/init.d/syslog-ng restart
[ ok ] Stopping system logging: syslog-ng.
[ ok ] Starting system logging: syslog-ng.

Checking the log's file permissions again:

# ll /var/log/mail.log
-rw-r----- 1 root adm 108437 Jun 27 10:00 /var/log/mail.log

Hmm... the permissions are still the same?! Oh, wait... maybe syslog-ng needs to actually receive something from the mail log facility in order to reset the permissions? Let's try that:

# echo "testmail" | mailx -s test root

Checking again:

# ll /var/log/mail.log
-rw-r--r-- 1 root adm 113261 Jun 27 10:55 /var/log/mail.log

Yep, that's it!

 

Magento2 Load Balancer Health Check for AWS ELB
Monday - Jun 26th 2017 - by - (0 comments)

The Problem

Amazon Web Services (AWS) provides Elastic Load Balancing (ELB) by performing a back end 'Health Check' for any compute resources, expecting a code such as HTTP 200 'OK' in order to start sending traffic.

Out of the box, eCommerce CMS Magento 2, can cater for 2 potential causes of failure in a typical LEMP (Linux, NginX, MySQL & PHP) stack -

  • PHP Fails = 502 Bad Gateway
  • MySQL  Fails = 503 Service Unavailable

The Solution

How are others doing it?

 A few minutes Googling turned up this approach: https://serverfault.com/questions/578984/how-to-set-up-elb-health-checks-with-multiple-applications-running-on-each-ec2-i

Adapting it to the Magento2 API URI '/rest/default/schema' results in the following code snippet;

## All AWS Health Checks from the ELBs arrive at the default server. Forward these requests on the appropriate configuration on this host.
  location /health-check/ {
    rewrite ^/health-check/(?[a-zA-Z0-9\.]+) /rest/default/schema break;
    # Lie about incoming protocol, to avoid the backend issuing a 301 redirect from insecure->secure,
    #  which would not be considered successful.
    proxy_set_header X-Forwarded-Proto 'https';
    proxy_set_header "Host" $domain;
    proxy_pass http://127.0.0.1;
  }

Why is this useful?

Unfortunately the ELB does not allow us to pass the header information to NginX, so the above must be included in the default server block - more on that in a minute. The format the above ingests the URL lets us specify the intended host header, very useful if running more than one site on each server. So now, by calling this check in the format http://{EC2 Public IP}/health-check/{base url} we retrieve a poorly formatted JSON encoded response.

What are the problems running in this configuration? 

It is a useful building block that we can extend, but it does not cover all of the failure domains of a Magento2 store. These include -

  • Being Administratively in maintenance mode 'php bin/magento maintenance:enable'
  • /vendor folder missing dependencies (returns: Autoload error Vendor autoload is not found. Please run 'composer install' under application root directory)
  • Redis Server becoming unavailable (returns: An error has happened during application run. See exception log for details. Could not write error message to log. Please use developer mode to see the message)

The above health check still returns 'HTTP 200 OK' in the above scenarios even if they return the error messages, meaning they will still receive live traffic routed to them, despite the fact they're not functioning correctly. 

So how can we cater to these additional failure domains? Enter some simple PHP JSON decoding.

<?php
$url = 'http://127.0.0.1/health-check/' . $_REQUEST['hostname'];
$ch = curl_init($url);

curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$data = curl_exec($ch);
curl_close($ch);

$result = json_decode($data);

if (json_last_error() === JSON_ERROR_NONE) {
  http_response_code(200);
  } else
  http_response_code(418);
?> 

PHP is the natural language choice, given this is what Magento2 uses natively. You can place the above inside an index.php file in an alternative location on the server - and use a separate PHP pool and user combination for additional security if required. It calls the initial solution, and parses the result to test for valid JSON. In normal circumstances, this will correctly route traffic to functioning nodes but when any of the previously mentioned failure conditions are met will instead tell the Load Balancer 'HTTP 418 I AM A TEAPOT' to stop the traffic being routed.

As an added bonus, the data being sent back to the load balancer is only 5 bytes, with the full API response staying locally within the server. 

Full NginX Config

# ELB Health Check Config
server {
  listen 80 default_server;
  server_name  _;
  index index.php;
  root /var/www/lbstatus;
  access_log off;
  location = /favicon.ico {access_log off; log_not_found off;}
  location ~ index\.php$ {
    try_files $uri =404;
    fastcgi_intercept_errors on;
    fastcgi_pass fastcgi_backend;
    fastcgi_index index.php;
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
  }

  ## All AWS Health Checks from the ELBs arrive at the default server. Forward these requests on the appropriate configuration on this host.
  location /health-check/ {
    rewrite ^/health-check/(?[a-zA-Z0-9\.]+) /rest/default/schema break;
    # Lie about incoming protocol, to avoid the backend issuing a 301 redirect from insecure->secure,
    #  which would not be considered successful.
    proxy_set_header X-Forwarded-Proto 'https';
    proxy_set_header "Host" $domain;
    proxy_pass http://127.0.0.1;
  }
}

So there we go, with 2 levels of 'Health Checks' within the NginX default server configuration, 1 for injecting the header information to return the API JSON (when working) and a simple PHP function to check the JSON response we can tell the ELB when Magento2 is healthy. The URI for health check in ELB will become '/?hostname={your base url}' and you should see 200 OK responses in the logs presuming everything is ship shape.

Failure Domains Not Catered For

  • Front end issues, eg. broken CSS, JS, images
  • Products missing from categories, often Indexing issues
If you have any solutions that might help us programmatically address them, please get in touch and let us know in the commments.

 

apt-get update error: Could not get lock (SystemD and unattended APT)
Friday - Jun 23rd 2017 - by - (0 comments)

On an Ubuntu 16.04 Xenial machine I got the following error:

 # apt-get update
Reading package lists... Done
E: Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)
E: Unable to lock directory /var/lib/apt/lists/

A quick look at the processes showed that a daily run of apt-get update, managed by systemd, seemed hanging:

# ps auxf| grep -i apt
root     28044  0.0  0.0   4508  1700 ?        Ss   03:34   0:00 /bin/sh /usr/lib/apt/apt.systemd.daily
root     28077  0.0  0.1  44628  7344 ?        S    03:34   0:03  \_ apt-get -qq -y update
_apt     28081  0.2  1.4 237124 57924 ?        S    03:34   1:44      \_ /usr/lib/apt/methods/https
_apt     28082  0.0  0.1  43212  5844 ?        S    03:34   0:00      \_ /usr/lib/apt/methods/http
_apt     28083  0.0  0.1  43276  5584 ?        S    03:34   0:00      \_ /usr/lib/apt/methods/http
_apt     28588  0.0  0.1  41036  5432 ?        S    03:36   0:00      \_ /usr/lib/apt/methods/gpgv

I tried to see what these processes are doing... Well for some reason the main process (PID 28077) seemed to have run in a timeout and was stuck in a loop:

# strace -s 1000 -f -p 28077
strace: Process 28077 attached
select(10, [5 6 7 9], [], NULL, {0, 76562}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}) = 0 (Timeout)
select(10, [5 6 7 9], [], NULL, {0, 500000}^Cstrace: Process 28077 detached
 

So what is causing this timeout? Let's check what exactly SystemD did at 03:34 this morning:

Jun 23 03:34:19 onl-lb04-s systemd[1]: Started Daily apt activities.
Jun 23 03:34:19 onl-lb04-s systemd[1]: apt-daily.timer: Adding 9h 3min 10.786515s random time.
Jun 23 03:34:19 onl-lb04-s systemd[1]: apt-daily.timer: Adding 6h 10min 44.777700s random time.
Jun 23 03:34:19 onl-lb04-s systemd[1]: Starting Daily apt activities...

WTF? SystemD seems to have added a total of 15 hours 13mins to the process? So this is why the process keeps hanging in a timeout and therefore locking apt?

There's quite some information on the Internet (if one knows what to look for) concerning this "problem". Obviously I'm not the only one stumbling on the automatic apt updates/upgrades which were enabled by default since Ubuntu 16.04 Xenial. Some good reads:

In general the apt folks agree that the current setup with adding random times to execute is not a good idea:

"We should think about this a bit more" - Julian Andres Klode (apt maintainer)

In the second bug 1686470 (which serves as general brainstorming and technical re-setup of the whole automatic apt update/upgrade) a definitive solution seems to be implemented. But it is yet to be released on Xenial:

Ubuntu Xenial Bug Automatic APT Unattended Updates

Until a definitive fix (re-setup of the automatic apt process) is released, there are the following workarounds:

  • Manually kill such apt processes which appear "hanging" (due to the random and sometimes huge added time)
  • Disable automatic updates/upgrades in /etc/apt/apt.conf.d/20auto-upgrades (by setting both values to 0)
  • If you want automatic apt updates and/or upgrades on your Xenial system, do so the legacy way using cron
  • Wait until bug 1686470 is fixed

Update June 26th 2017:
After additional research it seems the timeout (seen in strace above) didn't come from the added random time, but rather from apt itself trying to establish a connection with a proxy. apt on this particular server was set up to use a http proxy:

# cat /etc/apt/apt.conf
Acquire::http::Proxy "http://myproxyserver.local:8080";

However apt-get update got stuck on a https repository:

# apt-get update
Ign:1 https://packagecloud.io/varnishcache/varnish5/ubuntu xenial InRelease
0% [Waiting for headers] [Waiting for headers]^C                        

Reason seems to be some problem in the firewall rules as I wasn't able to communicate with the proxy server, even with curl:

# export https_proxy=http://myproxyserver.local:8080
# curl https://www.claudiokuenzler.com/robots.txt
curl: (56) Proxy CONNECT aborted




 


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7484 Days
until Death of Computers
Why?