Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

Monitoring multiple Varnish instances
Tuesday - Nov 14th 2017 - by - (0 comments)

To monitor Varnish performance I've been using check_varnish.py for quite some time now. It uses the varnishadm command in the background to get all kinds of values; for example number of hits, number of misses, number of requests etc.

Varnish also allows to run in parallel of other Varnish processes, as long as:

- The work directory is different. This is handled by assigning a different work directory name (-n name) to the process. This setting is better known as "instance name", although work directory would be the technical correct naming.

- The listener ports are different. This applies to both the http listener (-a) and the management listener (-T).

The problem with check_varnish.py? It doesn't support the -n parameter to query a certain Varnish process. Or better said: It didn't support the -n parameter. I modified the plugin and created a pull request for the upstream/original plugin.

With the modifications, the plugin check_varnish.py, is now able to monitor multiple Varnish processes/instances. And it stays backward compatible to single Varnish processes launched without -n parameter:

# ./check_varnish.py -f MAIN.cache_miss -n varnish-test
VARNISH OK - MAIN.cache_miss is 683744
| 'MAIN.cache_miss'=683744

 

Wetek Play/Openelec Box update to LibreElec 8.2 features tvheadend wizard
Sunday - Nov 12th 2017 - by - (0 comments)

It's been quite some time since I fiddled around the last time with my Wetek Openelec Box (which is the same device as a Wetek Play 1, just differently branded). I've kept the original OpenElec 6.95.3 on it but I thought it's time to update.

I downloaded LibreELEC-WeTek_Play.arm-8.2.0.tar from https://libreelec.tv/downloads/ and placed the tar file into the Samba share of the Wetek Openelec device, into the "Update" folder and then rebooted the device. This is one way, probably the easiest, to update the device by the way.

Openelec Shares

After the reboot the device powered up, detected a new tar update file and started to run the upgrade. After a while Kodi booted up and I had to do my configurations again. I also needed to set up the channels in TVHeadend again... I remember last time I had to do this was quite a pain (see "TVHeadend Mutex Scan Settings for Cablecom and Thurcom (Switzerland)"). Well first I needed to uninstall the existing TVHeadend application and install the newest one. Due to the switch from OpenElec to LibreElec the repositories changed and this was not correctly updated.

Once I was in the browser configuring TVHeadend (default on port 9981), something new caught my eye: A wizard!

TVHeadend setup wizard  TVheadend setup wizard

TVheadend setup wizard tvheadend setup wizard

tvheadend setup wizard TVheadend setup wizard

tvheadend setup wizard tvheadend setup wizard

Looks like TVHeadend setup got much easier now!

Note: Be careful with the authentication step! As you can see I added a whole local range (192.168.1.0/24) into the "Allowed network" field. It turned out that the Tvheadend HTSP Client (the client application connecting to TVHeadend Server) is connecting via localhost to TVHeadend. So make sure you also add a user for localhost connection - or change "Allowed network" to "0.0.0.0/0". You need to add the user credentials into the Tvheadend HTSP Client settings, too.

Note 2: I also decided to update my second Wetek device, a Wetek Play 2, today. This one is a little bit different as it runs on Android and Kodi on top of it. The upgrade procedure happens through an OTA update using the WeOS app inside Android. After the upgrade to the latest WeOS 3.2 with Kodi 17.4, the "back" button of my remote control didn't work anymore. One of the most important buttons! It turned out that there seems to be a bug in Kodi 17.x causing this problem. A workaround is to disable the "Joystick Support" for the peripheral devices. To do this: Home Screen -> Add-ons -> My add-ons  -> Peripheral libraries -> Select "Joystick Support" and disable the add-on.

 

Confused ElasticSearch refuses to insert data due to mapping conflict
Tuesday - Oct 31st 2017 - by - (0 comments)

The biggest and best reason to run an ELK stack is that you have one big database (oh no, I just wrote it: big data...) for all kinds of logs. All kinds of filters in Kibana lets you find exactly what you need to (once you figured out how) and let's you create nice graphs for statistical or monitoring reasons.

But some negative points or better said potential conflicts may come across your path, too. I'm talking about mapping conflicts. 

Let's assume you have the following log message arriving in Logstash and then sent to ElasticSearch:

"message": "{\"time\":\"2017-10-31T12:13:36.194Z\",\"tags\":[\"foul\",\"penalty\",\"home\"],\"action\":\"referee.decision\",\"data\":{\"team_id\":404,\"player_id\":652020}}\r",

The message runs thorugh a json filter in Logstash in order to split up the fields. By default, Logstash automatically recognizes the "time" field as a "date" format, because of it's ISO8601 format. From https://www.elastic.co/guide/en/logstash/current/plugins-filters-date.html: 

"ISO8601 - should parse any valid ISO8601 timestamp, such as 2011-04-19T03:44:01.103Z"

But now another message from another application arrives:

"message": "{\"action\":\"lap.completed\",\"time\":\"01:51:560\",\"data\":{\"car_id\":23,\"pilot_id\":60}}\r",

In this message the "time" field is used for the amount of time to complete the race lap (1 minute, 51 seconds, 560ms). That's definitely not a date. But because the index is in this case the same, ElasticSearch gets confused about the mapping.
This can also be seen in Kibana under Management -> Index Patterns:

Kibana mapping conflict 

In the details of the field "time" one can see that in almost every daily index the "time" field was seen as a "date". But in certain day indexes it was seen as "long":

Kibana field conflict 

ElasticSearch doesn't like such mapping conflicts at all and refuses to insert the received message. Error messages will show up in the ElasticSearch log:

[2017-10-31T13:09:44,836][DEBUG][o.e.a.b.TransportShardBulkAction] [ES02] [docker-2017.10.31][0] failed to execute bulk item (index) BulkShardRequest [[docker-2017.10.31][0]] containing [index {[docker-2017.10.31][docker][AV9yVmKDj3U_Ft3cxfu2], source[{"source_host":"somehost","data":{"player_id":1325124,"team_id":52},"level":6,"created":"2017-10-24T12:06:39.663803227Z","message":"{\"time\":\"2017-10-31T12:09:44.791Z\",\"tags\":[\"foul\",\"penalty\",\"guest\"],\"action\":\"referee.decision\",\"data\":{\"team_id\":52,\"player_id\":1325124}}\r","type":"docker","version":"1.1","tags":["foul","penalty","guest"]"protocol":0,"@timestamp":"2017-10-31T12:09:44.791Z","host":"docker01","@version":"1","action":"referee.decision","time":"2017-10-31T12:09:44.791Z"}]}]
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [time]
    at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:298) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:468) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:591) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:396) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:373) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:93) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:66) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:277) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:530) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.shard.IndexShard.prepareIndexOnPrimary(IndexShard.java:507) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.bulk.TransportShardBulkAction.prepareIndexOperationOnPrimary(TransportShardBulkAction.java:458) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:466) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:146) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:115) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:70) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:975) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:944) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:113) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:345) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:270) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:924) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:921) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:151) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1659) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:933) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.TransportReplicationAction.access$500(TransportReplicationAction.java:92) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:291) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:266) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:248) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:644) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.0.jar:5.6.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.lang.NumberFormatException: For input string: "2017-10-31T12:09:44.791Z"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) ~[?:?]
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110) ~[?:?]
    at java.lang.Double.parseDouble(Double.java:538) ~[?:1.8.0_144]
    at org.elasticsearch.common.xcontent.support.AbstractXContentParser.longValue(AbstractXContentParser.java:187) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.mapper.NumberFieldMapper$NumberType$7.parse(NumberFieldMapper.java:737) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.mapper.NumberFieldMapper$NumberType$7.parse(NumberFieldMapper.java:709) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.mapper.NumberFieldMapper.parseCreateField(NumberFieldMapper.java:1072) ~[elasticsearch-5.6.0.jar:5.6.0]
    at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:287) ~[elasticsearch-5.6.0.jar:5.6.0]
    ... 36 more

How can this be solved? There are of course several possibilities:

- The proper but almost impossible way: All applications writing into the same index must have a defined structure and common understanding of field names. If you can get all developers thinking the same way you're either working for the perfect company with a perfect documentation system or you're a dictator.

- Resolve the conflict: You could run all the indexes where "time" was seen as a "long" value, again through Logstash. You could add a mutate filter to force a type "date" on the "time" field. This will help for the messages using "time" as a date field, but not for other kinds of messages. You could then run the "long" indexes through Logstash again and kind of re-index the fields. Good luck...

- In my case I chose a different solution, which is by far not perfect. But I figured that all relevant messages in that index which contain a "time" field are indeed used as a date field. As this is also handled by the automatically created "@timestamp" field, I simply decided to drop the field "time" in Logstash:

filter {
    if [type] == "docker" {
    [...]
        mutate { remove_field => [ "time" ] }
    }
}

Right after this and a restart of Logstash, the following log entries appeared in ElasticSearch:

[2017-10-31T13:12:19,386][INFO ][o.e.c.m.MetaDataMappingService] [ES02] [docker-2017.10.31/kpW-7vceQWCQgza3lGK6Dg] update_mapping [docker]
[2017-10-31T13:12:28,584][INFO ][o.e.c.m.MetaDataMappingService] [ES02] [docker-2017.10.31/kpW-7vceQWCQgza3lGK6Dg] update_mapping [docker]
[2017-10-31T13:12:39,458][INFO ][o.e.c.m.MetaDataMappingService] [ES02] [docker-2017.10.31/kpW-7vceQWCQgza3lGK6Dg] update_mapping [docker]
[2017-10-31T13:13:41,338][INFO ][o.e.c.m.MetaDataMappingService] [ES02] [docker-2017.10.31/kpW-7vceQWCQgza3lGK6Dg] update_mapping [docker]

And the log entries were added into the ElasticSearch index again.

PS: Let me know in the comments if this can be handled in a better way.

 

XT Commerce http status 500 without any PHP errors
Tuesday - Oct 31st 2017 - by - (0 comments)

While I was trying to figure out why a XTCommerce shop didn't run anymore after a server migration (returned a HTTP status 500, without any errors), I came across the following important information (source):

Inside the document root is a folder "export. Inside this folder there's a file called "_error_reporting.admin". In order to see PHP errors this file needs to be renamed to "_error_reporting.all" so the application is allowed to display any php errors; otherwise simply nothing will show up:

root@webserver /var/www/shop/export # mv _error_reporting.admin _error_reporting.all

Without renaming this file there won't be any php errors. Not in the browser and neither in log files. 

Right after this (and having display_errors = on) there was finally a PHP error message showing up:

Warning: require_once(/var/www/shop/admin/includes/magnalister/php/lib/MagnaConnector.php): failed to open stream: No such file or directory in /var/www/shop/magnaCallback.php on line 653 Fatal error: require_once(): Failed opening required '/var/www/shop/admin/includes/magnalister/php/lib/MagnaConnector.php' (include_path='.:/opt/plesk/php/5.3/share/pear') in /var/www/shop/magnaCallback.php on line 653

Turned out to be a missing file. /var/www/shop/admin/includes/magnalister/php/lib/MagnaConnector.php didn't exist anymore on this server.

 

Automatically cleaning up archived WAL files on a PostgreSQL server
Friday - Oct 27th 2017 - by - (0 comments)

It's been a couple of weeks since I set up a PostgreSQL replication and added it to our monitoring system (see How to monitor a PostgreSQL replication) and it has been running smoothly so far. But in the past few days a disk usage warning popped up.

Although the databases themselves only use around 10GB of disk space, the WAL files (especially the archived WAL files) eat 63GB!

This is because by default the archived WAL files are kept forever if "archive_mode" is set to on in the PostgreSQL config:

archive_mode = on        # enables archiving; off, on, or always
archive_command = 'cp %p /var/lib/postgresql/9.6/main/archive/%f'

I thought the solution is easy: I just disable the archiv_mode on the master and enable it on the replica (a hot standby). NOPE! I was following the replica as the WAL files were rotating through (I have wal_keep_segments = 32) but no files in the archive directory were created.

A look at an older mail from February 2014 in the PostgreSQL mailing list reveals:

"It works fine, only the server will not generate WAL while it is in recovery.  As soon as you promote the standby, it will archive ist WALs."

A hot_standby replica server is basically ALWAYS running in recovery; means that the "archive_command" will never run on it. Lesson 1 learned: Cleaning up must be done on the master server.
Note: This is only true for hot_standby, it may be different for other kinds of replication modes.

To clean up the archived WAL files, there's a special command pg_archivecleanup. The program can be added into the recovery.conf on a standby server (not hot_standby!) or used as standalone command:

pg_archivecleanup [option...] archivelocation oldestkeptwalfile

I decided to go with the standalone command and build a wrapper around the command. This resulted in a shell script walarchivecleanup.sh. The script allows different options and is able to dynamically looking up a the "oldestkeptwalfile" given by the a max age parameter (-a). A specific "oldestkeptwalfile" can also be given (-f).

Example:

# ./walarchivecleanup.sh -p /var/lib/postgresql/9.6/main/archive -a 14 -d
pg_archivecleanup: keep WAL file "/var/lib/postgresql/9.6/main/archive/0000000100000002000000B6" and later
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000E6"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000002000000B1"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000B0"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/000000010000000200000056"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000008F"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000006F"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000BC"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000A2"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000B6"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000A4"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000004F"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000D0"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000004E"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000F1"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000002000000B5"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/000000010000000200000070"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000001C"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000002000000B4"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/000000010000000200000039"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000E0"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/0000000100000001000000FD"
pg_archivecleanup: removing file "/var/lib/postgresql/9.6/main/archive/00000001000000020000003E"
[...]

General information and usage:

$ ./walarchivecleanup.sh
./walarchivecleanup.sh (c) 2017 Claudio Kuenzler
This script helps to clean up archived WAL logs on a PostgreSQL master server using the pg_archivecleanup command.
Please note that WAL archiving currently only works on a master server (as of 9.6).
---------------------
Options:
  -p         Path to the archived WAL logs (e.g. /var/lib/postgresql/9.6/main/archive)
  -a         Age of archived logs to keep (days), anything older will be deleted
  -f         Specify a certain archived WAL file, anything older than this file will be deleted
             Note: If you use -f, it will override -a parameter
  -c         Full path to pg_archivecleanup command (if not found in $PATH)
  -d         Show debug information
  -n         Dry run (simulation only)
---------------------
Usage: ./walarchivecleanup.sh -p archivepath -a age (days) [-d debug] [-f archivefile] [-c path_to_pg_archivecleanup]
Example 1: ./walarchivecleanup.sh -p /var/lib/postgresql/9.6/main/archive -a 10
Example 2: ./walarchivecleanup.sh -p /var/lib/postgresql/9.6/main/archive -f 00000001000000010000001E
---------------------
Cronjob example: 00 03 * * * /root/scripts/walarchivecleanup.sh -p /var/lib/postgresql/9.6/main/archive -a 14

The script is now published on Github and can be found here: https://github.com/Napsty/scripts/blob/master/pgsql/walarchivecleanup.sh. Enjoy!

 

check_smart v5.8 now supports regex for MegaRaid interface
Thursday - Oct 19th 2017 - by - (0 comments)

Although there haven't been a lot of news about the monitoring plugins I maintain lately, that doesn't mean they're dead.

Today a new version of check_smart.pl is ready: v 5.8. The new version allows to use the global parameter (-g/--global) to be used in conjunction with a regular expression on the interface parameter (-i/--interface) for MegaRaid controllers.

A practical example:

$ sudo ./check_smart.pl -g /dev/sda -i megaraid,[8-9]
OK: [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean|

This allows to check the logical drive /dev/sda with the drives megaraid,8 and megaraid,9 to be checked. This is helpful for folks who prefer to run one monitoring check on all drives instead of setting up individual checks per drive.

Credits go to Bobby Jones for the contribution.

Documentation and examples were updated accordingly on the plugin's page: Monitoring Nagios Plugin check_smart SMART Hard Drive Disk Check

 

Nginx: Serve error pages from reverse proxy, not from upstream server
Thursday - Sep 21st 2017 - by - (0 comments)

If you run Nginx as a reverse proxy and you want to serve error pages from the reverse proxy itself, don't forget to set the following proxy setting:

proxy_intercept_errors on;

Without this, Nginx will forward the error page coming from the upstream server to the client.

If you want certain error pages still being delivered from the upstream server (for example 404), then simply don't specify the error_page 404 on the reverse proxy:

  error_page 400 /400.html;
  location /400.html {
    root   /var/www/errorpages;
    internal;
  }

  error_page 500 /500.html;
  location /500.html {
    root   /var/www/errorpages;
    internal;
  }

  error_page 502 /502.html;
  location /502.html {
    root   /var/www/errorpages;
    internal;
  }

  error_page 503 /503.html;
  location /503.html {
    root   /var/www/errorpages;
    internal;
  }

  error_page 504 /504.html;
  location /504.html {
    root   /var/www/errorpages;
    internal;
  }

 

Varnish panic / crash due to low workspace sizing
Monday - Sep 18th 2017 - by - (0 comments)

Last week I experienced several Varnish panics/crashes on a high-traffic website. Before this Varnish was taken into production, stress-tests (with up to 3200 concurrent connections) were all successful. However as soon as this Varnish got into production, it took around 20 minutes and the first crash happened.
The log entry was unfortunately not very helpful:

Child (6970) Panic at: Tue, 12 Sep 2017 18:23:34 GMT#012Assert error in VEP_Init(), cache/cache_esi_parse.c line 1033:#012  Condition((vep) != 0) not true.#012version = varnish-5.1.3 revision 05c5ac6b9, vrt api = 6.0#012ident = Linux,4.4.0-89-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll#012now = 3055882.671440 (mono), 1505240579.638393 (real)#012Backtrace:#012  0x438c25: /usr/sbin/varnishd() [0x438c25]#012  0x4256c2: /usr/sbin/varnishd(VEP_Init+0x172) [0x4256c2]#012  0x42324b: /usr/sbin/varnishd() [0x42324b]#012  0x42a2f5: /usr/sbin/varnishd(VFP_Open+0x85) [0x42a2f5]#012  0x427811: /usr/sbin/varnishd() [0x427811]#012  0x4520a2: /usr/sbin/varnishd() [0x4520a2]#012  0x45254b: /usr/sbin/varnishd() [0x45254b]#012  0x7feff213a6ba: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7feff213a6ba]#012  0x7feff1e703dd: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7feff1e703dd]#012thread = (cache-worker)#012thr.req = (nil) {#012},#012thr.busyobj = 0x7fefb779f020 {#012  ws = 0x7fefb779f0a0 {#012    OVERFLOWED id = \"Bo\",#012    {s, f, r, e} = {0x7fefb77a0f60, +57472, (nil), +57496},#012  },#012  retries = 1, failed = 0, flags = {do_esi, is_gzip},#012  http_conn = 0x7fefb77a8b30 {#012    fd = 137 (@0x7fefcd3887f4),#012    doclose = NULL,#012    ws = 0x7fefb779f0a0 {#012      [Already dumped, see above]#012    },#012    {rxbuf_b, rxbuf_e} = {0x7fefb77a8b90, 0x7fefb77a8eb4},#012    {pipeline_b, pipeline_e} = {0x7fefb77a8eb4, 0x7fefb77aeef0},#012    content_length = -1,#012    body_status = chunked,#012    first_byte_timeout = 60.000000,#012    between_bytes_timeout = 60.000000,#012  },#012  filters = ESI_GZIP=0 GUNZIP=0#012  director_req = 0x7fefcf76e2f8 {#012    vcl_name = niobe,#012    type = backend {#012      display_name = reload_20170821_100741.niobe,#012      ipv4 = 127.0.0.1,#012      port = 8080,#012      hosthdr = 127.0.0.1,#012      health = healthy,#012      admin_health = probe, changed = 1505067128.961975,#012      n_conn = 82,#012    },#012  },#012  director_resp = director_req,#012  http[bereq] = 0x7fefb779f670 {#012    ws = 0x7fefb779f0a0 {#012      [Already dumped, see above]
[...]

After a restart, the next crash happened just a few minutes later. Because this was a going-live, there was not much time to find the cause and I switched the traffic to a standby machine on which I installed Varnish 4.1.
Note: As of this writing in September 2017, Varnish 4.1 is considered the "stable" version while 5.1 is the "latest" version.
But even with Varnish 4.1 there was a crash, although it took much longer:

Child (1826) Last panic at: Tue, 12 Sep 2017 20:41:32 GMT#012"Assert error in http_EstimateWS(), cache/cache_http.c line 793:#012  Condition((fm->hd[u].b) != 0) not true.#012thread = (cache-worker)#012version = varnish-4.1.8 revision d266ac5c6#012ident = Linux,4.4.0-89-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll#012now = 2968968.394245 (mono), 1505248890.997950 (real)#012Backtrace:#012  0x434122: pan_ic+0x182#012  0x42d4f7: http_EstimateWS+0x3e7#012  0x422a81: vbf_beresp2obj+0x51#012  0x4233ff: vbf_fetch_thread+0x37f#012  0x44a5d9: WRK_Thread+0x4c9#012  0x44aa4b: pool_thread+0x2b#012  0x7f8f0038d6ba: libpthread.so.0(+0x76ba) [0x7f8f0038d6ba]#012  0x7f8f000c33dd: libc.so.6(clone+0x6d) [0x7f8f000c33dd]#012busyobj = 0x7f8ee7124020 {#012  ws = 0x7f8ee71240e0 {#012    OVERFLOWED id = \"Bo\",#012    {s,f,r,e} = {0x7f8ee7125f98,+57440,(nil),+57440},#012  },#012  refcnt = 2,#012  retries = 1, failed = 1, state = 1,#012  flags = {do_esi, is_gzip},#012  director_req = 0x7f8eff553ab8 {#012    vcl_name = niobe,#012    type = backend {#012      display_name = boot.niobe,#012      ipv4 = 127.0.0.1,#012      port = 8080,#012      hosthdr = 127.0.0.1,#012      health=healthy, admin_health=probe, changed=1505248796.1,#012      n_conn = 34,#012    },#012  },#012  director_resp = director_req,#012  http[bereq] = 0x7f8ee71246a8 {#012    ws[Bo] = 0x7f8ee71240e0,#012

Here the error message is different (Assert error in http_EstimateWS) than before. But at least Varnish 4.1 behaved much more stable and didn't crash every few minutes like Varnish 5.1. This finally gave me the time to troubleshoot.

On my research I came across an old bug (trac ticket 1352, meanwhile migrated into Github issue 1352) from 2013 which wasn't exactly the same error but it looked kind of similar due to the cache_esi_parse file and vep condition:

Child (22377) Panic message: Assert error in VEP_Init(), cache_esi_parse.c line 1001:
Condition((sp->wrk->vep) != 0) not true.
thread = (cache-worker)

The ticket was closed back then as "invalid" with a comment, to increase the sess_workspace:

"This assert is caused by the session workspace running out. When doing ESI deliveries, the session workspace usage will increase, especially when doing nested ESI includes. Increase the sess_workspace runtime parameter.
Regards, Martin Blix Grydeland"

Martin Blix Grydeland is a senior developer at Varnish. So if he pinpoints this error message to the session workspace, I believe him (who else is able to understand the panic message anyway?).

Whlie researching "sess_workspace" and its usage, I came across a helpful article (Varnish "sess_workspace" and why it is important) by Cosimo Streppone. It basically explains what this parameter means and how to set it during Varnish startup (-p sess_workspace=N).
This description also matches the environment of my Varnish setup:

"We found out we had to increase the default (16kb), especially since we're doing quite a bit of HTTP header copying and rewriting around."

Indeed the website running through my panicked Varnish also has a lot more HTTP headers (and rewrites) than a usual website. 

I finally thought this could be the cause, however the sess_workspace parameter does not exist anymore in Varnish 5.1. As by the upgrade notes of Varnish 4.0:

sess_workspace

In 3.0 it was often necessary to increase sess_workspace if a lot of VMODs, complex header operations or ESI were in use.
This is no longer necessary, because ESI scratch space happens elsewhere in 4.0.
If you are using a lot of VMODs, you may need to increase either workspace_backend and workspace_client based on where your VMOD is doing its work.

So instead I have to increase two parameters: workspace_backend and workspace_client. For this I adapted the Varnish's SystemD unit file:

claudio@varnish02:~$ cat /etc/systemd/system/varnish.service  | grep ExecStart
ExecStart=/usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/niobe.vcl -S /etc/varnish/secret -s malloc,20480m -p workspace_client=256k -p workspace_backend=256k

This worked fine on Varnish 4.1 and after several hours without any panic/crash I decided to try the same with Varnish 5.1, too.
Once I adapted these parameters, followed by a systemctl daemon-reload and restart of Varnish, I haven't seen any Varnish panics/crashes anymore for several days.

 

Linux Mint Dual Monitor Setup: Screens dark after login in Cinnamon
Friday - Sep 15th 2017 - by - (0 comments)

Re-organized my workspace today and switched the order of the two monitoring screens. After I attached my notebook, running with Linux Mint 17.3, to the docking station, connected to two physical screens, everything looked as it should - at first.

The logon screen appeared and I selected my user and logged myself in. Then both screens went dark. Funny part is when I plugged the screens the old way (switched DVI port 1 with port 2) the screens re-appeared.

When I de-tached the notebook from the docking station and rebooted it, I was able to logon and saw my cinnamon screen. Then I plugged it into the docking station again and now at least the primary screen got enabled - but the right-hand side screen kept dark.

Finally I was able to get both screens working again AND keeping the primary screen as I wanted it by using xrandr.

xrandr without options shows what screens are detected and usable:

$ xrandr
Screen 0: minimum 8 x 8, current 3840 x 1200, maximum 32767 x 32767
eDP1 connected (normal left inverted right x axis y axis)
   1920x1080      60.0 +   59.9 
   1680x1050      60.0     59.9 
   1600x1024      60.2 
   1400x1050      60.0 
   1280x1024      60.0 
   1440x900       59.9 
   1280x960       60.0 
   1360x768       59.8     60.0 
   1152x864       60.0 
   1024x768       60.0 
   800x600        60.3     56.2 
   640x480        59.9 
DP1 disconnected (normal left inverted right x axis y axis)
DP1-1 connected primary 1920x1200+0+0 (normal left inverted right x axis y axis) 518mm x 324mm
   1920x1200      60.0*+
   1920x1080      60.0 
   1600x1200      60.0 
   1680x1050      59.9 
   1280x1024      60.0 
   1280x960       60.0 
   1024x768       60.0 
   800x600        60.3 
   640x480        60.0 
   720x400        70.1 
DP1-2 connected 1920x1200+1920+0 (normal left inverted right x axis y axis) 518mm x 324mm
   1920x1200      60.0 +
   1920x1080      60.0 
   1600x1200      60.0 
   1680x1050      59.9 
   1280x1024      60.0 
   1280x960       60.0 
   1024x768       60.0 
   800x600        60.3 
   640x480        60.0 
   720x400        70.1 
DP1-3 disconnected (normal left inverted right x axis y axis)
HDMI1 disconnected (normal left inverted right x axis y axis)
HDMI2 disconnected (normal left inverted right x axis y axis)
VIRTUAL1 disconnected (normal left inverted right x axis y axis)

eDP1 is the notebook's own screen
DP1-1 is the first DVI port on which I plugged my left screen which I want to be primary. Note the asterisk (*) behind the 60 hertz rate. This means this screen is enabled.
DP1-2 is the second DVI port on which I plugged my right screen. The screen is obviously disabled (missing asterisk).

So now that I knew the "names" of the interfaces, I could tell xrandr which interface to enable or disable and with which hertz rate and resolution etc. This is the command I used:

$ xrandr --output DP1-1 --auto --primary --rotate normal --output DP1-2 --auto --right-of DP1-1 --output eDP1 --off

And the second (right-hand) screen switched on!

Now in order that Linux Mint uses these display settings at the next logon, I opened "Display" and clicked on Apply - followed by "Keep this configuration".

Linux Mint Cinnamon Display Settings 

After a reboot (the notebook's lid still closed and plugged into the docking station), both screens were working again as they should.

 

Increase JVM Heap Size Memory for ElasticSearch
Thursday - Sep 14th 2017 - by - (0 comments)

After increasing the usage of the ELK stack in the last few days, I noticed more and more lags and even timeouts in searches. I found the bottleneck in Elasticsearch itself: The available memory for ElasticSearch was limited to 2GB (probably the default).

Elasticsearch memory

To increase the JVM Heap Size for ElasticSearch (remember, ES is a Java application), the preferred way to define the heap sizes is to use the jvm.options file. This file is located in ELASTICSEARCH_CONF_DIR, in a standard installation this means in /etc/elasticsearch/:

root@elkes01:~# cat /etc/elasticsearch/jvm.options |egrep "(Xms|Xmx)"
## -Xms4g
## -Xmx4g
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms12g
-Xmx12g

 Important: Both initial (Xms) and maximum (Xmx) sizes must be the same value, or you get an error like this and ES won't start:

[1]: initial heap size [4294967296] not equal to maximum heap size [12884901888]; this can cause resize pauses and prevents mlockall from locking the entire heap

After adapting the Heap Size values, ElasticSearch must be restarted and if you follow the log file, it will confirm the new values:

[2017-09-14T09:28:00,021][INFO ][o.e.n.Node               ] [elkes01] initializing ...
[2017-09-14T09:28:00,118][INFO ][o.e.e.NodeEnvironment    ] [elkes01] using [1] data paths, mounts [[/var/lib/elasticsearch (/dev/mapper/vges-lves)]], net usable_space [819.5gb], net total_space [1007.8gb], spins? [possibly], types [ext4]
[2017-09-14T09:28:00,118][INFO ][o.e.e.NodeEnvironment    ] [elkes01] heap size [11.9gb], compressed ordinary object pointers [true]
[2017-09-14T09:28:00,531][INFO ][o.e.n.Node               ] [elkes01] node name [elkes01], node ID [t3GAvhY1SS2xZkt4U389jw]
[2017-09-14T09:28:00,531][INFO ][o.e.n.Node               ] [elkes01] version[5.6.0], pid[9390], build[781a835/2017-09-07T03:09:58.087Z], OS[Linux/4.4.0-83-generic/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_144/25.144-b01]
[2017-09-14T09:28:00,531][INFO ][o.e.n.Node               ] [elkes01] JVM arguments [-Xms4g, -Xmx12g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Des.path.home=/usr/share/elasticsearch]
[2017-09-14T09:28:01,212][INFO ][o.e.p.PluginsService     ] [elkes01] loaded module [aggs-matrix-stats]
[2017-09-14T09:28:01,212][INFO ][o.e.p.PluginsService     ] [elkes01] loaded module [ingest-common]

Searches and also visual graphics (which at the end use seaches, too) are now much faster in Kibana!

 


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7363 Days
until Death of Computers
Why?