Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

How to setup collaborative (parallel) editing in Atlassian Confluence
Friday - Aug 18th 2017 - by - (0 comments)

It's been a while since I focused on Confluence (see article "Tackling Confluence migration issues (Windows+MSSQL to Linux+MariaDB" from March 2017). Now it was time to upgrade the migrated Confluence (5.7) to a new version (6.3.3 as of this writing).
Since Confluence version 6 it is possible to edit a document (page) in a collaborative manner. This means users Dave and Eddy can be working on the same page at the same time and they can even see the changes in realtime. Sounds pretty cool, right?

In order to use this new feature, which is technically called "Synchrony", some changes will happen to your Confluence installation. Check out the official documentation, too.

  • The Context path of Tomcat must be changed from "" to "/confluence".
  • Due to the changed Context path, the Confluence URL changes (append /confluence).
  • Due to the changed Confluence URL, the base URL must be changed in the Confluence settings (can be done in UI).
  • Next to the main Tomcat process, a second process will be started by Confluence. This second process will listen with tcp port 8091 (so make sure you're not using that port on that machine!).
  • Requests to /confluence have to go to the main Tomcat (default listener tcp port 8090).
  • Requests to /synchrony have to go to the second process (port 8091).
  • Requests to /synchrony need additional http headers.
  • Synchrony does not speak SSL/HTTPS. Therefore, if you use https, you must go through a reverse proxy.

Putting all this information together, let's revise it technically.

--------------------------------- ---------------------------------

1. Change the Context path

To separate "normal" Confluence and "dynamic" Synchrony requests (which uses web sockets), the paths must be split up into /confluence and /synchrony. The main Tomcat server (which runs Confluence) must be run as /confluence. To do that, edit the server.xml (default path: /opt/atlassian/confluence/conf/server.xml) and adapt the Context path (right after the Host snippet):

<Context path="/confluence" docBase="../confluence" debug="0" reloadable="false" useHttpOnly="true">
  <!-- Logger is deprecated in Tomcat 5.5. Logging configuration for Confluence is specified in confluence/WEB-INF/classes/log4j.properties -->
  <Manager pathname="" />
  <Valve className="org.apache.catalina.valves.StuckThreadDetectionValve" threshold="60" />
</Context>

Oh and while we're at it and I'm pretty sure you run a Reverse Proxy in front of Confluence like I do, don't forget to add the proxy parameters to your Connector snippet (also in server.xml at the top):

        <Connector port="8090" connectionTimeout="20000" redirectPort="8443"
                maxThreads="200" minSpareThreads="10"
                enableLookups="false" acceptCount="10" debug="0" URIEncoding="UTF-8"
                proxyName="wiki-test.example.com" proxyPort="443" scheme="https" secure="true" />

Restart Confluence after the changes:

root@inf-wiki01-t:~# /etc/init.d/confluence restart

--------------------------------- ---------------------------------

2. Adapt Reverse Proxy config

As I wrote before, there are now two paths to consider: /confluence and /synchrony. This must be handled in the Reverse Proxy; the upstreams for these paths are now on different ports.
Also Confluence's Base URL has changed due to the adapted Context path, therefore I changed the "location /" into a general rewrite (redirect) to add the path "/confluence". This is especially helpful for direct links/bookmarks using the old Base URL.
I mentioned additional http headers before. You can see them in the "location /synchrony".
Here's my full Reverse Proxy config (obviously Nginx):

server {
  listen 80;
  server_name wiki-test.example.com;
  access_log /var/log/nginx/wiki-test.example.com.access.log;
  error_log /var/log/nginx/wiki-test.example.com.error.log;

  location / {
    rewrite ^(.*) https://wiki-test.example.com$1 permanent;
  }
}

server {
  listen 443;
  server_name wiki-test.example.com;
  access_log /var/log/nginx/wiki-test.example.com.access.log;
  error_log /var/log/nginx/wiki-test.example.com.error.log;

  ssl on;
  ssl_certificate /etc/nginx/ssl.crt/wildcard.example.com.crt;
  ssl_certificate_key /etc/nginx/ssl.key/wildcard.example.com.key;
  ssl_session_timeout 5m;
  ssl_prefer_server_ciphers on;
  ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
  ssl_ciphers ALL:!aNULL:!eNULL:!LOW:!EXP:!RC4:!3DES:+HIGH:+MEDIUM;
  ssl_dhparam /etc/ssl/private/dh2048.pem;

  error_page   500 502 503 504  /50x.html;
  error_page   403  /403.html;
  error_page   404  /404.html;

  location = /50x.html {
      root   /usr/share/nginx/html;
  }

  location = /403.html {
      root   /usr/share/nginx/html;
  }

  location = /404.html {
      root   /usr/share/nginx/html;
  }

  location / {
    rewrite ^(.*) https://wiki-test.example.com/confluence$1 permanent;
  }

  location /confluence {
    include /etc/nginx/proxy.conf;
    proxy_pass http://inf-wiki01-t.example.com:8090/confluence;
  }

  location /synchrony {
    include /etc/nginx/proxy.conf;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "Upgrade";
    proxy_pass http://inf-wiki01-t.example.com:8091/synchrony;
  }
}

For the sake of completeness, here's the proxy.conf as well:

proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
client_max_body_size 200m;
client_body_buffer_size 128k;
proxy_connect_timeout 6000;
proxy_send_timeout 6000;
proxy_read_timeout 6000;
proxy_buffer_size 4k;
proxy_buffers 4 32k;
proxy_busy_buffers_size 64k;
proxy_temp_file_write_size 64k;
send_timeout 6000;
proxy_buffering off;
proxy_next_upstream error;

Needless to say, these changes require a Nginx reload.

--------------------------------- ---------------------------------

Now that the necessary changes are done, let's try this out!

On the systems running Confluence, the successful start of Synchrony can be checked in the Tomcat log files right after starting Confluence:

2017-08-18 11:55:41,195 INFO [ListenableFutureAdapter-thread-1] [plugins.synchrony.bootstrap.DefaultSynchronyProcessManager] updateSynchronyConfiguration Synchrony Internal Service URL: http://127.0.0.1:8091/synchrony/v1

The listener 8091 is then up:

root@inf-wiki01-t:~# netstat -lntup | grep java
tcp6       0      0 :::8090                 :::*                    LISTEN      13608/java     
tcp6       0      0 :::8091                 :::*                    LISTEN      14346/java     
tcp6       0      0 127.0.0.1:8000          :::*                    LISTEN      13608/java  

As you can already see in netstat, the process ID is different than the one from the main Tomcat process. In detail:

conflue+ 13608 14.7 30.7 8425044 1878852 ?     Sl   13:53   9:32 /opt/atlassian/confluence/jre//bin/java -Djava.util.logging.config.file=/opt/atlassian/confluence/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dconfluence.context.path=/confluence -Datlassian.plugins.startup.options= -Dorg.apache.tomcat.websocket.DEFAULT_BUFFER_SIZE=32768 -Dsynchrony.enable.xhr.fallback=true -Xms1024m -Xmx4096m -XX:+UseG1GC -Datlassian.plugins.enable.wait=300 -Djava.awt.headless=true -XX:G1ReservePercent=20 -Xloggc:/opt/atlassian/confluence/logs/gc-2017-08-18_13-53-42.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=2M -XX:-PrintGCDetails -XX:+PrintGCDateStamps -XX:-PrintTenuringDistribution -Djava.endorsed.dirs=/opt/atlassian/confluence/endorsed -classpath /opt/atlassian/confluence/bin/bootstrap.jar:/opt/atlassian/confluence/bin/tomcat-juli.
conflue+ 14346  1.8 11.7 4704556 715768 ?      Sl   13:57   1:08  \_ /opt/atlassian/confluence/jre/bin/java -classpath /opt/atlassian/confluence/temp/1.0.0-release-confluence_6.1-78073294.jar:/opt/atlassian/confluence/lib/mysql-connector-java-5.1.41-bin.jar -Xss2048k -Xmx1g synchrony.core sql

Now using the browser and logging into Confluence, the first apparent thing is a warning popping up about the changed Base URL:

Confluence Base URL changed after Synchrony 

Obviously, the Base URL needs to be changed (append /confluence).

In the Administration part, there's a new page called "Collaborative editing". Everything on that page should be green and "Running". Otherwise you have to start troubleshooting.

Confluence Synchrony 

Note: During my first try I got errors on that page saying "The Synchrony service is stopped or has errors". This was because I had previously run the main Tomcat with two listeners (8090 for proxied connections, 8091 for non-proxied). After disabling the second Tomcat connector/listener and therefore freeing up port 8091 the errors were gone.

Time for making the collaborative editing test! I went on a page and saw in the top right corner there's a new icon:

Confluence Collaborative Editing 

So that's obviously me. The Plus-icon allows to invite additional users to work with you on the same page. Shortly after that, my colleague arrived on the page and I saw his avatar appearing.

Confluence Collaborative Editing 

So we started working. I was able to see in real-time in which part of the document my colleague was (indicated with a colored flag) and what he's been adding so far:

Confluence Collaborative Editing

Yes! This is pretty cool and it's working! Thumbs up to Atlassian!

The only negative point is that the Confluence's URL changes as I mentioned before. But with the automatic redirect in the Reverse Proxy, this can be handled properly. Maybe Atlassian will one day merge the main process Confluence and Synchrony so a Base URL change wouldn't be necessary anymore.

 

Install and configure Elastic Filebeat through Ansible
Friday - Aug 11th 2017 - by - (0 comments)

Whilst I'm currently building an ELK stack for centralized logging and visualizing these logs, I came also across Filebeat. Filebeat is basically a log parser and shipper and runs as a daemon on the client. 

So far the first tests using Nginx access logs were quite successful.

Now I wanted to go one step further and automatically deploy Filebeat through an Ansible playbook. This playbook should also be used to automatically configure the "logs to be followed", called "prospectors" in Filebeat terminology.

Well, the following playbook does it. With the current code, it checks if there is Nginx and/or HAProxy installed on the target machine and automatically configures the prospectors and of course also the output (a Logstash receiver in my setup):

$ cat /srv/ansible/playbooks/filebeat/filebeat.yml
- name: ANSIBLE - Filebeat installation and configuration by www.claudiokuenzler.com
  hosts: '{{ target }}'
  roles:
    - yaegashi.blockinfile

  sudo: yes

  tasks:

  - name: APT - Add elastic.co key
    apt_key: url="https://artifacts.elastic.co/GPG-KEY-elasticsearch"
    when: ansible_distribution == "Ubuntu"

  - name: APT - Add elastic.co repository
    apt_repository: repo="deb https://artifacts.elastic.co/packages/5.x/apt stable main" filename="elastic-5.x" update_cache=yes
    when: ansible_distribution == "Ubuntu"

  - name: FILEBEAT - Install Filebeat
    apt: pkg=filebeat
    when: ansible_distribution == "Ubuntu"
 
  - name: FILEBEAT - Copy base filebeat config file
    copy: src=/srv/ansible/setup-files/filebeat/filebeat.yml dest=/etc/filebeat/filebeat.yml

  - name: FILEBEAT - Set shipper name
    lineinfile: "dest=/etc/filebeat/filebeat.yml state=present regexp='^name:' line='name: {{ ansible_hostname }}' insertafter='# Shipper Name'"
 
  - name: FILEBEAT - Configure Logstash output
    blockinfile:
      dest: /etc/filebeat/filebeat.yml
      insertafter: '# Logstash output'
      marker: "# {mark} -- Logstash output configured by Ansible"
      block: |
        output.logstash:
          hosts: ["logstashreceiver.example.com:5044"]


  - name: FILEBEAT - Check if Nginx is installed
    command: dpkg -l nginx
    register: nginxinstalled

  - name: FILEBEAT - Configure Nginx Logging
    blockinfile:
      dest: /etc/filebeat/filebeat.yml
      insertafter: 'filebeat.prospectors:'
      marker: "# {mark} -- Nginx logging configured by Ansible"
      block: |
        - input_type: log
          paths:
            - /var/log/nginx/*.log
          document_type: nginx-access
    when: nginxinstalled.rc == 0

  - name: FILEBEAT - Check if HAProxy is installed
    command: dpkg -l haproxy
    register: haproxyinstalled

  - name: FILEBEAT - Configure HAProxy Logging
    blockinfile:
      dest: /etc/filebeat/filebeat.yml
      insertafter: 'filebeat.prospectors:'
      marker: "# {mark} -- HAProxy logging configured by Ansible"
      block: |
        - input_type: log
          paths:
            - /var/log/haproxy.log
          document_type: haproxy
    when: haproxyinstalled.rc == 0

  - name: FILEBEAT - Restart filebeat
    service: name=filebeat state=restarted

Of course this only works when the correct "template" is used (see "FILEBEAT - Copy base filebeat config file"). This is a minimal config file prepared with certain lines:

$ cat /srv/ansible/setup-files/filebeat/filebeat.yml
#=========================== Filebeat prospectors =============================

filebeat.prospectors:

#================================ General =====================================

# Shipper Name

#================================ Outputs =====================================

# Logstash output

#================================ Logging =====================================

 I let the playbook run on a test machine (which has Nginx and HAProxy installed):

ansible-playbook playbooks/filebeat.yaml --extra-vars "target=testmachine"
[DEPRECATION WARNING]: Instead of sudo/sudo_user, use become/become_user and make sure become_method is 'sudo'
(default).
This feature will be removed in a future release. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.

PLAY [ANSIBLE - Filebeat installation and configuration] ***********************

TASK [setup] *******************************************************************
ok: [testmachine]

TASK [APT - Add elastic.co key] ************************************************
ok: [testmachine]

TASK [APT - Add elastic.co repository] *****************************************
ok: [testmachine]

TASK [FILEBEAT - Install Filebeat] *********************************************
ok: [testmachine]

TASK [FILEBEAT - Copy base filebeat config file] *******************************
changed: [testmachine]

TASK [FILEBEAT - Set shipper name] *********************************************
changed: [testmachine]

TASK [FILEBEAT - Configure Logstash output] ****************
skipping: [testmachine]

TASK [FILEBEAT - Check if Nginx is installed] **********************************
changed: [testmachine]

TASK [FILEBEAT - Configure Nginx Logging] **************************************
changed: [testmachine]

TASK [FILEBEAT - Check if HAProxy is installed] ********************************
changed: [testmachine]

TASK [FILEBEAT - Configure HAProxy Logging] ************************************
changed: [testmachine]

TASK [FILEBEAT - Restart filebeat] *********************************************
changed: [testmachine]

PLAY RECAP *********************************************************************
testmachine     : ok=13   changed=8    unreachable=0    failed=0  

On the testmachine itself, the Filebeat config was correctly set:

root@testmachine:~# cat /etc/filebeat/filebeat.yml
#=========================== Filebeat prospectors =============================

filebeat.prospectors:
# BEGIN -- HAProxy logging configured by Ansible
- input_type: log
  paths:
    - /var/log/haproxy.log
  document_type: haproxy
# END -- HAProxy logging configured by Ansible
# BEGIN -- Nginx logging configured by Ansible
- input_type: log
  paths:
    - /var/log/nginx/*.log
  document_type: nginx-access
# END -- Nginx logging configured by Ansible

#================================ General =====================================

# Shipper Name
name: testmachine

#================================ Outputs =====================================

# Logstash output
# BEGIN -- Logstash output configured by Ansible
output.logstash:
  hosts: ["logstashreceiver.example.com:5044"]
# END -- Logstash output configured by Ansible

#================================ Logging =====================================

 Now that Logstash received these logs and added them into ElasticSearch, I am able to see them in Kibana:

Filebeat logs in Kibana


 

HTTP POST benchmarking / stress-testing an API behind HAProxy with siege
Thursday - Aug 3rd 2017 - by - (0 comments)

I was looking for a way to stress-test a SOAP API running behind a HAProxy loadbalancer using a HTTP POST. In my usual stress-testing scenarios (using GET) I've been using "wrk" and "httperf" for years. But this week I came across something else: siege.

According to the description, siege is:

HTTP regression testing and benchmarking utility
 Siege is an regression test and benchmark utility. It can stress test a single
 URL with a user defined number of simulated users, or it can read many URLs
 into memory and stress them simultaneously. The program reports the total
 number of hits recorded, bytes transferred, response time, concurrency, and
 return status. Siege supports HTTP/1.0 and 1.1 protocols, the GET and POST
 directives, cookies, transaction logging, and basic authentication. Its
 features are configurable on a per user basis.

Installation in Debian and Ubuntu is very easy as siege is already part of the standard repositories:

$ sudo apt-get install siege

To run a benchmarking (stress-) test using POST data, I used the following command:

# siege -b --concurrent=10 --content-type="text/xml;charset=utf-8" -H "Soapaction: ''" 'http://localhost:18382/ POST < /tmp/xxx' -t 10S

So let's break that down:

-b: Run a benchmark/stresstest (there are no delays between requests as it would be from a normal Internet user)
--concurrent=n: Simulate n concurrent users, here I chose 10 concurrent users
--content-type: Define the content-type. Can be pretty important when testing a POST (to send data in the correct format)
-H: Several additional HTTP headers can be send by using -H (multiple times)
-t: For how long do you want to run siege? Here I chose 10 seconds.

One has to look out for the destination URL syntax. It must be in single quotes and also contain the request method:

'URL[:port][URI] METHOD'

Additionally I directly used the data stored in the file /tmp/xxx to be sent by POST:

$ cat /tmp/xxx
<?xml version='1.0' encoding='UTF-8'?><note><to>Internet</to><from>Claudio Kuenzler</from><heading>Hello world</heading><body>This is a test.</body></note>

This can of course be in any format (for example json) you want, as long as the destination API correctly handles the data.

The siege command above returns statistics after it finished its run:

# siege -b --concurrent=10 --content-type="text/xml;charset=utf-8" -H "Soapaction: ''" 'http://localhost:18382/ POST < /tmp/xxx' -t 10S
** SIEGE 3.0.8
** Preparing 10 concurrent users for battle.
The server is now under siege...
Lifting the server siege...      done.

Transactions:                 224 hits
Availability:              100.00 %
Elapsed time:                9.60 secs
Data transferred:            0.10 MB
Response time:                0.23 secs
Transaction rate:           23.33 trans/sec
Throughput:                0.01 MB/sec
Concurrency:                5.41
Successful transactions:         224
Failed transactions:               0
Longest transaction:            4.91
Shortest transaction:            0.00
 
FILE: /var/log/siege.log
You can disable this annoying message by editing
the .siegerc file in your home directory; change
the directive 'show-logfile' to false.

Most of them are pretty safe explanatory (otherwise use the man page, it is greatly documented), but just to add some notes:

Transactions: siege was able to hit the target 224 times (HTTP responses below 500)
Elapsed time: The test ran for 9.6 seconds
Successful transactions: From all transactions, 224 transactions were successful
Availability: ... ergo resulting in 100% availability

I mentioned before that I was testing a SOAP API balanced through a HAProxy loadbalancer (listening on port 18382 as you probably noticed).
In this particular setup the SOAP servers can only handle a maximum of 6 concurrent connections. In HAProxy I set the maxconn value for each backend server to "6", set a miminum queue and very low queuing time (I don't want requests to be piled up in a queue, instead HAProxy should deliver an error). The HAProxy backend runs with 16 servers, each allowing 6 concurrent connections. This makes a total of 96 possible concurrent connections (16 x 6). Let's try siege with 100 concurrent users, there should be some failed requests.

# siege -b --concurrent=100 --content-type="text/xml;charset=utf-8" -H "Soapaction: ''" 'http://localhost:18382/ POST < /tmp/xxx' -t 10S
** SIEGE 3.0.8
** Preparing 100 concurrent users for battle.
The server is now under siege...
Lifting the server siege...      done.

Transactions:                 224 hits
Availability:               39.16 %
Elapsed time:                9.35 secs
Data transferred:            0.13 MB
Response time:                1.52 secs
Transaction rate:           23.96 trans/sec
Throughput:                0.01 MB/sec
Concurrency:               36.34
Successful transactions:         224
Failed transactions:             348
Longest transaction:            5.09
Shortest transaction:            0.00

As in the previous test, siege was able to hit the target 224 times.
However this time there are 348 failed and only 224 successful transactions, resulting in an availability of 39.16%.

And now the same with the max concurrent connections of 96:

# siege -b --concurrent=96 --content-type="text/xml;charset=utf-8" -H "Soapaction: ''" 'http://localhost:18382/ POST < /tmp/xxx' -t 10S
** SIEGE 3.0.8
** Preparing 96 concurrent users for battle.
The server is now under siege...
Lifting the server siege...      done.

Transactions:                 224 hits
Availability:              100.00 %
Elapsed time:                9.38 secs
Data transferred:            0.10 MB
Response time:                1.52 secs
Transaction rate:           23.88 trans/sec
Throughput:                0.01 MB/sec
Concurrency:               36.20
Successful transactions:         224
Failed transactions:               0
Longest transaction:            5.09
Shortest transaction:            0.00

100% success rate here. This is proof that siege used exactly 96 users as HAProxy allowed.

By the way: siege also allows to compare previous test results by checking the log file it creates:

# cat /var/log/siege.log
Date & Time,  Trans,  Elap Time,  Data Trans,  Resp Time,  Trans Rate,  Throughput,  Concurrent,    OKAY,   Failed
2017-08-03 08:56:46,    224,       9.19,           0,       1.52,       24.37,        0.00,       36.95,     224,     340
2017-08-03 09:04:01,    224,       9.35,           0,       1.52,       23.96,        0.00,       36.34,     224,     348
2017-08-03 09:05:24,    224,       9.03,           0,       1.58,       24.81,        0.00,       39.19,     224,     336
2017-08-03 09:05:39,    224,       9.23,           0,       1.50,       24.27,        0.00,       36.50,     224,     329
2017-08-03 09:06:55,    224,       9.49,           0,       1.45,       23.60,        0.00,       34.23,     224,       0
2017-08-03 09:08:03,    224,       9.38,           0,       1.52,       23.88,        0.00,       36.20,     224,       0

Hands down a very good tool for http benchmarking/stress-testing. Not just for POST requests.

 

check_mssql_health: Monitoring MSSQL with SQL query and negative thresholds
Friday - Jul 28th 2017 - by - (0 comments)

When monitoring a MSSQL server, my first choice is the monitoring plugin check_mssql_health by Gerhard Lausser. By using its different modes, the most important checks are already pre-defined and ready to use. 

The plugin also supports to run SQL queries on the target MSSQL server. And this is actually what I needed to do to collect performance data from a specially created database (which itself collects statistics from the SQL instance).

But first a SQL query needs to be encoded so check_mssql_health can understand it:

$ echo 'SELECT TOP (1) [buffer_cache_hit] FROM [PerformanceDB].dbo.tblPerformanceSampling ORDER BY [timestamp] DESC;' | /usr/lib/nagios/plugins/check_mssql_health --mode encode
SELECT%20TOP%20%281%29%20%5Bbuffer%5Fcache%5Fhit%5D%20FROM%20%5BPerformanceDB%5D%2Edbo%2EtblPerformanceSampling%20ORDER%20BY%20%5Btimestamp%5D%20DESC%3B

Now the plugin can be run with mode "sql" followed by the encrypted SQL query as "name" parameter:

$ /usr/lib/nagios/plugins/check_mssql_health --server="mssqlserver\instance001" --port=1433 --username="sql_monitoring" --password=supersecret --mode=sql --name="SELECT%20TOP%20%281%29%20%5Bbuffer%5Fcache%5Fhit%5D%20FROM%20%5BPerformanceDB%5D%2Edbo%2EtblPerformanceSampling%20ORDER%20BY%20%5Btimestamp%5D%20DESC%3B" --commit --warning=99 --critical=98
CRITICAL - select top (1) [buffer_cache_hit] from [PerformanceDB].dbo.tblperformancesampling order by [timestamp] desc;: 100.000000 | 'select'=100.00;99;98;;

So far so good. The SQL query worked and it read a result from the database: 100. But as this is a value of the buffer_cache_hit, everything below 100 is not good.

Note: The plugin allows to check the buffer hit ratio directly (without having to create SQL queries) using "--mode mem-pool-data-buffer-hit-ratio". But in this particular scenario, the DBA didn't want to give my monitoring user the needed privileges on the MSSQL server... Don't ask for details.

There's however a possibility to tell the plugin to use "negative thresholds" by using colons after the thresholds (source):

$ /usr/lib/nagios/plugins/check_mssql_health --server="mssqlserver\instance001" --port=1433 --username="sql_monitoring" --password=supersecret --mode=sql --name="SELECT%20TOP%20%281%29%20%5Bbuffer%5Fcache%5Fhit%5D%20FROM%20%5BPerformanceDB%5D%2Edbo%2EtblPerformanceSampling%20ORDER%20BY%20%5Btimestamp%5D%20DESC%3B" --commit --warning=99: --critical=98:
OK - select top (1) [buffer_cache_hit] from [PerformanceDB].dbo.tblperformancesampling order by [timestamp] desc;: 100.000000 | 'select'=100.00;99:;98:;;

Now the plugin will return a WARNING if the result from the SQL query is 99 and CRITICAL if the result is 98 or below.

 

How to monitor a PostgreSQL replication
Wednesday - Jul 26th 2017 - by - (2 comments)

There are multiple ways of monitoring a working master-slave-replication on PostgreSQL servers.

Using PSQL

First of all there is of course the replication status which can be read directly from the master PostgreSQL server:

postgres@dbmaster:~$ psql -x -c "select * from pg_stat_replication;"
-[ RECORD 1 ]----+------------------------------
pid              | 13014
usesysid         | 16387
usename          | replica
application_name | dbslave
client_addr      | 10.10.10.11
client_hostname  |
client_port      | 48596
backend_start    | 2017-07-26 13:07:00.617621+00
backend_xmin     |
state            | streaming
sent_location    | 0/6000290
write_location   | 0/6000290
flush_location   | 0/6000290
replay_location  | 0/6000290
sync_priority    | 1
sync_state       | sync

This information can only be read on the master. If you try that on the slave (hot_standby = on), you don't get to see anything:

postgres@dbslave:~$ psql -x -c "select * from pg_stat_replication;"
(0 rows)

Obviously the moast important information here is the sync_state:

postgres@dbmaster:~$ psql -x -c "select sync_state from pg_stat_replication;"
-[ RECORD 1 ]----
sync_state | sync

Possible values of sync_state are:

  • async: This standby server is asynchronous -> CRITICAL!
  • potential: This standby server is asynchronous, but can potentially become synchronous if one of current synchronous ones fails -> WARNING
  • sync: This standby server is synchronous -> OK
  • quorum: This standby server is considered as a candidate for quorum standbys -> OK

Other important values are the different "locations":

sent_location    | 0/6000290
write_location   | 0/6000290
flush_location   | 0/6000290
replay_location  | 0/6000290

From the documentation:

  • sent_location: Last write-ahead log location sent on this connection
  • write_location: Last write-ahead log location written to disk by this standby server
  • flush_location: Last write-ahead log location flushed to disk by this standby server
  • replay_location: Last write-ahead log location replayed into the database on this standby server

This basically shows where the slave server is. If all values are the same it is caught up 100%.

Using monitoring plugin check_postgres

The monitoring plugin check_postgres also features a replication check (hot_standby_delay). The trick is to correctly understand this check. Using the hot_standby_delay check, the plugin connects to both the master and slave and compares the replay delay and receive delay to the given warning and critical thresholds. In order to connect to both the master and the slave, the pg_hba.conf must be adapted accordingly.

On the master (IP 10.10.10.10) I added the following lines:

# Monitoring
host    all             monitoring      127.0.0.1/32          md5
host    all             monitoring      10.10.10.11/32        md5

On the slave (IP 10.10.10.11) I added the following lines:

# Monitoring
host    all             monitoring      127.0.0.1/32          md5

The plugin will be executed on the slave server ergo there the monitoring line for localhost is enough.

To not use the db password with the plugin (the password would show up in cleartext in the process list), I created a .pgpass file for the nagios user (under which this plugin will run). This file contains two entries; first for the localhost connection and secondly for the remote connection to the master server:

nagios@dbslave:~$ whoami
nagios

nagios@dbslave:~$ ls -la .pgpass
-rw------- 1 nagios nagios 94 Jul 26 15:25 .pgpass

nagios@dbslave:~$ cat .pgpass
localhost:5432:*:monitoring:mysupersecretpassword
dbmaster:5432:*:monitoring:mysupersecretpassword

Make sure the .pgpass file has correct permissions (chmod 0600), otherwise it won't be used for psql commands!

Now the plugin can be executed with the hot_standby_delay check:

nagios@dbslave:~$ /usr/lib/nagios/plugins/check_postgres.pl -H localhost,dbmaster -u monitoring -db mydb --action hot_standby_delay --warning 60 --critical 600
POSTGRES_HOT_STANDBY_DELAY OK: DB "mydb" (host:localhost) 0 and 432 seconds | time=0.05s replay_delay=0;60;600  receive-delay=0;60;600 time_delay=432;

Note the -H parameter uses two hostnames. The plugin will connect to both localhost and the dbmaster host using the SQL user "monitoring" (password will automatically be read from .pgpass file). I set a delay warning to 60 seconds, a critical delay to 600 seconds (10 minutes).

 

check_nwc_health: rumms - UNKNOWN no interfaces
Thursday - Jul 20th 2017 - by - (0 comments)

Today I had to solve a special case where an Icinga 2 satellite server ran out of disk space in /var. After I increased the disk size I noticed that almost all network switches, checked via this satellite using check_nwc_health, returned an UNKNOWN status. Service output: rumms. 

check_nwc_health rumms

I manually verified this on the cli:

# /usr/lib/nagios/plugins/check_nwc_health --hostname aswitch --community public --mode interface-usage --name Ethernet1/1
rumms
UNKNOWN - no interfaces

I manually re-listed all interfaces:

# /usr/lib/nagios/plugins/check_nwc_health --hostname aswitch --community public --mode list-interfaces
83886080 mgmt0
151060482 Vlan2
[...]
526649088 Ethernet101/1/29
526649152 Ethernet101/1/30
526649216 Ethernet101/1/31
526649280 Ethernet101/1/32
OK - have fun

And then the check worked again:

# /usr/lib/nagios/plugins/check_nwc_health --hostname aswitch --community public --mode interface-usage --name Ethernet1/1
OK - interface Ethernet1/1 (alias UCS-FI-A) usage is in:0.82% (82014272.36bit/s) out:3.21% (320758024.71bit/s) | 'Ethernet1/1_usage_in'=0.82%;80;90;0;100 'Ethernet1/1_usage_out'=3.21%;80;90;0;100 'Ethernet1/1_traffic_in'=82014272.36;8000000000;9000000000;0;10000000000 'Ethernet1/1_traffic_out'=320758024.71;8000000000;9000000000;0;10000000000

The reason for this is that by default check_nwc_health creates a "cached" list of interfaces per checked device. This cached list is a file in /var/tmp/check_nwc_health:

# ls -l /var/tmp/check_nwc_health | grep cache
-rw-r--r-- 1 nagios nagios  8192 Jul 20 08:03 01switch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  8577 Jul 20 08:17 02switch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  8192 Jul 20 08:04 aswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios     0 Jul 20 07:32 bswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  8192 Jul 20 08:06 cswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  7017 Jul 20 08:18 dswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  7013 Jul 20 08:19 eswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios     0 Jul 20 07:31 fswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-rw-r-- 1 nagios nagios  9291 Jul 20 08:16 gswitch_interface_cache_81b3d521b731e73215515a4f1f4a3ccf
-rw-r--r-- 1 nagios nagios  6245 Jul 20 07:44 hswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios     0 Jul 20 07:46 iswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  4096 Jul 20 08:12 jswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
-rw-r--r-- 1 nagios nagios  4096 Jul 20 07:46 kswitch_interface_cache_d2e08e73bba4b976b8b4dcdcf66e3c7d
[...]

Note the cache-files with a 0-byte size. That's an empty list of interfaces for the specific device - ergo unknown interface for any given interface.
Because /var was full during the time the interface cache file was written the last time, it was a 0-byte file causing check_nwc_health to think there are no interfaces at all on the network device to check.

By removing the cache files the check worked again (if there is no interface cache file, it will re-created).

 

Presenting new monitoring plugin: check_lxc
Wednesday - Jul 19th 2017 - by - (1 comments)

I'm proud to announce a new Nagios/Monitoring plugin: check_lxc.

As the name already tells you, this is a plugin to monitor Linux Containers (LXC). It needs to run on the LXC host and allows to check CPU, Memory, Swap usage of a container. The plugin also allows to check for an automatic boot of a container.

The work on this plugin already began several years ago, in 2013. After having recently added a CPU check I think the plugin is now "ready" to be used in the wild.

Back in 2013, when the plugin development was started, LXC was at version 0.8. I have taken extra precautions to keep compatibility between the LXC releases. As of today I can say that the plugin works from LXC 0.8 upwards.

Enough talk for now. Read the documentation of check_lxc, use the plugin and enjoy!

 

Varnish vcl reload not working with SystemD on Ubuntu 16.04
Tuesday - Jul 18th 2017 - by - (0 comments)

When using Varnish, a restart of it is not often wanted because the cache is cleared. For configuration changes in a vcl a reload comes in more handy.

However I came across an issue today, that this reload doesn't work with SystemD.  OS is Ubuntu 16.04.2 LTS. The reason for this is the "ExecReload" in the SystemD unit file for Varnish:

# grep ExecReload /etc/systemd/system/varnish.service
ExecReload=/usr/share/varnish/reload-vcl

This command (/usr/share/varnish/reload-vcl) reads the config file /etc/default/varnish - which is now obsolete when using SystemD (see Configure Varnish custom settings on Debian 8 Jessie and Ubuntu 16.04 LTS). An issue on the Github repository of Varnish confirms this bug.

A workaround (and it's a working workaround, I tested it) is to use the new "varnishreload" script. As of this writing this script is not part of the varnish package yet, but will probably soon be added. I downloaded the script and saved it as /usr/sbin/varnishreload (and gave it executable permissions). Then I modified the SystemD unit file for the Varnish service:

# grep ExecReload /etc/systemd/system/varnish.service
ExecReload=/usr/sbin/varnishreload

Followed by a reload of SystemD:

# systemctl daemon-reload

 and a restart of Varnish:

# systemctl restart varnish

To test this, I modified the used vcl (which is not the default.vcl by the way) and removed special debug headers in the new config. If a reload works, Varnish should stop sending this header in the response.

# systemctl reload varnish

# systemctl status varnish
? varnish.service - Varnish Cache, a high-performance HTTP accelerator
   Loaded: loaded (/etc/systemd/system/varnish.service; disabled; vendor preset: enabled)
   Active: active (running) since Tue 2017-07-18 15:52:08 CEST; 31min ago
  Process: 7229 ExecReload=/usr/sbin/varnishreload (code=exited, status=0/SUCCESS)
  Process: 26848 ExecStart=/usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/zerberos.vcl -S /etc/varnish/secret -s malloc,2048m (code=exited, status
 Main PID: 26850 (varnishd)
    Tasks: 218
   Memory: 143.6M
      CPU: 3.623s
   CGroup: /system.slice/varnish.service
           +-26850 /usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/zerberos.vcl -S /etc/varnish/secret -s malloc,2048m
           +-26858 /usr/sbin/varnishd -a :6081 -T localhost:6082 -f /etc/varnish/zerberos.vcl -S /etc/varnish/secret -s malloc,2048m

Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54288 ::1 6082 Wr 200 VCL compiled.
Jul 18 16:22:01 varnish1 varnishreload[7229]: VCL compiled.
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Rd auth b3a13c2d09d6d3551504ace7665994ea9bccab035be9d9518d00ea6f36a8ead3
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Wr 200 -----------------------------
                                            Varnish Cache CLI 1.0
                                            -----------------------------
                                            Linux,4.4.0-77-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit
                                            varnish-5.1.2 revision 6ece695
                                           
                                            Type 'help' for command list.
                                            Type 'quit' to close CLI session.
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Rd ping
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Wr 200 PONG 1500387721 1.0
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Rd vcl.use reload_20170718_162201
Jul 18 16:22:01 varnish1 varnishd[26850]: CLI telnet ::1 54294 ::1 6082 Wr 200 VCL 'reload_20170718_162201' now active
Jul 18 16:22:01 varnish1 varnishreload[7229]: VCL 'reload_20170718_162201' now active
Jul 18 16:22:01 varnish1 systemd[1]: Reloaded Varnish Cache, a high-performance HTTP accelerator.

systemctl status seems to verify a working reload. However don't let yourself be fooled - the same kind of entries also appeared with the non-working reload script before. But a manual check confirmed that the reload of the changed vcl config actually worked; the debug headers were gone in HTTP responses.

Changed the vcl again, re-enabled the headers and ran another systemctl reload varnish. And the headers are back again. So make sure you're using the new varnishreload script when using Varnish on Ubuntu 16.04 LTS with SystemD (might also affect other Linux distributions, didn't test that).

 

Count backwards with seq in Linux
Wednesday - Jul 12th 2017 - by - (0 comments)

Needed to manually create some basic web statistics using awstats (a one shot statistic). My approach was to get all the rotated logs and create one big access log. I wanted the lines of that access log in the correct order to avoid awstats tumbling.

First I unzipped all rotated logs:

gunzip *gz

Then I needed to get the log entries from rotated file 40 down to rotated file 9. But here's the catch: How do I count down without having to note every single number from 40 to 9 (that would be something like for i in 40 39 38 37, etc)? I know how to automatically count up using seq:

$ seq 1 5
1
2
3
4
5

So I needed to find a way to count backwards. The solution? seq again :-)

seq offers an optional parameter between the starting and the ending number. From the --help output:

$ seq --help
Usage: seq [OPTION]... LAST
  or:  seq [OPTION]... FIRST LAST
  or:  seq [OPTION]... FIRST INCREMENT LAST
Print numbers from FIRST to LAST, in steps of INCREMENT.
[...]

Example: Count up to 10 but increase with 2 numbers:

$ seq 1 2 10
1
3
5
7
9

The INCREMENT number can be negative, too:

$ seq 10 -1 1
10
9
8
7
6
5
4
3
2
1

And this is actually the way to count down. To put together all rotated logs in the correct order, I finally used the following command:

$ for i in $(seq 40 -1 9); do cat access.log.$i >> final.access.log; done

 

Gandi domain registrar hacked?
Friday - Jul 7th 2017 - by - (0 comments)

Today we've received several messages that some websites didn't work anymore. Further analysis revealed that several domains suddenly had their DNS nameservers changed.

A whois lookup of an affected domain showed the following nameservers:

ns1.dnshost.ga
ns2.dnshost.ga

A DNS lookup using "dig -t NS" on affected domains all showed NS records of 

ns1.example.com
ns2.example.com

A records were set to:46.183.219.205 (an IP address registered in Latvia).

Currently we have 922 domains registered at Gandi. 7 domains were affected and all nameservers pointed to the ones above. Without our doing. Without Gandi having done anything.

Direct communication with Gandi revealed that these manipulations didn't happen on our account only, several customers were affected. I was also assured that it has nothing to do with the new Gandi v5 version but that the problem was in between the Gandi backend and the communication of the domain registries (like nic.ch for Swiss domains).

This pretty much sounds like a hack of Gandi's backend to me. Ouch :-((

The domain settings were quickly restored and an update to the nic servers were initiated. After a couple of hours our affected domains were running again. However I'm still curious in hearing, what exactly was causing this.

Update July 10th 2017: Gandi confirmed an "unauthorized connection" in their backend in a statement sent to the affected customers:

Following an unauthorized connection which occurred at one of the
technical providers we use to manage a number of geographic TLDs[2].

In all, 751 domains in total were affected by this incident, which
involved a unauthorized modification of the name servers [NS] assigned
to the affected domains that then forwarded traffic to a malicious site
exploiting security flaws in several browsers.

Additionally, SWITCH security (the registry of .ch domains) added a good technical article about that case here: https://securityblog.switch.ch/2017/07/07/94-ch-li-domain-names-hijacked-and-used-for-drive-by/ 

Update July 11th 2017: Gandi added a special article on their news blog. On this article Gandi shares details about what happened. It's really worth to check it out. Appreciate the transparency at Gandi!

 


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7457 Days
until Death of Computers
Why?