Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

Installing check_vmware_esx.pl and VMware Perl SDK 6.7 on Ubuntu 16.04 Xenial
Friday - Jul 20th 2018 - by - (0 comments)

As I'm preparing to put a new high-available Icinga 2 cluster into production, I was at the step to migrate the monitoring plugins. While most of the plugins are easily migrated, some depend on third party modules.

Today's headscratcher was the migration of "check_vmware_esx" because it relys on the VMware Perl SDK. I've already had my experiences and troubleshooting cases with that in the past so I knew what I will get into... For reference see:

Because I just spent several hours figuring out which versions work well together (new OS, newer Perl version and modules, new VMware SDK, new ESXi versions) follow the steps below to install the check_vmware_esx plugin on Ubuntu 16.04 (Xenial). Trust me, it will save you a lot of effort.

1. Install pre-requirements
Both the VMware Perl SDK and the check_vmware_esx plugin require some Perl modules. Install the following:

root@xenial:~# apt-get install libssl-dev libxml-libxml-simple-perl libsoap-lite-perl libdata-dump-perl libuuid-perl libdata-uuid-libuuid-perl libuuid-tiny-perl libarchive-zip-perl libcrypt-ssleay-perl libclass-methodmaker-perl libtime-duration-perl

2. Download the SDK
The VMware Perl SDK can be downloaded from here: https://code.vmware.com/web/sdk/67/vsphere-perl. If the link doesn't work anymore, use your favourite search engine to find it. As of this writing the newest available version is 6.7. I downloaded VMware-vSphere-Perl-SDK-6.7.0-8156551.x86_64.tar.gz.

Note: In the past (= a couple of years ago) it was not advised to use VMware Perl SDK above version 5.5. The SDK 6.0 was bloody slow and even the plugin maintainer, Martin Fuerstenau, called it "pretty little thing of bull..t.". This does not apply anymore as 6.7 seems to be working rather fast.

3. Unpack and install the SDK
Note: The SDK needs to be installed as root (or use sudo).

root@xenial:~# tar -xzf VMware-vSphere-Perl-SDK-6.7.0-8156551.x86_64.tar.gz
root@xenial:~# cd vmware-vsphere-cli-distrib/

Now launch the installation of the Perl SDK. Note: This can take quite some time, especially the CPAN installations. Grab a coffee...

root@xenial:~/vmware-vsphere-cli-distrib# ./vmware-install.pl
Creating a new vSphere CLI installer database using the tar4 format.
 
Installing vSphere CLI 6.7.0 build-8156551 for Linux.
 
You must read and accept the vSphere CLI End User License Agreement to
continue.
Press enter to display it.

[...]

Do you accept? (yes/no) yes
 
Thank you.
 
warning: vSphere CLI requires Perldoc.
Please install perldoc.
 
WARNING: The http_proxy environment variable is not set. If your system is
using a proxy for Internet access, you must set the http_proxy environment
variable .
 
If your system has direct Internet access, you can ignore this warning .
 
WARNING: The ftp_proxy environment variable is not set.  If your system is
using a proxy for Internet access, you must set the ftp_proxy environment
variable .
 
If your system has direct Internet access, you can ignore this warning .
 
Please wait while configuring CPAN ...
 
Can't locate Module/Build.pm in @INC (you may need to install the Module::Build module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base .).
BEGIN failed--compilation aborted.
Below mentioned modules with their version needed to be installed,
these modules are available in your system but vCLI need specific
version to run properly
 
Module: ExtUtils::MakeMaker, Version: 6.96
Module: Module::Build, Version: 0.4205
Module: Net::FTP, Version: 2.77
Do you want to continue? (yes/no) yes

Please wait while configuring perl modules using CPAN ...
CPAN is downloading and installing pre-requisite Perl module "Path::Class" .
CPAN is downloading and installing pre-requisite Perl module "Socket6 " .
CPAN is downloading and installing pre-requisite Perl module "Text::Template" .
CPAN is downloading and installing pre-requisite Perl module "IO::Socket::INET6" .
CPAN is downloading and installing pre-requisite Perl module "Net::INET6Glue" .
 
In which directory do you want to install the executable files?
[/usr/bin]  [Enter]
 
Please wait while copying vSphere CLI files...
 
The installation of vSphere CLI 6.7.0 build-8156551 for Linux completed
successfully. You can decide to remove this software from your system at any
time by invoking the following command:
"/usr/bin/vmware-uninstall-vSphere-CLI.pl".
 
This installer has successfully installed both vSphere CLI and the vSphere SDK
for Perl.
 
The following Perl modules were found on the system but may be too old to work
with vSphere CLI:
 
Time::Piece 1.31 or newer
Try::Tiny 0.28 or newer
UUID 0.27 or newer
XML::NamespaceSupport 1.12 or newer
XML::LibXML::Common 2.0129 or newer
XML::LibXML 2.0129 or newer
LWP 6.26 or newer
LWP::Protocol::https 6.07 or newer
 
Enjoy,
 
--the VMware team

4. Test the Perl SDK
You can launch the following command (which came from the SDK installation) to test if the SDK works:

root@xenial:~# /usr/lib/vmware-vcli/apps/general/connect.pl --server myesxhost
Enter username: root
Enter password:
 
Connection Successful
Server Time : 2018-07-20T13:17:48.861076Z

That looks good! Connection was successful!

Note: With an older SDK (I tested 5.5) an error message "Server version unavailable" appeared and no matter what I tried, I couldn't fix it this time.

5. The monitoring plugin check_vmware_esx.pl
Let's proceed with the monitoring plugin! I suggest you are cloning the whole Github repository because the plugin alone is not enough.

root@xenial:~# cd /tmp
root@xenial:/tmp# git clone https://github.com/BaldMansMojo/check_vmware_esx.git

Then copy the plugin to your other plugins and give it the correct permissions:

root@xenial:/tmp# cp /tmp/check_vmware_esx/check_vmware_esx.pl /usr/lib/nagios/plugins/check_vmware_esx.pl
root@xenial:/tmp# chmod 755 /usr/lib/nagios/plugins/check_vmware_esx.pl

The plugin comes with a few perl modules. They're required to let the plugin run correctly.
To keep them separate from the system perl modules, I created a dedicated folder copied them into it:

root@xenial:/tmp# mkdir -p /opt/check_vmware_esx/modules
root@xenial:/tmp# cp check_vmware_esx/modules/* /opt/check_vmware_esx/modules/

The plugin now needs to know where it can find its own modules. The default is just "modules":

root@xenial:/tmp# grep "use lib" /usr/lib/nagios/plugins/check_vmware_esx.pl
use lib "modules";
#use lib "/usr/lib/nagios/vmware/modules";

So set the correct path (/opt/check_vmware_esx/modules):

root@xenial:/tmp# sed -i "/^use lib/s/modules/\/opt\/check_vmware_esx\/modules/" /usr/lib/nagios/plugins/check_vmware_esx.pl

So it is now:

root@xenial:/tmp# grep "use lib" /usr/lib/nagios/plugins/check_vmware_esx.pl
use lib "/opt/check_vmware_esx/modules";
#use lib "/usr/lib/nagios/vmware/modules";

6. Launch the plugin
Lean back and enjoy.

root@xenial:~# /usr/lib/nagios/plugins/check_vmware_esx.pl -H myesxhost -u root -p secret --select=runtime
OK: 0/21 VMs suspended - 1/21 VMs powered off - 20/21 VMs powered on - overallstatus=green - connection state=connected - All 125 health checks are GREEN: CPU (2x), power (14x), Processors (20x), voltage (32x), system (1x), System (1x), Platform Alert (1x), temperature (28x), other (25x), Memory (1x) - 0 config issues  - 0 config issues ignored

7. Icinga 2's ITL config
In order to immediately work with Icinga 2, the plugin needs either a rename to "/usr/lib/nagios/plugins/check_vmware_esx" or we can create a symlink:

root@xenial:~# ln -s /usr/lib/nagios/plugins/check_vmware_esx.pl /usr/lib/nagios/plugins/check_vmware_esx

The reason for this is that Icinga2's ITL (Icinga Template Library) defines the commands "vmware-esx-*" to use the plugin check_esx_vmware (without the .pl extension):

root@xenial:~# grep check_vmware_esx /usr/share/icinga2/include/ -rni
/usr/share/icinga2/include/plugins-contrib.d/vmware.conf:25:    command = [ PluginContribDir + "/check_vmware_esx" ]

That's it. You're welcome.

 

Freeing up disk space from CouchDB - do not forget the views!
Thursday - Jul 19th 2018 - by - (0 comments)

In the past few weeks I've seen a steady increase of disk usage of a CouchDB cluster I'm managing:

CouchDB Disk Usage steadily increases 

Time to free up some disk space! I already knew there was a "compaction" mechanism (comparable to the "vacuum" process in PostgreSQL) which will free up the used disk space by removing old revisions of data. But when I ran "compact" on the database using most disk space, it wasn't really helping.

Before I ran compact on the DB, there was a disk size of 10846902016 (Bytes):

# curl -q -s localhost:5984/bigdb
{
 "db_name": "bigdb",
 "update_seq": "13559532-g1AAAAHTeJzLYWBg4...",
 "sizes": {
  "file": 10846902016,
  "external": 3690681900,
  "active": 5355688382
 },
 "purge_seq": 0,
 "other": {
  "data_size": 3690681900
 },
 "doc_del_count": 46,
 "doc_count": 13559486,
 "disk_size": 10846902016,
 "disk_format_version": 6,
 "data_size": 5355688382,
 "compact_running": true,
 "cluster": {
  "q": 8,
  "n": 3,
  "w": 2,
  "r": 2
 },
 "instance_start_time": "0"
}

Running compact:

# curl -q -s -H "Content-Type: application/json" -X POST localhost:5984/bigdb/_compact
{"ok":true}

After this, I was able to see the status of the database compaction processes in the Fauxton UI:

CouchDB Database Compaction Progress
 

But once the compaction was completed, I found the disk size didn't change:

# curl -q -s localhost:5984/bigdb
{
 "db_name": "bigdb",
 "update_seq": "13559692-g1AAA...",
 "sizes": {
  "file": 10851612416,
  "external": 3690734442,
  "active": 5355768390
 },
 "purge_seq": 0,
 "other": {
  "data_size": 3690734442
 },
 "doc_del_count": 46,
 "doc_count": 13559646,
 "disk_size": 10851612416,
 "disk_format_version": 6,
 "data_size": 5355768390,
 "compact_running": false,
 "cluster": {
  "q": 8,
  "n": 3,
  "w": 2,
  "r": 2
 },
 "instance_start_time": "0"
}

Even worse: The compaction process used even more disk space. The opposite of what I expected!

I then checked on the file system level, where most disk space is being used and came across the following folders:

root@st-cdb01-p:/var/lib/couchdb# du -ksh shards/
15G    shards/

root@st-cdb01-p:/var/lib/couchdb# du -ksh .shards/
30G    .shards/

Note the dot in the second folder (.shards). According to the documentation, the ".shards" folder contains "views" and not "databases". So I manually checked the size of a view using the Fauxton UI:

CouchDB View Size 

Woah! Taking a look and comparing "Actual data size (bytes): 1,444,291,661" and "Data size on disk (bytes): 19,648,534,600" I was pretty sure I found the bad guy.

A compaction can also be run on a view (in this case "stats" is the view, can also be seen in the UI screenshot above):

root@couchdb:~# curl -q -s -H "Content-Type: application/json" -X POST localhost:5984/bigdb/_compact/stats
{"ok":true}

The compaction processes and their current progress can also be checked in the UI:

CouchDB View Compaction Progress
 

Once all of these processes were completed, 20GB of disk space were freed!

CouchDB Disk Usage after Views Compaction 

The change can also be seen in Fauxton:

CouchDB View Size after compaction

Some additional questions related to compaction and their answers below:

How did the compaction affect the cluster?
I ran the compaction on node 1 of a two node cluster. I could not see an immediate change of disk usage on the second node. I had to run the same compaction commands on node 2 to free disk space there, too.

Shouldn't auto compaction do this job?
That's what I thought, too. I verified that automatic compaction is enabled and this seems to be the case by default (Ubuntu 16.04, CouchDB 2.1):

root@couchdb:~# grep "\[daemons\]" -A 10 /opt/couchdb/etc/default.ini
[daemons]
index_server={couch_index_server, start_link, []}
external_manager={couch_external_manager, start_link, []}
query_servers={couch_proc_manager, start_link, []}
vhosts={couch_httpd_vhost, start_link, []}
httpd={couch_httpd, start_link, []}
uuids={couch_uuids, start, []}
auth_cache={couch_auth_cache, start_link, []}
os_daemons={couch_os_daemons, start_link, []}
compaction_daemon={couch_compaction_daemon, start_link, []}

The compaction_daemon is enabled and so are the settings:

root@couchdb:~# grep "\[compaction_daemon\]" -A 8 /opt/couchdb/etc/default.ini  
[compaction_daemon]
; The delay, in seconds, between each check for which database and view indexes
; need to be compacted.
check_interval = 300
; If a database or view index file is smaller then this value (in bytes),
; compaction will not happen. Very small files always have a very high
; fragmentation therefore it's not worth to compact them.
min_file_size = 131072

root@couchdb:~# grep "\[compactions\]" -A 78 /opt/couchdb/etc/default.ini  | egrep -v "^;"
[compactions]
_default = [{db_fragmentation, "70%"}, {view_fragmentation, "50%"}, {from, "00:00"}, {to, "04:00"}, {parallel_view_compaction, true}]

Note: I changed view_fragmentation from the default 60% to 50% and added the "from" and "to" timeslot.

So auto compaction should have been doing its job to free up disk space. According to the logs the compaction daemon did indeed run (on databases and views) but nothing was freed up.

TL;DR of this article?
Do not forget to compact your db views, too! Check their sizes (either in the UI or via CLI) and you should be able to determine where your disk space is getting wasted.

How can I make sure to run compact on all relevant databases and views?
For this purpose I created a script called compact_couchdb.sh. It runs through all the databases found in the addressed CouchDB. In each database, the views are detected. And the script compacts each database and each view of each database found.
The script can be found here (on Github): https://github.com/Napsty/scripts/blob/master/couchdb/compact_couchdb.sh

 

Monitoring memory usage of a LXC container (comparing 1.x vs 2.x)
Wednesday - Jul 18th 2018 - by - (0 comments)

Monitoring Linux Containers (LXC, also known as System Containers to separate from the Docker world) is as important as you monitor your LXC host. But the usage view inside a container is sometimes "unreal". 

Back in 2013, when I started the monitoring plugin check_lxc, the only way to really check the memory usage of a container was to check the current cgroup values of the container:

root@lxchost:~# /usr/lib/nagios/plugins/check_lxc.sh -n container1 -t mem
LXC container1 OK - Used Memory: 6187 MB|mem=6488358912B;0;0;0;0

In the background of check_lxc.sh, the cgroup values of container1 are read. But why so complicated and not just run a classic check_mem.pl inside the container?

To answer that question, take a look at the following picture:

LXC 1 Memory Usage 

Focus on the memory usage; both the LXC host (top), running LXC 1.x, and the LXC container (bottom) show the exact same values.

Or to see it in text form:

root@lxchost:~# free -m
             total       used       free     shared    buffers     cached
Mem:         32176      30911       1264        119       1855      21921
-/+ buffers/cache:       7135      25041
Swap:         3814        165       3649


root@container1:~# free -m
             total       used       free     shared    buffers     cached
Mem:         32176      30911       1264        119       1855      21921
-/+ buffers/cache:       7135      25041
Swap:         3814        165       3649

The container only sees the same values as the host. But the container itself only uses 6187 MB according to cgroups, not 7135 MB.

That's why you should use check_lxc on the host to get a more accurate memory usage of the containers.

Until recently.

Now that I'm working on a new LXC environment on Debian Stretch, there's a newer LXC version (LXC 2.x). Something immediately caught my eye the first time I ran (h)top:

LXC 2.x container memory usage 

Focus again on the memory usage. This time the LXC host (top) and the LXC container (bottom) have different values. True, the (cpu) load and the swap usage is still the same on both host and container, but it's already something!

Doing the same check with free:

root@lxchost:~# free -m
              total        used        free      shared  buff/cache   available
Mem:          64421       21419         361         179       42639       42173
Swap:         15258         668       14590


root@container1:~# free -m
              total        used        free      shared  buff/cache   available
Mem:          64421        1038       62770         179         612       62770
Swap:         15258         668       14590

Note: These are different hosts and different containers than the values seen above from LXC 1.x.

Both host and container show the total capacity of memory and swap, but the used column clearly shows a difference.
But don't be fooled: The calculation on the available memory (last column) is kind of wrong in the container. That is because the container cannot know about the other containers running beside it and is therefore unaware of other memory consumers.

What about check_lxc in that case?

root@lxchost:~# /usr/lib/nagios/plugins/check_lxc.sh -n container1 -t mem
LXC container1 OK - Used Memory: 1596 MB|mem=1673646080B;0;0;0;0

The host tells us the container is using 1596 MB, which is almost the same value as 1038 (used) + 612 (buff/cache) (=1650 MB).

The big question now: Can check_mem.pl be used inside the container and give accurate alerts?

root@container1:~# ./check_mem.pl -u -w 90 -c 95
OK - 2.6% (1693488 kB) used.|TOTAL=65967908KB;;;; USED=1693488KB;59371117;62669512;; FREE=64274420KB;;;; CACHES=443236KB;;;;

The answer is: No. Because the check_mem.pl plugin (as of today) makes a calculation based on the "free" output from above. And as long as these are kind of incorrect, the container's consumption of resources (disk, memory, cpu) should still be monitored on the host.

If you'd create a script/plugin which only checks the "used" value, you're probably good to go though.

But let's focus on the good news: When you're logged into the container and you run (h)top you now see the (more or less) correct memory consumption of the container. That's already a big improvement and really helpful.

 

Creating a persistent volume in Rancher 2.0 from a NFS share
Wednesday - Jun 27th 2018 - by - (0 comments)

I'm currently building a new Docker environment. For the last almost 2 years I've successfully been running Docker containers in Rancher 1.x, see some related posts:

Now that Rancher 2.0 recently came out, it's definitely worth to see what can be achieved with it. Something which hit my eye when cross-clicking through the new user interface was the "persistent storage" section. As it turns out, the new Docker environment I'm building needs to have some Docker containers which require an external file system (NFS share) being mounted from a central NFS server. As you can read in my post "The Docker Dilemma: Benefits and risks going into production with Docker" I'm not a fan of mounting local volumes from the Docker host into the container (mainly for security reasons) but mounting a network file system, like a NFS share, is less of a security risk. But let's call it straight by the name: A Docker container or better said the application running inside the container should be built cloud-ready in mind. This means that there shouldn't be any fixed mounts of (internal) file servers (-> Object Storage through HTTP call is the future). But anyway, in the short term this environment still requires that particular NFS share mounted into some of the containers.

Back to the topic: Rancher 2.0 comes with a cluster-wide storage solution. A lot of storage drivers (volume plugins) are ready to be used, including the "NFS Share". And here's how you do it.

1. Add the NFS share as persistent volume on the Kubernetes cluster

Inside the Kubernetes cluster level (here mh-gamma-stage), you can find a top menu entry "Storage" and "Persistent Volumes" inside of Storage.

Click on "Add Volume" and the following form will be shown:

Rancher 2.0 NFS Persistent Storage 

Give the new volume a meaningful name; here I chose nfs-gamma-stage.
Select the correct volume plugin; here "NFS Share" (explains itself).
I defined a volume capacity of 500GB here, but it doesn't actually matter as the NFS server defines the capacity, see later in this article.
Path is the export path from the NFS server (see /etc/exports on the NFS server).
Server of course is the IP or DNS name of the NFS server.
It's also possible to chose whether this volume should be read-only or not.
Hit the "Save" button and you will see the volume being "Available":

Rancher 2.0 NFS Share as persistent volume 

2. Create project and namespace

Inside the Kubernetes cluster level, make sure you create a project and a namespace inside the project - if you haven't already.

3. Claim the persistent volume in the project

The previously created volume can now be claimed inside a project. Enter the project where you want to claim the volume (here: "Gamma" project).

In the tab "Workloads", select the navigation tab "Volumes".

Rancher 2.0 persistent volume from NFS share 

Click on the "Add Volume" button and the "Add Volume Claim" form will show up:

Rancher 2.0 NFS Share as persistent volume

Name: Enter a meaningful name, it can even be the same name as on Kubernetes cluster level.
Namespace: Select the namespace in which the volume should be seen.
Source: Select "Use an existing persistent volume".
Persistent Volume: Select the volume created before.

Click on "Create" and the volume will then show up as "Bound":

NFS Share in Rancher 2.0  

4. Deploy a new workload with the volume attached

So far so good, but now we want to have some containers with this persistent volume! Change to the tab "Workloads" and click on the "Deploy" button to deploy a new workload (and therefore container/s):

NFS Share in Rancher 2.0

I chose a commonly used image "ubuntu:xenial" and scrolled down to the "Volumes" configuration.

Rancher 2.0 NFS Volume attach to Docker container 

Here I selected the persistent volume I created before and also chose two mount points.
In this example the persistent volume (ergo the NFS share) will be mounted twice:
- /mnt will be used as mount point within the container to mount the whole volume. This will be mounted read-only.
- /logs will be used as mount point within the container to mount a subfolder (logs) of the volume. This will be mounted with read-write permissions.

So this is actually pretty useful: The same volume can be used for multiple mount points. It's not necessary to create several volumes and then mount each volume separately into the container. Saves a lot of work!

After this, deploy the workload.

5. Inside the container

Once the workload is deployed (you can see this on the green dots), you can execute a shell into a container and verify that the volumes were mounted:

Rancher 2.0 Volume in Container

So far so good! Several containers were able to write into the volume at the same time (where read-write was given).

But what about the volume sizing? As you could see above, I set a capacity of 500GB in the user interface but the NFS share in the container clearly shows a size of 95GB.
When we increased the NFS share on the NFS server, this was immediately seen inside the container. So this capacity limit in the Rancher UI seems to be more informational than a restriction (not sure though).

 

Search field in Firefox is gone, how to get it back
Friday - Jun 15th 2018 - by - (0 comments)

In recent Firefox versions I noticed that the search field next to the address bar disappeared. While it is still possible to simply enter keywords in the address bar and therefore use it as search field, it is not possible to dynamically change the search engine, which was pretty handy sometimes.

Search field in Firefox gone 

But the search field can be made visible again. It simply requires a quick change of Firefox's settings.

Open a new tab and enter "about:config" in the address bar, then enter. Accept the warning that you'll be careful.

In the search field (inside the about:config tab), enter: "browser.search.widget.inNavBar".

Firefox change config to show search field again 

As you can see, the value is set to "False". Double-click on the line/text of the preference and it will change to "True" (and text will become bold). And you'll also see that the search field magically re-appeared next to the address bar:

Firefox showing search field again

 

Ansible: Detect and differ between LXC containers and hosts
Wednesday - Jun 13th 2018 - by - (0 comments)

While looking for a way to handle certain tasks differently inside a LXC container and on a (physical) host, I first tried to use ansible variables based on some hardware. 

But the problem, as you might know, is that LXC containers basically see the same hardware as the host because they use the same kernel (there is no hardware virtualization layer in between).
Note: That's what makes the containers much faster than VM's, just sayin'.

So checking for hardware will not work, as both host and container see the same:

$ ansible host -m setup | grep ansible_system_vendor
        "ansible_system_vendor": "HP",

$ ansible container -m setup | grep ansible_system_vendor
        "ansible_system_vendor": "HP",

When I looked through all available variables coming from "-m setup", I stumbled across ansible_virtualization_role at the end of the output. Looks interesting!

$ ansible host -m setup | grep ansible_virtualization_role
        "ansible_virtualization_role": "host",

$ ansible container -m setup | grep ansible_virtualization_role
        "ansible_virtualization_role": "guest",

Awesome!

 

Ubuntu 18.04: /etc/rc.local does not exist anymore - does it still work?
Monday - Jun 11th 2018 - by - (0 comments)

I was about to create a new Galera Cluster where a third node is simply being used as Arbitrator, not as data node. 

As this is a simple process starting up, I usually just added a line in /etc/rc.local to start the process. But in the new Ubuntu 18.04 this file does not exist anymore:

root@arbitrator:~# file /etc/rc.local
/etc/rc.local: cannot open `/etc/rc.local' (No such file or directory)

Sure, I could write a SystemD unit file for a garbd.service now, but lazy, you know.

So I wondered if and how /etc/rc.local is still working in Ubuntu 18.04 and I stumbled on the manpage of systemd-rc-local-generator:

"systemd-rc-local-generator is a generator that checks whether /etc/rc.local exists and is executable, and if it is pulls the rc-local.service unit into the boot process[...]"

Cool, let's give it a shot.

root@arbitrator:~# echo "/usr/bin/garbd --cfg /etc/garbd.conf &" > /etc/rc.local
root@arbitrator:~# echo "exit 0" >> /etc/rc.local
root@arbitrator:~# chmod 755 /etc/rc.local

After a reboot, the process didn't show up. What happened?

root@arbitrator:~# systemctl status rc-local
● rc-local.service - /etc/rc.local Compatibility
   Loaded: loaded (/lib/systemd/system/rc-local.service; enabled-runtime; vendor preset: enabled)
  Drop-In: /lib/systemd/system/rc-local.service.d
           └─debian.conf
   Active: failed (Result: exit-code) since Mon 2018-06-11 16:53:47 CEST; 1min 53s ago
     Docs: man:systemd-rc-local-generator(8)
  Process: 1182 ExecStart=/etc/rc.local start (code=exited, status=203/EXEC)

Jun 11 16:53:46 arbitrator systemd[1]: Starting /etc/rc.local Compatibility...
Jun 11 16:53:47 arbitrator systemd[1182]: rc-local.service: Failed to execute command: Exec format error
Jun 11 16:53:47 arbitrator systemd[1182]: rc-local.service: Failed at step EXEC spawning /etc/rc.local: Exec format error
Jun 11 16:53:47 arbitrator systemd[1]: rc-local.service: Control process exited, code=exited status=203
Jun 11 16:53:47 arbitrator systemd[1]: rc-local.service: Failed with result 'exit-code'.
Jun 11 16:53:47 arbitrator systemd[1]: Failed to start /etc/rc.local Compatibility.

Exec format error? Huh?

I came across an older article from SUSE (SLES12), which looked kind of similar: The after.local.service fails to start with exec format error. The resolution on that knowledge base article:

"Add a hashpling to the begining [...]"

D'oh! I compared with a /etc/rc.local from another system (Xenial) and indeed it starts with the shell, like a normal shell script. I changed my /etc/rc.local and added #!/bin/bash:

root@arbitrator:~# cat /etc/rc.local
#!/bin/bash
/usr/bin/garbd --cfg /etc/garbd.conf &
exit 0

The rc-local service could now be restarted with success:

root@arbitrator:~# systemctl restart rc-local

root@arbitrator:~# systemctl status rc-local
● rc-local.service - /etc/rc.local Compatibility
   Loaded: loaded (/lib/systemd/system/rc-local.service; enabled-runtime; vendor preset: enabled)
  Drop-In: /lib/systemd/system/rc-local.service.d
           └─debian.conf
   Active: active (running) since Mon 2018-06-11 16:59:31 CEST; 11s ago
     Docs: man:systemd-rc-local-generator(8)
  Process: 1948 ExecStart=/etc/rc.local start (code=exited, status=0/SUCCESS)
    Tasks: 3 (limit: 4664)
   CGroup: /system.slice/rc-local.service
           └─1949 /usr/bin/garbd --cfg /etc/garbd.conf

Jun 11 16:59:31 arbitrator systemd[1]: Starting /etc/rc.local Compatibility...
Jun 11 16:59:31 arbitrator systemd[1]: Started /etc/rc.local Compatibility.

And garbd is started:

root@arbitrator:~# ps auxf | grep garb | grep -v grep
root      1949  0.0  0.2 181000  9664 ?        Sl   16:59   0:00 /usr/bin/garbd --cfg /etc/garbd.conf

TL;DR: /etc/rc.local still works in Ubuntu 18.04, when
1) it exists
2) is executable
3) Starts with a valid shell hashpling (is that really the name for it?)

 

Filesystem full because of Filebeat still hanging on to rotated log
Monday - Jun 11th 2018 - by - (0 comments)

In the past weeks I've seen recurring file system warnings on certain servers. After some investigation it turned out to be Filebeat still hanging on to a already rotated log file and therefore not releasing the inode, ergo not giving back the available disk space to the file system.

Let's start off at the begin. Icinga sent the disk warning for /var:

root@server:~# df -h
Filesystem            Type        Size  Used Avail Use% Mounted on
sysfs                 sysfs          0     0     0    - /sys
proc                  proc           0     0     0    - /proc
udev                  devtmpfs    2.0G   12K  2.0G   1% /dev
devpts                devpts         0     0     0    - /dev/pts
tmpfs                 tmpfs       396M  736K  395M   1% /run
/dev/sda1             ext4        4.5G  2.0G  2.3G  48% /
none                  tmpfs       4.0K     0  4.0K   0% /sys/fs/cgroup
none                  fusectl        0     0     0    - /sys/fs/fuse/connections
none                  debugfs        0     0     0    - /sys/kernel/debug
none                  securityfs     0     0     0    - /sys/kernel/security
none                  tmpfs       5.0M     0  5.0M   0% /run/lock
none                  tmpfs       2.0G     0  2.0G   0% /run/shm
none                  tmpfs       100M     0  100M   0% /run/user
none                  pstore         0     0     0    - /sys/fs/pstore
/dev/mapper/vg0-lvvar ext4         37G   33G  3.0G  92% /var
/dev/mapper/vg0-lvtmp ext4        922M  1.2M  857M   1% /tmp
systemd               cgroup         0     0     0    - /sys/fs/cgroup/systemd

As you can see, /var is at 92% full.

With lsof we checked for open yet deleted files:

root@server:~# lsof +L1
COMMAND    PID     USER   FD   TYPE DEVICE   SIZE/OFF NLINK   NODE NAME
init         1     root   11w   REG  252,1        309     0     79 /var/log/upstart/systemd-logind.log.1 (deleted)
filebeat 30336     root    3r   REG  252,1 1538307897     0    326 /var/log/haproxy.log.1 (deleted)
filebeat 30336     root    4r   REG  252,1 1474951702     0   2809 /var/log/haproxy.log.1 (deleted)
filebeat 30336     root    6r   REG  252,1 1513061121     0   1907 /var/log/haproxy.log.1 (deleted)
filebeat 30336     root    7r   REG  252,1 1566966965     0     72 /var/log/haproxy.log.1 (deleted)
filebeat 30336     root    8r   REG  252,1 1830485663     0   2558 /var/log/haproxy.log.1 (deleted)
filebeat 30336     root    9r   REG  252,1 1426600050     0    163 /var/log/haproxy.log.1 (deleted)
nginx    31673 www-data   64u   REG  252,1     204800     0   2978 /var/lib/nginx/proxy/9/53/0008826539 (deleted)
nginx    31674 www-data  154u   REG  252,1     204800     0 131334 /var/lib/nginx/proxy/2/95/0008825952 (deleted)
nginx    31676 www-data  111u   REG  252,1     241664     0   4064 /var/lib/nginx/proxy/0/54/0008826540 (deleted)

There it is; Filebeat still hanging on to a (meanwhile) rotated HAProxy log file (which is quite big as you can see in the column SIZE/OFF).

To release the inode, Filebeat can either restart or force reloaded.

root@server:~# /etc/init.d/filebeat force-reload
 * Restarting Filebeat sends log files to Logstash or directly to Elasticsearch. filebeat
2018/05/23 13:39:20.004534 beat.go:297: INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2018/05/23 13:39:20.004580 beat.go:192: INFO Setup Beat: filebeat; Version: 5.6.9
2018/05/23 13:39:20.004623 logstash.go:91: INFO Max Retries set to: 3
2018/05/23 13:39:20.004663 outputs.go:108: INFO Activated logstash as output plugin.
2018/05/23 13:39:20.004681 metrics.go:23: INFO Metrics logging every 30s
2018/05/23 13:39:20.004727 publish.go:300: INFO Publisher name: server
2018/05/23 13:39:20.004850 async.go:63: INFO Flush Interval set to: 1s
2018/05/23 13:39:20.006018 async.go:64: INFO Max Bulk Size set to: 2048
Config OK

Verification with lsof again:

root@server:~# lsof +L1
COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF NLINK   NODE NAME
init        1     root   11w   REG  252,1      309     0     79 /var/log/upstart/systemd-logind.log.1 (deleted)
nginx   31674 www-data  154u   REG  252,1   204800     0 131334 /var/lib/nginx/proxy/2/95/0008825952 (deleted)

Looks better! What about the file system size?

root@server:~# df -h
Filesystem            Type        Size  Used Avail Use% Mounted on
sysfs                 sysfs          0     0     0    - /sys
proc                  proc           0     0     0    - /proc
udev                  devtmpfs    2.0G   12K  2.0G   1% /dev
devpts                devpts         0     0     0    - /dev/pts
tmpfs                 tmpfs       396M  736K  395M   1% /run
/dev/sda1             ext4        4.5G  2.0G  2.3G  48% /
none                  tmpfs       4.0K     0  4.0K   0% /sys/fs/cgroup
none                  fusectl        0     0     0    - /sys/fs/fuse/connections
none                  debugfs        0     0     0    - /sys/kernel/debug
none                  securityfs     0     0     0    - /sys/kernel/security
none                  tmpfs       5.0M     0  5.0M   0% /run/lock
none                  tmpfs       2.0G     0  2.0G   0% /run/shm
none                  tmpfs       100M     0  100M   0% /run/user
none                  pstore         0     0     0    - /sys/fs/pstore
/dev/mapper/vg0-lvvar ext4         37G   24G   12G  68% /var
/dev/mapper/vg0-lvtmp ext4        922M  1.2M  857M   1% /tmp
systemd               cgroup         0     0     0    - /sys/fs/cgroup/systemd

The /var partition is now back to 68% used. Good!

But how does one prevent this from happening again? In the logrotate config for HAProxy (/etc/logrotate.d/haproxy) I added a postrotate command to reload Filebeat:

root@server:~# cat /etc/logrotate.d/haproxy
/var/log/haproxy.log {
    daily
    rotate 52
    missingok
    notifempty
    compress
    delaycompress
    postrotate
        invoke-rc.d rsyslog rotate >/dev/null 2>&1 || true
        service filebeat force-reload >/dev/null 2>&1
    endscript
}

For the last couple of weeks I've been watching this on that particular server and it turns out this definitely solved the problem that Filebeat was still hanging on to inodes not in use anymore. The following graphs shows the disk usage of /var:

Filebeat still hanging on to logfiles after rotation 

The red circles show when I manually forced a reload of Filebeat.
The blue circle notes the day when I added the "service filebeat force-reload >/dev/null 2>&1" line into postrotate in the logrotate file - and when it first executed (note the significant fall compared to the days before).

I also had to add the reload line to the logrotate config of Nginx, as I'm using Filebeat to go through HAProxy and Nginx logs.

Note: This happened under both Ubuntu 14.04 and 16.04 with Filebeat 5.6.9. Meanwhile Filebeat 6.x is out, maybe this fixes the rotated log file issue but I didn't have the time to upgrade yet. 

 

Network config in Ubuntu 18.04 Bionic is now elsewhere and handled by netplan
Friday - Jun 8th 2018 - by - (0 comments)

Whlie creating a new template for Ubuntu 18.04, I stumbled across something I didn't expect. 

At the end of the template creation process, I usually comment the used IP addresses (to avoid human mistake when someone powers up the template with network card enabled). So I wanted to configure /etc/network/interfaces, as always. But:

root@ubuntu1804:~# cat /etc/network/interfaces
root@ubuntu1804:~#

Umm... the config file is empty. Huh? What's going on?

But during the setup I defined a static IP (10.10.4.44)?

root@ubuntu1804:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:8d:5c:45 brd ff:ff:ff:ff:ff:ff
    inet 10.10.4.44/24 brd 194.40.217.255 scope global ens192
       valid_lft forever preferred_lft forever
    inet6 fe80::250:56ff:fe8d:5c45/64 scope link
       valid_lft forever preferred_lft forever

Yep, the static IP is configured. Where the hell is this stored now?

root@ubuntu1804:~# grep -rni 10.10.4.44 /etc/*
/etc/hosts:2:10.10.4.44    ubuntu1804.nzzmg.local    ubuntu1804
/etc/netplan/01-netcfg.yaml:8:      addresses: [ 10.10.4.44/24 ]

/etc/netplan? Never heard of it... 

Let's check out this config file (which is written in YAML format):

root@ubuntu1804:~# cat /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    ens192:
      addresses: [ 10.10.4.44/24 ]
      gateway4: 10.10.4.44.1
      nameservers:
          search: [ example.com ]
          addresses:
              - "1.1.1.1"
              - "8.8.8.8"

OK, doesn't look that complicated and is quite understandable. Important here is also the "renderer". According to the documentation, this refers to the network daemon reading the netplan config file:

"Netplan reads network configuration from /etc/netplan/*.yaml which are written by administrators, installers, cloud image instantiations, or other OS deployments. During early boot, Netplan generates backend specific configuration files in /run to hand off control of devices to a particular networking daemon."

The Ubuntu default is to hand off control to networkd, which is part of the SystemD jungle:

root@ubuntu1804:~# ps auxf|grep networkd
systemd+   750  0.0  0.1 71816  5244 ?        Ss   13:27   0:00 /lib/systemd/systemd-networkd
root       930  0.0  0.4 170424 17128 ?        Ssl  13:27   0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher

So if I change the IP address to something else, how does one apply the new IP and will it be immediate?

root@ubuntu1804:~# ip address show ens192
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:8d:5d:37 brd ff:ff:ff:ff:ff:ff
    inet 10.10.4.44/25 brd 10.150.2.127 scope global ens192
       valid_lft forever preferred_lft forever
    inet6 fe80::250:56ff:fe8d:5d37/64 scope link
       valid_lft forever preferred_lft forever

root@ubuntu1804:~# sed -i "s/10.10.4.44/10.150.2.98/g" /etc/netplan/01-netcfg.yaml

root@ubuntu1804:~# netplan apply

# [ Note: Here my SSH session got disconnected - so yes, IP change was immediate ]

admck@ubuntu1804:~$ ip address show ens192
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:8d:5d:37 brd ff:ff:ff:ff:ff:ff
    inet 10.150.2.98/25 brd 10.150.2.127 scope global ens192
       valid_lft forever preferred_lft forever
    inet6 fe80::250:56ff:fe8d:5d37/64 scope link
       valid_lft forever preferred_lft forever

Right after "netplan apply", the IP was changed. I lost my SSH connection (of course) and could log back in using the new IP. So yes, IP change was immediate.

I just wonder how often I will still use "vi /etc/network/interfaces" as this has been stuck in my head for more than 10 years. Old habits die hard.

 

Rancher 1.6: Github user not able to login (Internal Server Error)
Thursday - May 31st 2018 - by - (0 comments)

On our Rancher 1.6 environment one particular developer was unable to login into Rancher using his Github account. Whenever he tried it, he got an error "Internal Server Error".

Rancher Login Internal Server Error 

But all other users were able to log in, so I first suspected a problem with his Github account (e.g. OAuth disabled or similar). On the research of the problem we came across an issue in Rancher's Github repository, having similar problems:

Some of the users from the organization added in the environment can log in with no problem, but some get Internal Server Error

The issue was closed by the OP himself with the reason:

It was because of mysql being in latin and not utf8

Could this be our problem, too?

I verified and indeed the database and all its tables were set to latin1_swedish_ci (the "old default" of MySQL). The Rancher 1.6 install documentation clearly states to create the database with utf8 though:

Instead of using the internal database that comes with Rancher server, you can start Rancher server pointing to an external database. The command would be the same, but appending in additional arguments to direct how to connect to your external database.

Here is an example of a SQL command to create a database and users.

> CREATE DATABASE IF NOT EXISTS cattle COLLATE = 'utf8_general_ci' CHARACTER SET = 'utf8';
> GRANT ALL ON cattle.* TO 'cattle'@'%' IDENTIFIED BY 'cattle';
> GRANT ALL ON cattle.* TO 'cattle'@'localhost' IDENTIFIED BY 'cattle';

Why wasn't our database created with utf8? Turns out we're using AWS RDS for this database and an option to set the collation and character set is not possible:

RDS Aurora DB Options 

But can the failed login really be blamed on the database's character set? I entered the Rancher container and checked the logs:

root@rancher01:/# docker exec -it 1d0ec49acc20 bash

root@1d0ec49acc20:/# cd /var/lib/cattle/logs

root@1d0ec49acc20:/var/lib/cattle/logs# cat cattle-debug.log|grep -i github
[...]
Query is: insert into `account` (`name`, `kind`, `uuid`, `state`, `created`, `data`, `external_id`, `external_id_type`, `health_state`, `version`) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?), parameters ['Paweł Lewandowski','user','de7b9940-e699-422f-8586-5a964adec35a','requested','2018-05-30 10:10:27.511','{}','6796097','github_user','healthy','2']
[...]

Yes, there is indeed a special character "L with stroke" in the first name. This character is not part in the latin1_swedish_ci character set. The failed login can therefore really be blamed on the wrong database/table character sets.

 


Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7121 Days
until Death of Computers
Why?