systemd-journald stopped logging on CentOS 7 (Failed to write entry)

Written by - 0 comments

Published on March 17th 2016 - Listed in Linux


On a production server running CentOS 7 I stumbled on a (new?) systemd problem that systemd's journald, which is responsible for system logging (and kind of replacing syslog...). I tried to analyze what happened on the system yesterday and couldn't find any logs. Neither in /var/log, nor anything in journalctl. 

The last logged entries according to journalctl were from February 18th, almost a month ago:

root@centos log]# journalctl -f
-- Logs begin at Mon 2016-02-08 15:33:13 CET. --
Feb 18 15:15:23 centos.local sudo[13290]:  ansible : TTY=pts/4 ; PWD=/home/ansible ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-nvxhcickrkxejyyxiyxddxidvnexliak; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python /home/ansible/.ansible/tmp/ansible-tmp-1455804928.67-234040995840251/user; rm -rf /home/ansible/.ansible/tmp/ansible-tmp-1455804928.67-234040995840251/ >/dev/null 2>&1
Feb 18 15:15:23 centos.local ansible-user[13292]: Invoked with comment=Application user User Test ssh_key_bits=2048 update_password=always non_unique=False force=False ssh_key_type=rsa ssh_key_passphrase=None home=None append=False uid=984 ssh_key_comment=ansible-generated on centos.local group=testuser system=False state=present shell=/bin/bash expires=None ssh_key_file=None groups=None move_home=False password=NOT_LOGGING_PASSWORD name=testuser createhome=True remove=False login_class=None generate_ssh_key=None
Feb 18 15:15:23 centos.local useradd[13295]: new user: name=testuser, UID=984, GID=984, home=/home/testuser, shell=/bin/bash
Feb 18 15:15:24 centos.local unknown[13301]: [audit ansible/11589 as ansible/13214 on pts/4/10.162.214.210:39712->10.162.215.53:22]: #=== bash session ended. ===
Feb 18 15:15:24 centos.local unknown[13370]: [audit ansible/11589 as ansible/13302 on pts/4/10.162.214.210:39712->10.162.215.53:22]: #=== New bash session started. ===
Feb 18 15:15:24 centos.local unknown[13379]: [audit ansible/11589 as ansible/13302 on pts/4/10.162.214.210:39712->10.162.215.53:22]: #=== bash session ended. ===
Feb 18 15:15:24 centos.local unknown[13448]: [audit / as ansible/13380 on /10.162.214.210:39712->10.162.215.53:22]: #=== New bash session started. ===
Feb 18 15:15:24 centos.local unknown[13455]: [audit / as ansible/13380 on /10.162.214.210:39712->10.162.215.53:22]: #=== bash session ended. ===
Feb 18 15:15:24 centos.local unknown[13524]: [audit ansible/11589 as ansible/13456 on pts/4/10.162.214.210:39712->10.162.215.53:22]: #=== New bash session started. ===
Feb 18 15:15:24 centos.local sudo[13532]:  ansible : TTY=pts/4 ; PWD=/home/ansible ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-ixocfinomkjmwqhxmdecoqaehteqxyed; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python /home/ansible/.ansible/tmp/ansible-tmp-1455804928.97-213542358801429/authorized_key; rm -rf /home/ansible/.ansible/tmp/ansible-tmp-1455804928.97-213542358801429/ >/dev/null 2>&1

After February 18th 15:15:24 no new entries were logged anymore. I mean in general no more logs; no logins, no kernel messages, no mail logs, nothing.
I checked if the system time was correct (it is) and also looked at the current uptime:

[root@centos log]# date
Thu Mar 17 08:42:55 CET 2016

[root@centos log]# uptime
 08:43:48 up 37 days, 17:09,  2 users,  load average: 0.00, 0.01, 0.05

The uptime is older than when the problem arose, so a reboot is definitely not the cause.

I suspected that maybe the systemd-journald process crashed. This would explain the immediate stop of logging. However systemctl showed that the "systemd-journald.service" is loaded and actively running:

[root@centos log]# systemctl list-units | egrep "(log|journal)"
rhel-dmesg.service                          loaded active exited    Dump dmesg to /var/log/dmesg
rsyslog.service                             loaded active running   System Logging Service
systemd-journal-flush.service               loaded active exited    Flush Journal to Persistent Storage
systemd-journald.service                    loaded active running   Journal Service
systemd-logind.service                      loaded active running   Login Service
systemd-journald.socket                     loaded active running   Journal Socket

A status check on systemd-journald.service shows that the service is up for 1 month and 7 days (and running):

[root@centos log]# systemctl status systemd-journald.service
● systemd-journald.service - Journal Service
   Loaded: loaded (/usr/lib/systemd/system/systemd-journald.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2016-02-08 15:33:17 CET; 1 months 7 days ago
     Docs: man:systemd-journald.service(8)
           man:journald.conf(5)
 Main PID: 361 (systemd-journal)
   Status: "Processing requests..."
   CGroup: /system.slice/systemd-journald.service
           └─361 /usr/lib/systemd/systemd-journald

Feb 08 15:33:17 centos.local systemd-journal[361]: Runtime journal is using 8.0M (max allowed 189.5M, trying to leave 284.3M free of 1.8G available → current limit 189.5M).
Feb 08 15:33:17 centos.local systemd-journal[361]: Runtime journal is using 8.0M (max allowed 189.5M, trying to leave 284.3M free of 1.8G available → current limit 189.5M).
Feb 08 15:33:17 centos.local systemd-journal[361]: Journal started
Feb 08 15:33:19 centos.local systemd-journal[361]: Runtime journal is using 8.0M (max allowed 189.5M, trying to leave 284.3M free of 1.8G available → current limit 189.5M).

 I checked dmesg to maybe catch a segfault or something similar but I caught this:

[root@centos log]# dmesg | tail
[3258033.227855] systemd-journald[361]: Failed to write entry (13 items, 318 bytes), ignoring: Cannot assign requested address
[3258089.234540] systemd-journald[361]: Failed to write entry (23 items, 605 bytes), ignoring: Cannot assign requested address
[3258089.236248] systemd-journald[361]: Failed to write entry (24 items, 636 bytes), ignoring: Cannot assign requested address
[3258089.238195] systemd-journald[361]: Failed to write entry (22 items, 618 bytes), ignoring: Cannot assign requested address
[3258093.202606] systemd-journald[361]: Failed to write entry (20 items, 481 bytes), ignoring: Cannot assign requested address
[3258093.209852] systemd-journald[361]: Failed to write entry (20 items, 517 bytes), ignoring: Cannot assign requested address
[3258093.212118] systemd-journald[361]: Failed to write entry (16 items, 370 bytes), ignoring: Cannot assign requested address
[3258095.090851] systemd-journald[361]: Failed to write entry (11 items, 295 bytes), ignoring: Cannot assign requested address
[3258105.616016] systemd-journald[361]: Failed to write entry (12 items, 314 bytes), ignoring: Cannot assign requested address
[3258105.618950] systemd-journald[361]: Failed to write entry (16 items, 395 bytes), ignoring: Cannot assign requested address

What the hell?! There is definitely something bad happening with systemd-journald! By reading the text I thought maybe the disk is full. But then my monitoring would have alerted me and a verification revealed all OK on /var:

root@centos log]# df -h | grep var
/dev/mapper/vgsystem-lvvar     ext4         3.9G  210M  3.4G   6% /var

So the problem is somewhere else...
Of course I tried to restart the service with "systemctl restart systemd-journald.service" - but nothing happened.
I then came across a forum post at linuxquestions.org which mentioned to run verify. And that's what I did:

[root@centos log]# journalctl --verify
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system.journal
454530: unused data (entry_offset==0)
454598: invalid object
File corruption detected at /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-000000000001442e-00052bff5257e30c.journal:454598 (of 8388608 bytes, 54%).

FAIL: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-000000000001442e-00052bff5257e30c.journal (Cannot assign requested address)
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-000000000001242d-00052bdfb8cb8d83.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-0000000000010460-00052bbf48b22ed6.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-000000000000d417-00052ba467e8f90a.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-0000000000009faf-00052b8afeaa9672.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-0000000000006b98-00052b7171ab2d26.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-0000000000003705-00052b5868f72dc3.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-0000000000000001-00052b4314b1902b.journal

 The lines appearing in red show quite obviously a file corruption.
After I ran the verify command, logging started working again:

[root@centos log]# journalctl -f
-- Logs begin at Mon 2016-02-08 15:33:13 CET. --
Mar 17 09:01:05 centos.local sshd[1643]: Set /proc/self/oom_score_adj to 0
Mar 17 09:01:05 centos.local sshd[1643]: Connection from 10.162.215.204 port 51034 on 10.162.215.53 port 22
Mar 17 09:01:05 centos.local sshd[1643]: Connection closed by 10.162.215.204 [preauth]
Mar 17 09:02:01 centos.local systemd[1]: Started Session 61871 of user root.
Mar 17 09:02:01 centos.local systemd[1]: Starting Session 61871 of user root.
Mar 17 09:02:01 centos.local CROND[1724]: (root) CMD (/app/scripts/idssrvjobs/idsjobindex)
Mar 17 09:02:05 centos.local sshd[1743]: Set /proc/self/oom_score_adj to 0
Mar 17 09:02:05 centos.local sshd[1743]: Connection from 10.162.215.204 port 51202 on 10.162.215.53 port 22
Mar 17 09:02:05 centos.local sshd[1743]: Connection closed by 10.162.215.204 [preauth]

Another run of --verify still showed the same corrupted journal file, so I deleted it and then ran verify again:

[root@centos log]# rm "/run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-000000000001442e-00052bff5257e30c.journal"
rm: remove regular file ‘/run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-000000000001442e-00052bff5257e30c.journal’? y

[root@centos log]# journalctl --verify
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-000000000001242d-00052bdfb8cb8d83.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-0000000000010460-00052bbf48b22ed6.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-000000000000d417-00052ba467e8f90a.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-0000000000009faf-00052b8afeaa9672.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-0000000000006b98-00052b7171ab2d26.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-0000000000003705-00052b5868f72dc3.journal
PASS: /run/log/journal/5050c3bd03844f988c3383ff7e5a4b52/system@11d03d9ffaa240bbb2cdbdee87d322bb-0000000000000001-00052b4314b1902b.journal

This time all checks passed.

For over 10 years I never had to worry that Linux stops logging. And even if that happened, the syslog process could be monitored.
Here systemd told me that systemd-journald is running correctly although it clearly had a problem. As this is not my first systemd problem in a production environment, I avoid or ditch systemd wherever possible. On Debian and Ubuntu I replaced systemd by the stable SysV init system.  Unfortunately this system is a CentOS and systemd is very much bundled fix into the system. I just hope this was a one time problem, but I doubt that.

Update September 15th 2016:
More and more distributions have switched to systemd. I just experienced the same problem on a openSUSE 13.2 machine:

# journalctl --verify
Invalid tail monotonic timestamp██████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  49%
File corruption detected at /var/log/journal/74fc18841ff54a7f8e09f6b58d4bad53/user-1002@e2b1911792f44ec2a5af6eef8cc2fdf1-00000000000cecf4-00053876db13a8e9.journal:000000 (of 8388608 bytes, 0%).
FAIL: /var/log/journal/74fc18841ff54a7f8e09f6b58d4bad53/user-1002@e2b1911792f44ec2a5af6eef8cc2fdf1-00000000000cecf4-00053876db13a8e9.journal (Bad message)
Invalid tail monotonic timestamp██████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  49%
File corruption detected at /var/log/journal/74fc18841ff54a7f8e09f6b58d4bad53/user-900@46649045ac2d45bc8543ab113c770566-0000000000064f0c-00052e4c2c98616c.journal:000000 (of 8388608 bytes, 0%).
FAIL: /var/log/journal/74fc18841ff54a7f8e09f6b58d4bad53/user-900@46649045ac2d45bc8543ab113c770566-0000000000064f0c-00052e4c2c98616c.journal (Bad message)
Invalid tail monotonic timestamp██████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  49%
File corruption detected at /var/log/journal/74fc18841ff54a7f8e09f6b58d4bad53/user-1000@b78fe5fd971048d6a4abf38182ad32ec-00000000000052c7-000522238f831f95.journal:000000 (of 8388608 bytes, 0%).
FAIL: /var/log/journal/74fc18841ff54a7f8e09f6b58d4bad53/user-1000@b78fe5fd971048d6a4abf38182ad32ec-00000000000052c7-000522238f831f95.journal (Bad message)
Invalid tail monotonic timestamp██████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  49%
File corruption detected at /var/log/journal/74fc18841ff54a7f8e09f6b58d4bad53/user-902@d37e827bc73b42bba5e732f37a4c6bdd-00000000000c5b03-000537962000a79a.journal:000000 (of 8388608 bytes, 0%).
FAIL: /var/log/journal/74fc18841ff54a7f8e09f6b58d4bad53/user-902@d37e827bc73b42bba5e732f37a4c6bdd-00000000000c5b03-000537962000a79a.journal (Bad message)
 

So this is a general systemd-journald problem, not just one on CentOS 7.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.