For the last couple of weeks, I have been running behind a Logstash issue, where thousands of JSON parsing errors were logged after the whole ELK stack was upgraded to 7.15.
This caused the
/var partition to be filled up within a very short time:
Even after clearing (emptying) /var/log/syslog, it didn't even take 30 minutes, until more than 30GB of errors were written and filling up the partition again.
The local syslog log file (/var/log/syslog) contained thousands of errors, indicating JSON parsing errors from Logstash:
Oct 20 11:17:46 log01 logstash[3628473]: [2021-10-20T11:17:46,964][ERROR][logstash.codecs.json ][main][6becfe14a36e9ff58e20a97f0820ea946e3122df132f9ddd7c5d61b99c609832] JSON parse error, original data now in message field {:message=>"Unexpected end-of-input: was expecting closing quote for a string value\n at [Source: (String)\"{\"@timestamp\":\"2021-10-20T11:17:46.960783+02:00\",\"@version\":\"1\",\"message\":\" [2021-10-20T11:17:46,960][ERROR][logstash.codecs.json ][main][6becfe14a36e9ff58e20a97f0820ea946e3122df132f9ddd7c5d61b99c609832] JSON parse error, original data now in message field {:message=>\\\"Unexpected end-of-input in character escape sequence\\\\n at [Source: (String)\\\\\\\"{\\\\\\\"@timestamp\\\\\\\":\\\\\\\"2021-10-20T11:17:46.956975+02:00\\\\\\\",\\\\\\\"@version\\\\\\\":\\\\\\\"1\\\\\\\",\\\\\\\"message\\\\\\\":\\\\\\\" [2021-10-20T11:17:46,955][ERROR][logs\"[truncated 7596 chars]; line: 1, column: 16193]", :exception=>LogStash::Json::ParserError, :data=>"{\"@timestamp\":\"2021-10-20T11:17:46.960783+02:00\",\"@version\":\"1\",\"message\":\" [2021-10-20T11:17:46,960][ERROR][logstash.codecs.json ][main][6becfe14a36e9ff58e20a97f0820ea946e3122df132f9ddd7c5d61b99c609832] JSON parse error, original data now in message field {:message=>\\\"Unexpected end-of-input in character escape sequence\\\\n at [Source: (String)\\\\\\\"{\\\\\\\"@timestamp\\\\\\\":\\\\\\\"2021-10-20T11:17:46.956975+02:00\\\\\\\",\\\\\\\"@version\\\\\\\":\\\\\\\"1\\\\\\\",\\\\\\\"message\\\\\\\":\\\\\\\" [2021-10-20T11:17:46,955][ERROR][logstash.codecs.json ][main][6becfe14a36e9ff58e20a97f0820ea946e3122df132f9ddd7c5d61b99c609832] JSON parse error, original data now in message field {:message=>\\\\\\\\\\\\\\\"Unexpected end-of-input: was expecting closing quote for a string value\\\\\\\\\\\\\\\\n at [Source: (String)
[...]
Help was requested on the public Elastic discussion forum (find source of events causing Logstash errors), with the goal to identify which application or program would cause these errors - but nobody answered. An additional support ticket with the Elastic support was also created.
After having tried lots of different configuration options, including disabling the "codec => json" setting on the Logstash input or in general disabling JSON parsing (which worked, but resulted in our logs not being split into needed fields anymore), Logstash was finally downgraded to 7.12.1.
And what a change this was! Since Logstash was downgraded from 7.15.2 to 7.12.1, the massive error logging stopped, and the disk usage on the /var partition remained stable (within the expected growth):
But not only the disk usage remained very steady, with only a few occasional Logstash errors being logged, we also saw a sharp decrease of CPU and network usage (towards Elasticsearch):
The graphs all look very good - but are there any downsides? So far we have discovered no downsides at all. We are still getting all the logs (= we are not missing any logs), the events are correctly JSON parsed and split into fields, indexing in Elasticsearch works was it should. Yet, there are no more mass-errors being logged and performance is better (on a large factor).
This leads to the conclusion that between Logstash 7.12.1 and 7.15.x (we have seen this mass error happening on Logstash 7.15.0-2) must be a bug in Logstash itself. What exactly is yet unknown and we hope the Elastic support might find the cause of it. For now we will stick with the 7.12 version.
Updated December 8th, 2021
Further analysis and discussion with Elastic support showed indeed a different logging behaviour between Logstash 7.12 and 7.15. Logstash's JSON codec changed the logging event in case of parsing errors:
An initial JSON parsing error (1) is therefore logged by Logstash to syslog and is written into /var/log/syslog (2). One particular configuration in our environment is that we use rsyslogd (3) to format syslog events to JSON and then forward it to - you guessed it - Logstash (4). By just looking at the logged error message (which contains partial JSON data, too) it doesn't need a JSON genius to guess that Logstash will AGAIN throw a JSON parsing error when the forwarded syslog entry reaches the Logstash input (5).
Yet again, another error is logged in syslog, which is then again picked up by Logstash, which creates another error, which is logged, ... A classical case of a loop. The following drawing should explain this visually:
So right now we are working on a way to prevent Logstash actually writing into syslog - and keep its own dedicated log file. This would prevent the loop to be triggered and only the initial JSON parse error would be logged (which would be fine).
Another alternative workaround would be to drop the Logstash syslog lines as soon as they arrive in the LS input:
filter {
if "log01" in [sysloghost] and "logstash" in [programname] {
drop { }
}
}
But of course this uses performance, too.
Updated January 3rd 2022
As described in the previous update, the suspected reason was that errors were logged to syslog and then picked up by Logstash again, hence creating a loop. To disable logging to syslog, the following log4j setting needs to be disabled (commented):
root@logstash:~# grep "console.ref" /etc/logstash/log4j2.properties
#rootLogger.appenderRef.console.ref = ${sys:ls.log.format}_console # Disabled logging to syslog
After another Logstash restart, all Logstash related logs (including errors) are now logged in a dedicated log file under /var/log/logstash/logstash-plain.log.
This did the trick! Syslog doesn't pick up any Logstash logs anymore and the JSON parse errors are gone. This can be nicely seen in the disk usage of the /var partition:
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder