ELK stack not sending notifications anymore because of DNS cache

Written by - 0 comments

Published on - Listed in ELK Java DNS

In our primary ELK stack we enabled XPack to send notifications to a Slack channel:

# Slack config for data team
      url: https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXX
        from: watcher

But one day these notifications suddenly stopped. As you can see from the config, the xpack.notification is supposed to connect to https://hooks.slack.com.  

When we checked the firewal logs we saw that ElasticSearch always connected to the same IP address, yet our DNS resolution check pointed towards another (new) IP address. Means: Slack has changed the public IP for hooks.slack.com. But ElasticSearch, which uses the local Java settings, wasn't aware of that change. This is because, by default, DNS is cached forever in the JVM (see DNS cache settings).

To change this, I checked $JAVA_HOME/jre/lib/security/java.security and the defaults were the following:

# The Java-level namelookup cache policy for successful lookups:
# any negative value: caching forever
# any positive value: the number of seconds to cache an address for
# zero: do not cache
# default value is forever (FOREVER). For security reasons, this
# caching is made forever when a security manager is set. When a security
# manager is not set, the default behavior in this implementation
# is to cache for 30 seconds.
# NOTE: setting this to anything other than the default value can have
#       serious security implications. Do not set it unless
#       you are sure you are not exposed to DNS spoofing attack.

# The Java-level namelookup cache policy for failed lookups:
# any negative value: cache forever
# any positive value: the number of seconds to cache negative lookup results
# zero: do not cache
# In some Microsoft Windows networking environments that employ
# the WINS name service in addition to DNS, name service lookups
# that fail may take a noticeably long time to return (approx. 5 seconds).
# For this reason the default caching policy is to maintain these
# results for 10 seconds.

And changed it to use an internal DNS cache of 5 minutes (300s) but failed resolutions should not be cached at all:

# grep ttl $JAVA_HOME/jre/lib/security/java.security

After this a restart of ElasticSearch was needed.

Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.