As you can see on my LinkedIn profile, I'm working for one of the leading news corporations in Switzerland. As a news portal, you can imagine how important the domain is. And what if this domain is suddenly deleted? This is exactly what happened last Friday, April 15th 2016. But let's start at the begin.
At 2.14pm our satellite monitoring running in the AWS cloud (for having an outside view of our web services) reported a failed HTTP check of our main domain. And not just a HTTP 500 or something like this - it was the following error:
Name or service not known HTTP CRITICAL - Unable to open TCP socket
When I got this alert (by both e-mail and SMS) I immediately knew something's off. Name or service not known indicates a state, when the domain name could not be resolved. I first expected a problem in the AWS DNS servers, maybe having a DNS resolving problem. I logged onto the satellite server and verified the DNS resolving issue - and to figure out that other domains resolve without a hiccup. What the hell...?!
A few minutes after this (the time difference was most likely caused by the domain's TTL), we received alerts on our internal systems as well. Now we were in deep trouble.
A whois of our domain did not show any DNS nameservers anymore so I suspected a problem at our domain registrar (Gandi). Maybe someone deleted the DNS servers from the domain? But when I logged into our account, the DNS servers were there. No modification has been done. I called Gandi to ask them for help to figure out what was going on with our domain - but they affirmed me that DNS configuration seemed correct and they can't explain why the domain isn't working.
After Gandi's response, I decided to call SWITCH, the registry operator or also called NIC (network information center) for domains ending with .ch (Switzerland) and .li (Principality of Liechtenstein). That was at exactly 2.59pm. In a few short sentences I explained our domain problem to the first level support and he asked me to hold on, he'd check with the responsible team (which I know is just a few feet away, I visited their offices back in 2012). A few minutes later he was back and explained me that our domain was blocked - probably because of malware (that were his words). I should contact the security team of SWITCH by e-mail. He couldn't give me any additional information. I sent the mail, explaining the situation in the shortest way possible, asking for an immediate call back to explain what's going on. That was at 3.06pm. I didn't get a call back.
At 3.15pm I called again, reached the same guy from before and demanded to ask directly to the security team or to a supervisor. Which didn't work with the excuse that they don't have a direct phone number. My ass. Our company is completely down (e-mails as well) and I'm being held idle on the phone... At least he went again to see his colleagues from the security team on my request. A few minutes later he was back on the phone and told me that the domain will be reactivated shortly. But still no answer to my question "But why? What happened?!". I was told, the security team would contact me.
At 3.29pm we received first recovery alerts. A whois command showed the DNS nameservers again. But of course this is only a direct whois call on the central servers - DNS cache servers at the big providers have "deleted" our domain. It'll take more than a few minutes to get the domain "back in".
At 4.18pm I got an information from a colleague who has a direct contact with someone from SWITCH and was able to talk to him. It turned out that a human mistake happened and that our domain was accidentally deleted. It took until 4.40pm until we saw normal incoming traffic again.
Besides the downtime which was costly, avoidable and, as you can imagine, hectic, there are a few facts which still anger me:
1) Communication disaster. Until today, nobody ever called or mailed me back and (technically) explained to me what happened.
2) Technically in shape? What kind of official registry operator/network information center just deletes a domain by "error"? What are your monitoring tools? Is there no prevention and verification before "accidentally" deleting a domain? Can anyone working at SWITCH just delete a domain without validation? Let's say you "accidentally" delete a domain like SBB.ch (the Swiss Federal Railways) - oh congratz, you've just brought a huge part of Switzerland's transportation system down.
3) Lies - sweet, sweet lies: SWITCH told my colleague that they "found the problem ourselves at around 3pm". Remember the time when I called and sent an e-mail? Be at least honest and acknowledge the end user had to report you've made a mistake.
Later that day, SWITCH posted a "sorry" on Twitter: "nzz.ch is back online. We're sorry for the erroneous manipulation on our side!".
Interestingly, on the very same date this "accident" happened to our domain, the Swiss government released a public document stating:
"Technical management of the .ch domain in relation to the global internet domain name system is being provided by Switch until 2017"
On 15th April 2016, OFCOM launched a public invitation to tender to award the management mandate for .ch domain names. (registry function).
So after 2017 a new private or public organization will take over the registry function currently held by SWITCH. After last Friday I salute this very much.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Container Containers CouchDB DB DNS Database Databases Docker ELK ElasticSearch Elasticsearch Filebeat FreeBSD GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Icingaweb2 InfluxDB Internet Java KVM Kibana Kubernetes LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher SSL Security Shell SmartOS Solaris Surveillance SystemD TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder