On Friday's there's always a bit more time for the curious cases which require some deeper research.
A developer contacted me with the suspicion that the data inside a clustered MariaDB (using Galera) was not in sync. Sometimes a certain query resulted X results, sometimes Y results.
Technically this sounded plausible because the application connects to 127.0.0.1:3306 which is a HAProxy listening on 3306. The requests are then balanced across multiple Galera nodes in round-robin style. So when the query lands on node1, X results are shown. But when the query lands on node2, Y results are shown. That could indeed be a data sync problem.
But it wasn't. The data across the cluster was exactly the same on all nodes. The same amount of rows inside the tables, the same data.
On node1:
MariaDB [app]> select count(*) from docs;
+----------+
| count(*) |
+----------+
| 3916 |
+----------+
1 row in set (0.00 sec)
MariaDB [app]> select count(*) from pages;
+----------+
| count(*) |
+----------+
| 446 |
+----------+
1 row in set (0.01 sec)
On node2:
MariaDB [app]> select count(*) from docs;
+----------+
| count(*) |
+----------+
| 3916 |
+----------+
1 row in set (0.00 sec)
MariaDB [app]> select count(*) from pages;
+----------+
| count(*) |
+----------+
| 446 |
+----------+
1 row in set (0.01 sec)
Even a dump created on both nodes showed the exact same size:
root@node1:/tmp# mysqldump app > /tmp/app.sql
root@node1:/tmp# du -ks /tmp/app.sql
3416 /tmp/app.sql
root@node2:/tmp# mysqldump app > /tmp/app.sql
root@node2:/tmp# du -ks /tmp/app.sql
3416 /tmp/app.sql
Could the two nodes interpret the query differently? The query itself is indeed a bit special, joining two tables together and using several internal MySQL functions:
SELECT DISTINCT d.*, p.page, p.pagestatus, DATE_FORMAT(creationdate,"%d.%m.%Y %H:%i:%s") as dcreationdate, DATE_FORMAT(updatedate,"%d.%m.%Y %H:%i:%s") as dupdatedate, DATE_FORMAT(lastpubdate,"%d.%m.%Y %H:%i:%s") as dlastpubdate, DATE_FORMAT(modified,"%d.%m.%Y %H:%i:%s") as dmodified FROM docs d LEFT JOIN (SELECT * FROM pages WHERE id IN (SELECT MAX(id) FROM pages GROUP BY wwid)) p ON (p.wwid = d.wwid) WHERE d.id in (SELECT MAX(id) FROM docs WHERE channelid = 2 GROUP BY ldid) AND d.printpublicationdate = '2018-06-22' ORDER BY updatedate DESC;
Agreed, not your typical select query... And this query did indeed show two different result sets from node1 and node2:
node1: 62 rows in set (0.02 sec)
node2: 82 rows in set (0.01 sec)
I checked for differences on the two nodes and indeed found a different patch level of MariaDB: node1 had 10.0.33 installed, node2 was running a newer version 10.0.36. Could this really be the source of the difference?
As this problem was identified on a staging server, I updated all cluster nodes to the current MariaDB patch level 10.0.37 (I wouldn't have done that on a Friday afternoon on PROD, no way!).
And by "magic", both nodes reported the same results once they ran the same version:
node1: 82 rows in set (0.02 sec)
node2: 82 rows in set (0.01 sec)
There were two important lessons learned here:
1) The data was still in sync and therefore not corrupted.
2) Always make sure a Galera cluster uses the same version across all nodes.
I'm not sure which exact version (10.0.34, 35 or 36) fixed this, but it could be related to bug MDEV-15035 which was fixed in 10.0.35.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder