MySQL Galera cluster not starting (failed to open channel)

Written by Claudio Kuenzler - 3 comments

Published on July 14th 2014 - last updated on March 17th 2025 - Listed in MySQL Databases Galera

Working with Galera clusters is in most cases pretty straight forward. But there can be headaches, especially when nodes run out of sync after a reboot or a network interruption.

Here are some cases I've come across over the years.

Whole cluster down

On a Galera Cluster test environment which was previously shut down (two virtual servers on the same physical machine), I got the following error message when I tried to start MySQL on the first cluster node:

/etc/init.d/mysql start
* Starting MariaDB database server mysqld [fail]

The detailed information was logged in the background in syslog:

Jul 14 15:17:07 node1 mysqld_safe: Starting mysqld daemon with databases from /var/lib/mysql
Jul 14 15:17:07 node1 mysqld_safe: WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.iuhkNF' --pid-file='/var/lib/mysql/node1-recover.pid'
Jul 14 15:17:09 node1 mysqld_safe: WSREP: Recovered position cc4fb7ad-e5ab-11e3-8fae-d3fd14daa6a4:391488
Jul 14 15:17:09 node1 mysqld: 140714 15:17:09 [Note] WSREP: wsrep_start_position var submitted: 'cc4fb7ad-e5ab-11e3-8fae-d3fd14daa6a4:391488'
[...]
Jul 14 15:17:13 node1 mysqld: 140714 15:17:13 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50224S), skipping check
Jul 14 15:17:38 node1 /etc/init.d/mysql[11978]: 0 processes alive and '/usr/bin/mysqladmin --defaults-file=/etc/mysql/debian.cnf ping' resulted in
Jul 14 15:17:38 node1 /etc/init.d/mysql[11978]: #007/usr/bin/mysqladmin: connect to server at 'localhost' failed
Jul 14 15:17:38 node1 /etc/init.d/mysql[11978]: error: 'Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111 "Connection refused")'
Jul 14 15:17:38 node1 /etc/init.d/mysql[11978]: Check that mysqld is running and that the socket: '/var/run/mysqld/mysqld.sock' exists!
Jul 14 15:17:38 node1 /etc/init.d/mysql[11978]:
Jul 14 15:17:42 node1 mysqld: 140714 15:17:42 [Note] WSREP: view((empty))
Jul 14 15:17:42 node1 mysqld: 140714 15:17:42 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
Jul 14 15:17:42 node1 mysqld: #011 at gcomm/src/pc.cpp:connect():141
Jul 14 15:17:42 node1 mysqld: 140714 15:17:42 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -110 (Connection timed out)
Jul 14 15:17:42 node1 mysqld: 140714 15:17:42 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'Galera Test' at 'gcomm://192.168.41.11,192.168.41.12': -110 (Connection timed out)
Jul 14 15:17:42 node1 mysqld: 140714 15:17:42 [ERROR] WSREP: gcs connect failed: Connection timed out
Jul 14 15:17:42 node1 mysqld: 140714 15:17:42 [ERROR] WSREP: wsrep::connect() failed: 7
Jul 14 15:17:42 node1 mysqld: 140714 15:17:42 [ERROR] Aborting
Jul 14 15:17:42 node1 mysqld:
Jul 14 15:17:42 node1 mysqld: 140714 15:17:42 [Note] WSREP: Service disconnected.
Jul 14 15:17:43 node1 mysqld: 140714 15:17:43 [Note] WSREP: Some threads may fail to exit.
Jul 14 15:17:43 node1 mysqld: 140714 15:17:43 [Note] /usr/sbin/mysqld: Shutdown complete
Jul 14 15:17:43 node1 mysqld:
Jul 14 15:17:43 node1 mysqld_safe: mysqld from pid file /var/run/mysqld/mysqld.pid ended

The important information here is the following line:

[ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)

When node1 starts MySQL, it tries to join an existing cluster. But because both nodes are currently down, there is no primary node available (see this page for a good and short explanation).

Bootstrapping Galera Cluster

So when a Galera Cluster must be started from "zero" again, the first node must be started with the "wsrep-new-cluster" command (exactly during the set up of a new cluster):

service mysql start --wsrep-new-cluster
* Starting MariaDB database server mysqld [ OK ]
* Checking for corrupt, not cleanly closed and upgrade needing tables.

In syslog, the following log entries can be found:

Jul 14 15:18:43 node1 mysqld_safe: Starting mysqld daemon with databases from /var/lib/mysql
Jul 14 15:18:43 node1 mysqld_safe: WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.hK64YC' --pid-file='/var/lib/mysql/node1-recover.pid'
Jul 14 15:18:46 node1 mysqld_safe: WSREP: Recovered position cc4fb7ad-e5ab-11e3-8fae-d3fd14daa6a4:391488
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: wsrep_start_position var submitted: 'cc4fb7ad-e5ab-11e3-8fae-d3fd14daa6a4:391488'
[...]
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: Start replication
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: Setting initial position to cc4fb7ad-e5ab-11e3-8fae-d3fd14daa6a4:391488
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: protonet asio version 0
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: Using CRC-32C (optimized) for message checksums.
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: backend: asio
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: GMCast version 0
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: (62a145e9-0b59-11e4-9a2f-c62c46c73c36, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: (62a145e9-0b59-11e4-9a2f-c62c46c73c36, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: EVS version 0
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: PC version 0
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: gcomm: bootstrapping new group 'Galera Test'
[...]
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: gcomm: connected
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: Opened channel 'Galera Test'
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
[...]
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] WSREP: Quorum results:
Jul 14 15:18:46 node1 mysqld: #011version    = 3,
Jul 14 15:18:46 node1 mysqld: #011component = PRIMARY,
Jul 14 15:18:46 node1 mysqld: #011conf_id    = 0,
Jul 14 15:18:46 node1 mysqld: #011members    = 1/1 (joined/total),
[...]
Jul 14 15:18:46 node1 mysqld: 140714 15:18:46 [Note] /usr/sbin/mysqld: ready for connections.
Jul 14 15:18:46 node1 mysqld: Version: '10.0.10-MariaDB-1~trusty-wsrep-log' socket: '/var/run/mysqld/mysqld.sock' port: 3306 mariadb.org binary distribution, wsrep_25.10.r3968
Jul 14 15:18:47 node1 /etc/mysql/debian-start[13061]: Upgrading MySQL tables if necessary.

The other nodes can be started normally and they will automatically connect to the primary node.

Under certain situations, especially under RHEL 7 or RHEL 8 and other older Enterprise Linux variants, the --wsrep-new-cluster parameter needs to be added behind mysqld_safe:

root@rhel8 ~ # mysqld_safe --wsrep-new-cluster

This is needed when the mariadb service is controlled using systemctl but the additional --wsrep-new-cluster parameter is not supported.

Safe to bootstrap

Update April 25th 2018:
On newer machines using SystemD as init system, additional steps might be necessary to start the first cluster node again.

First make sure that the node, which will be the new primary node, is allowed to bootstrap the cluster (this part is irrelevant to SystemD):

# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 6a1f102a-13a3-11e7-b710-b2876418a643
seqno: -1
safe_to_bootstrap: 0

Replace the value of safe_to_bootstrap to 1:

# sed -i "/safe_to_bootstrap/s/0/1/" /var/lib/mysql/grastate.dat

Then run the command

# galera_new_cluster

which takes care of the bootstapping of the cluster.

Note that the galera_new_cluster command is not always available. In such situations use mysqld_safe --wsrep-new-cluster.

Galera Sync and Systemd timeout

Updated September 24th 2018
When you want to re-sync (SST) a cluster node and you use SystemD as init system, the sync process will fail on larger databases because it runs into a timeout:

node02 systemd[1]: mariadb.service: Start operation timed out. Terminating.
node02 mysqld[22981]: Terminated
node02 rsyncd[23270]: rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at io.c(513) [generator=3.1.2]
node02 rsyncd[23269]: rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at io.c(513) [generator=3.1.2]
node02 rsyncd[23030]: sent 0 bytes received 0 bytes total size 0
node02 rsyncd[23269]: rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at io.c(513) [receiver=3.1.2]
node02 systemd[1]: mariadb.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
node02 mysqld[22981]: 2018-09-24 11:12:30 0 [ERROR] WSREP: Process was aborted.
node02 mysqld[22981]: 2018-09-24 11:12:30 0 [ERROR] WSREP: Process completed with error: wsrep_sst_rsync --role 'joiner' --address '10.150.2.76' --datadir '/var/lib/mysql/' --parent '22981' '' : 2 (No such file or directory)
node02 mysqld[22981]: 2018-09-24 11:12:30 0 [ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script.
node02 mysqld[22981]: 2018-09-24 11:12:30 0 [ERROR] WSREP: SST failed: 2 (No such file or directory)
node02 mysqld[22981]: 2018-09-24 11:12:30 0 [ERROR] Aborting
node02 mysqld[22981]: 2018-09-24 11:12:30 0 [Warning] WSREP: 0.0 (mh-mysql01-s): State transfer to 2.0 (node02) failed: -255 (Unknown error 255)
node02 mysqld[22981]: 2018-09-24 11:12:30 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():737: Will never receive state. Need to abort.
node02 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT
node02 systemd[1]: mariadb.service: Failed with result 'timeout'.
node02 systemd[1]: Failed to start MariaDB 10.3.7 database server.

To fix this, you need to tell SystemD to allow an infnite startup time for mariadb (see this helpful blog article). Create an additional SystemD config file for the mariadb service (/etc/systemd/system/mariadb.service.d/timeout.conf) and set the value TimeoutSec=infinity:

# cat /etc/systemd/system/mariadb.service.d/timeout.conf
[Service]
TimeoutSec=infinity

Reload systemd unit files:

# systemctl daemon-reload

Now start MariaDB the normal way and this time SystemD will allow longer startup time and give the Galera process time to sync.

Galera Sync fails with no route to host

Updated April 21st 2020

If you get an error like the following:

[ERROR] WSREP: Requesting state transfer failed: -113(No route to host)
[ERROR] WSREP: State transfer request failed unrecoverably: 113 (No route to host). Most likely it is due to inability to communicate with the cluster primary component. Restart required.

... then check out the following article: Galera node unable to sync from donor: Requesting state transfer failed: -113(No route to host).

WSREP: Failed to read 'ready <addr>'

Updated March 17th 2025

Another Galera sync error could come from using a specific donor node which is not ready. The logs however are unfortunately not very clear about this:

2025-03-17 13:25:09 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_mariabackup --role 'joiner' --address '10.10.10.42' --datadir '/var/lib/mysql/' --parent 1613030 --progress 0 --mysqld-args --basedir=/usr
Read: '(null)'
2025-03-17 13:25:09 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '10.10.10.42' --datadir '/var/lib/mysql/' --parent 1613030 --progress 0 --mysqld-args --basedir=/usr: 114 (Operation already in progress)
2025-03-17 13:25:09 1 [ERROR] WSREP: Failed to prepare for 'mariabackup' SST. Unrecoverable.

In this situation the local Galera node tried to access a specific donor node, added in the configuration:

root@node1 ~$ grep donor /etc/my.cnf.d/*
/etc/my.cnf.d/server.cnf:wsrep_sst_donor="node2"

You guessed it right; 10.10.10.42 is this node2. However in this situation, the node2 was also down and therefore not "ready".

After a configuration change to point to another (working) node, Galera SST sync worked again.

Need help in Galera troubleshooting?

Problems in Galera Clusters are not always easy to spot. Need help troubleshooting a Galera cluster? Contact us on Infiniroot.com.

Add a comment

Show form to leave a comment

Comments (newest first)

DanAtFH from wrote on Mar 18th, 2020:

The infinite startup time tip was a life saver. Thanks!

shaurya from wrote on Jan 2nd, 2020:

Finally! Had been looking for this for weeks.

Vicente from wrote on Nov 6th, 2019:

Absolutely great!
Thank you!!

Blog Tags:

AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PHP Perl Personal PostgreSQL PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder Linux