Skip to navigation

Error Recovery

At some point, a server will probably break down and need to be replaced.  When this happens it's important to minimize down-time and avoid losing data.  Error recovery procedures need to be simple so that you can solve problems quickly.

I'm not planning on setting up automated failover.  Automated failover can kick in at the wrong time and cause more problems than it solves, as the engineers at Github discovered.

It's a good idea to rehearse error recovery before you have to do it for real.  I need to plan for either a master failure or a slave failure.  A slave failure is fairly easy to deal with - I just need to remove the failed slave from the cluster and add another one.

A master failure is only slightly harder to deal with.  I need to promote one of the remaining slaves and add another slave to the cluster.

Master Failure

I'm using the same cluster as in the other pages in this section.  I've initialized it and started replication.  I forced a failure by simply unplugging the master server's power cable.

The first thing I need to do is remove the server from the cluster:

./db_cluster_utils.py -r 192.168.0.35
Remove node: 192.168.0.35
 Failed to connect
Line: 74
(2002, "Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)")
 *** Couldn't revoke replication privilege from slave_user
Line: 290
db_server instance has no attribute 'curs'
Removing slave
192.168.0.35 Failed to connect
Line: 74
(2003, "Can't connect to MySQL server on '192.168.0.35' (113)")
192.168.0.35 Failed to stop slave
Line: 375
db_server instance has no attribute 'curs'
192.168.0.35 *** Failed to drop user slave_user
Line: 193
db_server instance has no attribute 'curs'
192.168.0.35 *** Failed to drop user db_user
Line: 193
db_server instance has no attribute 'curs'
192.168.0.35 *** Failed to delete database my_db
Line: 204
db_server instance has no attribute 'curs'
Exception AttributeError: "db_server instance has no attribute 'conn'" in <bound method db_server.__del__ of <__main__.db_server instance at 0xb6aa7da0>> ignored
Exception AttributeError: "db_server instance has no attribute 'conn'" in <bound method db_server.__del__ of <__main__.db_server instance at 0xf5db20>> ignored

There are lots of error messages when this command is executed.  The server that's being removed can't accept a connection, so parts of this command didn't work properly.  The important thing is that the entry for this server was removed from status_log_records.dat.

The master server has been removed from the cluster, so one of the slaves needs to be promoted:

./db_cluster_utils.py -p 192.168.0.36
192.168.0.36 Stopped slave thread
192.168.0.37 Stopped slave thread
192.168.0.36 Cleared read only
Slave IP list:
['192.168.0.37']
192.168.0.36 Granted replication privilege to slave_user@192.168.0.37 id'd by mypassword
192.168.0.36 Got master status: mysql-bin.000031, 34726
192.168.0.37
CHANGE MASTER TO MASTER_HOST='192.168.0.36', 
                    MASTER_USER='slave_user', 
                    MASTER_PASSWORD='mypassword', 
                    MASTER_LOG_FILE='mysql-bin.000031', 
                    MASTER_LOG_POS=34726;
192.168.0.37 Set master
192.168.0.37 Started slave thread
Database name: my_db
Master server IP: 192.168.0.36
Control Host: 192.168.0.8
Slave IP address list:
['192.168.0.37']
Root password: T4nk3r!12
User password: mypassword
Slave user password: mypassword
New master IP: 192.168.0.36

Again, I'm going to test that replication is working by adding a record to the new master:

mysql> INSERT INTO users VALUES ('Isabelle', '123456', 'email@addr.com', CURRENT_DATE());
Query OK, 1 row affected (0.03 sec)

mysql> COMMIT;
Query OK, 0 rows affected (0.00 sec)

At the MySQL prompt on the remaining slave, I should see the new record in the list of all records from the users table:

mysql> SELECT * FROM users;
+----------+----------+-----------------+---------------------+
| username | password | email_addr      | joined              |
+----------+----------+-----------------+---------------------+
| Anna     | 123456   | anna@email.com  | 2014-10-20 00:00:00 |
| Bob      | 333456   | bob@addr.com    | 2014-10-20 00:00:00 |
| Claire   | 444456   | claire@addr.com | 2014-10-20 00:00:00 |
| Dave     | 555456   | dave@addr.com   | 2014-10-20 00:00:00 |
| Emma     | 666456   | emma@addr.com   | 2014-10-20 00:00:00 |
| Fred     | 111456   | fred@addr.com   | 2014-10-20 00:00:00 |
| Gwen     | 222456   | gwen@addr.com   | 2014-10-20 00:00:00 |
| Harry    | 123123   | harry@mail.com  | 2014-10-20 00:00:00 |
| Isabelle | 123456   | email@addr.com  | 2014-10-20 00:00:00 |
+----------+----------+-----------------+---------------------+
9 rows in set (0.00 sec)

Add the recovered master as a slave

Normally I would need to fix whatever went wrong with the server, but in this instance, I can simply boot it up again and drop the database from this server.  Then I can add the server back into the cluster:

./db_cluster_utils.py -a 192.168.0.35
192.168.0.36 Got master status: mysql-bin.000032, 1067
192.168.0.36 Granted replication privilege to slave_user@192.168.0.35 id'd by mypassword
Initializing new slave
Init server 192.168.0.35
192.168.0.35 Created database my_db
192.168.0.35 Created user db_user
192.168.0.35 Granted privileges
192.168.0.35 Created user slave_user
192.168.0.35 Granted privileges
192.168.0.35 Got master status: mysql-bin.000008, 8813
192.168.0.35 Set read only
CHANGE MASTER TO MASTER_HOST='192.168.0.36', 
                    MASTER_USER='slave_user', 
                    MASTER_PASSWORD='mypassword', 
                    MASTER_LOG_FILE='mysql-bin.000032', 
                    MASTER_LOG_POS=1067;
192.168.0.35 Set master
New slave IP: 192.168.0.35

Now I need to copy the database from the new master to the new slave:

./db_cluster_utils.py -m 192.168.0.35
Dump the database on the master server
mysqldump -h 192.168.0.36 -u root -p --opt my_db --result-file=my_db_2014.10.21.175558.sql
Enter password: 
Move the database file
scp my_db_2014.10.21.175558.sql bananapi@192.168.0.35:./my_db_2014.10.21.175558.sql
bananapi@192.168.0.35's password: 
my_db_2014.10.21.175558.sql                   100% 3703     3.6KB/s   00:00    
Enter password: 

This command shows the configuration information for the cluster:

./db_cluster_utils.py -c
Database name: my_db
Master server IP: 192.168.0.36
Control Host: 192.168.0.8
Slave IP address list:
['192.168.0.37', '192.168.0.35']
Root password: T4nk3r!12
User password: mypassword
Slave user password: mypassword

192.168.0.36:mysql-bin.000031:34486:master
192.168.0.37:mysql-bin.000030:31273:slave
192.168.0.35:mysql-bin.000008:8813:slave

I wanted to check that replication is working, so I added a new record to the master at 192.168.0.36:

mysql> INSERT INTO users VALUES ('John', '123456', 'email@addr.com', CURRENT_DATE());
Query OK, 1 row affected (0.02 sec)

mysql> COMMIT;
Query OK, 0 rows affected (0.00 sec)

On other servers, the contents of the users table should look like this:

mysql> SELECT * FROM users;
+----------+----------+-----------------+---------------------+
| username | password | email_addr      | joined              |
+----------+----------+-----------------+---------------------+
| Anna     | 123456   | anna@email.com  | 2014-10-20 00:00:00 |
| Bob      | 333456   | bob@addr.com    | 2014-10-20 00:00:00 |
| Claire   | 444456   | claire@addr.com | 2014-10-20 00:00:00 |
| Dave     | 555456   | dave@addr.com   | 2014-10-20 00:00:00 |
| Emma     | 666456   | emma@addr.com   | 2014-10-20 00:00:00 |
| Fred     | 111456   | fred@addr.com   | 2014-10-20 00:00:00 |
| Gwen     | 222456   | gwen@addr.com   | 2014-10-20 00:00:00 |
| Harry    | 123123   | harry@mail.com  | 2014-10-20 00:00:00 |
| Isabelle | 123456   | email@addr.com  | 2014-10-20 00:00:00 |
| John     | 123456   | email@addr.com  | 2014-10-21 00:00:00 |
+----------+----------+-----------------+---------------------+
10 rows in set (0.00 sec)

Share this page:

comments powered by Disqus