Usually the purpose of scaling out is to cope with a high volume of traffic by adding more servers. I want to try this with my Banana Pi cluster in order to develop a system which I can use on clusters of more powerful machines.
The first step was to develop Pyplate Multi-Site, which allows several sites to be served from a single server. The next step is to set up Pyplate multi-site on a cluster of servers. I need to scale up database access, and make sure that all content can be accessed from each web server.
Pyplate was originally written to work with SQLite, which is not a network database. I modified Pyplate to work with MySQL so that I could offload the database to another set of servers. This means the web server nodes don't have to use CPU and memory processing database queries, so they can handle more concurrent connections.
I'm using a four node database cluster that I used to develop a database cluster management utility. I changed the value of binlog_do_db in /etc/mysql/my.cnf on each server from my_db to pyplate_db. On the control node, I edited cluster-utils.conf (the configuration file for my database cluster management tool) to include settings for the Pyplate database, and I used the db_cluster_utils.py script's init option to set up the cluster:
This command starts database replication on the cluster:
I exported the databases of a couple of Pyplate sites that I own into XML files. Data was imported from the XML files into the master MySQL server in the database cluster.
The imported data is replicated on the other MySQL servers in the cluster.
Each web server needs to be able to access the files (themes, templates and content) for every site hosted on the cluster. I considered two approaches:
The rsync option is quite simple to set up. The down side is that it means there has to be a complete copy of every site's /var/www directory on each server. My sites are quite small (in the range of tens of megabytes), so making a complete copy of all sites on each server isn't a problem. Even using 8GB SD cards, there's enough space to run several Pyplate sites on each server. In this configuration, one web server is a master which executes rsync to copy data to all the slave servers. All modifications to files on the web servers should be made on the master server, and rsync will propagate changes to the other servers.
Copying each site's data to every server is not an efficient use of disk space, so it's not a sensible approach for large web sites with huge datasets. A web site the size of Wikipedia or Facebook cannot be contained on a single server, so using rsync to copy the site's content to each node won't work. Instead, files that need to be accessed by several servers can be stored on a network storage device. This can be as simple as a single node with an NFS share, or a group of nodes with a distributed file-system like GlusterFS.
I decided against using Gluster because it would have been an extra system to maintain and debug. This site's storage requirements aren't that complicated, and Gluster isn't really necessary. The Banana Pi nodes that I would have used to run Gluster can be used as web servers instead. If I was setting up a mass hosting system capable of running thousands of sites, I would use Gluster.
I'm using rsync as described on this page about building the Banana Pi cluster.
I've set up ssh keys so that rsync can run without prompting for a user password.
I considered using rsync in daemon mode so that changes to the master node are flushed through to the rest of the cluster immediately, but this has some disadvantages. When I'm modifying a site, I don't want changes to be flushed through to the cluster while I'm halfway through making changes. I want to be able to make changes on the master node, and only synchronize the master node with the rest of the cluster when I've finished making updates. After making changes, I always clear the cache and rebuild it, so I added some code to trigger rsync every time the cache is built.
I wrote a bash script named sync.sh to run rsync:
This code synchronizes the CMS and web root directories on all nodes in the web server cluster. If caching is always enabled for all sites, then only the /var/www directory needs to be synchronized. I've updated this script so that it also synchronizes the CMS directories. Sync.sh is called from a Python function which I added to the code that handles caching:
When the cache is built, the sync function is called, and the slave servers are synchronized with the master server.
Share this page: