For starters, I had dutifully prepared the weekend before by moving away the dastardly data/ directory (and the odd image_gallery/ directory), retrofitting everything with symbolic links and ensuring that the site was still running correctly. I even checked out the repository to a temporary directory and ensured that our rsync dutifully pushed all changes from the master out to the slaves.
My plan for Tuesday evening was to :
A. Deploy application with ‘svn checkout’ in the test environment (master):
- disable rsync on the master (keeping the live website undisturbed and running the existing application)
- move away the existing application directory (called docs/) for rollback purposes
- rename and move the already checked out code repository to docs/
- add the www-data user to the Linux group that owns the repository (which has recursive read permissions)
- restart Apache to flush out any permission problems
- manual “smoke test” of the test site
B. Deploy application via ‘svn checkout’ on the live slave servers
- Stop apache server (on just one of the nodes to ensure the site stays up = uptime good, downtime bad)
- Repeat deployment steps done on the test server (see above)
- Start apache and manually “smoke test” www100.
- if everything looks good, rinse & repeat on the other slaves
C. Reenable rsync on the master test server
- goto the document root and run ‘svn status’, ‘svn info’ (if “ignorance is bliss” than I say “knowledge is nirvana”)
- make a change locally and commit, then run ‘svn update’ on the Master
- wait 5 minutes and verify rsync is keeping the slaves uptodate by running ‘svn info’
The Plot Thickens
I figured I’d need about 10 minutes per server to do the deployment – in just half-an-hour we’d all know exactly what was running on those servers. As our peak time is between 8-10pm, I planned to do the actual deployment at 10:30.
But the late, great John Steinbeck wrote in “Of Mice and Men” a thing or two about plans, and early Tuesday afternoon we got a customer request which was easily handled by executing the initial part of the release on the Master. I was fine running without rsync for the afternoon and looked forward to getting a jump on the deployment.
First things, first – disable the replication. On our systems this is done by renaming a dummy file located inside the docs/ directory .do-not-sync.not => .do-not-sync … don’t you just love double negatives? Why are sysadmins so negative? This file would be so much more easily understood as .do-sync.not , right? But I digress…(however, keep this lovely pitfall in mind – we of course will refer to it later)
Taking the Plunge
I dutifully logged into one of the slaves, and tailed the sync log for 5 minutes, verifying that synchronization had indeed been successfully disabled. We were “Go for launch”!
Taking a deep breath, I moved the docs/ directory to docs.orig/ and moved the subversion repository into its place as the new docs/. I added the www-data user to the appropriate group and restarted Apache keeping all my good little ducks in a row. Clicking around the test site gave no surprises – a bit anti-climatic I admit but I’d been running this exact configuration in the staging environment for almost two months now.
We gave the “All clear” signal and notified the customer they could check out the test site. If this pre-deployment was any indicator, the release that night would be a breeze.
What Actually Happened
About an hour later, I made a local change & commit for a minor bugfix. Wanting to try out my new found powers, I logged into the test server and ran ‘svn update’. Sweet, this is almost too easy now.
Wanting to keep the slaves up-to-date (and the initial rsync overhead to a minimum during the night’s release), I hopped on over and went to the temp directory where I had checked out the repository the weekend prior. Hmm, that’s strange – it wasn’t there anymore. How the hell could it have just disappeared.
pwd … yep, I’m in the right directory but no svn root.
Something tickles in the back of my brain. After all, I’d been bit hard in the past by this damn rsync – could I have just been bitten again? I jump up a couple of directories to take a look at the document root.
$ ls -ld docs*
tdrwxrwsr-x 11 netdoktor netdoktor 4096 2009-04-22 12:01 docs
drwxrwsr-x 12 www-data netdoktor 4096 2009-04-21 15:07 docs.orig
$ svn info docs/
Repository Root: svn+ssh://svn
Repository UUID: db2ab612-952d-4ba0-912a-6502137340e4
WTF?! I was already running the subversion deployed directory on the live site from all slaves. How the hell did that happen?
Well, remember that happy little file called .do-not-sync ? He and his friends were too cool to hang out in the subversion repository, so when I moved the original docs/ directory away and replaced it with the one from svn, the system no longer saw this file.
And guess what? The default behavior of the cronjob was to damn the torpedoes and run a full rsync if this file was not around to bully it into submission. Maybe the creator of this cronjob hadn’t heard of safe defaults but if you’ve read this far you, for better or worse, have. Write scripts in such a way that if one of the keystones of the whole process is gone that it fails fast and LOUD.
All’s Well that Ends Well
Luckily, I had intensely prepared for this release, so that nobody (not even myself for almost an hour) had even realized that the live site had been updated. The Zend page caching caught most of the requests that snuck through in the 20 seconds it took to rsync the files over so that the majority of our customers didn’t even realize either (yes, you bet your ass I carefully poured through the Apache logs looking for 404s during this timeframe).
What did I learn from this experience? Me and rsync do NOT get along for deploying our website. Don’t get me wrong, it’s superb for backups, just not code deployment. Matthias, I’ll soon be asking you for some help writing my Capistrano scripts!