The next step to get a better grip on your environment is figuring out exactly what kind of production configurations you have running out there. If you’ve ever caught yourself walking through the data center and wondering just what in the hell those servers in the back corner are for, this phase will be quite the eye-opener. Now that you have changes to this environment under better control you can start to invest some time in assessing what you actually have.
“Catch & Release”
This part does just what it sounds like – although, I’m a bit more fond of “bag & tag” myself. You’re going to want to inventory every piece of hardware and every piece of software in the data center. Begin to draw a network of dependencies between servers and services to help you understand how it all works together. Interview any sysadmins that have experience (or even passing knowledge of) to fill the nooks and crannies of your team’s “tribal knowledge”. Its important to do these seemingly tedious and time consuming tasks now when everyone is relaxed and able to discuss things intelligently, because trying to do so during a 3am Nagios alert is just ridiculous. Here are some typical inventory questions to answer:
- What does it do (technically and business impact)?
- What is the hardware & OS?
- What apps are installed and which services do they support?
- Who is authorized to make changes?
- Can we build a new one of these if it fails?
- What is the outage cost (per minute of downtime)?
- How is it backed up?
- Are we monitoring this box?
- How long can the business afford to live without it?
- If its mission critical, are there enough hardware backups in place?
Although this may sound tedious, don’t assign the task to a junior! Remember the ultimate goal of this step: automatically recreating your entire production infrastructure. So you’ll need someone with intimate system knowledge and who has the respect of the whole organization to be able to go in do this job right the first time! “The Visible Ops Handbook” gives a much more in-depth analysis of this critical step in establishing your CMDB.
Find Fragile Artifacts
These are fairly easy to find but much more difficult to fix. You know the servers I’m talking about – the ones with the big yellow post-it “DON’T TOUCH!!” stuck all over. Any change requests made on these assets are done so at the company’s peril and you know that if it does crash and burn there will be hell to pay at all levels. Hopefully, these problematic boxes were among the first to receive your more stringent change control processes we discussed in the last step. Getting an inventory of all production assets and storing this in a Configuration Management Database (CMDB) gives you a invaluable overview of your production server topology; without which you won’t be able to refactor (simplify) or improve in the next step. But what do you do with all this information now? Its time to hand it over to the release team – you do have one of those, right? Now these guys can work on automatically rebuilding the environment, finally giving you the confidence and control you need to keep things ship-shape.
Besides open source CM tools like Cfengine and Puppet, thousands of customers worldwide use Tripwire, co-founded by one of the authors of “The Visible Ops Handbook”, Gene Kim. But even if you’re just starting out and can’t justify the time and expense needed to use these, don’t use it as an excuse to not try and automate this process! Matthias will be sharing a real world example of applied configuration management using Capistrano in a couple of days. Subscribe here to get it delivered right to your inbox!