Visible Ops: If At First You Don’t Succeed, Build, Build Again

by on October 12, 2008 · 2 comments

By now you should have a better understanding of how your data center is setup and exactly what’s all in there. You’ve been carefully monitoring changes to this environment and are ready to take it to the next level. The “Visible Ops Handbook” introduces the concept of “production fuses” : when things go wrong with a fuse, you don’t try to repair it; instead, you simply replace it with a working one right out of the package. Not only does this require less downtime, but you can be pretty sure that the new fuse will work as expected (and not accidentally burn down your entire home!). Your confidence in the correct functioning of the new fuse is directly related to how well you’ve been monitoring changes on that production fuse. If some engineer has made undocumented fixes or “improvements” to it (which is probably what caused it to short out to begin with), then the pre-fabricated fuse you replace it may cause unintentional side effects and further headaches. Replacing instead of fixing is what all high-performing organizations do, giving them high server to sysadmin ratios, less unplanned work, and the ability to maintain manageable system configurations.

Create A Release Management Team

This team’s primary responsibility is constructing the mechanisms to deploy the best configurations into production. They won’t be doing the builds; instead, they’ll be designing the build. You’ll want your best and brightest for this critical task. Take the most senior IT operations folks and get them out of reactive, firefighting and into a more proactive, structured regimen of release engineering designed to decrease downtime and increase your successful change rate. To reduce complexity and cost while improving the manageability of your operations environment, note the following benefits of rebuilding infrastructure as opposed to repairing (in increasing order of importance):

  •  An automated process takes a known amount of time, as opposed to firefighting which almost always takes longer than original estimates
  •  It introduces less configuration variance, as opposed to repeated break & fix cycles (“What all did we just change to get it to work now?”)
  •  They are usually better documented and less complicated so they can be monitored by junior staff which frees up senior staff from firefighting
  •   When senior staff break free of firefighting, they can work on new build projects fixing other systemic issues (addressing the real root causes of issues instead of just their symptoms)

Create A Repeatable Build Process

From the “Find Fragile Artifacts” step I introduced last week, we have our starting point in this next project – namely, getting fragile infrastructure replaced with stable builds that can be run by junior staff. Infrastructure blueprints should be stored in an online, protected, backed-up repository known as a Definitive Software Library (DSL). This library not only houses all the relevant software applications your environment requires to operate, but also license keys, patches, etc. The release management team should now consider the following:

  1.  Identify the common set of essential services and components used across your infrastructure (OSs, applications, business rules, and data)
  2.  Create a list of standardized components, called a “build catalog.” Look for ways to create components that can be reused and combined to create standardized configurations (i.e. Apache and Oracle both installed on a Solaris OS)
  3.  For each component in the build catalog, create a repeatable build process that generates it. The goal is to have an automated build system that can provision the package by “pushing a button” (i.e. NIM for AIx or Jumpstart for Solaris)
  4.  Any testing environment should be isolated from the production network to ensure that it does not disrupt production systems and to make sure that all dependencies outside the test environment are fully documented and understood
  5.  Ensure that you can recreate each system with a bare metal build. Our goal is a repeatable process that eliminates anything tedious and error-prone, as well as reducing work and errors (i.e. virtual images from Xen or VMware)
  6.  For critical HA or load-balanced environments develop reference builds that can provision a box from bare metal without human intervention (i.e. triggered as a workflow action based on a certain Nagios alert)
  7.  When the build engineering process has been completed, store them in the DSL

Create And Maintain The Definitive Software Library (DSL)

Let’s face it – it can be downright horrifying having to check out some source code from a development repository for release at 1am in the morning. As a sysadmin, you cringe while executing the build script, cross your fingers and plead “Will it build without any errors?” If you weren’t able to convince the development group to adopt a Committer Role, here’s your second chance. Let the release management guys build it for you and commit this into the DSL (and this at 1pm in the afternoon when they can easily go track that developer down who made that “oh, just one more quick fix” commit) for the operations team to confidently release.

  1.  Designate a manager to maintain the DSL who will be responsible for authorizing the acceptance of new applications and packages
  2.  Create an approval process for accepting items into the DSL
  3.  Establish a disparate network to house the DSL and the required build servers (i.e. a DMZ)
  4.  Any software accepted into the DSL (both retail and internally developed applications) must be under revision control
  5.  Audit the DSL to ensure that it contains only authorized components

Close The Loop Between Production And Pre-Production

So you have a beautiful test environment upon which you’re running production infrastructure – or so you think. Do you have automated detection of changes on all production servers? Are you sure that not a single developer has access to production? Even if only your operations staff has access, as most systems nowadays number thousands of files and perhaps hundreds of separate configurations, the odds are against them for remembering to document every single production change. You have to use a some automated configuration management tools (like Cfengine or Puppet) to ensure that production builds stay in sync with the golden builds in your DSL.

If you’ve made it to here, you should have reduced the amount of unplanned work down to 15% or less. The ITIL Institute reports that by drastically reducing configuration counts, you can significantly change the staffing allocation from unplanned to planned work, and consequently increase the server to sysadmin ratio. Now you’ve:

  •  created a release engineering team to define and generate infrastructure that can be repeatedly built
  •  further decreased uncontrolled production changes, which increases the amount of time available to work on planned tasks
  •  created a new problem resolution mechanism, making it cheaper to rebuild than to repair – a viable alternative to protracted fire-fighting
  •  enabled shifting of senior staff from the front-lines to the release management area, where the defect repair costs are lowest
  •  closed the loop between production and release management, to curb production configuration variance
  •  enabled the continual reduction of unique configurations in deployment, increasing the server to sysadmin ratio
  • Stay tuned next week for some concluding words about the Visible Ops Handbook and how to pull it all together!

    Did you enjoy this article? Get new articles for free by email: