No Manual Changes refers to the behavioural trait of not messing with any productive systems. Let’s discuss why messing with production systems is bad and what to do about it.
Manual Changes Lead to Configuration Drift
You know how this goes: Your servers are under heavy load and you just want to kick up the number of worker processes a bit. You ssh into the first server and just start a bunch of additional workers. Now you sit back and see whether the box is behaving as expected. You find that all is fine and make a mental note that you need to persist your change in the startup script of your service and you need to roll out that change to all other servers. But your mental note is quickly forgotten because another urgent issue demands your full attention…
You end up with one server configured differently from the rest of the bunch. Even worse, your configuration is not even restart safe. As soon as anyone starts an automated configuration run or restarts the box your changes are gone. In our example case this might lead to the box crashing under load without anyone understanding why.
Automate All Changes
The way to go is obvious. Change the number of worker processes in your configuration management tool and let it reconfigure the box for you. This approach makes sure that your changes survive restarts and configuration runs. And it makes sure that your teammates see what you’ve done if any problems arise. Even your developers are now able to see that their code might need optimizations because it requires so many worker processes to run (way more than they initially expected).
Congratulations, you’ve made one more step toward Devops collaboration!
No Rule Without Any Exception – Really?
You might argue that my example is a bad one. If your production systems are in trouble, there’s no time to lose in going the extra mile of automation. Jump in and fight the fire!
While this approach might look reasonable at first glance, it is a very dangerous one. Especially when in fire fighting mode, you might not remember all the changes made to your production system. You think your problems solved but in reality they’ll come back worse than ever (and sooner then you expect).
You need to prepare in advance to avoid the need to go in and hot fix. Setup log monitoring services like splunk or graylog2 to be able to analyze what is happening. And you should have your configuration management tool (like Puppet or Chef) setup so that you can try out possible solutions without having to go in and do it manually.
How do you avoid manual changes to your production systems? Do you “electrify the fence” as described in the Visible Ops Handbook? Please let us know in the comments