Tuesday 25 October 2022

The Ten Commandments of IT Operations

Ten Commandments of IT

These have been committed to memory for some time for me. I've finally decided to write them down. There's always been a bit fuzziness arounf the edges for these rules, so I suppose writing them here kinda sets them in stone. Anyway, for better or worse: 

1. What happens in the Server Room stays in the Server Room 

A long time ago when I was a poor Uni student, I had a temp job at "The Roofing Company". On my first day I was told tie a pallet of corrugated iron up using the strapping machine, which used a reel of metal strap. Nobody showed me how to use it and I fed the metal strap into the machine the wrong way round. The whole thing unspooled about 50 metres of strap onto the floor of the warehouse.

Later in the day a manager saw the mess and wanted to know who did it. A colleague looked directly at me and said "It was that bloke from Queensland who was here yesterday, wasn't it?" I nodded my head and the manager stormed away.

The lesson wasn't lost on me. Although I was to blame, the root cause lay in my not being provided training on the equipment.

Most of the time, your boss (or his boss) will be non-technical. Anyone who works in IT knows that the only people in IT who don't make mistakes are those who do no work. So don't throw your colleagues under the bus when they make mistakes. Do a root cause analysis. Discuss it amongst yourselves. Work out what changes need to be made to prevent it in the future. Then report - as a team - the actual root cause (or at least a plausible one). 

If the blame must lie with a person, then it lies with all of you, or the team leader.

2. Always have a second method of access - preferably more 

This applies to anything. If you have an electronic lock, have a physical key somewhere. If you use Citrix, make sure you have VPN as well. This includes having an extra admin account for all systems. Test the other methods of access.

By extension, have a backup internet link for inbound purposes, and out-of-bandwidth access for physical devices (such as ILO and DRAC), including the ability to power cycle them if necessary.

In short, eliminate all SPOF (Single Points Of Failure) for everything.

3. DDUF - Don't Do Updates on Friday 

Unless you like working over the weekend.

4. Take snapshots. Test Backups. 

Take snapshots of virtual machines before making changes. Also test your backups by doing a trial restore from time to time. On occasion, test a DR restore. 

5. ABC - Always Be Coding (or scripting) 

A lot of people who gain a tertiary qualification in IT will never write a line of code after they graduate. I don't care what it's in (BASH, Powershell, Python, C#, VBscript) make some form of coding a regular part of your work week. Don't let your skills atrophy. Sure, you CAN do your job without coding, but you can do it better WITH coding. 

6. A ten minute job takes ten minutes. Five minute jobs take two hours. 

We're all tempted to do something quick. Something that will only take five minutes. Usually we do them when we only have five minutes to spare. 

It's a trap. Those five minute jobs have been sitting there, gathering dust for some time, otherwise you would have done them some time ago. Chances are that your recollection of what really needs to be done has faded and there's more to the job than you remember. Or the conditions have changed. Or someone has done something else in the meantime. By contrast, ten minute jobs are usually on point. You do them regularly and know them well. They take ten minutes to do and that's it.

All of a sudden that five minute job is taking some time, and you can't roll back. You have to plough on through and finish in around two hours - usually after calling home to say you will be late. But that leads us to point 7.

7. Take breaks. Don't start anything new thirty minutes before you leave. 

Sometimes you need to come up for air. Take a break, go for a walk and come back with a fresh set of eyes. You will be way more productive this way than simply having your head down for several hours.

The second part of this is don't start any new work with thirty minutes (or less) to go. Step back, do some documentation, fill out a form, tidy your desk, plan tomorrow, reply to emails, rearrange files. There's always plenty of busy work to do other than starting something you realistically won't be able to finish, or feel pressured by time constraints. Any new work you do now won't be finished before you leave unless you work back late and chances are, you won't remember what you got up to tomorrow.

8. Sanity check everything. Even simple stuff. 

Whether it's code, configuration, deployment, modification etc. Get someone to check your work. You'd be surprised how easy it is for someone else to pick up some mistake you may have made. And if something you do does go south, you can always say you had someone check it out too.

9. Especially simply stuff. 

This particularly applies to 'simple things'. Mistakes are much easier to spot by others, but you can easily be blinded by proximity or familiarity. Forgot to remove a comment tag? Still logging in debug mode? Forget to enable that service?

Rookie mistakes, but we all do them. Not just rookies.

10. It's not yours. It never was.

Work with any system long enough, put enough of your blood, sweat and tears into something and it will feel like you own it. Like it's your baby.

It's not. It never was.

Sometimes, our recommendations will be ignored. We'll be told to do something we don't want to do. We'll be tempted to ignore that instruction because we know better.

Don't. It's not your system. You're just paid to maintain it. If your boss wants to melt it down and make ornaments of it, he can do that. It's his, not yours.


So that's it. Thirty years of wisdom condensed down into ten simple rules. If can add to the list, feel free to comment.