Thursday 6 December 2018

On Cargo-Cult System Administration

During World War II, many Pacific Islands were used as fortified air bases by Japanese and Allied forces. The vast amount of equipment airdropped into these islands meant a drastic change in the lifestyle of the indigenous inhabitants.  There were manufactured goods, clothing, medicine, tinned food etc. Some of this was shared with the inhabitants, many of whom had never seen outsiders and for whom modern technology may as well have been magic - the purveyors of which seemed like gods.  

After WWII finished, the military abandoned the bases and stopped dropping cargo. On the island of Tanna in Vanuatu sprang the "John Frum" cult. In an attempt to get cargo to be dropped by parachute or land by plane, natives carved headphones from wood, made uniforms, performed parade drills, built towers which they manned, waved landing signals on runways. They imitated in every way possible what they had observed the military doing in an attempt to "summon" cargo from the sky.

The practice wasn't limited to Vanuatu, many other pacific islands developed "Cargo Cults". This may be surprising to us - even amusing - but fundamentally it stems from a disconnect between observed practice and an understanding of how systems (in this case logistical systems) work. The native observer has no idea that the actions of the soldiers they saw don't cause the cargo to appear, they merely facilitate it.

 Eric Lippert coined the phrase “cargo cult programming":
The cargo cultists had the unimportant surface elements right, but did not see enough of the whole picture to succeed. They understood the form but not the content.  There are lots of cargo cult programmers –programmers who understand what the code does, but not how it does it.  Therefore, they cannot make meaningful changes to the program.  They tend to proceed by making random changes, testing, and changing again until they manage to come up with something that works.

The IT world is rife with Cargo Cult System Administrators. It usually manifests itself as an instinctive reaction to reboot a server as a first resort when anything goes wrong, without any effort to understand cause and effect and if that doesn't work, they start disabling firewalls, or running system cleaners. If they actually manage to fix a problem, they invent a reason with no evidence (It must have been a virus). If they are reporting to someone non-technical (which is usually the case) then this often gets accepted. 

The big problem with "fixes" like this is they often cause collateral damage or degrade performance or system security.

How do these people keep jobs in IT? It's not surprising this level of ignorance exists amongst the lay users, it shouldn't exist in support staff. But if those in charge of IT hiring are not themselves experts, they don't know any better either and there seems to be this idea that a non-technical manager is perfectly capable of overseeing IT Departments. 

With increasing complexity of systems, it is all the more important to hire people that actually know what they are doing. At the very least, you need people that don't think it is somehow mystical and have a grounding in the basic technologies - that at least understand binary and hexadecimal number systems; have a grounding in basic electronics and circuit theory; can follow and develop algorithms and cut code in at least one programming language; can formulate troubleshooting steps from a block diagram of the system; that understand the principle of abstraction and working through layers of abstraction.

A simple yet effective method of Root Cause Analysis of any problem is known as 5Ys (Five Whys) was developed by Sakichi Toyoda and adopted as best practice by the Toyota Motor Corporation. In it's simplest form, it involves ask an initial 'why' question and getting a simple answer. You then ask 'why' to the answer and work it out, next you ask 'why' to that answer until you've asked why five times and giving five increasingly low level answers. When you can no longer ask 'why' you have your root cause. An example of 5Ys is:


The vehicle will not start. (the problem)
  1. Why? - The battery is dead. (First why)
  2. Why? - The alternator is not functioning. (Second why)
  3. Why? - The alternator belt has broken. (Third why)
  4. Why? - The alternator belt was well beyond its useful service life and not replaced. (Fourth why)
  5. Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)
An extension of this method is the Ishikawa diagram - also known as a fishbone diagram.

So, the next time you are tempted (or pressured) to reboot a server because it is not "working" perform 5Ys as starting point. Find out what isn't working and ask why. Look at all the possible things that could stop it from working and test them out one at a time. You may end up rebooting the server, but you will have a better understanding of how the system you are fixing works and maybe even put in preventative measures to stop this occurrence from happening again.

No comments:

Post a Comment