Yup, that’s me. You’re probably wondering how I ended up in this situation.
Ever been stuck in a comedy of errors and thought the above to yourself? Maybe it wasn’t even a comedy of errors and was just an ordinary day. You go to work, you check your dashboards like you do every morning and maybe you resolve a few issues, work on a project that for some reason keeps changing/has no business existing/makes you question all choices that lead to this point in time. But suddenly, something goes wrong and it all just topples down crashing and burning.
Is this a coincidence? Let’s take a look at the definition:
I am not superstitious, but I am a little stitious. I refuse to believe that waking up on the wrong side of the bed or my last Change Request did not appease the IT Gods. Especially where disaster occurs. So let’s take a scientific approach, or as it is called: The Swiss Cheese Model!
The Swiss Cheese Model
You are welcome to google a nicer definition but this is how I learnt and understood it: A layered approach of looking at an incident and analysing the cause to better understand a way of preventing it in the future. This theory has been traditionally used in Aviation security, health and emergency services. However, it is starting to find its way in computer security and defense.
Most incidents can be traced to one or more of four levels of failure:
- Organisational Influence: Organizational Culture, processes and resource management.
- Unsafe supervision: Usually around the human elements and individuals involved. Inadequate supervision, planned inappropriate operations, failure to correct a known problem.
- Preconditions for unsafe acts: Situational factors such as physical environment, tools and technology. Personnel factors such as communication, coordination and planning.
- Unsafe acts: Errors based on decision, perceptual and skill based. Violations such as routine violations and exceptional violations.
Why Swiss Cheese? The holes in the cheese slices represent a series of barriers. The weaknesses in a system if you will. When all the holes align, it permits the incident to occur.
TL;DR – Root Cause Analysis
Multiple servers are unresponsive, particularly those hosted in the Ship Creek data centre. Attempts to contact the data centre have been fruitless. The on call engineer is contacted at 2AM and is perplexed as to what has occurred and needs to get things back on 5 minutes ago but is currently flying blind because it is the festive period, they are not available to travel due to blizzard conditions and distance.
Attempts to login to monitoring systems have failed. However, access to emails indicate multiple Veeam ONE alerts defining a temperature alert.
This is slightly indicative of the current issue but more information is required, much less a confirmation! Engineer recalls that they are observing a 3-2-1 rule (3 copies of data, 2 media types and 1 offsite) with a cloud service provider. Contact is made with the service provider to establish last successful backup and/or replica time stamp to assess level of loss.
There is still a matter of physical confirmation. What happened to this site?
Contact is eventually made by the data centre and confirmed that a fire broke out and may have affected the servers and its related hosts.
Failover plans are available, but engineer is not authorised to run this. A copy of the Backup configuration file was made as well so based on the information available from the cloud service provider, the engineer is outlining his plan of attack… or in this case resolution. However, authorization to carry out the failover is required, and access is restricted to one user (CTO) that is currently unavailable as he is on a plane.
Applying cheese to the cracker
Now, let’s look at some of the holes that this problem crept through and continued. There are some things that were within the engineer’s control and some that weren’t.
(Click the image for a clearer view)
There are a multitude of answers that can be looked at to resolve the incident. For starters, the CTO can have a delegation of authority and not be the single point of failure! This cheese platter outlines a variety of preventative measures and factors that need to be addressed.
Things that have not been outlined is the planning the engineer had taken to reach the desired destination. Such as identifying latest points available and confirming failover plans.
So, to answer the question: What does DR and Swiss cheese have in common? Holes. By analysing a procedure, we can reduce the number of holes that lead to that slice of incident. Although some of the information in the scenario may seem blatant, it isn’t far in the realm of possibilities and chances are some readers have had worse.
I leave you with a few healthy cheese puns:
What did the aged cheddar say when his mom told him he couldn’t see a movie that was rated R?
“I’m mature for my age.”
What is a cheese lover’s favorite Village People song?
What did Shakespeare say as he was making a cheese plate?
To brie or not to brie.