The need of downtime
If your business thinks it does not allow you to schedule controlled downtimes on services, you’re doomed to a major issue when an unexpected failure occurs.
One of the quotes I am hearing lately is that a company cannot afford no kind of down time. Not even a few seconds.
Let’s face it: whatever the system is, it is meant to go offline for a while in time, there is no such thing as 100% uptime. Amazon, Google, Facebook, Twitter, Microsoft Azure, even the NYSE has gone offline. All these downtimes were unexpected and had a real bad kick on businesses relying on their services.
I don’t pretend this to be a guideline to avoid such great outages, but if your business thinks it does not allow you to schedule controlled downtimes on services, you’re doomed to a major issue when an unexpected failure occurs. Things can go very bad if life is at risk, but even when running a service oriented business is running.
Unexpected downtime in high availability environments can be caused by many factors. Lack of patching, firmware failure, unmonitored standby node malfunction, missed replication of data across nodes, memory issues, etc., not to forget hardware issues like power supply failures, data corruption on hard drives, memory faults or external factor like power supply.
A good practice I have been following and enforcing is to plan and schedule controlled downtimes.
Finding a business time window that allows you to tear down a single system and see what happens when you bring it up again is a good exercise. It also makes you sure that startup procedures are correct when it comes to turn business back on-line.
The downtime also allows you to reboot systems and patch them, so to be sure they will eventually restart after any type of power failure or crash and the operating system boots correctly with the new patches installed.
If anything goes wrong you need to be sure you can rollback your configuration making systems run at their previous state, which enforces your backup and patch management policies.
In clustered environments flipping node’s roles and rebooting periodically slave nodes helps ensuring the correct capability of the cluster of failing over. A cluster that does not converge has no reason to exist and can cause more damage than benefit. Also a passive node, whereas being turned on, is subject to failure as well as the active node. Discovering the passive node was in a failed state during a cluster failover is not something on the list of the things you’d like to happen.
Planned downtime is good, is a requirement in my opinion. You should really not rely on cluster features, you need to test them, you need to master the process of managing an HA infrastructure and drill simulated failure events. It may be called the “Kata of HA”, where common tasks are repeated to ensure the failover and failback processes works correctly and keeping business running safely.
It also psychologically gets the users to know how to manage downtimes: it’s never nice having 50 people wandering around the coffee machines without a purpose, as well as having 50 people calling the IT service desk during an outage is not exactly something that might help solving the issue. Exactly the same way as you wouldn’t like to receive complains from all your Customers during a service outage.
Preventing the “Murphy’s Law” is a good practice that every business should embrace, and it doesn’t matter if you have an on-premises installation or a cloud service. You should consider the impact of the downtime and manage it so it won’t impact on your activity.
11/03/2015 00:00:00