There’s a lot of discussion about preventing downtime. As a DBA and IT professional, it’s my sworn duty to prevent downtime. I usually describe my job as DBA something along the lines of, “to make sure data is always available to the people and applications that need it, and never available to the people and applications that shouldn’t have it.” Preventing downtime is certainly important for that first part–but how the heck do you define downtime?
What “downtime” means to most DBAs
When DBAs think about downtime, we often think in terms of “the server is offline” or “the database isn’t responding to queries” or even “the database isn’t responding to queries fast enough.” But is that really downtime?
We get caught up in making sure systems are always online, as much as possible. We want redundant storage, redundant servers, redundant networks. We agonize over RPO & RTO. But how much of that is necessary? Are we actually worrying about the right things? Are we forgetting about other scenarios that will affect our having “downtime” and our ability to recover from it?
If I fail over my AG to another replica, is that downtime?
If my secondary AG replica is offline, is that downtime?
If a VM reboots, and “disappears” for 3 minutes, is that downtime?
If a VM reboots, and “disappears” for 3 minutes at 2AM, is that downtime?
If a login is temporarily locked out, is that downtime?
If an offline index build makes dbo.transactions
unavailable for an hour, is that downtime?
If overnight processing doesn’t run, is that downtime?
If overnight processing runs longer than usual, is that downtime?
If downtime is scheduled, is that downtime?
What “downtime” means to the business
For most business users, “downtime” is just their ability to use the application/system. They aren’t concerned with what is happening behind the scenes. That don’t care (at least they shouldn’t), if any given server is up, or if there was a drive failure. They just care that they can do work, and that the company can make money.
If a user cannot buy widgets from your website, is that downtime?
If widgets disappear from the user’s online basket, is that downtime?
If the CEO can’t log in to the system after dinner, is that downtime?
If order confirmation emails don’t send, is that downtime?
If users can’t log in (but logged in users can still do things), is that downtime?
If finance can’t process payroll when they normally do, is that downtime?
If accounts receivable can’t generate invoices, is that downtime?
If nobody is logging into the system mid-day, is that downtime?
If nobody is logging into the system at 3AM on a Sunday morning, is that downtime?
It depends on your business
If you are hoping I give you the answer to all those questions, then I’m sorry. I can’t answer any of those things for you. You’ll need to do it yourself: for your environment, your company, and your business users.
I do recommend starting by talking to the business to define what “downtime” means. I once worked at a company where our primary measure of downtime was order volume. As long as we got at least one order per minute, there was no “downtime.” I also worked at a finance company where a delay of just a couple of seconds in executing a trade was unacceptable–but where systems could be down on the weekend, every weekend if necessary.
Figure out what your business goals are for uptime first. Be realistic. (“Absolutely zero downtime” probably isn’t realistic!) Usually, the business will be thinking in terms of lost revenue, lost productivity, and your reputation as a company. As you discuss scenarios that matter, be careful to separate what the business wants compared to what the business needs. Only after you separate the business needs can you start to form your technology plan to support that. You can set your own goals so that if you accomplish your goals, the business will accomplish theirs.
And maybe you can achieve “zero downtime” even if you take a server offline.
Ooo, you should check out the Database Reliability Engineering book. They do a great job of thinking about it differently – not in terms of clock time, but in error rate budgets. The business sets an error rate budget, which determines your financial budgets too. Then, the admins manage their planned outage windows based on when they’ll have less app requests (and thereby keeping the monthly error rate lower). If they have too many unplanned outages in a month or quarter, then they have to hold off on planned outages until the error rate stabilizes to match the budget. When they don’t have any errors (and they’re not spending enough of their error rate budget), then that’s a problem too, because it means they’re not patching, not experimenting to figure out how to cut costs, etc. Really thought-provoking stuff.