Are Basic Availability Groups a reliable HA solution without integrity checks on the secondary replica?

Question

We have a few Basic Availability groups in production and I’ve been reading about the limitations of them. I’m concerned about the following limitations:

No backups on secondary replica.

No integrity checks on secondary replicas.

Suppose that we haven’t failed over in months. In our case, that seems to imply that we haven’t run DBCC CHECKDB for a long time on any of the databases on the secondary. There could have been a storage corruption issue that occurred months ago that we still don’t know about. If a disaster occurs on the primary replica and we fail over to the secondary, we might end up with the production application pointing at corrupt data.

Would it be considered a best practice to perform one of the following on a fixed schedule?

Perform a planned manual failover to switch the primary and secondary and leave the former secondary in the primary role until the next planned failover.
Take a database snapshot of the secondary replica and run DBCC CHECKDB against that.

Or am I overthinking the risks here?

asked 2021-10-26 by Joe Obbish

Answer

"Or am I overthinking the risks here?"

No, I think you’re worrying about the right things.

You can’t spell Schrodinger without DR.

"Schoinge" doesn’t roll off the tongue the same way.

Just like you can’t rely on your backups in a disaster if you haven’t tested restores, you can’t rely on your DR site if you haven’t tested using it.

For the purposes of this question, I am also going to assume that "DR" means "a secondary datacenter in a separate location for when the primary data center is unable to serve users."

What I would do:

I would personally take copy-only backups from the secondary as part of my existing DR plan. This ensures I have a backup copy at the DR location, in addition to any backup copies at the primary location. Boom, now I have my offsite backups for the DR plan. ✅

Then, I need to test restores of those backups on a regular basis to ensure I’m not relying on corrupt backups. I can implement some sort of automated "Restore my most recent offsite backup" plan, possibly by extending what I am already doing with testing restores of my backups from Production. While those backups are restored, I would do a CHECKDB on that restored copy. Since the restored copy originated from the DR AG replica, this should be sufficient for validating that the DR AG replica doesn’t have corruption. ✅

Test failovers too

Testing that you can fail over to the DR site & serve users from that location is an important part of DR testing, but I’d separate that kind of DR test from running CHECKDB. If failing over to DR to test DR readiness also requires completing CHECKDB that could require the DR test be a very long test if your databases are big. I wouldn’t want to entangle the two.

answered 2021-10-26 by Andy Mallon