Windows cluster failed over, but SQL instances didn’t move
Question
I recently inherited a SQL Cluster (2008R2) which for the most part behaves itself impeccably.
The windows cluster is made up of two nodes running Active/Passive, Node1 and Node2 are dedicated blades in two different data centers. There are 3 SQL instances all running on Node1. Quorum is established by a File Share Witness and we have a heartbeat between the two nodes.
The other day someone switched off the file share witness by mistake, and the windows server failed over from Node1 to Node2. Or should I say, in Failover Cluster Manager, Node2 was now specified as the active node by Windows.
However, the SQL Cluster didn’t do anything. All the instances stayed up and hosted on Node1. I would have expected them to move Nodes, but no.
There was no adverse affect on the databases at all.
Once power was resumed to the File Share Witness I brought it online again and the Windows Cluster failed back to Node1.
Our Windows Technicians are looking into why the cluster failed over, and I’m left scratching my head with the SQL bit.
All I can think of is that the heartbeat kept the SQL instances on Node1 and losing the witness wasn’t important.
I’m still learning the small details of Windows Clustering, being much more used to Log Shippping and Mirroring when it comes to HA solutions, so any insight into why the SQL Instances didn’t failover would be appreciated.
asked 2016-04-01 by Molenpad
Answer
Clustering is complex, and there are lots of moving parts (no pun intended). Let me try to break this down into more manageable chunks:
From a terminology perspective, there’s your Windows Server Failover Cluster (WSFC), and your SQL Server Failover Cluster Instances (FCI). I try to avoid saying “Cluster” and use these acronyms to avoid ambiguity.
Quorum:
The quorum is the number of votes necessary to transact business on your WSFC. Depending on your WSFC configuration, voters can be nodes (servers), a drive, or a file share. You need more than 50% of your votes in order for the WSFC to be online. If you lose 50% or more of your voters, then the WSFC and all clustered services (including your FCI) will go offline and not come back until you have (or force) quorum.
In your configuration, you have two nodes, and one file share for a total of three votes. Any one of those voters can go offline. When you lost the file share, you still had two nodes online, so your WSFC and all clustered services stayed online.
Cluster Owner/Host Server:
When you say that “Node2 was now specified as the active node by Windows”, I suspect you are referring to the “Current Host Server” for the cluster. So what is that?
Your WSFC has a network name and an IP address. That name & IP has to be tied to a machine that is part of your cluster. More specifically, it can be tied to any one machine in your cluster. This is part of your WSFC, but not your FCI.
In your scenario, you have three FCIs on a two-node WSFC. It would be a perfectly valid to have one FCI on Node1, and two FCIs on Node2. And the “Current Host Server” for the WSFC could be either node. SQL Server won’t care.
So what happened: As you said, there were no adverse effects on the databases. I’d expect that, because SQL Server isn’t tied to that WSFC host server. I don’t think I wouldn’t have expected the host server to move when the file share failed–but I’d let your Windows guys dig into that more. From a SQL perspective, everything worked as expected.
answered 2016-04-01 by Andy Mallon