Why putting SQL Agent offline caused WSFC to fail over on passive node?
Question
I have 2 node Windows Failover Cluster with Quorum disk.
SQL agent is NOT a resource of the cluster.
I needed to enable service broker on the server; for that I need to set SQL Agent offline, run tsql statement then simply put it back online.
However, as soon as I stopped SQL Agent using SSMS, the Windows failed over to a passive node.
I thought, because SQL Agent is not listed as a resource in cluster manager, then I need to stop it from active node, do the change, and put it back online.
The questions are :
-
why stopping service that is not a part of the cluster caused cluster to fail over?
-
what would be the proper way to stop SQL Agent in my case? For maintenance for example
I simulated same actions on my test cluster and everything worked fine, cluster didn’t fail over. Same cluster structure, but without quorum.
UPDATE:
Right click on cluster name itself I can see SQL Agent under property type.
Does it mean all those resources are in the cluster even though they are not visible under "Roles"?
asked 2021-08-12 by Serdia
Answer
If you look in Failover Cluster Manager, if you select the role for the Failover Cluster Instance (FCI) Role, then select the "Resources" tab at the bottom you’ll see that the role is actually built with both the SQL Server service and the SQL Server Agent service as resources under that role.
When you stopped the Agent service, the Windows Cluster detected that it stopped "unexpectedly" and it failed over to the other node.
Instead of stopping the service from the service from SSMS or the Services control panel, you’ll want to right click on the "SQL Server Agent" Resource in Failover Cluster Manager and stop the resource there. That will result in the WSFC understanding your intent, and it will not fail over. Instead, it will show the FCI Role as being partially online. To restart SQL Agent, again right click on the resource and bring it online.
answered 2021-08-13 by Andy Mallon