As a DBA, we have the opportunity & responsibility to build highly available database platforms for use by applications & users. Even if you are a DBA at a small company supporting databases for only internal applications, we are usually still responsible for providing databases as a service for application teams. Sometimes, in order for us to provide the best service, we need application teams to do some work to make use of the super awesome thing we built. And sometimes, getting application teams to make that change isn’t easy, or isn’t fast, or isn’t even possible.
Such is often the case with multi-subnet availability groups.
Multi-subnet availability groups
When building a highly-available availability group (AG), we often need to straddle multiple subnets. The most common case where an AG spans multiple subnets is when an AG spans data centers or geographies–such as a primary data center, and a disaster recover (DR) data center. Another common case is when an AG spans more than one availability zone (AZ) within a data center. All the major cloud providers support AZs, and it is becoming more common for companies to build on prem data centers with AZ support. AZs create “shared nothing” zones where Zone A & Zone B have fully separate storage, network, and compute–which allows for teams to build platforms that are zone-redundant, so that even in dramatic failure situations (ie a SAN failure), the failure is limited to servers in that zone, and the redundant systems stay online in the other zone.
This type of zone-redundant high availability is great for uptime, but does introduce one wrinkle for connectivity.
MultiSubnetFailover=true
With a default configuration, multi-subnet AGs require that the clients connecting to them include MultiSubnetFailover=true
as a connection string attribute. This attribute tells the driver to expect DNS to provide multiple IP addresses for the Listener name, and to try all of them to find the correct IP to connect to for that network name. Clients that do not specify this attribute will get multiple IPs and not know how to handle them properly–most drivers will pick up one of the returned IPs at random (or maybe just seemingly random), and try to connect to that. This can result in random (or seemingly random) connection failures when it picks the wrong IP.
However, not every client or application will support this connection string attribute. In my experience there are two extremely common reasons that you can’t use MultiSubnetFailover=true
:
- Third party apps that don’t support it. Software vendors take all sorts of different approaches to connection strings, and sometimes it’s just not possible to edit anything but the server name. If your software vendor doesn’t support it, then all you can do is ask them to fix it in the future (and keep reading for the workaround).
- Legacy apps with a huge codebase. Even if all your software is home-grown, and you have full control of it, getting changes in can take time. In an environment of legacy apps, you might have dozens or hundreds of client-side apps, services, and containers that need to be updated to the proper drivers and then have connection strings modified. It’s all within your power, but it’s going to take time to accomplish.
RegisterAllProvidersIP=0
There is a configuration on the AG listener itself–the network name on the Windows cluster–that you can flip in order to change the way the listener registers IPs, so that only one IP is registered at a time. When the listener fails over to a different subnet, then the Windows cluster will re-register the listener with a different IP.
There is a trade off to be made here:
MultiSubnetFailover=true | RegisterAllProvidersIP=0 |
---|---|
❌ Microsoft drivers support it, but not jTDS & other drivers. | ✅ IP changes are handled at the server, so all drivers can work with this config. |
❌ Requires connection string changes from all clients. | ✅ No changes needed from clients |
✅ During failover clients can reconnect immediately. | ❌ During failover clients will not be able to reconnect until the DNS changes propagate & TTL expires. |
That third item is really the problem with using RegisterAllProvidersIP
. In addition to TTL, most multi-subnet network configurations will also have some level of DNS propagation where each subnet has it’s own DNS server, so the changes also have to propagate across to that other subnet. If you have to wait for TTL to expire and for DNS propagation, you’re going to bump failover downtime from “a few seconds” to “a minute or more,” even if you set TTL relatively low.
Setting RegisterAllProvidersIP
This setting has to be set on the listener object on the Windows Server Failover Cluster(WSFC). This setting needs to be set via PowerShell, as it’s not exposed from the Failover Cluster Manager snap-in.
First, let’s just find & look at the setting to see it’s current value. In my example, I’ve not yet changed it, so it will have the default value of 1 (true) to register all IPs.
First off, let’s talk about example configuration. I have a WSFC with a single AG, and a single listener. The listener has two IPs, each in a different subnet:
- The Windows Cluster:
WSFC01
- The Availability Group:
AG01
- The AG Listener:
AGListener01
- 10.10.10.xxx subnet
- 10.10.20.xxx subnet
I’ll step through the PowerShell slowly, assuming that most of us DBAs don’t play around with the FailoverClusters PowerShell module very often.
First, let’s take a look at the listener in DNS:
Resolve-DnsName AGListener01
Name Type TTL Section IPAddress ---- ---- --- ------- --------- AGListener01 A 1200 Answer 10.10.10.138 AGListener01 A 1200 Answer 10.10.20.138
Notice both IPs are listed in DNS.
Next, we can see the list of all cluster resources using the Get-ClusterResource
cmdlet:
Get-ClusterResouce -Cluster WSFC01
Name State OwnerGroup ResourceType ---- ----- ---------- ------------ 10.10.10.199 Online Cluster Group IP Address Cluster IP Address Online Cluster Group IP Address Cluster Name Online Cluster Group Network Name AG01 Online AG01 SQL Server Availability Group AG01_10.10.20.138 Offline AG01 IP Address AGListener01 Online AG01 Network Name File Share Witness Online Cluster Group File Share Witness IP Address 10.10.10.138 Online AG01 IP Address
Notice that both IPs are listed–but the 10.10.20 IP is offline. It is offline, but still registered in DNS. The DNS registration allows for clients to know both IPs, and connect to whichever it finds online.
Out of all of the things returned by Get-ClusterResource, we only care about the objects that belong to the OwnerGroup for the AG. We can filter a bit further by using our AG name as a filter:
Get-ClusterResource -Cluster WSFC01 | Where OwnerGroup -eq AG01
Name State OwnerGroup ResourceType ---- ----- ---------- ------------ AG01 Online AG01 SQL Server Availability Group AG01_10.10.20.138 Offline AG01 IP Address AGListener01 Online AG01 Network Name IP Address 10.10.10.138 Online AG01 IP Address
These 4 items represent the resources that make up my AG: The AG itself, the listener (listed as “Network Name”), and the two IP addresses, each in different subnets.
The item that I want to check for the RegisterAllProvidersIP
parameter is the listener (Network name) resource. If I pass the Cluster name & resource name to Get-ClusterResource
, I can then pass that resource along the pipeline to Get-ClusterParameter
which will list out all parameters for that one resource:
Get-ClusterResource -Cluster WSFC01 -Name AGListener01 | Get-ClusterParameter
Object Name Value Type ------ ---- ----- ---- AGListener01 Name AGListener01 String AGListener01 DnsName AGListener01 String AGListener01 Aliases String AGListener01 RemapPipeNames 1 UInt32 AGListener01 HostRecordTTL 1200 UInt32 AGListener01 RegisterAllProvidersIP 1 UInt32 AGListener01 PublishPTRRecords 0 UInt32 AGListener01 ResourceData {1, 0, 0, 0...} ByteArray AGListener01 StatusNetBIOS 0 UInt32 AGListener01 StatusDNS 0 UInt32 AGListener01 StatusKerberos 0 UInt32 AGListener01 CreatingDC \\dc01.am2.co String AGListener01 LastDNSUpdateTime 9/29/2020 3:57:12 PM DateTime AGListener01 ObjectGUID abc1234def567890ab12cd3456ef7890 String AGListener01 DnsSuffix am2.co String AGListener01 ADAware 1 UInt32
And I can get the specific RegisterAllProvidersIP parameter by explicitly asking for it. I’m also curious about TTL, so lets get just these two parameters:
Get-ClusterResource -Cluster WSFC01 -Name AGListener01 | Get-ClusterParameter -Name RegisterAllProvidersIP,HostRecordTTL
Object Name Value Type ------ ---- ----- ---- AGListener01 RegisterAllProvidersIP 1 UInt32 AGListener01 HostRecordTTL 1200 UInt32
To change this sample script from the docs specify five commands to run:
Import-Module FailoverClusters Get-ClusterResource yourListenerName | Set-ClusterParameter RegisterAllProvidersIP 0 Get-ClusterResource yourListenerName | Set-ClusterParameter HostRecordTTL 300 Stop-ClusterResource yourListenerName Start-ClusterResource yourListenerName Start-Clustergroup yourListenerGroupName
Note that these commands are written in a way that they do not pass in the
Cluster
parameter, so as written, they would need to be run on a cluster node for the cluster.
PowerShell is more powerful when I can run it from my desktop without fussing with remoting into every server to run something locally on the cluster node, so let’s update this example to allow that (and to substitute our own cluster object names for our example):
Import-Module FailoverClusters Get-ClusterResource -Cluster WSFC01 -Name AGListener01 | Set-ClusterParameter -Name RegisterAllProvidersIP 0 Get-ClusterResource -Cluster WSFC01 -Name AGListener01 | Set-ClusterParameter -Name HostRecordTTL 300 Stop-ClusterResource -Cluster WSFC01 -Name AGListener01 Start-ClusterResource -Cluster WSFC01 -Name AGListener01 Start-Clustergroup -Cluster WSFC01 -Name AG01
There are five steps outlined here:
- Change the RegisterAllProvidersIP parameter.
- Reduce the HostRecordTTL. The docs recommend 300 seconds (5 minutes), but you’ll need to determine the right value for your environment.
- Take the listener offline.
- Bring the listener back online.
- Ensure the entire AG is online.
Be careful about running these commands in production, as those stop/start steps will be disruptive to users. It’s important to note that if you change these cluster parameters, they will not take effect until you offline & online the resource. The Windows Cluster service will continue to renew the DNS entry using the configuration from the last time the resource was brought online. Therefore you’ll need to briefly take the listener (but not the entire AG) offline, and back online for it to realize the changes & update DNS.
After we’ve run the Set
commands, we can use the Get
commands to confirm that it is now set properly:
Get-ClusterResource -Cluster WSFC01 -Name AGListener01 | Get-ClusterParameter -Name RegisterAllProvidersIP,HostRecordTTL
Object Name Value Type ------ ---- ----- ---- AGListener01 RegisterAllProvidersIP 0 UInt32 AGListener01 HostRecordTTL 300 UInt32
After it is changed and the old TTL is expired, you’ll be able to see the Lister has only a single IP address configured, and now shows the shorter TTL:
Resolve-DnsName AGListener01
Name Type TTL Section IPAddress ---- ---- --- ------- --------- AGListener01 A 300 Answer 10.10.10.138
Now your clients that are not multi-subnet aware will be able to connect, at the cost of longer downtime during an AG failover. Tune in tomorrow for another post that will let you support legacy clients, and modern clients that can use MultiSubnetFailover=true
, all on the same AG.