Replication latency when an Availability Group replica is offline

Assume you have configured a transactional replication publisher, which is part of an availability group (AG). Now, you have an asynchronous secondary AG (or mirroring) replica that is offline. Perhaps this is a DR node, and you are doing patching or other planned maintenance on that DR node, with it intentionally offline.

When the SQL Server instance on the secondary replica goes offline, data stops flowing through transactional replication from the Primary replica. When the secondary replica comes back online, transactional replication begins flowing again. What happened? Why did maintenance on the DR node affect replication on primary?

Log Reader Agent is very conservative

The Log Reader Agent will only read as far as the last transaction committed to all other Availability Group replicas. This means that if one AG replica is offline, Replication will stop reading transactions until that AG replica is back online. This is done to be ultra-safe and ensure that only fully committed transactions are replicated.

Can I change this behavior?

In some cases, this may be too conservative of a rule for your setup. In that case, you can add Trace Flag 1448 to your startup parameters, which will disable this behavior for async replicas. If a synchronous replica is offline, it will still prevent the log reader agent from reading unsent transactions for Replication.

This TF will affect the server that is currently hosting the primary database replica, and thus must be set on every server that might host the primary replica for the AG.

From the Microsoft documentation on TF 1448:

Enables the replication log reader to move forward even if the async secondaries have not acknowledged the reception of a change. Even with this trace flag enabled the log reader always waits for the sync secondaries. The log reader will not go beyond the min ack of the sync secondaries. This trace flag applies to the instance of SQL Server, not just an availability group, an availability database, or a log reader instance. Takes effect immediately without a restart. This trace flag can be activated ahead of time or when an async secondary fails.

Why does Microsoft impose this restriction on synchronous replicas?

If the LSN has not been hardened on the synchronous replica, then the transaction isn’t fully committed. Replication is very conservative in what it considers a fully-committed transaction that is safe for replication. Say for example, that the AG isn’t in sync due to a network issue, or some other problem (ie, it is online, but the Primary node just doesn’t know that). If you force a failover of the AG, and allow data loss, then your replication secondary would have data that was lost during failover– potentially breaking replication.

Essentially, the default behavior is that SQL Server will first ensure the row gets to every AG node first, then will allow transactional replication to pick up the row next. It does this to ensure that there is no AG failover-with-data-loss scenario that results in replicating data that was lost in failover. The trace flag can be used to relax that limitation for asynchronous replicas only. Even with the trace flag enabled, data needs to be committed to all synchronous replicas before transactional replication will pick up the data change.