log_reuse_wait_desc = AVAILABILITY_REPLICA, log is 100% full but AG is all fine

clip art of 
 a double-quote character

Question

That’s a weird scenario. The transaction log is full. It waits on AG. AG is all good. No latency, last commit/hardened time is almost real-time on all the nodes. The Estimated Recovery Time is zero. I checked this from the primary node and secondaries as well (in case if Dashboard is not updating).
This page lists only two reasons for this error, a delivery latency and redo latency. Nothing applies to my system. How to troubleshoot this further?

Edit 1. The only change we have done to the environment, we added two SQL Server 2019 nodes in preparation for migration. It shouldn’t follow to such problem though.

Edit 2. The only workaround I found is to re-add the DB to the AG.

asked 2021-11-08 by Alsin


Answer

You should check the DMVs to confirm that the AG is healthy with send & redo queues. The dashboard in SSMS does not do a great job at representing the true send & redo health.

As part of my open-source DBA database I have a Check_AGHealth stored procedure that you can use to give more information.

Without using the full stored procedures, a couple of the important statements to run on primary would be these:

Check dm_hadr_database_replica_states

This query returns a bunch of data, but the interesting things you’ll want to look for are:

  • Anything reporting that it is suspended
  • Anything reporting it is unhealthy
  • Large log send queue
  • Large redo queue

The ORDER BY clause on this should bubble those things up to the top of the result set, but any of those things would present a problem to clearing the log and would show a log reuse wait of AVAILABILITY_REPLICA

SELECT ServerName         = ar.replica_server_name, 
        AgName             = ag.Name,
        DbName             = db_name(ds.database_id), 
        AgRole             = rs.role_desc,
        rs.role ,
        UnsentLogMB        = ds.log_send_queue_size/1024.0,
        RedoQueueSizeMB    = ds.redo_queue_size/1024.0,
        RedoEstCompletion  = CASE WHEN redo_rate = 0 THEN 0 ELSE ds.redo_queue_size/ds.redo_rate END,
        SynchState         = ds.synchronization_state_desc, 
        AgHealth           = ds.synchronization_health_desc, 
        SuspendReason      = ds.suspend_reason_desc, 
        SynchHardenedLSN   = ds.last_hardened_lsn,
        LastHardenedTime   = ds.last_hardened_time,
        LastRedoneTime     = ds.last_redone_time,
        LastCommitTime     = ds.last_commit_time
FROM sys.availability_groups AS ag
JOIN sys.availability_replicas AS ar ON ar.group_id = ag.group_id 
JOIN sys.dm_hadr_database_replica_states AS ds ON ds.group_id = ar.group_id AND ds.replica_id = ar.replica_id
JOIN sys.dm_hadr_availability_replica_states AS rs ON rs.replica_id = ds.replica_id
ORDER BY 
    CASE 
        WHEN ds.suspend_reason_desc IS NOT NULL THEN 1
        WHEN ds.synchronization_state_desc <> 'HEALTHY' THEN 1
        ELSE 9
    END,
    rs.role DESC,
    ds.log_send_queue_size + ds.redo_queue_size DESC
;

answered 2021-11-08 by Andy Mallon