Multi-site MSSQL Availability Group & Disaster Recovery

There is a lot of information scattered around different blogs and Microsoft documentation about running SQL Availability Groups across multiple sites for DR or reporting requirements, this post outlines how to get a 3 node cluster working again when you have to failover to your DR site.

Assumptions:

Failover Cluster created with all Cluster members, Cloud Witness or Fileshare Witness (in DR site or 3rd site) created
SQL AG Listener IPs configured for both regions

The following guide is based on a 3 node cluster across two sites (PRI and DR):

PRI-SQL01
- Synchronous Commit
- Automatic Failover
- Read Only Replica
PRI-SQL02
- Synchronous Commit
- Automatic Failover
- Read Only Replica
DR-SQL01
- Synchronous Commit (if network permits)/Asynchronous Commit
- Manual Failover
- No Read Only Replica
- Failover cluster vote removed so it does not interfere with Quorum

(get-clusternode -name "DR-SQL01").NodeWeight=0

Disaster Occurs! Time to Bring up the Disaster Recovery Site

First issue – the Cluster will most likely have lost quorum and the cluster has stopped – so you can’t failover, now what? On the DR SQL Server you do the following (Source for these steps):

Stop the Cluster Service with the following Powershell

Stop-Service ClusSvc

Start the Cluster, fixing/ignoring Quorum

Start-ClusterNode -name DR-SQL01 -FixQuorum

Check the status of the Cluster with the following command, it will transition from Joining to Up (good to go)

Get-ClusterNode -Name DR-SQL01

Return the DR SQL server a vote in the Cluster Quorum

(Get-ClusterNode -name "DR-SQL01").NodeWeight=1

Remove votes from Offline Primary site servers

(get-clusternode -name "PRI-SQL01").NodeWeight=0
(get-clusternode -name "PRI-SQL02").NodeWeight=0

Confirm the votes assigned to the Cluster

Get-ClusterNode | fl Name,NodeWeight

The Failover Cluster should now be up, but the Role(s) for the SQL Availability Group will be offline or failed
Start SQL Server Management Studio and run the following to check the health of the databases
- If Failover Ready = 1 – there should be no dataloss
- If Failover Ready = 0 – there will be dataloss (will always be this value for Asynchronous Commit)
- Keep a record of Recovery_lsn and Truncation_lsn to compare against the PRI servers when they come back up to see how much was missing

select dharcs.replica_server_name, dhdrcs.is_failover_ready, dhdrcs.database_name, dhdrcs.recovery_lsn, dhdrcs.truncation_lsn
from sys.dm_hadr_database_replica_cluster_states dhdrcs join sys.dm_hadr_availability_replica_cluster_states dharcs
on(dhdrcs.replica_id = dharcs.replica_id)
where dharcs.replica_server_name = @@servername

Failover to the DR Server inside SQL Availability Groups
- I’ve found this works best in this scenario with T-SQL instead of using the Wizard

//** In this example the AG name is SYSMONKEY-AG **//
ALTER AVAILABILITY GROUP "SYSMONKEY-AG" FORCE_FAILOVER_ALLOW_DATA_LOSS

Database should now be active, check that it shows as Primary in SSMS

Primary Site is back up and running, lets move back to there

Failing back without dataloss won’t be available at first because the returning Cluster Members are now in a paused state as they recognise the change in Primary for the Availability Group and Failover Cluster (source)

Return node votes to the other cluster members

(get-clusternode -name "PRI-SQL01").NodeWeight=1
(get-clusternode -name "PRI-SQL02").NodeWeight=1

Resume the replication on each server by running the following T-SQL on each per database

//** Run this against each database on each of the servers in the Primary site that should now show as 'secondary' database servers **//
 
alter database dbname set hadr resume

Let the servers true up their data, if there has been a high rate of change or they have been offline for too long then you may have to remove them from the Availability Group and reseed them.
Once up to date, you will be able to failover without dataloss back to the Primary Site.
Make sure to remove the Vote for the DR site again from the Cluster

(get-clusternode -name "DR-SQL01").NodeWeight=0

Hopefully if you get to here you should be back to normal operations!

Reduce Failover time between subnets – planned & unplanned

One key thing to remember is if your Apps support it, you should use the MultiSubnetFailover=True SQL Connection String option to ensure speedy failover between sites.

.NET Framework 4 is required for this, and essentially when set to True apps will query DNS for all IPs associated with the Availability Group and try to connect on the Database Server port (1433 or other…) on all IPs at the same time.

Since only one of the Sites IPs is active it will immediately use that one. This greatly speeds up failover during normal operations as well.

Should your apps not support this then you should follow the information in this Microsoft Document on how to disable this feature and reduce the DNS TTL instead.