SRM User Guide: FlashArray Continuous Replication (ActiveDR) Workflows
For a detailed overview of ActiveDR and VMware, please refer to this guide.
Disaster recovery management is a critical piece of any infrastructure and array-based replication plays an important role in that. Ensuring that the data that runs the business is not only protected in the present production site, but also available in a separate location at a moments notice, allows the business to remain functioning in the case of a small failure or a large catastrophe. Virtualized environments such as VMware implementations are no exception.
ActiveDR provides an array-based mechanism to protect the important data in a simple and robust way over great distances with extremely low RPOs. The simplicity of management makes disaster recovery (and importantly testing your disaster recovery plans) straight-forward and repeatable.
In order for SRM to run any operations on FlashArray-hosted datastores/RDMs with ActiveDR there are a few requirements:
- FlashArray SRA version 4.0.0 with Purity 6.0.0 or 6.0.1. For Purity 6.0.2 or later you must be using SRA version 4.1.0 (Linux Appliance support only) or later if you intend to use ActiveDR
- Replication must be configured on the pod hosting the volumes
- Array managers must be configured
- Array pairs protecting those volumes must be enabled
- Volumes must be discovered by SRM
- The pod volumes must be added to a protection group
- The protection group must be in at least one recovery plan
- Hosts and host groups must be pre-created on the recovery FlashArray for the target recovery clusters
- Place volumes in protection groups for use with SRM protection groups and recovery plans. Using Hosts or Host Groups as placement for volumes to be protected by SRM has inconsistent behavior and support for this is best effort. Pure Storage is working to improve these workflows for a future release of the SRA when using host or host groups, but at this time Pure recommends avoiding using Host or Host Group placement for FlashArray protection groups
There is no need to pre-create any recovery volumes or pre-connect anything to the recovery hosts on the FlashArray(s). ActiveDR itself automates the process of creating recovery volumes and the SRA automates connecting them to the appropriate host resources.
As of the 4.0 release of the SRA, Pure Storage does not support adding volumes not in use by the VMware environment under the control of SRM into an ActiveDR pod that has one or more volumes under SRM control. Supporting pods with non-VMware volumes to be recovered by SRM will be considered for a future release of the SRA.
For the below workflow walkthrough, the following example environment will be used:
The source pod called SRM-podA on a FlashArray named flasharray-m20-1.
...linked to a target pod called SRM-podB on a FlashArray named flasharray-m20-2.
There are three volumes in this pod.
With corresponding recovery volumes in the target pod.
Two of which are in use as VMFS datastores and one as an RDM.
These are discovered in SRM.
In a SRM protection group called ActiveDR-PG protecting 31 VMs.
In a recovery plan called ActiveDR-RP.
Synchronization with ActiveDR
During a test recovery, recovery, or reprotect, SRM issues and operation internally referred to as syncOnce, to synchronize the data in the SRM protection group.
ActiveDR has no specific "synchronize" operation because it is inherently always synchronizing as fast as it can. The underlying goal of this operation is to ensure that before the test starts is that at least the point-in-time of the start of the test is replicated to the second site. Therefore, the SRA makes use of the property in ActiveDR replica links called recovery point to ensure it is in sync.
The recovery point indicates the last point-in-time of the entire pod that is available on the second array. This is calculated by using the lag property (which says how far behind the replication is) and subtracting it from the current time. So if the lag is 30 seconds and the current time is 08:00:00, the recovery point is 07:59:30. So if the replication were to fail at that moment, ActiveDR would be able to recover all data, snapshots, configurations, etc. from 07:59:30.
So when SRM issues a synchronization, the SRA uses the timestamp of that request as the target point-in-time. If the recovery point is not at that time or later, the SRA will tell SRM to wait and try again. Once the recovery point equals or surpasses the identified timestamp, the synchronization is deemed to be complete.
To initiate a test recovery, the following prerequisites must be met:
- The source pod is promoted and the target pod is demoted
- There is no "undo" pod still in existence for the target pod. A test failover start will complete successfully but the test failover stop (a.k.a cleanup) will fail as a pod cannot be demoted if an undo pod for it already exists
- The pod is entirely populated by volumes owned and managed by the VMware environment controlled by an SRM pair. If SRM manages one device in a pod, that same pair must manage all devices in the pod
- The appropriate host or host groups are created on the target FlashArray
If encounter errors, please see the ActiveDR test recovery troubleshooting KB.
To execute a test recovery, find the SRM recovery plan and on the Recovery Steps tab, click Test.
A basic wizard appears with one option: to synchronize or not before the test begins.
The default behavior is to replicate any changes at the start of the test, to skip this step, deselect Replicate recent changes to recovery site. Unless your ActiveDR lag is significantly high (in the minutes range, which would be abnormal), there is not a significant difference to selecting this option or not.
The SRA will then promote the target pod.
Once the pod is promoted, the GUI will reflect the current state.
The SRA will tag each volume in the pod with the key of puresra-testFailover_<timestamp> with a value of the source pod ID.
The SRA will then remove any existing connections of the target volumes and then connect them to the hosts or host groups identified by SRM.
SRM will then rescan the host HBAs, and resignature the datastores and continue on with the test (registering VMs, powering them on, etc.).
The datastores can now be seen in vCenter.
Note that the resignature process for VMFS includes a mandatory step that adds a name prefix in the form of snap-XXXXXXX. SRM has an advanced setting that can automatically remove this prefix if preferred, which is documented here.
When the test has been verified, click Cleanup to reset the environment.
Confirm the cleanup in the wizard and click Finish.
A cleanup will remove all VMs and resources created by the test. The SRA will disconnect all volumes in the pod and remove the tags on the volumes. The pod will then be demoted-resetting the environment for another test or recovery.
The undo pod created by the demotion will also be removed by the SRA (unless safe mode is enabled, in that case it will be skipped).
A recovery is very similar to a test recovery in process. There are some differences however that are worth noting. ActiveDR was specifically designed to make the process of disaster recovery easier (read: Active Disaster Recovery). The processes for a test recovery and full recovery (planned or otherwise) were thought through to find the most common areas of trouble; how can we make DR safer? How can the FlashArray make it simpler? Because of this attention to detail with the design of ActiveDR, it is extremely simple for a user to manage and for the SRA to automate.
Prior to a recovery, a source pod is in the promoted state and the target pod is in the demoted state. If the target pod is in the promoted state, it must be manually demoted prior to an attempted recovery. To initiate a recovery:
- The source pod must be promoted and the target pod is demoted. The exception is that the source pod is inaccessible-either through site failure, array failure or network partition
- There is no "undo" pod still in existence for the source pod if it is online
- The pod is entirely populated by volumes owned and managed by the VMware environment controlled by an SRM pair. If SRM manages one volume in a pod, that same pair must manage all volume in the pod
- The appropriate host or host groups are created on the target FlashArray
To initiate a recovery, navigate within SRM to the recovery plan and click Run.
The recovery wizard will ask you to accept the risk (this takes production offline before restored on the recovery site).
Furthermore, choose either a Planned Migration or a Disaster Recovery.
There is no significant different in the process with either of these options; the main difference is how they respond to certain events. A Planned Migration is what it sounds like: a movement of the workload from site A to site B and that movement is expected, anticipated, and well planned. In this case, any failures encountered in the recovery process will halt the execution of the workflow. The failure must be resolved and the recovery must be restarted. If the Disaster Recovery option is chosen, the workflow will tolerate almost any failure on the source site up to the entire failure of the source environment; this includes but is not limited to the source network, vCenter, SRM server, storage, compute, and like the entire building that it all sits in. It will not, however, tolerate failures on the target side-so a disaster recovery operation might fail if the target side is misconfigured in some way. If failures are encountered there, the process will halt. Fix the issue and re-run the process and the recovery will re-attempt the steps and continue on if the problem has been resolved.
The first operation is a synchronization.
Next the source site will shutdown, VM will be unregistered and the datastores will be unmounted.
During the Change storage to read-only step, the SRA will:
- Disconnect the volumes in the pod from any hosts or host groups
- Tag all of the volumes (more detail on that in a moment)
- Demote the source pod. The demotion will use the "quiesce" flag, meaning that the pod will not finish the demotion until all of the pods' data has been replicated to the remote FlashArray
The volumes in the source pod will each be tagged with a tag with the key of puresra-demoted and the value of the UUID of the source pod.
Right after the tags are created the pod will then be demoted.
There will be a final synchronization at that point.
When logging into to the target array and running a tag query on the volumes in the pod, the demoted tags appear on the target side volumes too:
This is because tags (along with everything else) get replicated to the target pod. So not only the data, but the tags, QoS settings, etc get copied to the target volumes. The final synchronization ensures these tags are replicated over.
The target side FlashArray operations occur during the Change recovery site storage to writable step.
First the pod is promoted.
Then the volumes are tagged with a tag that has a key of puresra-failover and the value of the UUID of the target pod.
If you query for tags without a filter (the above query looks for tags with the key of puresra-failover): you will see all of the tags.
Why does the SRA tag things at all? Well SRM requires operations to be idempotent-which means if you run something, or re-run it, it should not do something differently. A better way to explain this is that SRM needs to be able to pickup where it left off-without being supplied new information. The SRA needs to be able to figure out what it did and what it didn't do on the the last attempt. Tagging the source volumes provides the array a way to say "yes we demoted the source pod directly, it is not an error". The first time a failover is run, it is expected that the source pod is NOT demoted. But if it is run once, the array will demote it, something fails and then it is run again, the SRA shouldn't fail because it is demoted. Tagging it with this gives us a way to know, "yes it is demoted, but it is demoted because we demoted it on purpose, so move forward". Tags get added during transition states and remove during static states (temporary would be between a recovery and a reprotect or between a test recovery start and a test recovery cleanup).
When querying on the target side for tags, if the puresra-demoted tags are not visible it means that the source site was down during the recovery and could not be tagged. Ensure that the recovery is re-run to complete the process (assuming of course the source site is back online).
The last step is the volumes will be connected to the appropriate hosts and host groups. Note that, before any connections are made, the SRA will disconnect all of the volumes in the recovery plan from any hosts or host groups first (in case they were pre-connected) and then connect them as needed. This ensures that the volumes are only connected to the hosts dictated by SRM.
The volumes will be connected, and VMware will rescan the ESXi environment, resignature the datastores and mount them. VMs will then be powered-on according to the recovery plan.
Note that the resignature process adds the snap-XXXXXXX prefix to the datastore names. SRM has the ability to automatically remove this by enabling a non-default option described here.
At the end of the recovery process (a reprotect has NOT been run yet) the target side is automatically already replicating back from the previously target pod. In the example above, the original configuration was the pod called srm-podA replicating to srm-podB. During the recovery, srm-podA was demoted and then srm-podB was promoted. When Purity sees the original source pod demoted and the target pod is promoted-it automatically swaps the replication link. The original target pod (srm-podB) is now a source pod, replicating back to the original source pod, which is now a target pod (srm-podA).
If this is state of the pods BEFORE recovery.
This is the state AFTER recovery and BEFORE reprotect.
Notice that the replication direction is switched.
What if this turns out to be a mistake? What if the data is needed that was in the original source pod (srm-podA) after replication has reversed?
This is why the undo pod exists.
The data (and configuration) point-in-time of the original source pod (srm-podA) is stored in a "snapshot" of the pod that is created upon demotion of the source pod. That pod remains around for the eradication window (default 24 hours) or until manually eradicated-whichever comes first.
This can be seen under Destroyed and Undo Pods on the original source FlashArray.
The process to restore from and undo pod is not built into SRM and will be a manual operation. See the following guide for details.
While the SRA attempts to eradicate undo pods created by demotions in a test recovery cleanup, it does not eradicate undo pods created by the demotion during an actual recovery.
In the case of an actual failure being the reason for a recovery operation within Site Recovery Manager (as opposed to disaster avoidance or a migration which implies things are currently, at least, fully functional), there may be steps that cannot be executed that are in-plan for a recovery. Any and all steps that involve making changes at the source site can fail and SRM can still recover the workload.
If errors are run into during a recovery, it is generally recommended to try to fix those errors and then re-run the recovery plan in planned migration mode. If that is not possible (due to an extended outage or irrecoverable loss of equipment), the recovery plan can be run (or re-run as the case may be) in Disaster Recovery mode.
If the source SRM server is offline, SRM will give no choice but to run it in that mode. Running the recovery workflow in disaster recovery mode does not change how the SRA behaves-the operations on the target side do not change whether or not the source site is down.
The SRA operations that might fail are:
- Demoting the source pod
- Tagging the source volumes
- Disconnecting the source volumes from hosts
Run the recovery in disaster recovery mode to get the workload started in the target site. Once the source site is back online, re-run the workflow in planned migration mode. Though it might not be fully recoverable, if not, manual reconfiguration is necessary.
If it is, a few scenarios might be encountered:
Volumes were never disconnected
If the VMs were never powered off, these will be brought down, the datastores unmounted, and the volumes disconnected.
Source pod is still promoted
In the case that the source pod is still promoted, the SRA will see that and check if the target pod is already promoted AND the volumes in the target pod have the puresra-failover tags. If so, it knows that the SRA previously promoted the target pod and that it is running production now. Therefore it will demote the source pod WITHOUT the quiesce option. This means that any changes build up since on the source side are discarded (or rather not sent to the target side). Once the demotion is complete, the target side becomes the source and it will begin replicating the changes back to the original source side (which is now the target). This demotion reverses the replication direction.
Remember that the demotion will create an undo pod of the state of the original source pod at the point of demotion. So any changes, volumes, configurations will be protected until the undo pod is eradicated manually or eradicated with time as dictated by the eradication timer (defaults to 24 hours).
Note that the source volumes are not tagged with the puresra-demoted tag unlike with a planned migration. Why? The demotion without quiesce will cause the tags to never be sent to the target. A valid question is: why not quiesce? Well because the target is promoted already. Source volume changes, like tags, only get reflected on target volumes when:
- A demoted target pod periodically refreshes itself with the latest replicated data
- A demoted target pod gets promoted
Since the target pod is already promoted, it will never pick up new tags. Creating them is pointless, and replicating them is pointless. But they are needed! This is how we determine that it is safe to run a reprotect. So if we detect upon the recovery side procedures that there is a source pod that has been demoted, and we see a target pod with only puresra-failover tags, not puresra-demoted tags, we then tag the target volumes directly on the target side with puresra-demoted and the original source pod UUID. This allows the tags to be in the exact state we need to run a reprotect. Source is demoted. Target is promoted. Target volumes have the puresra-failover tag with the value of the target pod UUID and the puresra-demoted tag with the value of the source pod UUID.
Source pod is demoted
If the SRA finds the source pod already demoted, it will check to see if target pod is promoted and has the puresra-failover tags. If so, it will know that the SRA has failed over the source side and will move on. If the volumes in the target pod do not have the puresra-demoted tags (meaning they were never applied on the source and therefore the target volumes did not inherit them upon promotion), they will be applied to the target volumes.
In order to run a reprotect operation, a recovery operation must be fully executed. If the workflow was run in disaster recovery mode and there were failures on the source site, resolve the issues and re-run the recovery operation until all operations complete successfully. This may require some manual intervention in the VMware environment depending on the failure.
Once complete, the reprotect operation will be available for execution.
Click on the Reprotect button. In the window that pops up, select the box to confirm the process. and click Next.
Review the details and click Finish to start the reprotect.
The reprotection process for ActiveDR is fairly straightforward for the SRA.
The process is:
- Remove the puresra-demoted and puresra-failover tag from all of the volumes on the now source site
- Confirm that replication is enabled and in-sync to the formerly source, but now target, site
Prior to a reprotect the volumes on both sides will have both the puresra-demoted tag and the puresra-failover tag, here are the states.
The tags are removed on the source volumes during the step Configure storage to reverse direction.
But the tags are still on the target side after the reprotect is completed.
This is because the tags are removed from the source volumes directly. Removing the tags on the source volumes causes the removal of the tags to be propagated to the target volumes within minutes-therefore there is no need for them to be directly removed (also the array blocks manipulating volumes in a demoted pod directly in any way).
Their automated disappearance can take up to 15 minutes to occur, though the update can be forced by promoting then demoting the target pod (either manually or by running a test recovery).
It is recommended to run a reprotect operation as quickly as possible to re-enable disaster recovery. If a disaster occurs while the environment is in a recovered, but-not-yet reprotected state, SRM-controlled recovery will not be possible and manual intervention will be needed.
Failback is identical to failover in every way. No temporary objects are created and no cleanup is normally required. The main thing to note is that if an undo pod for the current source pod is identified by the SRA, it will not be able to demote the source pod and the recovery will fail. To enable a recovery, manually eradicate any undo pod listed under the source pod and retry.
An undo pod can exist for the source pod if there are quick recovery operations back and forth (in less time than the eradication window which defaults to 24 hours). If it is not possible to eradicate the undo pod (for example if safemode is enabled), a disaster recovery operation must be run for the recovery and then re-run the recovery when the pod is eventually eradicated by the timer.
See below for additional information on issues you might encounter and their resolutions.
Troubleshooting ActiveDR Recovery Errors in Site Recovery Manager
Troubleshooting ActiveDR Test Recovery Errors in Site Recovery Manager
Troubleshooting ActiveDR Reprotect Errors in Site Recovery Manager