Skip to main content
Pure Technical Services

SRM User Guide: Periodic Replication SRM Workflow Behavior

Currently viewing public documentation. Please login to access the full scope of documentation.

KP_Ext_Announcement.png

Periodic replication on the FlashArray is replication managed by FlashArray protection groups. The FlashArray SRA supports protection group replication that is FlashArray to FlashArray. For SRM workflows this consists of test recovery, recovery, reprotect, and failback.

Prerequisites

In order for SRM to run any operations on FlashArray-hosted datastores/RDMs there are a few requirements:

  1. Replication must be configured on those volumes
  2. Array managers must be configured
  3. Array pairs protecting those volumes must be enabled
  4. Volumes must be discovered by SRM
  5. The volumes must be added to a protection group
  6. The protection group must be in at least one recovery plan
  7. Hosts and host groups must be pre-created on the recovery FlashArray for the target recovery clusters
  8. Place volumes in protection groups for use with SRM protection groups and recovery plans. Using Hosts or Host Groups as placement for volumes to be protected by SRM has inconsistent behavior and support for this is best effort. Pure Storage is working to improve these workflows for a future release of the SRA when using host or host groups, but at this time Pure recommends avoiding using Host or Host Group placement for FlashArray protection groups

There is no need to pre-create any recovery volumes or pre-connect anything to the recovery hosts on the FlashArray(s). The SRA automates the process of creating recovery volumes and connecting them to the appropriate host resources.

Test Recovery

One of the primary benefits of Site Recovery Manager is the ability to test recovery plans--coordinating the failover of the replication and recovery of virtual machines without affecting the configuration, state, protection, or connectivity of the original source VMs.

Best Practice: Test recovery and test it often. Do it on a schedule as well as right after any changes in your replicated environment.

The high-level process of a recovery is:

  1. Issue a synchronization of the relevant FlashArray volumes to the target.
  2. Create new FlashArray volumes from the replicated snapshots on the target array.
  3. Connect the volumes to the appropriate hosts and/or host groups.
  4. Rescan the target cluster(s).
  5. Resignature and mount the datastores.
  6. Power-on the VMs and configure them according to the recovery plan.

Below is a recovery plan with 9 virtual machines:

clipboard_eb48d4425103c1675691fd7bb268297c7.png

These are spread across four datastores replicating from a FlashArray called flasharray-m50-1 to a FlashArray called flasharray-m20-1:

clipboard_e2d5ffa165111284412b158c58bcf79b0.png

These are all seen as replicated devices in SRM:

clipboard_eed4e3b7c0fad227cdbf6925ab5b01bfb.png

These volumes are in a FlashArray protection group called srm-PG01:

clipboard_e1179ae950d1293629d8e5a1e2c4a3a39.png

To initiate a test, click the Test button in SRM:

clipboard_e4a49ad705ce60ef71cfb1befc1ec1595.png

A wizard will appear, with a default option called Replicate recent changes to recovery site selected:

clipboard_e6c198df0e974a7d511b1fd415048f79b.png

If this is selected, the SRA will create a new replication point-in-time. If you de-select it, the SRA will just use the latest point-in-time found for each volume. Note though that during failover, the SRA will always still use the latest available point-in-time. So if the SRA creates a new one for the test, and between the completion of that point-in-time and before creating the recovery volumes a new point-in-time is created, that new one will be used for recovery.

Click Next.

clipboard_e4b7a6dac97dd22b7966456617b88b298.png

Confirm the test and click Finish.

If you kept "replicate recent changes" selected, the SRA will initiate a new point-in-time on the target array and will give the protection group snapshot created a name including a UUID and the suffix of -puresra:

clipboard_e960df59d69ba83498601adc93a982c60.png

The SRA will create a new replication point-in-time and will apply the retention policy specified in the protection group to it. So the protection group snapshot will be destroyed and eradicate according to the schedule. If preferred you can manually destroy/eradicate it earlier.

The SRA will then create one new volume for each protected volume on the target array with the original name of the volume plus a suffix of -puresra-testFailover:

clipboard_eec29b0dc1a61914d5ab7a177c4e77d3e.png

They will then automatically be connected to the appropriate hosts and/or host groups. SRM will then rescan the cluster and the datastores will be resignatured and mounted:

clipboard_e46b562721ff043681b0ce1ddbd2eaba5.png

Note that the datastores will have the prefix applied to their names of "suffix-XXXXXXXX" by SRM. This can be automatically removed by enabling the advanced SRM setting described here:

Site Recovery Manager Advanced Options

The VMs will be registered and configured according to your recovery plan.

clipboard_e6dd17ab3f84857a70ebd7ac9c41fe5d2.png

Do not rename the test failover volumes during the test--this will cause them to not be cleaned up and you will need to destroy them manually.

Test Recovery Cleanup

Once you have verified the test, click the cleanup button.

clipboard_e41ae52503d2b4ba091114ce4ee84abde.png

The cleanup process will power-off the VMs and un-register them. The datastores will be unmounted and detached from the hosts. The SRA will then disconnect the volume from all hosts, destroy it and then eradicate it.

This resets the process to the original state. The only object that will remain through a cleanup is any point-in-time protection group snapshots that were created by the test. Those will be destroyed according to retention.

Recovery 

Site Recovery Manager offers two main modes of failover: Planned Migration and Disaster Recovery. A planned migration will fail if any problems are encountered at the source or target site. A disaster recovery operations will tolerate up to a full failure of the source site resources and still recover the virtual machines. It is recommended to run a planned migration operation if possible, as this will ensure the cleanest failover of the environment. Furthermore, if a disaster recovery event is run, it is likely that manual cleanup of the source site will be required once resources are back online.

clipboard_e9a68ede07ef3f172a2011c88663dbf0b.png

The high-level process of a recovery is:

  1. Issue a synchronization of the relevant FlashArray volumes to the target.
  2. Shutdown the production side virtual machines, unregister the VMs, and unmount the datastores.
  3. Synchronize the relevant FlashArray volumes again to the target.
  4. Create new FlashArray volumes from the replicated snapshots on the target array.
  5. Connect the volumes to the appropriate hosts and/or host groups.
  6. Rescan the target cluster(s).
  7. Resignature and mount the datastores.
  8. Power-on the VMs and configure them according to the recovery plan.

Below is a recovery plan with 9 virtual machines:

clipboard_eb48d4425103c1675691fd7bb268297c7.png

These are spread across four datastores replicating from a FlashArray called flasharray-m50-1 to a FlashArray called flasharray-m20-1:

clipboard_e2d5ffa165111284412b158c58bcf79b0.png

These are all seen as replicated devices in SRM:

clipboard_eed4e3b7c0fad227cdbf6925ab5b01bfb.png

These volumes are in a FlashArray protection group called srm-PG01:

clipboard_e1179ae950d1293629d8e5a1e2c4a3a39.png

To start a recovery, click on the Run button on the recovery plan:

clipboard_ea7e2e9af463e061477483c0b6f71a5bf.png

Confirm the type of the recovery and the details of the operation and click Next then Finish.

clipboard_efa0b4a2bcf900e2353a6987b92acd1b3.png

The data will be synchronized twice to the target FlashArray(s). Once before the VMs are shutdown, and once after:

clipboard_e62c77980e1f78b1e68241db822991baa.png

clipboard_ee72a33cd6f7a43a15bc65a30d85969b4.png

On the corresponding FlashArray protection groups you can see a new protection group snapshot created for each synchronization with a -puresra suffix (preceded by a random UUID) in the snapshot name.

clipboard_e4e539b9e33188cd7a24db4c1041187f3.png

The second point-in-time will likely be the one used for recovery, but if a new point-in-time is created between the second synchronization and the subsequent step, the latest one will be used.

Upon the step called Configure recovery site storage, the replicated snapshots will be copied to new FlashArray volumes on the target FlashArray.

The volumes will be named with the same name as their source with a suffix of -puresra-Failover added:

clipboard_e1ebc20c6a6c4b6b0b159581c4ff9b02c.png

The original source volumes will be disconnected from their hosts and also renamed at this point. The SRA will add the suffix of -puresra-demoted to those volume names:

clipboard_ef11016b446b04847c2f43d64dbac3181.png

It is recommended to not delete or rename the source volumes (the volumes with -puresra-demoted in the name) after a failover and prior to a reprotect. If a volume is renamed during this windows, the SRA will not be able to find the original protection of the source volumes. Therefore it will just create a default protection group called PureSRADefaultProtectionGroup with replication enabled back to the source array.

The device discovery screen will show the device pair(s) as Failover Complete.

clipboard_e11e0c18e4b1c2e75c52b771b521be15c.png

The recovery volumes are connected to appropriate hosts or host group on the recovery site and SRM will resignature and mount them. The resignature process will add a name suffix to them of snap-XXXXXXXX. SRM can be configured to automatically remove the suffix through advanced configuration documented here.

clipboard_e9c1aa6242604dcb02f3df61980916116.png

The virtual machines will be registered, configured, and then powered-on according to the recovery plan.

clipboard_e07863b91a85bddf58f9363728403af6d.png

Reprotect

A reprotect operation automates the replication of failed over volumes back to the original FlashArray. In order to run reprotect, a fully successful recovery must occur. If the disaster recovery operation was executed and skipped multiple steps due to failures, a reprotect might not be possible. In this case manual reprotection might be required--which is essentially the same process as setting up replication for the first time. If a reprotect fails, try again with the Force Cleanup option checked (it will only appear if a reprotect has failed once).

Once a recovery operation has completed, it is recommended to run the reprotect as soon as possible. This will ensure the data being generated on the now production site is being protected.

The reprotect operation does the following things:

  1. Sets up replication on the FlashArray(s) that are now running the VMs
  2. Reverses the SRM protection groups and recovery plans
  3. Initiates a synchronization of the data.

To start a reprotect, click on the desired SRM recovery plan and click Reprotect.

clipboard_ef815591acb49a7034d14209a2e7971b3.png

Confirm the action and click Next, then Finish.

clipboard_eda713cf06d3d0c3cc7586ab618e3f534.png

The FlashArray will then look for the FlashArray protection group or groups of the source volume.

For ease of explanation, FlashArray A will be the array that was failed over from and FlashArray B will be where the volume was failed over to and where the VMs are currently running before reprotection.


The process follows these rules:

  • The SRA will look for the original source volume on FlashArray A, if it cannot find it, the SRA will setup a new protection group on FlashArray B--see below for details. The source volume is identified via the volume name--so if it has been renamed on the source the lookup will fail.
    • Protection groups on FlashArray A that include the original source volume will be created on FlashArray B even if they do not have replication enabled and/or replicate to a different array
    • Protection group name matching is not case-sensitive. So if  FlashArray A has a protection group called srm-pg and  FlashArray B has one called SRM-PG, they will be considered the same and no new group will be created.
    • For protection groups created by the SRA, it will match the replication and local snapshot policy.
    • Protection groups that are created by the SRA during reprotect will only add the original FlashArray as a replication target--no matter how many targets were in the original protection group. If you would like the created protection group to also replicate to other arrays, add them manually as targets later.
  • If the original source volume on FlashArray A is in more than one protection group the SRA will re-create all protection groups on FlashArray B.
  • If a protection group with the name already exists on  FlashArray B the SRA will not re-create the group and will put the volume in that identified group.
    • If a pre-existing protection group policy name does not match the original source protection group policy it will still be used. The SRA will not update the identified pre-existing protection group with the policy on the original array

Once the protection has been re-created on the now production side, the original source volume(s) will be removed from protection groups that replicate to the now production array. If the volume is in protection groups that do not replicate to the now production array, the volume will be left in those.

If the SRA cannot find the original source volume (and therefore cannot identify the correct protection groups) the SRA will instead create a protection group with the default replication policy with the name PureSRADefaultProtectionGroup replicating back to the original array. You may edit and change (or even remove) this protection group as desired afterwards. Just ensure that the desired volumes are still replicated by a protection group.

clipboard_eabc7f1c840c54f86930d6966a015d1b0.png

The reprotect operation will then complete swapping directions of the SRM objects and reconfiguring protection of the virtual machines in the recovery plan.

clipboard_eab0a419a6a5235cd4e26fc8bb47cee69.png

The reprotect operation will rename the volume on the target array to remove the applied suffix of -puresra-failover

clipboard_e713fb6936e10833eadb7598acf5539a4.png

If failures occur, the Force Cleanup option will become available--it is highly encouraged to not resort to that option immediately. It is advised to attempt figure out and resolve the underlying problem and re-run the reprotect without Force Cleanup selected until successful. 

The main reasons that a reprotect could fail are often in the VMware environment (placeholders aren't there, mappings are incorrect or missing, etc). Though some FlashArray failures can cause this too:

  • Replication connection does not exist back to the array. If so, re-create this connection.
  • Array managers are configured incorrectly
  • Original source volume was renamed. It should be in the form of <volume name>-puresra-demoted. If it is not, rename it back. You will see an error like: SRA command 'prepareReverseReplication' failed for device '<volume name>'. Cannot find the volume <volume name> on the array <arrayname>.
    Please make sure that replication setup is correct"
    . If it fails in the prepareReverseReplication phase likely the source volume has been renamed manually or destroyed. Either rename it back, or if it is gone, run an SRM device discovery and then re-attempt the reprotect without force cleanup checked. Only if that fails, should you then retry with force cleanup checked.
  • Target volume was renamed. It should be in the form of <volume name>-puresra-failover. If it is not, rename it back. You will see an error like: SRA command 'reverseReplication' failed for device '<volume name>'''. Cannot find the volume '<volume name>' on the array <arrayname>.
    • If this is the case, fix the name of the volume AND ensure that the original volume is still in a protection group replicating to the target. The reprotect operation will have removed it from any replication groups at this stage causing device discovery to fail.
  • A volume exists with the original name but no suffix. If the original volume was srm-DS1 and it was failed over to the 2nd array, the failover volume will be called srm-DS1-puresra-failover. If there is a volume on the failover array called srm-DS1 already the initial recovery will not fail, but the reprotect will as it will try to rename the volume from srm-DS1-puresra-failover to srm-DS1 which will fail because there is already a volume with that name. While this is unlikely to occur, it could. You will see an error upon reprotect like: Failed to reverse replication for device 'peer-of-53f027a7-828d-4b4d-a3a8-d4b2c8364507:srmDS-08'. SRA command 'reverseReplication' failed for device 'peer-of-53f027a7-828d-4b4d-a3a8-d4b2c8364507:srmDS-08'. Could not rename failed over volume srmDS-08 to peer-of-53f027a7-828d-4b4d-a3a8-d4b2c8364507:srmDS-08 on array flasharray-m20-1. Note that the volume might be in the destroyed volume folder awaiting eradication if you cannot see it.
     

Failback

An SRM failback is essentially the same process as a recovery, just instead of site A to site B, it is site B to site A. So all comments mentioned in the recovery section apply. The only main difference of a failback from a recovery is in that a failback implies that there was at least once recovery from the opposite site already executed.

To the end user this doesn't really change anything and how the process is managed or reflected in SRM or in vSphere. With that being said, the FlashArray SRA will attempt to re-use the original volumes, if they are still there. The SRA looks for volumes with the name of the volumes to be failed over with a suffix of -puresra-demoted. If the volume has been renamed or of course destroyed, it will not be re-used and the SRA will create a new volume with the correct name.