Skip to main content
Pure Technical Services

Troubleshooting ActiveDR Test Recovery Errors in Site Recovery Manager

Currently viewing public documentation. Please login to access the full scope of documentation.

 

Below is a listing of a few scenarios that might cause an ActiveDR test recovery to fail in Site Recovery Manager.

Test Recovery Start Fails: Pod is already promoted 

If the pod is already promoted, the SRA will report an error during the test and fail the operation:

With the below error:

 Failed to sync data on replica devices. SRA command 'syncOnce' failed. The target pod <pod name> of the operation is already promoted. Make sure the target pod is demoted before running a planned migration or disaster recovery.

Demote the target pod and retry the test recovery.

Test Recovery Cleanup Fails: More than one type of device found in the pod

This error can be caused by a few scenarios--all of them relating to the SRA finding unexpected devices in the target pod upon the cleanup of a test recovery.

The scenarios that can cause this are:

  • A non-SRM controlled volume has been added to the source pod
  • A volume has been added to the target pod between a test recovery start and before the test recovery stop has been run.
  • The test recovery tags on the target volumes have been edited or removed between a test recovery start and before the test recovery stop has been run.

A Non-SRM controlled Volume is added to the source pod

As of the 4.0 release of the Pure Storage SRA, volumes not belonging to the VMware environment (meaning in use as a VMFS or RDM) that is controlled by SRM is not supported. In other words, for a given ActiveDR pod, either all of the volumes must be in use as an VMFS or RDM and protected by the same SRM protection group, or none of them. Supporting "non-VMware" volumes is under investigation for a future release of the SRA.

If a volume is added to the source pod prior to a test:

clipboard_ef869900b4cf24e81c7b1e4f76a2436ef.png

And the 4th volume (in this case) is not in the protection group:

clipboard_e565d4b3ddc4b6db967cb522ecbdc8dae.png

The test recovery will still succeed, but the cleanup will fail. The test recovery will tell the SRA to recover the 3 volumes SRM knows about, and the SRA will tag them:

clipboard_e766814ecfe50cef80026328e082b5805.png

During the cleanup, the SRA will see a 4th volume which is untagged. The cleanup will fail with the following error:

clipboard_e55ceedce78bfbfb19c9bff0699869fe0.png

 Failed to delete snapshots of replica devices. SRA command 'testFailoverStop' failed. More than one type of devices found in the pod. Please verify that all the volumes are either untagged or tagged with same value in the "puresra" namespace.

This failure can also occur if someone manually deleted the tags on one of the expected volumes. The SRA will fail in this situation because it does not want to demote the pod when an in-use volume is in it--the demotion process causes all volumes to go NOT READY, meaning they will not be able to be written to or read from. If some environment is using that 4th volume, it will lose access. In this situation, manual intervention is required.

The options are:

  • Delete the non-VMware volume
  • Terminate the pod replica link, move the volume out of the pod, and then relink the pods (note this will cause replication to stop for the duration of the process!).

You may either do this, then re-run test recovery cleanup or run test recovery cleanup with the force cleanup option selected to have the SRA skip the demotion step. The force option will require manual intervention to demote the pod.

clipboard_eed9e98a72c0513f5c2ae936c4bd04569.png

Volume added to target pod during test

When the target pod is promoted during a test, the pod is now available for provisioning. While it is possible to do so, it is not recommended to provision from a pod that is promoted for the use of a test as upon demotion those volumes will be destroyed as the pod will sync back up with the source pod which does not have the volume that was created in the temporarily promoted target pod.

In this case there are two pods linked for ActiveDR, a source pod called srm-podA and a target pod srm-podB. During an SRM test, someone added a volume to srm-podB called volcreatedintarget:

clipboard_e846f4362f46d8372154de980312276f9.png

This volume is not in the source pod:

clipboard_e1470c9d687d54cb04e4285f1cbaa4976.png

The SRM protection group is not aware of it either:

clipboard_ef139a7fa333fcf5cc6253512c7322a0c.png

A cleanup will fail because that volume is unknown and therefore not tagged as being in test. The SRA does not want to continue on and demote the pod because that volume could be serving up production data--someone might have accidentally provisioned it to the target pod when it was promoted. While doing so is a mistake--accidents happen and the SRA does not want to make it worse. Therefore, the cleanup will fail:

clipboard_e56d73bc8ce24c703bba5f59e26aaf9a9.png

 Failed to delete snapshots of replica devices. SRA command 'testFailoverStop' failed. More than one type of devices found in the pod. Please verify that all the volumes are either untagged or tagged with same value in the "puresra" namespace.

clipboard_edf421fd25a36b758b0628c9091a35f71.png

The options are:

  • Move the data off of the volume to a new FlashArray volume in a different pod via some host-based tool and then delete the original volume.
  • Terminate the pod replica link, move the volume out of the pod, and then relink the pods (note this will cause replication to stop for the duration of the process!).

You may either do this, then re-run test recovery cleanup or run test recovery cleanup with the force cleanup option selected to have the SRA skip the demotion step. The force option will require manual intervention to demote the pod.

clipboard_eed9e98a72c0513f5c2ae936c4bd04569.png

A Test Recovery tag was accidentally deleted

It is important to remember: DO NOT manually edit or remove tags in the puresra namespace unless you understand fully what you are doing. If some rogue admin or typo'ed script deletes the tags during a test recovery it will be necessary to manually intervene.

During the test recovery, all volumes in the source pod get tagged with three pieces of information:

  • The operation (like testFailover) 
  • Timestamp of the test
  • UUID of the pod they belong to.

The tags look like this:

clipboard_eb58721c3bf13b38f1a1728c87e63dcba.png

If someone deletes one of the tags, the cleanup process will fail. So in this case, my 3rd volume (activeDR-SRM3-RDM) was untagged:

clipboard_efb95aff61c709e0573470048943fcd56.png

clipboard_ee30b23111f73cedf2f966f99ac711b06.png

 Failed to delete snapshots of replica devices. SRA command 'testFailoverStop' failed. More than one type of devices found in the pod. Please verify that all the volumes are either untagged or tagged with same value in the "puresra" namespace.

 If you log into the target array CLI, you can query for the tags of the volumes in the pod:

clipboard_e7267003d4dff53d4ef7f1db6c496c1e6.png

If you add the target pod name with an asterisk at the end, you will only see volumes in that pod.

Compare that list to the list of volumes in the SRM protection group:

clipboard_e42e8a7ecadaaf13fa266ab8f51e77782.png

If there is a volume in the protection group that is missing in the tagged volume list (and that volume is in fact in the target pod), you know the tag was for some reason deleted.

If they are the same list of volumes, then it means someone added a volume to the target pod during the test. Look at the above section for handling that.

If a volume is missing the tag, you can either add the tag or re-run the test recovery with the force option and manually demote the pod. This will erase all of the tags on the target volumes. Generally the latter is the preferred option as tagging can be error prone.

To add the tag:

Look at the tags in the pod, copy the key and the value. Identify the volume that needs to be tagged and tag it as follows:

purevol tag SRM-podB::activeDR-SRM3-RDM --key puresra-testFailover_1597691959294 --value e184134e-5a94-9f76-5c53-ae33e6df93d5 --namespace puresra
clipboard_e0dbfd1eaac8c3005e670bacdb17bcf1c.png

The above shows listing the tags, adding the tag, then listing the tags again. You must use the puresra namespace--otherwise the tag will not work.

Once done, re-run the test recovery cleanup.

 clipboard_e39183e25b57603d8b1cd2e6cac3214cc.png

The alternative (let's assume the same scenario as above) is to force the test recovery cleanup and then manually demote the pod.

clipboard_ec06f1b14dd0b6fa63979bf5596516485.png

Let the cleanup complete, then login to the target FlashArray and identify the target pod. Click on the vertical ellipsis and choose Demote. CONFIRM that this is indeed the target pod, not the source pod. The replication direction should be coming towards this pod like in the screenshot below:

clipboard_e3286161aa15723547a5272a684a6afd6.png

Click Demote.

clipboard_ee3c0f185eacf066e23908f65791b6b79.png

The tags will be removed:

clipboard_e4137fc461642a7d87f70831ec3c996f0.png

Once you confirm that no issues were caused by the demote, you can eradicate the undo pod (if you need to run another test) or if not, let the eradication timer remove it in due time.

clipboard_ebc80a66fd4f49f119dad971302280ef6.png

Test Recovery Cleanup Fails: Undo Pod already exists on target site 

If a target pod was manually promoted or the SRA was unable to eradicate the undo pod (for instance if safe mode was enabled) a test recovery cleanup will fail as the pod will not be able to be demoted due to the presence of an undo pod.

 Failed to delete snapshots of replica devices. SRA command 'testFailoverStop' failed. Array operation failed: "PureRestException: HttpStatusCode = 'BadRequest', RestErrorCode = 'InternalError', Details = '["context":"SRM-podB","message":"Cannot demote pod because the associated undo-demote pod has not been eradicated.","context":"PATCH","message":"https://flasharray-m20-2.purecloud.c..._quiesce=False"]', InnerException = ''". Array operation failed. Please see logs at /srm/sra/log/testFailoverStop_2020-08-17-16-35-16-3662578-45c6e077-144a-4f5b-8c8d-f46c61a40ecb.log.

To verify, login to the target FlashArray and look under Destroyed and Undo Pods and see if there is a destroyed "undo" pod there for your target pod. If this exists, the test recovery cleanup will not be possible until it is fully eradicated.

There are three reasons that the undo pod might still exist:

  1. Someone manually promoted and then demoted the target pod but did not eradicate the undo pod.
  2. The credentials in the array manager within SRM are at a storage admin permission level. Storage admin level permissions cannot eradicate a pod--this requires the highest level of FlashArray permissions (array admin). So either request that an admin eradicates it, wait for timed eradication, or update the SRM array manager credentials for that ActiveDR pair to array admin.
  3. Safemode is enabled on the FlashArray and therefore no object can be manually eradicated.

Manually eradicate it, or wait for the eradication timer to automatically remove the pod according to the schedule and then try the test recovery.

Note if you have enabled "SafeMode" on the FlashArray, manual eradication will not be possible. This will also mean that only one SRM test recovery process can be run per pod pair per each eradication window.

You can then either eradicate the pod and rerun the cleanup:

 

Or re-run the cleanup with Force Cleanup selected:

Then demote the pod when needed or through the automatic eradication window. Note that a subsequent test recovery or recovery will not be possible until the pod is demoted.

Replication Link is Paused

There is a current known issue with the 4.0 release of the SRA with ActiveDR when the replication link is paused between the source and target pod.

clipboard_e9473b1fdfebfd85e903138bdceae5c07.png

The Synchronize Storage step will appear to hang at 100%:

clipboard_eb12428d486e1c2e6eb524f008edbca7a.png

Go to the either FlashArray in the ActiveDR pod pair, and navigate to the pod and in the Pod Replica Links box click the vertical ellipsis and choose Resume:

clipboard_e67e7d6f1a70391bc87e7710c2f47e679.png

Confirm the resume operation:

clipboard_edda048b6bb32a80076f7701a3477a3c1.png

The test recovery will then complete once synchronized. If you prefer to not re-enable replication but still perform a test, cancel the test recovery and de-select the Replicate recent changes to recovery site option.

clipboard_e4d5df464c5a767cb97416acba24579cf.png

This will cause the synchronization to be skipped.

clipboard_e8399bc67e9daf5b76f32f1f3ab26d50b.png