Skip to main content
Pure Technical Services

Troubleshooting ActiveDR Recovery Errors in Site Recovery Manager

Currently viewing public documentation. Please login to access the full scope of documentation.

 

The recovery process in Site Recovery Manager is fairly simple from an SRA operation perspective, but there are certain situations that could cause a recovery to fail. In this KB, the situations that can cause an error or failure with an ActiveDR recovery process in SRM will be detailed. Pure Storage will continue to work to identify ways that the SRA (or the FlashArray itself) can handle these situations automatically. When and if this happens this article will be updated with new guidance.

Undo Pod Exists for Source Pod

 Failed to demote source consistency group 'e184134e-5a94-9f76-5c53-ae33e6df93d5'. SRA command 'prepareFailover' failed for consistency group 'e184134e-5a94-9f76-5c53-ae33e6df93d5'. Demoting the pod <pod name> failed. Error message from Purity: Message from Purity='ctx:<pod name>,msg:Cannot demote pod because the associated undo-demote pod has not been eradicated.' Message from Purity='ctx:PATCH,msg:https://flasharray-m20-2.purecloud.c..._quiesce=False'. Please verify that the source pod is setup correctly.

One of the first steps of a recovery operation is to demote the source pod. When a pod gets demoted an "undo" pod gets created to store the state of the pod upon the time of demotion. This undo pod can then be manually eradicated or will be eradicated automatically in time according to the eradication time (defaults to 24 hours). For a given pod, only one undo pod can exist at a time (though this might change in the future). If an undo pod exists for a pod the pod cannot be demoted again until the undo pod is eradicated.

So if the source pod has an undo pod upon an attempted recovery you will see the plan quickly fail:

clipboard_e4d4d12d319ac64ede41ca6020cab1b9e.png

If this is the case, identify the undo pod and eradicate it:

clipboard_e3cd5bda29a8e57db21e87c2fdc23d63d.png

Now re-run the recovery preferably in planned migration mode (or in disaster recovery mode if that is necessary for some other reason).

If you cannot eradicate it (for instance safe mode is enabled and objects cannot be manually eradicated) you must re-run the recovery plan in disaster recovery mode:

clipboard_edaee0583ae986d200b46e83187fc2828.png

When run in disaster recovery mode and the source cannot be demoted, that step will fail, but the error will be skipped. Causing the source to not be demoted.

clipboard_e9145cf57f0bb9e2291786776702d0f93.png

Note that this will allow a recovery, but the target side will not be able to be reprotected until the undo pod is gone. As soon as the undo pod is eradicated, run the recovery one last time to complete the process (demotion of the source pod) which will also enable replication to restart from the source (previously the target) pod.

clipboard_e016f4933482d52eba13e4bafbcea4065.png

Confirm the recovery:

clipboard_e06b142c8354b89718c45f0f207c2515c.png

This will complete that step and any other step that might not have been possible during the initial recovery.

clipboard_eba83b2b589334ae9c332989f654eeb9a.png

Target Pod is already Promoted

Operation Failed: Failed to sync data on replica devices. SRA command 'syncOnce' failed. The target pod <pod name> of the operation is already promoted. Make sure the target pod is demoted before running a planned migration or disaster recovery.

During the process of recovery, the source pod gets demoted and the target pod gets promoted. If the target pod is already promoted, the SRA needs to know if the SRA itself promoted the pod already (a recovery occurred, it got past the promotion step but something then failed and the recovery was restarted).

If a pod was promoted by the SRA, the volumes in it will have the puresra-failover tag:

If they do not and the pod is promoted the SRA will decide it was not promoted by itself and throw the above error during the synchronization step.

clipboard_efc8fd7add6583adb94644d7afd94076a.png

To recovery from this, go to the target pod and demote it and retry the recovery:

clipboard_eca3c8966803c40b87eb7f6d9a50c812f.png

clipboard_ea7e3b471f436322f95a303b69c11141c.png

There is one edge case that is worth noting: 

  1. If you have safe mode enabled (which means nothing can be manually eradicated)
  2. The target pod is already promoted for testing reasons etc
  3. There is an undo pod still in existence for the target pod

The target pod will not be able to be demoted manually until the undo pod gets eradicated by the timer. If too much time is left on the timer that would cause too long in a delay in disaster recovery, call Pure Storage Support.

Lastly, Pure Storage is working on improving the handling of scenarios such as this where safe mode is enabled,

Replication Link is Paused 

There is a current known issue with the 4.0 release of the SRA with ActiveDR when the replication link is paused between the source and target pod.

clipboard_e9473b1fdfebfd85e903138bdceae5c07.png

The Synchronize Storage step will appear to hang at 100%:

clipboard_efa1a641d05f3e68d1d053ebe8f1fda5f.png

Go to the either FlashArray in the ActiveDR pod pair, and navigate to the pod and in the Pod Replica Links box click the vertical ellipsis and choose Resume:

clipboard_e67e7d6f1a70391bc87e7710c2f47e679.png

Confirm the resume operation:

clipboard_edda048b6bb32a80076f7701a3477a3c1.png

The recovery will then complete once synchronized.