Troubleshooting ActiveDR Recovery Errors in Site Recovery Manager
The recovery process in Site Recovery Manager is fairly simple from an SRA operation perspective, but there are certain situations that could cause a recovery to fail. In this KB, the situations that can cause an error or failure with an ActiveDR recovery process in SRM will be detailed. Pure Storage will continue to work to identify ways that the SRA (or the FlashArray itself) can handle these situations automatically. When and if this happens this article will be updated with new guidance.
Undo Pod Exists for Source Pod
Failed to demote source consistency group 'e184134e-5a94-9f76-5c53-ae33e6df93d5'. SRA command 'prepareFailover' failed for consistency group 'e184134e-5a94-9f76-5c53-ae33e6df93d5'. Demoting the pod <pod name> failed. Error message from Purity: Message from Purity='ctx:<pod name>,msg:Cannot demote pod because the associated undo-demote pod has not been eradicated.' Message from Purity='ctx:PATCH,msg:https://flasharray-m20-2.purecloud.c..._quiesce=False'. Please verify that the source pod is setup correctly.
One of the first steps of a recovery operation is to demote the source pod. When a pod gets demoted an "undo" pod gets created to store the state of the pod upon the time of demotion. This undo pod can then be manually eradicated or will be eradicated automatically in time according to the eradication time (defaults to 24 hours). For a given pod, only one undo pod can exist at a time (though this might change in the future). If an undo pod exists for a pod the pod cannot be demoted again until the undo pod is eradicated.
So if the source pod has an undo pod upon an attempted recovery you will see the plan quickly fail:
If this is the case, identify the undo pod and eradicate it:
Now re-run the recovery preferably in planned migration mode (or in disaster recovery mode if that is necessary for some other reason).
If you cannot eradicate it (for instance safe mode is enabled and objects cannot be manually eradicated) you must re-run the recovery plan in disaster recovery mode:
When run in disaster recovery mode and the source cannot be demoted, that step will fail, but the error will be skipped. Causing the source to not be demoted.
Note that this will allow a recovery, but the target side will not be able to be reprotected until the undo pod is gone. As soon as the undo pod is eradicated, run the recovery one last time to complete the process (demotion of the source pod) which will also enable replication to restart from the source (previously the target) pod.
Confirm the recovery:
This will complete that step and any other step that might not have been possible during the initial recovery.
Target Pod is already Promoted
Operation Failed: Failed to sync data on replica devices. SRA command 'syncOnce' failed. The target pod <pod name> of the operation is already promoted. Make sure the target pod is demoted before running a planned migration or disaster recovery.
During the process of recovery, the source pod gets demoted and the target pod gets promoted. If the target pod is already promoted, the SRA needs to know if the SRA itself promoted the pod already (a recovery occurred, it got past the promotion step but something then failed and the recovery was restarted).
If a pod was promoted by the SRA, the volumes in it will have the puresra-failover tag:
If they do not and the pod is promoted the SRA will decide it was not promoted by itself and throw the above error during the synchronization step.
To recovery from this, go to the target pod and demote it and retry the recovery:
There is one edge case that is worth noting:
- If you have safe mode enabled (which means nothing can be manually eradicated)
- The target pod is already promoted for testing reasons etc
- There is an undo pod still in existence for the target pod
The target pod will not be able to be demoted manually until the undo pod gets eradicated by the timer. If too much time is left on the timer that would cause too long in a delay in disaster recovery, call Pure Storage Support.
Lastly, Pure Storage is working on improving the handling of scenarios such as this where safe mode is enabled,
Replication Link is Paused
There is a current known issue with the 4.0 release of the SRA with ActiveDR when the replication link is paused between the source and target pod.
The Synchronize Storage step will appear to hang at 100%:
Go to the either FlashArray in the ActiveDR pod pair, and navigate to the pod and in the Pod Replica Links box click the vertical ellipsis and choose Resume:
Confirm the resume operation:
The recovery will then complete once synchronized.
Consistency Group Warning in the Recovery Plan History Report
When you run a recovery (or test recovery) in Site Recovery Manager with ActiveDR, a warning will appear in the report similar to below:
Warning: | Expected consistency group '4eca7aff-f044-58a9-6dc7-f084e14d776c' not found in SRA's 'queryReplicationSettings' response. |
While this error is innocuous it can be concerning to see. This is due to an SRA feature called queryReplicationSettings not being implemented for ActiveDR operations. ActiveDR leverages the SRM feature for consistency groups to ensure that all volumes in the same pod are grouped. During a recovery operation SRM issues queryReplicationSettings and expects to see consistency groups returned. Since the API is rejected the failure turns into the warning above as SRM did not find the consistency group which is not the underlying problem but the consequence of not responding to this API. Since this is not a critical error, a warning is thrown by SRM which does not terminate the plan operation.
Support for this API will be added in an upcoming release of the SRA so that this warning does not appear.