Skip to main content
Pure Technical Services

FlashArray ActiveDR: MongoDB Disaster Recovery

Currently viewing public documentation. Please login to access the full scope of documentation.

Disaster Recovery

Pure Storage® Purity 6.0, the latest version of the operating environment for FlashArray™, delivered a major new replication feature called ActiveDR™. ActiveDR is a continuous, asynchronous replication between FlashArrays.  Host write operations are acknowledged as soon as they are safely persisted on the replication source array without the need to wait for an acknowledgement from the replication target array. This approach removes the potential write performance penalty related to the replication network latency in synchronous replication environments, such as ActiveCluster™, while delivering near zero recovery point objective (RPO). ActiveDR provides a simple and reliable disaster recovery capability. ActiveDR also includes replicating more than just volumes, corresponding snapshot histories, snapshot schedules, volume settings (like QoS limits), protection groups, and user defined volume tags are also copied to the target array.

MongoDB also includes an increased data availability solution with replica set. In the MongoDB replica set, data is asynchronously copied between primary and one or more secondary nodes. With the built-in protection for a single node failure replica set can be also deployed as a disaster recovery solution. In this article we explore the possibility of deploying MongoDB on FlashArray with ActiveDR as a disaster recovery solution and investigate associated benefits.

Environment

MongoDB replica set requires at least three nodes. Three member replica set is capable of surviving a single node failure without impacting database operations, however, if another node fails, the entire replica set is switched to read-only mode. If master node fails, MongoDB automatically forces an election which requires a majority of nodes to be available. In such situations where a production site with two nodes fails, either a third data center or cloud-based MongoDB instance is recommended. In case if the third data center or cloud MongoDB node is not available, the entire replica set needs to be recovered on the surviving site. Typical three node replica set deployed across two sites is depicted in Figure 1.

figure1.png

Figure 1.

MongoDB replica set can also be deployed on FlashArray. The benefits of FlashArray and MongoDB such as disk space savings and data reduction, fast replica set expansion and rapid node recovery are described here. Additionally, with ActiveDR, a disaster recovery environment may be easily configured without the need for MongoDB nodes located in a third data center or in the cloud. Utilizing FlashArray ActiveDR, the potential data loss and the recovery time are reduced and single-command storage failover process enables quick operational recovery. In case of the production site failure, data on the disaster recovery site is nearly instantaneously available via promoting the replica pod on the target FlashArray. With MongoDB hosts configured and connected to the volumes on the disaster recovery site, the disaster recovery database instance can be quickly started using replica volumes and immediately become available for production workloads.

Replication Lag

MongoDB and FlashArray utilize asynchronous replication however, given the same network conditions hardware based (ActiveDR) replication can be more efficient. Based on testing in the Pure Storage laboratory, the replication lag (RPO) delivered by FlashArray during heavy write operations (50% writes) was up to 33% lower than MongoDB replication lag under the same workload. For the testing purposes YCSB (Yahoo! Cloud Serving Benchmark) with workload A (50% reads,50% writes) were utilized. This type of workload simulates applications such as live tracking systems. The database was built on a three node replica set. The MongoDB replication lag was collected based on the output of rs.printSlaveReplicationInfo() collected every 15 seconds. For flash array purepod replica-link list --lag --historical <time> with a 30 seconds reporting frequency. The chart shown in Figure 2 illustrates the maximum replication lag reported by FlashArray and MongodB.  Lower replication lag means less potential data loss (recovery point objective) in case of the catastrophic data center failure. 

figure2.png

Figure 2.

ActiveDR MongoDB Recovery

MongoDB replica set deployment with FlashArray ActiveDR replication is shown in Figure 3.

figure3.png

Figure 3.

Disaster Site Preparation

To ensure fast and efficient MongoDB replica set recovery at the disaster site, it is recommended to provide at least three or more stand-by servers with ActiveDR replicated volumes connected (not mounted). MongoDB binaries should be installed as well. Host names or IP addresses, depending on how the replica set was configured must match the production site configuration. MongoDB version should be the same on all hosts (production and disaster recovery) as well. Furthemore, mongod user ID and group membership should be the same on production and disaster recovery servers to ensure that mongod process has the proper file system permissions.

ActiveDR Setup

ActiveDR setup consists of the following tasks:

  • Connect the source and target arrays

  • On source array: create source pod and move the corresponding MongoDB FlashArray volumes into the source pod

  • On source array: create a replica link to the target FlashArray and specify a new target pod name.  Specifying a new target pod name during replica link creation eliminates the need to connect to the target array for any of the ActiveDR setup steps..

The process of creating replica link is described in ActiveDR Quick Start Guide.

Testing Failover

FlashArray ActiveDR provides the means of non-disruptive  testing of the recovery process. The test failover command for ActiveDR is the same as the actual failover command and the test does not impact production replication and the target site (DR site) RPO. 

To perform test failover and database recovery:

  1. Disaster recovery site (on target array): promote the target pod

From the FlashArray command line execute:

purepod promote <pod_name> 

From FlashArray Graphical User Interface select (see Figure 4):

Storage ➤ Pods  ➤ <pod_name>  ➤ '⋮' (vertical ellipsis)  ➤ Promote 


figure4.png

Figure 4.

2.  Disaster recovery site: mount the volumes and execute required tests

After test completion:

3.   Disaster recovery site: stop test application, unmount volumes on target array and then demote the target pod

From the FlashArray command line execute: 

purepod demote <pod_name>

From the FlashArray Graphical User Interface select (see Figure 5)

Storage ➤ Pods  ➤ <pod_name>  ➤ '⋮' (vertical ellipsis)  ➤ Demote

figure5.png

Figure 5.

Following the demotion of a pod, any data changed or written to the  pod will be saved in a <pod_name>.undo-demote pod in the Destroyed and Undo Pods panel with a 24 hour eradication. When testing is finished and the pod on target array is demoted, any content changes received by the target array from the production site FlashArray during testing will be applied to the target pod. Customers also have an option of cloning <pod_name>.undo-demote pod in case the data modified during the test period needs to be preserved.


Failover (Production Site Failure)

When the production site becomes unavailable, the failover process must be manually initiated. The pod on the disaster site should be promoted to allow writes. To promote a pod using command line on the disaster recovery site execute on the disaster recovery site (target array):

  1. Promote the target pod

Disaster Recovery Site: From the FlashArray command line execute:

purepod promote <pod_name> 

From FlashArray Graphical User Interface select (see Figure 6):

Storage ➤ Pods  ➤ <pod_name>  ➤ '⋮' (vertical ellipsis)  ➤ Promote

figure6.png

Figure 6.

Even after the pod promotion, the pod replica links status will remain “unhealthy” and the “Lag” will be increasing. Once the production site is restored, the disaster recovery site can be either re-protected or failed back.

2. Disaster recovery site: mount the volumes and start MongoDB instances

Reprotecting After Failover

Reprotecting original disaster recovery reverses the direction of replication. Once the production site has been restored, the original source pod should be demoted. To reprotect the MongoDB instance now running at the target site (disaster recovery site):

  1. Original replication source site (Production Site)): Power-on / Validate health of the failed production FlashArray

  2. Original production site: demote the pod on restored FlashArray. Whether the demotion operation is performed using FlashArray GUI or command line, the ‘--skip-quiesce’ option should be selected so that stale data on the failed source array is not unnecessarily replicated prior to completing the demote.

From command line:

purepod demote --skip-quiesce <pod_name>

From FlashArray GUI select (see Figure 7):

Storage Pods ➤ <pod> ➤ '⋮' (vertical ellipsis) ➤ Demote ➤ Skip Quiesce➤ Demote

figure7.png

Figure 7.

Following the completed demotion of the original source pod, ActiveDR will automatically reverse the direction of replication.  The new content created by MongoDB running at the recovery site is now replicated back to the original production site. 

Failback After Failover (Planned Failover)

Failback returns the replication relationship between the source and target pod to the original direction (production to disaster recovery). Before failing back, ensure that the MongoDB instances on hosts connected to the disaster recovery site are quiesced and demote the disaster recovery site pod (original target pod) with --quiesce option to ensure that all data is replicated to the original production site before the pod is marked demoted -- making the volumes within the pod read-only. Once the demote process is complete and the pod on the disaster recovery site shows a ‘demoted’ status, the pod on the original source site (production site) can be promoted.

To failback after failover:

  1. Production site (original): Restore FlashArray

  2. Disaster recovery site (currently running disaster recovery applications: quiesce disaster recovery applications (stop MongoDB), umount target (disaster recovery) pod volumes

  3. Disaster recovery site (was running disaster recovery applications): demote the pod using --quiesce option

From the command line:

purepod demote --quiesce <pod_name>

From FlashArray GU (see Figure 8):

Storage ➤ Pods ➤ <pod> ➤ '⋮' (vertical ellipsis)  ➤ Demote ➤ Quiesce ➤ Demote

figure8.png

Figure 8.

4. Disaster recovery site (had been running disaster recovery applications): Monitor replica-link status; replica-link quiesced state indicates all data writes from disaster recovery site  have been completed to the original production site. To determine the replica-link from the command line execute:

purepod replica-link list

From FlashArray GUI select Protection ➤ ActiveDR (see Figure 9)

figure9.png

Figure 9.

Summary

ActiveDR is very well suited for customers seeking a highly reliable disaster recovery solution with near zero recovery point objective. Furthemore, ActiveDR eliminates the need for a third data center and cloud based MongoDB node. Combining excellent protection against host failure guaranteed by MongoDB replica set with ActiveDR fast, easy to configure and manage replication, customers can ensure their database availability on the disaster recovery site with minimal RPO and RTO. Additionally, ActiveDR as an integral part of the Purity operating system is included with all FlashArrays where Purity 6 and higher is supported and no additional license fees are required.

Appendix A

Test Environment and Description

MongoDB Version: 4.2.6

Replica Set Size: 3 nodes

FlashArray Model: \\X90

Purity Version: 6.0

Test software:

YCSB version: 0.17.0

YCSB workload A (50/50): 

recordcount = 500,000,000

Operationcount = 100,000,000

Throughput = 22,345 ops/sec