Business critical applications and the virtual machines hosting them often need the highest possible resiliency to ensure that business operations do not stop in the case of a disaster—either localized or site-wide.
To ensure this, the data in use by those applications needs to be spread across two arrays, often in more than one geographic location; importantly, this data need to be available in both sites at the same time. To achieve this, some arrays offer synchronous replication that provide the ability to write to the same block storage volume simultaneously while maintaining write-order. This is traditionally called Active-Active replication.
The Pure Storage FlashArray introduced Active-Active replication, called ActiveCluster, in the Purity 5.0.0 release.
In VMware vSphere environments, a common use-case for Active-Active replication is with the VMware vSphere High Availability offering. Together, this solution is called VMware vSphere Metro Storage Cluster (vMSC). The combination of these features provider the best possible Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for vSphere environments.
This paper overviews configuration and best practices for using the Pure Storage FlashArray ActiveCluster feature with the vSphere Metro Storage Cluster solution.
For specific best practices for individual features, please refer to Pure Storage or VMware documentation depending on the circumstance.
This paper is intended for storage, VMware, network or other administrators who plan to implement Pure Storage FlashArray ActiveCluster with VMware vSphere Metro Storage Cluster.
Before setting up ActiveCluster and VMware vSphere Metro Storage Cluster, it is important to read the ActiveCluster specific documentation and VMware’s own documentation:
- VMware vSphere Metro Storage Cluster Recommended Practices
- ActiveCluster Best Practices and Requirements
- ActiveCluster Solutions Overview
Ensure that the above stated best practices have been followed. This document assumes that the aforementioned documentation has been read prior to reading this document.
The Table of Contents for this guide can be found here and is helpful for navigating to the rest of this guide.
After January 2016, VMware retired the vMSC solution program and replaced it with the Partner Verified and Supported Products (PVSP) listing.
Looking for Metro Storage Cluster (vMSC) solutions listed under PVSP?
- vMSC was EOLed in late 2015. You can find more information about vMSC EOL in this KB article.
- vMSC solution listing under PVSP can be found on our Partner Verified and Supported Products listing.
VMware KB on the program change is here.
Solution Overview and Introduction
The most robust, resilient, and automated solution for critical data protection and availability combines three technologies:
- VMware vSphere High Availability—VMware’s vCenter feature for automated restart of virtual machines after service interruption to another VMware ESXi host.
- Pure Storage FlashArray ActiveCluster—a simple, built-in, active-active synchronous storage replication solution for FlashArray block storage.
- VMware vSphere Metro Storage Cluster—a VMware vCenter solution combining active-active storage and ESXi servers spread across geographic areas in a single vCenter cluster.
VMware vSphere High Availability
VMware High Availability (HA) is a technology that provides cluster-based monitoring of virtual machines running on included ESXi hosts. If a failure of the storage, network, or host occurs, the remaining ESXi hosts coordinate to restart the affected virtual machines on unaffected hosts. Through a variety of network and storage-based heartbeating, the ESXi hosts can detect and respond to a variety of failure or isolation events to provide the fastest automated recovery of virtual machines.
VMware HA offers a solution to protect applications running inside virtual machines that do not offer application-based high-availability or application cluster configurations. VMware HA, however, does not preclude the use of those features if the applications offer them. They can actually offer additional benefits on top of one another, though a detailed discussion on that topic is beyond the scope of this paper.
VMware HA is made possible via shared storage. Without shared storage, VMs and their data cannot be seen by surviving hosts and therefore a disaster restart operation is not an option. For this reason, it is a general best practice to provision storage identically and simultaneously to all hosts in a cluster—this provides the ability for any host to host any virtual machine on any other host in the cluster.
VMware vSphere Metro Storage Cluster
VMware vSphere Metro Storage Cluster (vMSC) is a feature that extends VMware HA with active-active stretched storage. VMware HA, as stated in the previous section, requires shared storage to be presented to all hosts in a cluster to enable restart of virtual machines in the event of a disaster. Currently, VMware only supports VMFS for vMSC.
In many scenarios, a cluster might have hosts in two entirely separate datacenters or geographical locations. For VMware HA to restart a VM on a host in the cluster that is in a different datacenter, that host must see that storage too. There are a few options to achieve this.
Scenario 1: Two-site host cluster, single-site storage
It is possible to cross-connect a storage array in one datacenter to hosts in its own datacenter and a second datacenter. Therefore, storage is only provided by one site. If the power goes out (for example) in site B, no hosts will have access to the storage. Consequently, a one-site storage configuration does not really provide much additional resiliency.
Figure 1. Scenario 1: Stretched host cluster with non-stretched/single-site storage
Scenario 2: Two-site host cluster, single-site storage in a third site
Another option is to put an array in a third site and provide access to the hosts in two other sites to that storage. While there are many obvious problems with this one, the main one is that this does not protect against is the array failures or a network partition of site C. Either of these failures will cause all hosts to lose access.
Figure 2. Scenario 2: Stretched host cluster with non-stretched/single-site storage in third site
Scenario 3: Two-site host cluster, dual-site storage
The third option is stretched storage. In this situation, a storage array is in site A and another array is in site B. Both sites have hosts that can see one or both arrays. One or more storage volumes are created on one array and an exact synchronous copy is created on the second array. All writes are synchronously mirrored between the arrays so the volume appears as the same in both sites and can be read from and written to simultaneously. If hosts in datacenter A fail (or the storage, or the entire datacenter), hosts in datacenter B can take over their workloads by using the array in datacenter B. This is the scenario that will be discussed in this paper.
Figure 3. Scenario 3: Stretched host cluster with stretched/dual-site storage
The combination of stretched storage AND stretched ESXi clusters provides an extremely high level of resiliency to an infrastructure. When the automatic failover and recovery features offered by VMware vSphere HA is enabled on top of these two topologies, RTO is reduced even further as VMware can quickly and intelligently respond and react to host, storage or site failures to bring virtual machines back online.
Pure Storage® Purity ActiveCluster is a fully symmetric active/active bidirectional replication solution that provides synchronous replication for RPO zero and automatic transparent failover for RTO zero. ActiveCluster spans multiple sites enabling clustered arrays and clustered hosts to be used to deploy flexible active/active datacenter configurations.
Synchronous Replication - Writes are synchronized between arrays and protected in non-volatile RAM (NVRAM) on both arrays before being acknowledged to the host.
Symmetric Active/Active - Read and write to the same volumes at either side of the mirror, with optional host-to-array site awareness.
Transparent Failover - Automatic Non-disruptive failover between synchronously replicating arrays and sites with automatic resynchronization and recovery.
Async Replication Integration - Uses async for baseline copies and resynchronizing. Convert async relationships to sync without resending data. Async data to a 3rd site for DR.
No Bolt-ons & No Licenses - No additional hardware required, no costly software licenses required, just upgrade the Purity Operating Environment and go active/active!
Simple Management - Perform data management operations from either side of the mirror, provision storage, connect hosts, create snapshots, create clones.
Integrated Pure1® Cloud Mediator - Automatically configured passive mediator that allows transparent failover and prevents split-brain, without the need to deploy and manage another component.
Purity ActiveCluster is composed of three core components: The Pure1 Mediator, active/active clustered array pairs, and stretched storage containers.
The Pure1 Cloud Mediator - A required component of the solution that is used to determine which array will continue data services should an outage occur in the environment.
Active/Active Clustered FlashArrays - Utilize synchronous replication to maintain a copy of data on each array and present those as one consistent copy to hosts that are attached to either, or both, arrays.
Stretched Storage Containers - Management containers that collect storage objects such as volumes into groups that are stretched between two arrays.
ActiveCluster introduces a new management object: Pods. A pod is a stretched storage container that defines a set of objects that are synchronously replicated together and which arrays they are replicated between. An array can support multiple pods. Pods can exist on just one array or on two arrays simultaneously with synchronous replication. Pods that are synchronously replicated between two arrays are said to be stretched between arrays.
Pods can contain volumes, protection groups (for snapshot scheduling and asynchronous replication) and other configuration information such as which volumes are connected to which hosts. The pod acts as a consistency group, ensuring that multiple volumes within the same pod remain write order consistent.
Pods also provide volume namespaces, that is different volumes may have the same volume name if they are in different pods. In the image above the volumes in pod3 and pod4 are different volumes than those in pod1, a stretched active/active pod. This allows migration of workloads between arrays or consolidation of workloads from multiple arrays to one, without volume name conflicts.
Transparent failover between arrays in ActiveCluster is automatic and requires no intervention from the storage administrator. Failover occurs within standard host I/O timeouts similar to the way failover occurs between two controllers in one array during non-disruptive hardware or software upgrades.
ActiveCluster is designed to provide maximum availability across symmetric active/active storage arrays while preventing a split-brain condition from occurring; split-brain is the case where two arrays might serve I/O to the same volume, without keeping the data in sync between the two arrays.
Any active/active synchronous replication solution designed to provide continuous availability across two different sites requires a component referred to as a witness or voter to mediate failovers while preventing split-brain. ActiveCluster includes a simple to use, lightweight, and automatic way for applications to transparently failover, or simply move, between sites in the event of a failure without user intervention: The Pure1 Cloud Mediator.
The Pure1 Cloud Mediator is responsible for ensuring that only one array is allowed to stay active for each pod when there is a loss of communication between the arrays.
In the event that the arrays can no longer communicate with each other over the replication interconnect, both arrays will pause I/O and reach out to the mediator to determine which array can stay active for each sync replicated pod. This is called a race to the mediator. The first array to reach the mediator is allowed to keep its synchronously replicated pods online. The second array to reach the mediator must stop servicing I/O to its synchronously replicated volumes, in order to prevent split brain. The entire operation occurs within standard host I/O timeouts to ensure that applications experience no more than a pause and resume of I/O.
The Pure1® Cloud Mediator
A failover mediator must be located in a 3rd site that is in a separate failure domain from either site where the arrays are located. Each array site must have independent network connectivity to the mediator such that a single network outage does not prevent both arrays from accessing the mediator. A mediator should also provide a very lightweight and easy to administer component of the solution. The Pure Storage solution provides this automatically by utilizing an integrated cloud based mediator. The Pure1 Cloud Mediator provides two main functions:
- Prevent a split-brain condition from occurring where both arrays are independently allowing access to data without synchronization between arrays.
- Determine which array will continue to service IO to synchronously replicated volumes in the event of an array failure, replication link outage, or site outage.
The Pure1 Cloud Mediator has the following advantages over a typical non-Pure heavy handed voter or witness component:
- SaaS operational benefits - As with any SaaS solution, the operational maintenance complexity is removed: nothing to install onsite, no hardware or software to maintain, nothing to configure and support for HA, no security patch updates and more.
- Automatically a 3rd site - The Pure1 Cloud Mediator is inherently in a separate failure domain from either of the two arrays.
- Automatic configuration - Arrays configured for synchronous replication will automatically connect to and use the Pure1 Cloud Mediator.
- No misconfiguration - With automatic and default configuration there is no risk that the mediator could be incorrectly configured.
- No human intervention - A significant number of issues in non-Pure active/active synchronous replication solutions, particularly those related to accidental split-brain, are related to human error. Pure’s automated non-human mediator eliminates operator error from the equation.
- Passive mediation - Continuous access to the mediator is not required for normal operations. The arrays will maintain a heartbeat with the mediator, however if the arrays lose connection to the mediator they will continue to synchronously replicate and serve data as long as the replication link is active.
On-Premises Failover Mediator
Failover mediation for ActiveCluster can also be provided using an on-premises mediator distributed as an OVF file and deployed as a VM. Failover behaviors are exactly the same as described above. The on-premises mediator simply replaces the role of the Pure1 Cloud Mediator during failover events.
The on-premises mediator has the following basic requirements:
- The on-premises mediator can only be deployed as a VM on virtualized hardware. It is not installable as a stand-alone application.
- High Availability for the mediator must be provided by the hosts on which the mediator is deployed. For example, using VMware HA, or Microsoft Hyper-V HA Clustering.
- Storage for the mediator must not allow the configuration of the mediator to be rolled back to previous versions. This applies to situations such as storage snapshot restores, or cases where the mediator might be stored on mirrored storage.
- The arrays must be configured to use the on-premises mediator rather than the Pure1 Cloud Mediator.
- The mediator must be deployed in a third site, in a separate failure domain that will not be affected by any failures in either of the sites where the arrays are installed.
- Both array sites must have independent network connections to the mediator such that a failure of one network connection does not prevent both arrays from accessing the mediator.
Pre-Election is an ActiveCluster feature that automatically engages when both FlashArrays lose connectivity to the Mediator to ensure stretched pod volumes will remain online should a subsequent loss of replication links occur.
What are the requirements of Pre-Election?
Enabled by default in Purity 5.3.x and higher.
What does Pre-Election do?
After both FlashArrays detect that the Mediator is unavailable, Pre-Election will:
ensure the pre-elected array in the pod will stay online if the replication network fails
ensure the pre-elected array in the pod will stay online if the non-elected array fails
designate the winning FlashArray for each pod based on the pod failover preference, if set, otherwise it picks a FlashArray
After one (or both) FlashArray detect that the Mediator is available, Pre-Election will:
- disengage and return to standard ActiveCluster transparent failover behavior
What does Pre-Election not do?
Pre-Election does not:
override normal behavior: if one array still has mediator access it must race to mediator to stay online.
provide a hard failover preference.
force the non-elected side online if the pre-elected array fails.
keep the pod online if both the mediator network and replication network fail at the same time.
Uniform and Non-Uniform Access
Hosts can be configured to see just the array that is local to it, or to both arrays. The former option is called non-uniform vMSC, the latter is referred to as uniform vMSC.
A uniform storage access model can be used in environments where there is host-to-array connectivity of either FC or ethernet (for iSCSI), and array-to-array ethernet connectivity, between the two sites. When deployed in this way, a host has access to the same volume through both the local array and the remote array. The solution supports connecting arrays with up to 11ms of round trip time (RTT) latency between the arrays.
The image above represents the logical paths that exist between the hosts and arrays, and the replication connection between the two arrays in a uniform access model. Because a uniform storage access model allows all hosts, regardless of site location, to access both arrays, there will be paths with different latency characteristics. Paths from hosts to the local array will have lower latency; paths from each local host to the remote array will have higher latency.
For the best performance in an active/active synchronous replication environments, hosts should be prevented from using paths that access the remote array unless necessary. For example, in the image below if VM 2A were to perform a write to volume A over the host side connection to array A, that write would incur 2X the latency of the inter-site link, 1X for each traverse of the network. The write would experience 11ms of latency for the trip from host B to array A and experience another 11ms of latency while array A synchronously sends the write back to array B.
With Pure Storage Purity ActiveCluster, there are no such management headaches. ActiveCluster does make use of ALUA to expose paths to local hosts as active/optimized paths and expose paths to remote hosts as active/non-optimized. However, there are two advantages in the ActiveCluster implementation.
- In ActiveCluster volumes in stretched pods are read/write on both arrays. There is no such thing as a passive volume that cannot service both reads and writes.
- The optimized path is defined on a per host-to-volume connection basis using a preferred-array option; this ensures that regardless of what host a VM or application is running on, it will have a local optimized path to that volume.
ActiveCluster enables truly active/active datacenters and removes the concern around what site or host a VM runs on; the VM will always have the same performance regardless of site. While a VM 1A is running on host A accessing volume A, it will use only the local optimized paths as shown in the next image.
If the VM or application is switched to a host in the other site, with the data left in place, only local paths will be used in the other site as shown in the next image. There is no need to adjust path priorities or migrate the data to a different volume to ensure local optimized access.
A non-uniform storage access model is used in environments where there is host-to-array connectivity of either FC or ethernet (for iSCSI), only locally within the same site. Ethernet connectivity for the array-to-array replication interconnect must still exist between the two sites. When deployed in this way, each host has access to a volume only through the local array and not the remote array. The solution supports connecting arrays with up to 11ms of round trip time (RTT) latency between the arrays.
Hosts will distribute I/Os across all paths to the storage only, because only the local Active/Optimized paths are available.
ActiveCluster Glossary of Terms
The following terms will be used repeatedly in this document that have been introduced for ActiveCluster:
- Pod—a pod is a namespace and a consistency group. Synchronous replication can be activated on a pod, which makes all volumes in that pod present on both FlashArrays in the pod.
- Stretching—stretching a pod is the act of adding a second FlashArray to a pod. When stretched to another array, the volume data will begin to synchronize, and when complete all volumes in the pod will be available on both FlashArrays.
- Unstretching—unstretching a pod is the act of removing a FlashArray from a pod. This can be done from either FlashArray. When removed, the volumes and the pod itself are no longer available on the FlashArray that was removed.
- Restretching— When a pod is unstretched, the other array (the array unstretched from) will keep a copy of the pod in pending eradication for 24 hours; this would allow the pod to be quickly re-stretched without having to resend all the data if restretched prior to the eradication completing in 24 hours.