Skip to main content
Pure Technical Services

Implementing vSphere Metro Storage Cluster With ActiveCluster: Failure Scenarios

Currently viewing public documentation. Please login to access the full scope of documentation.

The Table of Contents for this guide can be found here and is helpful for navigating to the rest of this guide.

Failure Scenarios

A core part of understanding vSphere HA and ActiveCluster is understanding how vSphere and ActiveCluster respond to failures. These failures are separated into two broad categories:

  1. Storage access lost. This could be due to the storage array failing, a pod going inactive, or loss of connectivity.
  2. Host failure. This could be due to the physical host failing, the hypervisor crashing, or network partitioning of the host.

vSphere HA Host Failure Response

A common failure is some type of host failure that renders ESXi unable to keep running virtual machines. This could be a power failure, a host kernel crash or something else.

vSphere HA will take over and restart virtual machines on other hosts when a host failure has been detected.

In this environment, there is a VM named VM02 running on host ac-esxi-a-01:

VM02OnA-01.png

A host failure occurs and is marked as lost:

ac110.png

At this point, vSphere HA kicks in and restarts the VM on another host. Since this environment as a rule that this VM should only be on hosts in host group A, it is restarted on an “A” host.

VM02Restarted.png

This behavior is not unique to vSphere HA and ActiveCluster—this is a property of shared storage in general when combined with vSphere HA. The true benefit of ActiveCluster with vSphere HA is when all of the compute goes down in the site local to a set of VMs.

Continuing with the failure above, the remaining 3 hosts in host group A fail too (the four of them comprise all hosts in datacenter A). Since there are no hosts left in host group A, vSphere HA follows the “should” rule and has not choice but to reboot them on hosts in host group B which are in the other datacenter.

VM02PoweredOnB.png

Since ActiveCluster presents the same VMFS datastore to both datacenters from a FlashArray in each datacenter, the hosts in the remote datacenter can reboot the VMs from the failed hosts in a different datacenter with ease.

The behavior of host failure is not impacted by whether or not the cluster is configured for Uniform or Non-Uniform connectivity.

vSphere HA and Storage Failure Response

In more severe failures, entire loss of the underlying storage system or SAN can occur, leading to ESXi hosts continuing to run, but with no storage access. This is a different type of failure than a host failure—as the host is still online but it cannot continue to run its VMs as there are no available paths to the local FlashArray.

Depending on the failure and on the configuration of the stretched cluster (uniform or non-uniform), vSphere reacts differently and the failover mechanism changes.

Storage Failure Response with a Non-Uniform Stretched Cluster

As discussed previously in this guide, a non-uniform stretched cluster is a cluster of ESXi hosts which are split across to physical sites, usually half in one datacenter and half in another datacenter. Each datacenter has a FlashArray and those two FlashArrays have ActiveCluster enabled on one or more volumes allowing the storage volume(s) to be presented simultaneously at both sites. The non-uniform portion of this configuration defines what paths to the storage the hosts see. In a non-uniform configuration, hosts only have paths to the volume (or volumes) via the FlashArray that is local to their datacenter.

In other words, if the FlashArray becomes unavailable, the ESXi hosts local to it no longer have access to the storage and a vSphere HA failover must occur to reboot any affected VMs on the remote hosts in the cluster that have access to the storage via the remote FlashArray.

In this scenario, essentially any storage-related failure will cause a vSphere HA failover such as:

  • Loss of local FlashArray
  • Accidental or purposeful removal of volume from access to hosts by administrator
  • Loss of SAN connectivity due to failure, power loss or administrative change
  • Failure of host bus adapters (HBAs) in host or connecting cables

These failures all lead to the same result in the presence of non-uniform configurations: loss of storage connectivity of the host and a vSphere HA failover.

In the following environment, there are eight hosts total, four are in site “A” and four are in site “B”. There is a VMFS datastore presented to all hosts in the cluster, with a total of 8 VMs on it.

VMFSDatastore.png

The datastore is hosted on a FlashArray volume that has been stretched to both FlashArrays using ActiveCluster.

VMFSVolume.png

Since this is a non-uniform configuration, the “A” hosts only have access to the volume via paths to the “A” FlashArray, and the “B” hosts only have access to the volume via paths to the “B” FlashArray.

In this cluster, half of the 8 virtual machines are running on “B” hosts and half are running on “A” hosts, so a storage failure on either side will cause 4 VMs to be restarted on the remaining hosts on the other side.

For this example, FlashArray “B” will experience a failure:

 

ac124.png

This causes the VMFS datastore to go inaccessible on the “B” hosts.

InaccessibleSideB.png

But because it is presented through both FlashArrays via ActiveCluster, the “A” hosts continue to have access to the datastore:

AccessibleSideA.png

Once the timeouts have been reached (dependent on the APD/PDL timeout configuration) if APD/PDL responses have been enabled, ESXi will shut down the VMs on the hosts that lost storage access and restart them on hosts with surviving storage connections to the volume. The VMs will be shut down, which makes them briefly marked as inaccessible.

InaccessibleVMsSideA.png

The VMs will then come back online and be powered-on elsewhere. When the FlashArray comes back online, the storage can then be seen again by the “A” hosts. If there are host-affinity rules, the VMs will be moved back almost immediately by vSphere DRS. Otherwise, they will remain where they are until manually moved, another failure occurs, or resource usage demands vSphere DRS rebalance the VMs across the cluster.

VMMigrationTask.png

Storage Failure Response with a Uniform Stretched Cluster

A uniform stretched cluster means that all hosts in the cluster have paths to a stretched volume through both FlashArrays servicing that volume via ActiveCluster. This configuration provides an additional level of resilience for virtual machines, as VMs can continue to run non-disruptively through the failure of an entire FlashArray, or the connectivity to it.

In the case of all hosts in one site losing all storage access where those hosts cannot access their local FlashArray, nor their remote one, the failover process is identical to the process shown in the previous section on non-uniform failover. In that case, vSphere HA will restart the affected virtual machines on the hosts in the remote site.

For uniform configurations, the case where just a single array fails is different. In this case, VMs can continue to run on their host, as those hosts can still access the storage, but via paths to the remote FlashArray. Therefore, this is not a case of a vSphere HA restart,which has downtime until the VM can be rebooted, but it is a case of multipathing simply failing over to the paths to the remote FlashArray. A multipathing failover is entirely non-disruptive, as a VM reboot is not required.

The below environment is configured in a uniform fashion, so all eight hosts have access to the VMFS protected via ActiveCluster through both FlashArrays:

ActiveIOActive.png

Furthermore, the preferred FlashArray is set on the host object, so that only the paths for a given host to its local FlashArray are in-use (denoted by the (I/O)).

This ensures that, while the paths are available, reads and writes go down optimized paths to provide the best possible performance. The non-optimized paths are paths to the VMFS via the remote FlashArray (non-optimized) and will only be used in absence of any optimized paths.

After a failure of a FlashArray, the datastores from that array that are not protected by ActiveCluster go offline:

InaccessibleSideB.png

The ActiveCluster-enabled volume “vMSC-VMware::nelson-stretched-volume” stays online.

The paths to that volume to the failed FlashArray are gone, but the paths to the remote site remain available and become active.

PathsToArray.png

At this point, the virtual machines are running on non-optimized paths, meaning that their I/Os are running the WAN to the remote FlashArray, which will incur greater latency than if the VMs were running on hosts local to their FlashArray. If the FlashArray failure is expected to be extended, it might be advisable to vMotion the VMs running on non-optimized paths to hosts in the other datacenter that have paths to their local FlashArray available. If the failed FlashArray is expected to be recovered soon (like in the case of a temporary loss of power), the simplest option may be to just leave the VMs where they are. They will resume running on optimized paths as soon as the FlashArray comes back online.

FlashArray Storage Access

Solution Component Failure Access to Stretched Pod Volumes Through Purity//FA 5.2 Access to Stretched Pod Volumes With Purity//FA 5.3+
One Array Other Array Replication Link Mediator
UP DOWN UP UP Available on one array Available on one array
UP UP DOWN UP Available on one array Available on one array
UP UP UP DOWN Available on both arrays Available on both arrays
UP DOWN DOWN UP Available on one array Available on one array
UP UP DOWN* DOWN* Unavailable Unavailable
UP DOWN* UP DOWN* Unavailable Unavailable
UP (pre-elected) UP DOWN DOWN** Unavailable Available on one array
UP (pre-elected) DOWN UP DOWN** Unavailable Available on one array***

* Simultaneous failures of components 

** Pre-Election completes before second component failure.

*** Assumes the ‘Other Array’ was not the Pre-Elected array.  If the Pre-Elected array fails then the stretched pod volumes are unavailable. 

Note: If the Mediator becomes unavailable after an array failure or a replication link failure has already been sustained, access to the mediator is no longer required and access to storage remains available on one array.

Host and Storage Network Failures

Failure Scenario

Failure Behavior

Single or multiple host failure

Applications can automatically failover to other hosts in the same site or other hosts in the other site connected to the other array.

This is driven by VMware HA assuming clusters are stretched between sites.

Stretched SAN fabric outage (FC or iSCSI)

(failure of SAN interconnect between sites)

Host IO automatically continues on local paths in the local site.

Uniformly connected hosts:

  • experience some storage path failures for paths to the remote array and continue IO on paths to the local array.
  • in each site will maintain access to local volumes with no more than a pause in IO.

Non-uniformly connected hosts:

  • do not have a SAN interconnect between sites, so this scenario is not applicable.

SAN fabric outage in one site

Applications can automatically failover to hosts at the other site connected to the other array.

This is driven by host cluster software assuming clusters are stretched between sites. VMware HA, Oracle RAC, SQL Cluster, etc.

Uniformly connected hosts:

  • in the site without the SAN outage, experience some storage path failures for paths to the remote array and continue IO on paths to the local array.
  • in the site with the SAN outage, will experience total loss of access to volumes and applications must failover to the other site as mentioned above.

Non-uniformly connected hosts:

  • in the site without the SAN outage, will maintain access to local volumes.
  • in the site with the SAN outage, will experience total loss of access to volumes and applications must failover to the other site as mentioned above.

The next two sections will describe how ESXi hosts in a stretched cluster configuration respond to certain failures including:

  • Host failures—what happens to VMs running on a host when it fails?
  • Array failure—what happens when a FlashArray fails?

The focus will not be on why the array goes down, but instead if the array goes down, how do ESXi and vSphere HA react?

Array, Replication Network, and Site Failures

Failure Scenario

Failure Behavior

Local HA controller failover in one array.

After short pause for the duration of the local HA failover, host I/O will continue to both arrays without losing RPO-Zero.

Async Replication source transfers may resume from a different array than before the failover.

Replication link failure.

After short pause, host IO continues to volumes only on the array that contacts the mediator first. This is per pod.

Failover is automatic and transparent and no administrator intervention is necessary.

Uniformly connected hosts:

  • after a short pause in IO, continue IO to the array that won the race to the mediator.  
  • experience some storage path failures for paths to the array that lost the race to the mediator.  
  • in the mediator losing site will maintain access to volumes remotely across stretched SAN to the mediator winning site.

Non-uniformly connected hosts:

  • in the mediator winning site will maintain access to volumes with no more than a pause in IO.  
  • in the mediator losing site will experience total loss of access to volumes.
  • use host cluster software to recover the apps to a host in the mediator winning site. This may be automatic depending on the type of cluster.

Mediator failure or access to mediator fails.

No effect. Host IO continues through all paths on both arrays as normal.

Entire single array failure.

After short pause, Host IO automatically continues on surviving array.

Failover is automatic and transparent and no administrator intervention is possible or necessary.

Uniformly connected hosts:

  • in the surviving array site, after a short pause in IO, continue IO to the surviving array that was able to reach the mediator.  
  • experience some storage path failures for paths to the failed array.
  • in the site where the array failed will do IO to volumes remotely across the stretched SAN to the surviving array.

Non-uniformly connected hosts:

  • in the surviving array site, after a short pause in IO, continue IO to the surviving array that was able to reach the mediator.  
  • in the failed array site will experience total loss of access to volumes.
  • use host cluster software to recover the apps to a host in the other site. This may be automatic depending on the type of cluster.

Entire site failure.

After short pause, Host IO automatically continues on surviving array.

Failover of the array is automatic & transparent and no administrator intervention is possible or necessary.

Uniformly connected hosts:

  • in the surviving array site, after a short pause in IO, continue IO to the surviving array that was able to reach the mediator.
  • experience some storage path failures for paths to the array in the failed site.
  • use host cluster software to recover the apps to hosts in the surviving site. This may be automatic depending on the type of cluster.

Non-uniformly connected hosts:

  • in the surviving site, after a short pause in IO, continue IO to the surviving array that was able to reach the mediator.
  • in the surviving site will maintain access to local volumes with no more than a pause in IO.  
  • use host cluster software to recover the apps to hosts in the surviving site. This may be automatic depending on the type of cluster.

Mediator failure followed by a second failure of replication link under 5 minutes from mediator failure, or failure of one array, or failure of one site.

(2nd failure occurs while mediator is unavailable)

Host IO access is lost to sync rep volumes on both arrays.

This is a double failure scenario; data service is not maintained through failure of either array if the mediator is unavailable.

Options to recover:

  1. Restore access to either the mediator or replication interconnect and the volumes will automatically come back online, as per above scenarios.
  2. Clone a pod create new volumes with different LUN serial numbers. New LUN serial numbers will prevent hosts from automatically connecting to and using the volumes, avoiding split-brain. Then re-identify and reconnect all LUNs on all hosts.

 

Mediator failure or access to mediator fails; 5 minutes or more later, replication network fails. Pre-election will be used to keep the stretched pod volumes online on the FlashArray pre-determined by the pod failover preference.

Conclusion

This guide covered the combination of active-active replication on the FlashArray, called ActiveCluster, and vSphere High Availability, collectively referred to a solution called vSphere Metro Storage Cluster (vMSC). The simplicity and flexibility of the FlashArray itself and the ActiveCluster feature allows administrators to configure and offer a highly-available storage solution with ease.

References

  1. vMSC best practices
  2. VMFS deep dive
  3. VMware iSCSI best practices
  4. vSphere HA guides
  5. VMware ActiveCluster KB
  6. Active Cluster and VMware Support Page
  7. ActiveCluster best practices
  8. ActiveCluster documentation