Skip to main content
Pure1 Support Portal

Understanding Virtual Volumes and Failure Scenarios

With Pure Storage supporting vSphere Virtual Volumes in Purity 5.0+ a new Data Path and Control Path topology and concept was introduced.  Understanding how these concepts work and understanding the warnings or errors will help identify the impact of a given situation.  This KB will be covering some Scenarios that can occur when using VVols.

 


Overview

A crucial step to understanding how VVols works with Pure Storage is knowing the terminology as well as the architecture. In this Overview section let's cover some of the terminology that will be referred to throughout the KB, as well as a high level view of the VVol Architecture.

Terminology Review

Name/Concept Explanation
Protocol Endpoint (PE) A PE is a volume of zero capacity with a special setting in its Vital Product Data (VPD) page that ESXi detects during a SCSI inquiry. It effectively serves as a mount point for VVols. It is the only FlashArray volume that must be manually connected to hosts to use VVols.  This industry term for a PE is "Administrative Logical Unit".

VASA

vSphere APIs for Storage Awareness (VASA) is the VMware-designed API used to communicate between vSphere and the underlying storage.

Management Path
Control Path
This is the TCP/IP path between the compute management layer (vSphere) and the storage management layer (FlashArray). Instructions to create, delete and otherwise manage storage is issued on this path.
Data Path/Plane The Data Path is the established connection from the ESXi hosts to the Protocol Endpoint on the FlashArray. This connection is established over the storage fabric. Today this means iSCSI or Fibre Channel.  
SPBM Storage Policy Based Management (SPBM) is a framework designed by VMware to provision and/or manage storage. Users can create policies of selected capabilities or tags and assign them to a VM or specific virtual disk. SPBM for internal storage is called vSAN, SPBM for external storage is called VVols. A vendor must support VASA to enable SPBM for their storage. 
VASA Provider
Storage Provider
A VASA provider (sometimes called a storage provider) is an instance of the VASA service that a storage vendor offers that a customer has deployed. For the FlashArray these are built-in to the FlashArray controllers--one in each.
Virtual Volume (VVol) Virtual Volumes is the name for this full architecture. A specific VVol is any volume on the array that is in use by the VMware environment and managed by the VASA provider. A VVol based volume is not fundamentally different than any other volume on the FlashArray--the main distinction is that when it is in use, it is attached as a sub-lun via a PE, instead of via a direct lun.
VVol Datastore
VVol Storage Container
The VVol Datastore is not a LUN, file system or volume. A VVol Datastore is a target provisioning object that represents a FlashArray, a quota for capacity, and a logical collection of config VVols.  

VVol Architecture

Here is a generic high level view of the VVol Architecture. Make note that the Control/Management Path is separate from the Data Path.
Picture1.png

 


Failure Scenarios

Essentially there are 2 types of VVol failure categories; failure on the Management (control) Path or failure on the Data Path. 
Management Path failures generally are not impactful to the operation of the VVol VMs. Data Path failures generally are impactful to the operation of the VMs.

Impact to the Management Path

When an event impacts the existing connection to the Management Path, there is no disruption to the I/O coming from any currently running VVol VMs. The disruption here comes into play with the communication via VASA between the ESXi Hosts, vCenter and the Storage Array. This disruption will prevent powered off VVol VMs from being powered on, vSphere vMotion(s) and Storage vMotion(s) will fail, VVol VMs won't be able to have their settings edited or updated, no new VMs can be created on the VVol Datastore, and the VVol Datastore will show as inaccessible in the vCenter UI. Essentially no reconfigurations will be possible until the management path comes back online. Examples include:

  • Creating new virtual disks
  • Resizing virtual disks
  • Deleting virtual disks
  • Assigning a storage policy to a VM or virtual disk 
  • Powering off a VM
  • Powering on a VM
  • Moving a VM

Howeverit is important to understand, that the existing VVol VMs that are powered on will continue to stay powered-on and service I/O. As an example, in a non-VVol environment, what happens to you I/O if you cannot access the FlashArray web management interface? It of course continues to run, but you cannot make any changes to the storage configuration until you regain access. This is identical in concept to vSphere losing management path access--vSphere cannot make any configuration changes until VASA access is restored. If management path (VASA) access goes down: DO NOT PANIC and start rebooting VMs--instead log into a running VM to make sure the data path is not also down, then start troubleshooting why vSphere cannot access VASA.

Just a note before starting, there is no impact to running VVol VMs when re-registering storage providers that are either active or offline.  In a video example towards the bottom of this page, the Storage Providers are in an offline state when they are re-registered.  This would include if the storage providers are removed when there are active VVol VMs running on that vCenter and then are registered again.  Removing and registering storage providers in not impactful to the VVol Data Path.

Let's cover some possible events that lead to an impact on the connection of the ESXi Hosts or vCenter Server to the FlashArray VASA Providers. Then we'll review what this impact can look like.

Firewall or Security Rules

An example of this is port 8084 being blocked or restricted between the vCenter Server and ESXi Host management network and the FlashArray's management network for CT0 and CT1.

Here is an example of a firewall change after the storage providers have been registered and VVols is currently in use.

Here we have the Storage Providers Online and Active:

vvol failure - 01 - firewall - 01.png

Then looking at the esxcli, we can see the storage container is online and in sync:   

[root@ESXi-6:~] esxcli storage vvol protocolendpoint list
naa.624a93702dcf29ad6aca49130002cba2
   Host Id: naa.624a93702dcf29ad6aca49130002cba2
   Array Id: com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f
   Type: SCSI
   Accessible: true
   Configured: true
   Lun Id: naa.624a93702dcf29ad6aca49130002cba2
   Remote Host:
   Remote Share:
   NFS4x Transport IPs:
   Server Scope:
   Server Major:
   Auth:
   User:
   Storage Containers: b487b73a-6128-3ad2-873b-8ddb1ba45d4f
   
[root@ESXi-6:~] esxcli storage vvol storagecontainer list
sn1-m20-c12-25-vvol-container
   StorageContainer Name: sn1-m20-c12-25-vvol-container
   UUID: vvol:b487b73a61283ad2-873b8ddb1ba45d4f
   Array: com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f
   Size(MB): 8589934592
   Free(MB): 8581017681
   Accessible: true
   Default Policy:
   
[root@ESXi-6:~] esxcli storage vvol vasaprovider list
sn1-m20-c12-25-ct0
   VP Name: sn1-m20-c12-25-ct0
   URL: https://10.21.88.113:8084/version.xml
   Status: online
   Arrays:
         Array Id: com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f
         Is Active: true
         Priority: 200

After showing we have a good and healthy configuration, let's say someone made a firewall change or blocked port 8084 on an upstream switch....

Okay, now we can see that the VVol Datastore is inaccessible and the Storage Providers are offline:

vvol failure - 01 - firewall - 02.png

 When checking on the esxcli we can see the vasaprovider is in 'syncError' and the Storage Container is inaccessible.  Note that the protocol endpoint is still active and accessible, as the Data Path has not been impacted.

[root@ESXi-6:~] esxcli storage vvol storagecontainer list

sn1-m20-c12-25-vvol-container
   StorageContainer Name: sn1-m20-c12-25-vvol-container
   UUID: vvol:b487b73a61283ad2-873b8ddb1ba45d4f
   Array: com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f
   Size(MB): 0
   Free(MB): 0
   Accessible: false
   Default Policy:
   
[root@ESXi-6:~] esxcli storage vvol vasaprovider list
sn1-m20-c12-25-ct0
   VP Name: sn1-m20-c12-25-ct0
   URL: https://10.21.88.113:8084/version.xml
   Status: syncError
   Arrays:
         Array Id: com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f
         Is Active: true
         Priority: 200
         
 
[root@ESXi-6:~] esxcli storage vvol protocolendpoint list
naa.624a93702dcf29ad6aca49130002cba2
   Host Id: naa.624a93702dcf29ad6aca49130002cba2
   Array Id: com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f
   Type: SCSI
   Accessible: true
   Configured: true
   Lun Id: naa.624a93702dcf29ad6aca49130002cba2
   Remote Host:
   Remote Share:
   NFS4x Transport IPs:
   Server Scope:
   Server Major:
   Auth:
   User:
   Storage Containers: b487b73a-6128-3ad2-873b-8ddb1ba45d4f

In the vCenter Server's /var/log/vmware/vmware-sps/sps.log you will find an error that says 'Connection reset' and 'Connection refused' as well:

2019-02-25T18:57:25.785Z [pool-9-thread-4] ERROR opId=sps-Main-119359-972 com.vmware.vim.sms.provider.vasa.event.EventDispatcher - Error occurred while polling events for provider: https://10.21.88.113:8084/version.xml
com.vmware.vim.sms.fault.VasaServiceException: org.apache.axis2.AxisFault: Connection reset
    at com.vmware.vim.sms.client.VasaClientImpl.getEvents(VasaClientImpl.java:338) 
    
2019-02-25T18:57:25.788Z [pool-9-thread-5] ERROR opId= com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error occurred while polling alarms for provider: https://10.21.88.113:8084/version.xml
com.vmware.vim.sms.fault.VasaServiceException: org.apache.axis2.AxisFault: Connection refused (Connection refused)
    at com.vmware.vim.sms.client.VasaClientImpl.getAlarms(VasaClientImpl.java:356)
    at sun.reflect.GeneratedMethodAccessor479.invoke(Unknown Source)

Confirming the above, we would then need to fix the firewall rules. After the firewall rules were fixed, the Storage Providers came back online and the VVold Datastore was accessible again. An important thing to point out here is that in such a situation, the VMs stay online and continue to push workload. Any operation to the VVol VM would fail and any powered off VVol VMs would fail to power on.  

Certificate Issues

These can be with the vCenter Server, ESXi Hosts or FlashArray. Examples of probable issues include expired certificates, outdated certificate information, accidently registering a different vCenter with the FlashArray's VASA Provider, or ESXi hosts unable to sync with the vCenter Storage Provider Certificate.

Here's an example where we have a working vCenter Server become impacted by another non-linked vCenter Server registering the FlashArray's VASA Provider. This would cause the existing vCenter that is registered with the FlashArray to have invalid certificates.

Here we have healthy Storage Providers:

vvol failure - 01 - cert - 01.png

After another vCenter has been registered with the Storage Providers they are no longer online and healthy:

vvol failure - 01 - cert - 02.png

From esxcli we can see that the storage container is offline and the VASA provider is out of sync:

[root@ESXi-6:~] esxcli storage vvol vasaprovider list
sn1-m20-c12-25-ct0
   VP Name: sn1-m20-c12-25-ct0
   URL: https://10.21.88.113:8084/version.xml
   Status: syncError
   Arrays:
         Array Id: com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f
         Is Active: true
         Priority: 200
         
 [root@ESXi-6:~] esxcli storage vvol storagecontainer list
sn1-m20-c12-25-vvol-container
   StorageContainer Name: sn1-m20-c12-25-vvol-container
   UUID: vvol:b487b73a61283ad2-873b8ddb1ba45d4f
   Array: com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f
   Size(MB): 0
   Free(MB): 0
   Accessible: false
   Default Policy:

The interesting log lines from the ESXi hosts /var/log/vvold and the vCenter Server's /var/log/vmware/vmware-sps/sps.log are as follows:

### ESXi Host's /var/log/vvold.log ###

2019-02-26T17:47:19.839Z info vvold[7764114] [Originator@6876 sub=Default] VasaSession::KillAllConnections VP (sn1-m20-c12-25-ct0), purged 1 connections, 1 currently active, new genId (2) (broadcast wakeup to all threads waiting for free connection)
2019-02-26T17:47:19.840Z warning vvold[7764114] [Originator@6876 sub=Default] VasaSession::SetState: VP sn1-m20-c12-25-ct0 [Connected -> AuthorizationError], state change locked!
2019-02-26T17:47:19.840Z info vvold[7764114] [Originator@6876 sub=Default] VasaSession::ReleaseConn VP (sn1-m20-c12-25-ct0), killing connection [14VasaConnection:0x000000b70f1da170] (size now 0/16)!
2019-02-26T17:47:19.840Z error vvold[7764114] [Originator@6876 sub=Default]
--> VasaOp::ThrowFromSessionError [#361]: ===> FINAL FAILURE getEvents, error (INVALID_LOGIN / SSL_ERROR_SSL
--> error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed / SSL_connect error in tcp_connect()) VP (sn1-m20-c12-25-ct0) Container (sn1-m20-c12-25-ct0) timeElapsed=2 msecs (#outstanding 0)
2019-02-26T17:47:19.840Z error vvold[7764114] [Originator@6876 sub=Default] VasaSession::EventPollerCB VP sn1-m20-c12-25-ct0: getEvents failed (INVALID_LOGIN, SSL_ERROR_SSL
--> error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed / SSL_connect error in tcp_connect()) [session state: AuthorizationError]

### vCenter Server's /var/log/vmware/vmware-sps/sps.log ###

2019-02-26T17:47:08.636Z [pool-9-thread-4] ERROR opId=sps-Main-119359-972 com.vmware.vim.sms.util.CustomSslSocketFactory - CompositeTrustManager could not validate certificate:

2019-02-26T17:47:08.637Z [pool-9-thread-4] ERROR opId=sps-Main-119359-972 com.vmware.vim.sms.util.CustomHostNameVerifier - [verify] Hostname verification failed for host: 10.21.88.113
javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated
    at sun.security.ssl.SSLSessionImpl.getPeerCertificates(SSLSessionImpl.java:440)
    
2019-02-26T17:47:08.637Z [pool-9-thread-4] ERROR opId=sps-Main-119359-972 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error occurred while polling alarms for provider: https://10.21.88.113:8084/version.xml
com.vmware.vim.sms.fault.VasaServiceException: org.apache.axis2.AxisFault: Host name could not be verified!
    at com.vmware.vim.sms.client.VasaClientImpl.getAlarms(VasaClientImpl.java:356)
    at sun.reflect.GeneratedMethodAccessor479.invoke(Unknown Source)
    
2019-02-26T17:47:09.070Z [pool-9-thread-5] ERROR opId= com.vmware.vim.sms.util.CustomSslSocketFactory - CompositeTrustManager could not validate certificate:

2019-02-26T17:47:09.071Z [pool-9-thread-5] ERROR opId= com.vmware.vim.sms.util.CustomHostNameVerifier - [verify] Hostname verification failed for host: 10.21.88.114
javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated
    at sun.security.ssl.SSLSessionImpl.getPeerCertificates(SSLSessionImpl.java:440) 

The only way to correct this is to re-register the Storage Providers for the vCenter Server that is supposed to have access to VVols on the FlashArray. 
Once the Storage Providers are re-registered then everything comes back online and is healthy again.

For now, the FlashArray's VASA Provider Certificates are stored locally on the FlashArray controllers. This requires the controllers Storage Provider being re-registered when a controller is replaced or is upgraded. There have been instances where the storage providers were not re-registered post NDU (non-disruptive upgrade) or controller replacement. Pure Storage is working to configure the VASA Providers in a method that would persist during controller replacements and upgrades. This will be part of a forthcoming Purity Release.

The following example will outline what happens when the controllers are not re-registered following a hardware NDU.

Here are the healthy storage providers:

vvol failure - 01 - cert - 03.png

Then both controllers are replaced/upgraded.  We can see that both providers are now offline:

vvol failure - 01 - cert - 04.png

Then looking at the vCenter Server's sps log, we can see the cert failures:

2019-02-26T18:15:24.877Z [pool-9-thread-5] ERROR opId= com.vmware.vim.sms.util.CustomSslSocketFactory - CompositeTrustManager could not validate certificate:
...
2019-02-26T18:15:24.877Z [pool-9-thread-5] ERROR opId= com.vmware.vim.sms.util.CustomHostNameVerifier - [verify] Hostname verification failed for host: 10.21.88.113
javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated
...
2019-02-26T18:15:34.880Z [pool-9-thread-1] ERROR opId=sps-Main-119359-972 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error: org.apache.axis2.AxisFault: Host name could not be verified! occured as provider: https://10.21.88.113:8084/version.xml is offline

Then looking at the ESXi Host's vvold log, we can see the authentication and cert failures:

--> VasaOp::EventPollerCB [#515]: ===> Issuing 'getEvents' to VP sn1-m20-c12-25-ct0 (#outstanding 0/16) [session state: Connected]
2019-02-26T18:15:34.025Z info vvold[7764102] [Originator@6876 sub=Default] VasaSession::KillAllConnections VP (sn1-m20-c12-25-ct0), purged 1 connections, 1 currently active, new genId (30) (broadcast wakeup to all threads waiting for free connection)
2019-02-26T18:15:34.025Z warning vvold[7764102] [Originator@6876 sub=Default] VasaSession::SetState: VP sn1-m20-c12-25-ct0 [Connected -> AuthorizationError], state change locked!
2019-02-26T18:15:34.025Z info vvold[7764102] [Originator@6876 sub=Default] VasaSession::ReleaseConn VP (sn1-m20-c12-25-ct0), killing connection [14VasaConnection:0x000000b70f11df80] (size now 0/16)!
2019-02-26T18:15:34.025Z error vvold[7764102] [Originator@6876 sub=Default]
--> VasaOp::ThrowFromSessionError [#515]: ===> FINAL FAILURE getEvents, error (INVALID_LOGIN / SSL_ERROR_SSL
--> error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed / SSL_connect error in tcp_connect()) VP (sn1-m20-c12-25-ct0) Container (sn1-m20-c12-25-ct0) timeElapsed=2 msecs (#outstanding 0)
2019-02-26T18:15:34.025Z error vvold[7764102] [Originator@6876 sub=Default] VasaSession::EventPollerCB VP sn1-m20-c12-25-ct0: getEvents failed (INVALID_LOGIN, SSL_ERROR_SSL
--> error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed / SSL_connect error in tcp_connect()) [session state: AuthorizationError]
2019-02-26T18:15:34.025Z info vvold[7764102] [Originator@6876 sub=Default] VasaSession::EventPollerCB VP sn1-m20-c12-25-ct0: connection state changed. Raising alarm to recalculate best VP for all arrays managed by this VP
2019-02-26T18:15:34.025Z info vvold[7764102] [Originator@6876 sub=Default] SI:ProcessVpStateChangeEvent: Processing VP State change for sn1-m20-c12-25-ct0 (newstate AuthorizationError)
2019-02-26T18:15:34.025Z info vvold[7764102] [Originator@6876 sub=Default] SI:GetBestVpRefForArrayId: VP:sn1-m20-c12-25-ct0: priority:200, isActive:\x01, actvReqd:\x01, sessionState:AuthorizationError, haStatus:0
2019-02-26T18:15:34.025Z error vvold[7764102] [Originator@6876 sub=Default] VvolServiceInstance::GetBestVpRefForArrayId: No Best VP available for array (com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f)
...
2019-02-26T18:15:34.237Z warning vvold[7764117] [Originator@6876 sub=Default] vvol_ssl_auth_init: Will skip CRL check as env variable VVOLD_DO_CRL_CHECK is not set!
2019-02-26T18:15:34.237Z warning vvold[7764117] [Originator@6876 sub=Default] VasaSession:DoSetContext: Setting VASAVERSION cookie to "3.0"
2019-02-26T18:15:34.237Z info vvold[7764117] [Originator@6876 sub=Default] VasaSession::KillAllConnections VP (sn1-m20-c12-25-ct0), purged 0 connections, 0 currently active, new genId (31) (broadcast wakeup to all threads waiting for free connection)
2019-02-26T18:15:34.240Z error vvold[7764117] [Originator@6876 sub=Default] VasaSession::DoSetContext: setContext for VP sn1-m20-c12-25-ct0 (url: https://10.21.88.113:8084/vasa) failed [connectionState: AuthorizationError]: INVALID_LOGIN (SSL_ERROR_SSL
--> error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed / SSL_connect error in tcp_connect())
2019-02-26T18:15:34.240Z info vvold[7764117] [Originator@6876 sub=Default]
--> VasaOp::EventPollerCB [#516]: ===> Issuing 'getEvents' to VP sn1-m20-c12-25-ct0 (#outstanding 0/16) [session state: TransportError]
2019-02-26T18:15:34.240Z error vvold[7764117] [Originator@6876 sub=Default]
--> VasaOp::ThrowFromSessionError [#516]: ===> FINAL FAILURE getEvents, error (INVALID_SESSION / Bad session state (TransportError)) VP (sn1-m20-c12-25-ct0) Container (sn1-m20-c12-25-ct0) timeElapsed=214 msecs (#outstanding 0)
2019-02-26T18:15:34.240Z error vvold[7764117] [Originator@6876 sub=Default] VasaSession::EventPollerCB VP sn1-m20-c12-25-ct0: getEvents failed (INVALID_SESSION, Bad session state (TransportError)) [session state: TransportError]

This is essentially the same from a symptom perspective as the previous example showed. The resolution is the same as well; re-register the storage providers and everything comes back online and healthy.

Changes to the Storage Provider Account

Examples of this are password changes, permission changes, or account deletion.  

When testing these different scenarios Pure found that when changing the password or the user permission levels, there was no effect on the existing storage providers and they did not go offline. The explanation behind this behavior is due to the username/password combination being used when initially registering the storage providers; once that connection is made to the FlashArray's VASA Provider and the ESXi hosts have authenticated, there is no request to validate the password again. This is due to the certificate and thumbprints being valid between the vCenter Server, ESXi Hosts and the FlashArray VASA Provider. Any future attempts to register the storage provider with additional vCenters will fail if using the incorrect password.

If either an AD User or FlashArray Local User that was used to register the storage provider is deleted, impact to the Management Path will be observed. The Storage Provider will be offline and the ESXi hosts will be in a sync error state with the VASA Provider. There are two solutions to this issue:  either re-create the deleted FlashArray/AD User or delete the registered storage providers in vCenter and then re-register each controllers storage provider with a new user.

Here is an example of what can be seen if the user that the storage provider was registered with is deleted or otherwise removed.  

In vCenter you will see that the Active Storage Provider is offline and that the standby is unable to takeover for the failed provider:

vvol failure - 01 - user - 01.png

This is an error you'll see in the vCenter Server's /var/log/vmware/vmware-sps/sps.log:

2019-02-26T00:18:58.547Z [pool-9-thread-5] ERROR opId= com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error occurred while polling alarms for provider: https://10.21.88.113:8084/version.xml
com.vmware.vim.vasa.StorageFault: StorageFault
    at sun.reflect.GeneratedConstructorAccessor229.newInstance(Unknown Source)

2019-02-26T00:19:08.499Z [pool-9-thread-1] ERROR opId=sps-Main-119359-972 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error occurred while polling alarms for provider: https://10.21.88.113:8084/version.xml
com.vmware.vim.vasa.StorageFault: StorageFault
    at sun.reflect.GeneratedConstructorAccessor229.newInstance(Unknown Source)

After adding back the user, just refresh the Storage Providers, rescan the FlashArray Storage Provider, and it'll be back online. 

Management Network Outage

Let's say an ESXi host of the vCenter Server loses access to the management network. The impact and what the impact looks like will be very similar to what is observed on firewall changes or issues.  

FlashArray VASA Service Connectivity

Connectivity issues can be seen if there is a FlashArray Controller replacement, failure or reboot (such as during a Purity Upgrade), if the service is unreachable on one or both controllers, or the service is stopped for any reason.

Should just one of the services fail, you will see it updated in the vCenter Server UI as offline.
The standby provider will take over as active should the failed provider have been the active provider:
vvol failure - 01 - service - 01.png
Should both of the services be disrupted, then both storage providers will show up as offline in the vCenter Server UI:

vvol failure - 01 - service - 02.png

In addition to the providers showing offline in the UI, in the /var/log/vmware/vmware-sps/sps.log there will be messages of Bad Gateway:

2019-02-26T00:34:00.601Z [pool-9-thread-5] ERROR opId= com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error occurred while polling alarms for provider: https://10.21.88.114:8084/version.xml
com.vmware.vim.sms.fault.VasaServiceException: org.apache.axis2.AxisFault: Transport error: 502 Error: Bad Gateway
    at com.vmware.vim.sms.client.VasaClientImpl.getAlarms(VasaClientImpl.java:356)
    at sun.reflect.GeneratedMethodAccessor479.invoke(Unknown Source)
    
2019-02-26T00:34:10.446Z [pool-9-thread-1] ERROR opId=sps-Main-119359-972 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error: org.apache.axis2.AxisFault: Transport error: 502 Error: Bad Gateway occured as provider: https://10.21.88.113:8084/version.xml is offline
2019-02-26T00:34:10.598Z [pool-9-thread-5] ERROR opId= com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error: org.apache.axis2.AxisFault: Transport error: 502 Error: Bad Gateway occured as provider: https://10.21.88.114:8084/version.xml is offline

Video Example of Management Path Impact:

Here is a video demo of a failure in the management path and what the impact can be.

 

 

Impact to Data Path

Here we will review data path impact situations.
There are 2 main types of possible events that lead to an impact to the connection of the ESXi Hosts to the FlashArray protocol-endpoint and subsidiary LUNs.

Storage Network Disruption

Such as a switch issue, networking problem, or ports failing in the SAN.

Protocol Endpoint Disconnection

Either from disconnecting the PE from the FlashArray Host Group or Host, or from a APD or PDL event to the PE.

Example of Data Path Impact:

The observed impact to the Data Path will be similar if either: all active data paths go down because of a switch or networking issue, or, by removing the protocol endpoint from the host. Network connectivity issues will result in an APD event, while the device backing being disconnected will lead to a PDL event. In both events, the connectivity for both the Protocol Endpoint and all subsidiary LUNs will be lost. The VMs will be in a hung or failed state and will be inaccessible, and the VVol Datastore will report as inaccessible.

Recall when connectivity to the management path was lost there was no impact to running VVol VMs. This was because the Data Path was not impacted during those events.  The following example will have the opposite event-- where there is impact to the Data Path and no impact to the Management Path. We'll see that the VASA Provider will stay online and healthy, but the connectivity to the Protocol Endpoint will be lost, which in turn leads to impact to the VVol VMs and a true outage.

In the example that follows, the network interfaces on the FlashArray will be disabled and the impact from that loss will be shown.

Before disabling the iSCSI interfaces on the FlashArray, here is the healthy environment. There is active I/O going to the FlashArray.

vvol failure - 02 - data path - 01.png

The Storage Container, Protocol Endpoint and VASA Provider are all showing up as healthy:

[root@ESXi-6:~] esxcli storage vvol storagecontainer list
sn1-m20-c08-17-vvol-container
   StorageContainer Name: sn1-m20-c08-17-vvol-container
   UUID: vvol:79eb90a400c632c4-a47630fae8da5412
   Array: com.purestorage:801c80a2-dd56-4d38-86a2-c6fabc38cc91
   Size(MB): 8589934592
   Free(MB): 8589442761
   Accessible: true
   Default Policy:
   
[root@ESXi-6:~] esxcli storage vvol protocolendpoint list
naa.624a9370801c80a2dd564d380001104b
   Host Id: naa.624a9370801c80a2dd564d380001104b
   Array Id: com.purestorage:801c80a2-dd56-4d38-86a2-c6fabc38cc91
   Type: SCSI
   Accessible: true
   Configured: true
   Lun Id: naa.624a9370801c80a2dd564d380001104b
   Remote Host:
   Remote Share:
   NFS4x Transport IPs:
   Server Scope:
   Server Major:
   Auth:
   User:
   Storage Containers: 79eb90a4-00c6-32c4-a476-30fae8da5412
   
[root@ESXi-6:~] esxcli storage vvol vasaprovider list
sn1-m20-c08-17-ct0
   VP Name: sn1-m20-c08-17-ct0
   URL: https://10.21.203.31:8084/version.xml
   Status: online
   Arrays:
         Array Id: com.purestorage:801c80a2-dd56-4d38-86a2-c6fabc38cc91
         Is Active: true
         Priority: 200

The host connectivity looks good from vCenter as well.

vvol failure - 02 - data path - 02.png

Here is how the FlashArray iSCSI interfaces are being disabled:

     root@sn1-m20-c08-17-ct0:~# purenetwork disable ct0.eth6 ct0.eth7 ct1.eth6 ct1.eth7
Name      Enabled  Subnet  Address       Mask           Gateway      MTU   MAC                Speed       Services  Subinterfaces
ct0.eth6  False    -       10.21.203.33  255.255.255.0  10.21.203.1  1500  90:e2:ba:2c:90:71  10.00 Gb/s  iscsi     -
ct0.eth7  False    -       10.21.203.34  255.255.255.0  10.21.203.1  1500  90:e2:ba:2c:90:70  10.00 Gb/s  iscsi     -
ct1.eth6  False    -       10.21.203.35  255.255.255.0  10.21.203.1  1500  90:e2:ba:7f:bc:3d  10.00 Gb/s  iscsi     -
ct1.eth7  False    -       10.21.203.36  255.255.255.0  10.21.203.1  1500  90:e2:ba:7f:bc:3c  10.00 Gb/s  iscsi     -

Now that the interfaces are disabled, let's see what the impact looks like. 

From the vCenter UI you will see that the VVol Datastore is inaccessible and the hosts are not connected to the VVol Datastore:

vvol failure - 02 - data path - 03.png

From the FlashArray GUI we can see that the workload stopped when the networking was lost (as would be expected).

vvol failure - 02 - data path - 04.png

From the esxcli we can see that the VASA Provider is still in a healthy state, but the storage container and protocol endpoint are inaccessible now.

[root@ESXi-6:~] esxcli storage vvol storagecontainer list
sn1-m20-c08-17-vvol-container
   StorageContainer Name: sn1-m20-c08-17-vvol-container
   UUID: vvol:79eb90a400c632c4-a47630fae8da5412
   Array: com.purestorage:801c80a2-dd56-4d38-86a2-c6fabc38cc91
   Size(MB): 8589934592
   Free(MB): 8589447423
   Accessible: false
   Default Policy:
   
[root@ESXi-6:~] esxcli storage vvol protocolendpoint list
naa.624a9370801c80a2dd564d380001104b
   Host Id: naa.624a9370801c80a2dd564d380001104b
   Array Id: com.purestorage:801c80a2-dd56-4d38-86a2-c6fabc38cc91
   Type: SCSI
   Accessible: false
   Configured: true
   Lun Id: naa.624a9370801c80a2dd564d380001104b
   Remote Host:
   Remote Share:
   NFS4x Transport IPs:
   Server Scope:
   Server Major:
   Auth:
   User:
   Storage Containers: 79eb90a4-00c6-32c4-a476-30fae8da5412
   
[root@ESXi-6:~] esxcli storage vvol vasaprovider list
sn1-m20-c08-17-ct0
   VP Name: sn1-m20-c08-17-ct0
   URL: https://10.21.203.31:8084/version.xml
   Status: online
   Arrays:
         Array Id: com.purestorage:801c80a2-dd56-4d38-86a2-c6fabc38cc91
         Is Active: true
         Priority: 200

Here is a look at the logging from the ESXi host's vmkernel and vvold log that are of note:

## vmkernel log ##
2019-02-27T23:15:53.177Z cpu37:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:699: vmhba64:CH:7 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2019-02-27T23:15:53.177Z cpu37:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:700: Sess [ISID: 00023d000008 TARGET: iqn.2010-06.com.purestorage:flasharray.107f4b1e374f4d6b TPGT: 1 TSIH: 0]
2019-02-27T23:15:53.177Z cpu37:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:701: Conn [CID: 0 L: 10.21.203.188:32535 R: 10.21.203.36:3260]
2019-02-27T23:15:53.181Z cpu37:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:699: vmhba64:CH:6 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2019-02-27T23:15:53.181Z cpu37:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:700: Sess [ISID: 00023d000007 TARGET: iqn.2010-06.com.purestorage:flasharray.107f4b1e374f4d6b TPGT: 1 TSIH: 0]
2019-02-27T23:15:53.181Z cpu37:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:701: Conn [CID: 0 L: 10.21.203.163:20319 R: 10.21.203.36:3260]
2019-02-27T23:16:02.287Z cpu31:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:699: vmhba64:CH:1 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2019-02-27T23:16:02.287Z cpu31:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:700: Sess [ISID: 00023d000002 TARGET: iqn.2010-06.com.purestorage:flasharray.107f4b1e374f4d6b TPGT: 1 TSIH: 0]
2019-02-27T23:16:02.287Z cpu31:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:701: Conn [CID: 0 L: 10.21.203.188:21959 R: 10.21.203.33:3260]
2019-02-27T23:16:02.291Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:699: vmhba64:CH:0 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2019-02-27T23:16:02.291Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:700: Sess [ISID: 00023d000001 TARGET: iqn.2010-06.com.purestorage:flasharray.107f4b1e374f4d6b TPGT: 1 TSIH: 0]
2019-02-27T23:16:02.291Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:701: Conn [CID: 0 L: 10.21.203.163:62519 R: 10.21.203.33:3260]
2019-02-27T23:16:04.569Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:699: vmhba64:CH:3 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2019-02-27T23:16:04.569Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:700: Sess [ISID: 00023d000004 TARGET: iqn.2010-06.com.purestorage:flasharray.107f4b1e374f4d6b TPGT: 1 TSIH: 0]
2019-02-27T23:16:04.569Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:701: Conn [CID: 0 L: 10.21.203.188:57333 R: 10.21.203.34:3260]
2019-02-27T23:16:05.332Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:699: vmhba64:CH:5 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2019-02-27T23:16:05.332Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:700: Sess [ISID: 00023d000006 TARGET: iqn.2010-06.com.purestorage:flasharray.107f4b1e374f4d6b TPGT: 1 TSIH: 0]
2019-02-27T23:16:05.332Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:701: Conn [CID: 0 L: 10.21.203.188:62535 R: 10.21.203.35:3260]
2019-02-27T23:16:05.337Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:699: vmhba64:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2019-02-27T23:16:05.337Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:700: Sess [ISID: 00023d000005 TARGET: iqn.2010-06.com.purestorage:flasharray.107f4b1e374f4d6b TPGT: 1 TSIH: 0]
2019-02-27T23:16:05.337Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:701: Conn [CID: 0 L: 10.21.203.163:25098 R: 10.21.203.35:3260]
2019-02-27T23:16:05.340Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:699: vmhba64:CH:2 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2019-02-27T23:16:05.340Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:700: Sess [ISID: 00023d000003 TARGET: iqn.2010-06.com.purestorage:flasharray.107f4b1e374f4d6b TPGT: 1 TSIH: 0]
2019-02-27T23:16:05.340Z cpu32:4994511)WARNING: iscsi_vmk: iscsivmk_StopConnection:701: Conn [CID: 0 L: 10.21.203.163:26364 R: 10.21.203.34:3260]
2019-02-27T23:16:05.345Z cpu11:2097313)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.624a9370801c80a2dd564d380001104b".
2019-02-27T23:16:05.345Z cpu11:2097313)ScsiDevice: 8707: No Handlers registered! (naa.624a9370801c80a2dd564d380001104b)!
2019-02-27T23:16:05.345Z cpu11:2097313)ScsiDevice: 5978: Device state of naa.624a9370801c80a2dd564d380001104b set to APD_START; token num:1
2019-02-27T23:16:05.345Z cpu11:2097313)StorageApdHandler: 1203: APD start for 0x43055b4d1f20 [naa.624a9370801c80a2dd564d380001104b]
2019-02-27T23:16:05.345Z cpu1:2097602)StorageApdHandler: 419: APD start event for 0x43055b4d1f20 [naa.624a9370801c80a2dd564d380001104b]
2019-02-27T23:16:05.345Z cpu1:2097602)StorageApdHandlerEv: 110: Device or filesystem with identifier [naa.624a9370801c80a2dd564d380001104b] has entered the All Paths Down state.
2019-02-27T23:16:07.188Z cpu28:2097313)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.624a9370801c80a2dd564d380001104b".
2019-02-27T23:16:12.290Z cpu28:2097313)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.624a9370801c80a2dd564d380001104b".
2019-02-27T23:16:12.294Z cpu28:2097313)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.624a9370801c80a2dd564d380001104b".
2019-02-27T23:16:14.572Z cpu28:2097313)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.624a9370801c80a2dd564d380001104b".
2019-02-27T23:16:15.335Z cpu28:2097313)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.624a9370801c80a2dd564d380001104b".
2019-02-27T23:16:15.339Z cpu28:2097313)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.624a9370801c80a2dd564d380001104b".
2019-02-27T23:16:15.343Z cpu28:2097313)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1844: Could not select path for device "naa.624a9370801c80a2dd564d380001104b".
2019-02-27T23:18:25.349Z cpu1:2097602)ScsiDevice: 8707: No Handlers registered! (naa.624a9370801c80a2dd564d380001104b)!
2019-02-27T23:18:25.349Z cpu1:2097602)StorageApdHandler: 609: APD timeout event for 0x43055b4d1f20 [naa.624a9370801c80a2dd564d380001104b]
2019-02-27T23:18:25.349Z cpu1:2097602)StorageApdHandlerEv: 126: Device or filesystem with identifier [naa.624a9370801c80a2dd564d380001104b] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will n$

## vvold log ##
2019-02-27T23:16:07.282Z info vvold[7746429] [Originator@6876 sub=Libs] LunImpl: IsAccessible = false, path state = 2, lun state 2
2019-02-27T23:16:07.282Z info vvold[7746429] [Originator@6876 sub=Default] HostManager::AddOrUpdatePE PE lookup (
--> SCSI PE, ID (host: naa.624a9370801c80a2dd564d380001104b, vasa: naa.624a9370801c80a2dd564d380001104b) (inaccessible, configured)
-->  Lun Id: naa.624a9370801c80a2dd564d380001104b)
2019-02-27T23:16:07.282Z info vvold[7746429] [Originator@6876 sub=Default] ProtocolEndpoint::InitializeFromInbandInfo Initializing SCSI PE (naa.624a9370801c80a2dd564d380001104b) using inband info
2019-02-27T23:16:07.282Z info vvold[7746429] [Originator@6876 sub=Default] HostManager::DiscoverSCSIPEs peVec.size=2, _peMap.size=2, setPEContextScheduled=false
2019-02-27T23:16:07.282Z info vvold[7746429] [Originator@6876 sub=Default] HostManager::RefreshPEMap PE (naa.624a9370801c80a2dd564d380001104b, naa.624a9370801c80a2dd564d380001104b) discovered inband, is inaccessible, will skip from setPEContext!
...
2019-02-27T23:16:07.300Z info vvold[7746429] [Originator@6876 sub=Default] VendorProviderMgr::SetPEContext: SCSI PE (uniqueIdentifier=naa.624a93702dcf29ad6aca491300089aff lunId=naa.624a93702dcf29ad6aca491300089aff ipAddress=__none__ serverMount=__none__ serverScope=__none__ serverMajor=__none__ authType=__none__ transportIpAddress=)
2019-02-27T23:16:07.300Z info vvold[7746429] [Originator@6876 sub=Default] HostManager::GetContainerPEAccessibility arrayId: com.purestorage:801c80a2-dd56-4d38-86a2-c6fabc38cc91, cid: 79eb90a4-00c6-32c4-a476-30fae8da5412, accessible: 0,checkAllPEsAccessibility: false containerType: SCSI, APD: 1
2019-02-27T23:16:07.300Z info vvold[7746429] [Originator@6876 sub=Default] StorageContainer::UpdateAccessibility: Accessibility of cId vvol:79eb90a400c632c4-a47630fae8da5412 (name: sn1-m20-c08-17-vvol-container) changed (Accessible -> Unaccessible), hostd was notified (VP: accessible)

Essentially here we have an APD event for all VVol related objects to the FlashArray. 
After enabling the iSCSI interfaces on the FlashArray, the VVols Datastore and host connectivity recovered automatically on its own.


Closing Thoughts

The goal of this KB was to go through several VVol failure scenarios and cover possible results from impacts to either the VVol Data Path or Management Path. A key takeaway would be that impact to the Management Path does not indicate a failure of the Data Path; ie the VMs didn't go down and there isn't an outage in the environment. That being said, while the VMs are up and running when the Management Path is down, any VVol related operations will fail. For example powered off VVol VMs can not be powered on, VVol VMs can not be edited, vMotion'd, or otherwise changed. Powering off a VVol VM when the Management Path is down will result in being unable to power back on that VM as well. The lesson to be learned is not to panic when the Storage Provider is offline or the VVol Datastore is reporting as inaccessible.