Skip to main content
Pure Technical Services

vVols Deep Dive: What Happens when a Single VASA Service is Unreachable?

Currently viewing public documentation. Please login to access the full scope of documentation.

The goal of this KB is to cover what happens when a Storage Provider (FlashArray VASA Service) is unreachable from vCenter or any ESXi hosts.  As well as to explore what is expected in vCenter and what would be expected from the ESXi Host as well.

What is the Issue?

There are times that one of the FlashArray VASA services may be unreachable.  For example, if there is a controller upgrade or replacement, a firewall change, a network change, or the VASA service is stopped on a controller.  What happens if these events happen to the Storage Provider that is currently active?  This KB will be covering how vCenter and ESXi will handle a Storage Provider going into an offline state.


How does vCenter and ESXi handle the Issue?

By default vCenter and ESXi will issue a request to the VASA Provider, GetEvents and GetAlarms, every 10 seconds.  These are issued to both the active and standby Storage Providers.  What does that mean?  That means that if the storage provider is unreachable for whatever reason, then after at most 10 seconds that GetEvents or GetAlarms request will fail.  In the example below the VASA service was stopped on the active storage provider at 23:48:20, then in the vvold.log and sps.log we can see the requests fail.

### ESXi vvold.log  ###

--> VasaOp::QueryVirtualVolumeInt [#89844]: ===> Issuing 'queryVirtualVolume' to VP sn1-m20-c12-25-ct1 (#outstanding 0/5) [session state: Connected]
2019-05-16T23:48:21.212Z error vvold[2099041] [Originator@6876 sub=Default]
--> VasaOp::ThrowFromSessionError [#89844]: ===> FINAL SUCCESS queryVirtualVolume VP (sn1-m20-c12-25-ct1) Container (b487b73a-6128-3ad2-873b-8ddb1ba45d4f) timeElapsed=154 msecs (#outstanding 0)
2019-05-16T23:48:21.212Z info vvold[2099041] [Originator@6876 sub=Default] SI::QueryVirtualVolumeInt Empty response

...

--> VasaOp::EventPollerCB [#89845]: ===> Issuing 'getEvents' to VP sn1-m20-c12-25-ct1 (#outstanding 0/5) [session state: Connected]
2019-05-16T23:48:39.001Z warning vvold[2099044] [Originator@6876 sub=Default] VasaSession::ConvertSoapError: [SESSION HEALTH] Connection-local error with VP sn1-m20-c12-25-ct1 [Connected -> TransportError], session appears healthy (last successful command 17788 msecs ago)
2019-05-16T23:48:39.001Z info vvold[2099044] [Originator@6876 sub=Default] VasaSession::ReleaseConn VP (sn1-m20-c12-25-ct1), killing connection [14VasaConnection:0x000000aa79df3100] (size now 4/5)!
2019-05-16T23:48:39.001Z error vvold[2099044] [Originator@6876 sub=Default]
--> VasaOp::ThrowFromSessionError [#89845]: ===> FINAL FAILURE getEvents, error (TRANSPORT_FAULT / HTTP/1.1 502 Bad Gateway / <html>
--> <head><title>502 Bad Gateway</title></head>
--> <body bgcolor="white">
--> <center><h1>502 Bad Gateway</h1></center>
--> <hr><center>nginx/1.10.3 (Ubuntu)</center>
--> </body>
--> </html>

### vCenter sps.log ###

2019-05-16T23:48:24.720Z [pool-10-thread-3] ERROR opId=sps-Main-86487-528 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error occurred while polling alarms for provider: https://10.21.88.114:8084/version.xml
com.vmware.vim.sms.fault.VasaServiceException: org.apache.axis2.AxisFault: Transport error: 502 Error: Bad Gateway
    at com.vmware.vim.sms.client.VasaClientImpl.getAlarms(VasaClientImpl.java:356)
    at sun.reflect.GeneratedMethodAccessor577.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.vmware.vim.sms.client.VasaClientMethodInvoker.invokeMethod(VasaClientMethodInvoker.java:48)
    at com.vmware.vim.sms.client.VasaClientMethodInvoker.invoke(VasaClientMethodInvoker.java:35)
    at com.vmware.vim.sms.client.VasaClientHandler.invoke(VasaClientHandler.java:27)
    at com.sun.proxy.$Proxy115.getAlarms(Unknown Source)
    at com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher.pollAlarms(AlarmDispatcher.java:88)
    at com.vmware.vim.sms.provider.vasa.alarm.AlarmManagerImpl$AlarmPollerTask.run(AlarmManagerImpl.java:116)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.axis2.AxisFault: Transport error: 502 Error: Bad Gateway
    at org.apache.axis2.transport.http.HTTPSender.handleResponse(HTTPSender.java:310)
    at org.apache.axis2.transport.http.HTTPSender.sendViaPost(HTTPSender.java:200)
    at org.apache.axis2.transport.http.HTTPSender.send(HTTPSender.java:76)
    at org.apache.axis2.transport.http.CommonsHTTPTransportSender.writeMessageWithCommons(CommonsHTTPTransportSender.java:400)
    at org.apache.axis2.transport.http.CommonsHTTPTransportSender.invoke(CommonsHTTPTransportSender.java:225)
    at org.apache.axis2.engine.AxisEngine.send(AxisEngine.java:435)
    at org.apache.axis2.description.OutInAxisOperationClient.send(OutInAxisOperation.java:402)
    at org.apache.axis2.description.OutInAxisOperationClient.executeImpl(OutInAxisOperation.java:229)
    at org.apache.axis2.client.OperationClient.execute(OperationClient.java:165)
    at com.vmware.vim.vasa.VasaServiceStub.getAlarms(VasaServiceStub.java:16976)
    at com.vmware.vim.sms.client.stub.VasaServiceStubImpl.getAlarms(VasaServiceStubImpl.java:39)
    at com.vmware.vim.sms.client.VasaClientImpl.getAlarms(VasaClientImpl.java:354)
    ... 16 more
    
2019-05-16T23:48:34.719Z [pool-10-thread-1] ERROR opId=sps-Main-86487-528 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error: org.apache.axis2.AxisFault: Transport error: 502 Error: Bad Gateway occured as provider: https://10.21.88.114:8084/version.xml is offline

What happens after the provider is marked as unreachable?  Good question.  Both the ESXi host and vCenter will attempt to retry those requests, but will then try the standby provider.  Overall this process takes about 20 to 25 seconds from the time the VASA service was unreachable.  Below are log lines from when the sucessful connections to the Active VASA provider are established and updated.

### vCenter sps.log ###

2019-05-16T23:48:44.821Z [Thread-3019] INFO  opId=sps-Main-86487-528 com.vmware.pbm.capability.impl.KvSchemaService - Successfully registered capability metadata for providerInfo (pbm.capability.provider.CapabilityProviderInfo) {
   dynamicType = null,
   dynamicProperty = null,
   providerUuid = 47f33486-f9b8-414d-96ab-c7a2082f181f,
   schemaId = com.purestorage.storage.policy.2.0
}
2019-05-16T23:48:44.821Z [Thread-3019] INFO  opId=sps-Main-86487-528 com.vmware.sps.vp.VasaDataCollector - Persisting CBP and storage associations for provider 47f33486-f9b8-414d-96ab-c7a2082f181f with url https://10.21.88.113:8084/version.xml

### ESXi vvold.log ###

2019-05-16T23:48:44.002Z info vvold[2099043] [Originator@6876 sub=Default] HostManager::GetContainerPEAccessibility arrayId: com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f, cid: b487b73a-6128-3ad2-873b-8ddb1ba45d4f, accessible: 1,checkAllPEsAccessibility: false containerType: SCSI, APD: 0
2019-05-16T23:48:44.918Z info vvold[2737872] [Originator@6876 sub=Default] Came to SI::UpdateVvolContainer: Container b487b73a-6128-3ad2-873b-8ddb1ba45d4f with 1 Arrays and 1 VPs
2019-05-16T23:48:44.918Z info vvold[2737872] [Originator@6876 sub=Default] VendorProviderMgr::AddOrRef VP (sn1-m20-c12-25-ct0) Uid (com.purestorage:dd7ee7cb-9e89-315d-a997-65377e1a841c) URL (https://10.21.88.113:8084/version.xml)
2019-05-16T23:48:44.918Z info vvold[2737872] [Originator@6876 sub=Default] DlgSoapCtxPool: Created new DlgSoapCtxPool ([14DlgSoapCtxPool:0x000000aa79bed780]) for VP (sn1-m20-c12-25-ct0) session (000000aa79bf12f0)
2019-05-16T23:48:44.918Z info vvold[2737872] [Originator@6876 sub=Default] VendorProviderMgr:UpdateFirewallRules: Firewall operation add on rule id:VVOLD_8084 success
2019-05-16T23:48:44.919Z info vvold[2737872] [Originator@6876 sub=Default] VasaSession::GetEndPoint: with url https://10.21.88.113:8084/version.xml
2019-05-16T23:48:44.931Z info vvold[2737872] [Originator@6876 sub=Default] VasaSession::Parse xml: id=4, loc=/vasa, myVer=4, ver=0
2019-05-16T23:48:44.931Z info vvold[2737872] [Originator@6876 sub=Default] VasaSession::ParseVersionXml: Effective VP endpoint is https://10.21.88.113:8084/vasa (version 4)
2019-05-16T23:48:44.931Z info vvold[2737872] [Originator@6876 sub=Default] VasaSession::DoSetContext: Setting GUID: 'D2CCA827-514D-E611-0000-78654321007B'  VC UID: '8c5e8263-1cba-4337-a350-87009728fa20' (lastEventId was -1)
2019-05-16T23:48:44.931Z info vvold[2737872] [Originator@6876 sub=Default] GetVasaHostInitiators: Adding FC HBA  wwpn: 20:00:00:25:b5:22:00:6c wwnn: 20:00:00:25:b5:44:00:27
2019-05-16T23:48:44.931Z info vvold[2737872] [Originator@6876 sub=Default] GetVasaHostInitiators: Adding FC HBA  wwpn: 20:00:00:25:b5:22:00:7c wwnn: 20:00:00:25:b5:44:00:27
2019-05-16T23:48:44.931Z info vvold[2737872] [Originator@6876 sub=Default] GetVasaHostInitiators: Adding FC HBA  wwpn: 20:00:00:25:b5:66:00:6c wwnn: 20:00:00:25:b5:44:00:27
2019-05-16T23:48:44.931Z info vvold[2737872] [Originator@6876 sub=Default] GetVasaHostInitiators: Adding FC HBA  wwpn: 20:00:00:25:b5:66:00:7c wwnn: 20:00:00:25:b5:44:00:27
2019-05-16T23:48:44.931Z info vvold[2737872] [Originator@6876 sub=Default] VasaSession::InitSoap: Master soap client created successfully
2019-05-16T23:48:44.931Z info vvold[2737872] [Originator@6876 sub=Default] VasaSession::InitSoap: _masterSoap=000000aa79cd3ee0, iomode=33558544
2019-05-16T23:48:44.931Z info vvold[2737872] [Originator@6876 sub=Default] VasaSession::InitSoap VVold using 15 secs for soap connect timeout
2019-05-16T23:48:44.931Z info vvold[2737872] [Originator@6876 sub=Default] VasaSession::InitSoap VVold using 75 secs for soap receive timeout
2019-05-16T23:48:46.477Z warning vvold[2737872] [Originator@6876 sub=Default] vvol_ssl_auth_init: Will skip CRL check as env variable VVOLD_DO_CRL_CHECK is not set!
2019-05-16T23:48:46.477Z warning vvold[2737872] [Originator@6876 sub=Default] VasaSession:DoSetContext: Setting VASAVERSION cookie to "3.0"
2019-05-16T23:48:46.477Z info vvold[2737872] [Originator@6876 sub=Default] VasaSession::KillAllConnections VP (sn1-m20-c12-25-ct0), purged 0 connections, 0 currently active, new genId (1) (broadcast wakeup to all threads waiting for free connection)
2019-05-16T23:48:46.507Z warning vvold[2737872] [Originator@6876 sub=Default] VasaSession::SetState: VP sn1-m20-c12-25-ct0 [OutOfSync -> Connected], state change locked!
2019-05-16T23:48:46.507Z info vvold[2737872] [Originator@6876 sub=Default] VasaSession::DoSetContext: master soap's original keep_alive = 1
2019-05-16T23:48:46.507Z info vvold[2737872] [Originator@6876 sub=Default] VasaSession::DoSetContext: VP sn1-m20-c12-25-ct0 url https://10.21.88.113:8084/vasa: success eventTimer 5000000  maxBatchsize 1 maxConcurrentRequestsPerSession 5 _sessionId [4518202577269936316] _genId 1

One point that was discovered when testing this was that both the HTML5 and Flex/Flash vCenter Web Clients do not automatically refresh to reflect this change in the Providers state.  The vCenter did not update the state of the providers in the Web Client even 30 seconds after the service was unreachable on the Active Storage Provider.

vCenter-storage-provider-1.png

Only after the refresh button was clicked did the provider status correctly update.

vCenter-storage-provider-2.png

This was the same with the esxcli query with esxcli storage vvol vasaprovider list.  Only after those queries failed and retried the other provider did the status update.

[root@ESXi-3:~] esxcli storage vvol vasaprovider list
sn1-m20-c12-25-ct1
   VP Name: sn1-m20-c12-25-ct1
   URL: https://10.21.88.114:8084/version.xml
   Status: online
   Arrays:
         Array Id: com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f
         Is Active: true
         Priority: 200
         
...

### In addition to waiting, a ls of the vvol directory path will help the standby become active as well.  ###

[root@ESXi-3:~] ls /vmfs/volumes/vvol\:b487b73a61283ad2-873b8ddb1ba45d4f/
Test-VVol-VM-1                                rfc4122.360f5c13-58a9-48cc-93dd-5fb89a1d3244  rfc4122.a34fb1e2-4500-4641-ae58-45c562b6299c  vvol-1-w2k16-mass                             vvol-m20-1-VM-vmfs-cbt-01
W2K16-CL-4-VM-1                               rfc4122.4eb90929-eb11-4020-a47b-7891bf651558  rfc4122.aa59c3d8-103f-4276-ad87-63b67291ae47  vvol-1-w2k16-nor                              vvol-m20-1-VM-vmfs-cbt-02
W2K16-CL-4-VM-2                               rfc4122.52d953a6-d735-4cbb-8e00-eca13c65bab8  rfc4122.b7053b4e-a19c-4a6d-b016-ac3a8726f5c3  vvol-2-rhel-7-cbt                             vvol-m20-1-VM-vmfs-res-02
dev-vvol-w2k16-template                       rfc4122.60413684-44cf-406f-8200-a059184000cc  rfc4122.b8dbb226-fc6f-49bc-97a9-87bb919c8b55  vvol-2-rhel-7-mass                            vvol-m20-2-VM-W-cbt-04
m20-c12-vvol-vm-1                             rfc4122.651d5163-f5ef-4254-96a5-aa60370b7ea0  rfc4122.c7369198-d568-4717-b7e3-76ce9e930c88  vvol-2-rhel-7-nor                             vvol-m20-2-VM-W-nor-04
m20-c12-vvol-vm-2                             rfc4122.67d92a53-c09e-454a-84d7-25e1256504c8  rfc4122.ca313ba4-0bed-416e-b2c7-a3e546b80b91  vvol-2-rhel-7-res                             vvol-m20-2-VM-W-res-04
rfc4122.02d7f4cb-e9d8-4348-8e15-22f7b83b3c4b  rfc4122.71b95128-ebbc-4677-85d5-f6d33ff19e11  rfc4122.cd8b2c73-9bde-4a54-8b1f-5157d3f39ee7  vvol-2-w2k16-cbt                              vvol-m20-2-VM-vmfs-cbt-04
rfc4122.05493fe6-bb37-4fd5-9a45-3ea5cd9d8a39  rfc4122.71d76eb9-3d0e-4eda-8300-8c459bbf8dcb  rfc4122.cf9417b1-2917-4145-b447-6fc20c49caa5  vvol-2-w2k16-mass                             vvol-m20-2-VM-vmfs-res-04
rfc4122.0836c8a3-eead-4810-85f9-94dd9c9d7a2d  rfc4122.75015ddc-9516-4060-a5bb-1badac56eeb3  rfc4122.df6d741c-c814-4b38-97bf-e2adbd52afac  vvol-2-w2k16-nor                              vvol-rhel-template
rfc4122.1523fabe-3cce-4660-8c75-49f0cdd24383  rfc4122.78df0bb7-f7c4-457e-96ff-3024755018c3  rfc4122.f0bf0264-7705-469e-bba0-e548d452fc90  vvol-2-w2k16-res                              vvol-w2k16-template
rfc4122.29a6938d-a806-44e1-b5f3-47cd1b17ce40  rfc4122.8a04b988-62d8-4e8a-910d-923fd3884db6  rfc4122.f416774a-6c59-43ad-8c14-c4851e3e3f5e  vvol-m20-1-VM-W-cbt-02                        vvol-w2k16-vm-1
rfc4122.2b9690c8-1298-4ad6-925b-070d473ad87c  rfc4122.987d974e-5dcd-4156-b230-49d315c54a9e  rfc4122.f99a686c-310d-4c57-ae3c-52b724c021f5  vvol-m20-1-VM-W-nor-02
rfc4122.35faa1db-bda8-4d1e-b23f-b10f1b7467e2  rfc4122.9d265d55-1ece-466c-820a-349fa1237dd6  vvol-1-rhel-7-nor                             vvol-m20-1-VM-W-res-02

...

[root@ESXi-3:~] esxcli storage vvol vasaprovider list
sn1-m20-c12-25-ct0
   VP Name: sn1-m20-c12-25-ct0
   URL: https://10.21.88.113:8084/version.xml
   Status: online
   Arrays:
         Array Id: com.purestorage:2dcf29ad-6aca-4913-b62e-a15875c6635f
         Is Active: true
         Priority: 200

In either case, after about 20-25 seconds the providers status was updated to reflect that the current state of those providers.  The example that was shown here was with ESXi and vCenter 6.7 U2 and Purity 5.2.3.


Summary of Findings

After some continued testing, there were some differences when testing the Storage Provider failover timing.  Here were the findings through the testing.

Test that was Performed Storage Provider Failover Timing
The VASA service is stopped on the FlashArray controller for the active storage provider. 20 to 30 seconds
The FlashArray controller for the active storage provider is rebooted. 2 to 3 minutes
The management interface on the FlashArray controller for the active storage provider is disabled. 7 to 8 minutes

Essentially what we saw with the management interface being disabled was a "transient" transport error.  What we saw with the controller reboot was an "unreachable" transport error. There appears to be different timeout values in vCenter and ESXi for transient transport errors vs unreachable transport errors. Which are both different when the service is down/connection is refused with a "Bad Gateway"  error and the session is additionally marked as invalid.

Pure Storage is currently working with VMware to get some clarity on the timeout values for these events and what the expected storage provider failover timing is as well.