Before diving into the specifics around Purity //FA 5.3 and the upgrade process, it is a good idea to review the KB written about understanding vVol Failure Scenarios. This will give a strong foundation for understanding what can be impacted overall in the vVols ecosystem.
What is Different about Purity //FA 5.3 for vVols?
Let's first take a deeper view on what's changed in Purity 5.3 for vVols. Overall there are two major changes: the changes to how certificates are managed and the changes to the FlashArray's backend database services to improve performance at scale.
The changes to how the VASA certificates are managed is covered in another KB which details how to also manage the VASA certificate with purecert. The short of it is that, prior to Purity 5.3, all management of the VASA certificates was handled at the root level. This meant that the end user couldn't do anything about the VASA certificates and any troubleshooting or support required the end user to open up a support case with Pure Storage. On Purity 5.3, any array admin on the FlashArray will have the ability to manage the VASA certificates for each controller. The KB listed above goes into this in much more detail.
For the backend database service changes, one important thing to understand is how VASA communicates to the FlashArray. VASA (vSphere API for Storage Awareness) is the communication that allows the vSphere environment to communicate directly to the array. This allows vSphere to request the FlashArray to create, delete, snapshot, connect or disconnect volumes and many of the other great value adds of vVols magic to happen. On the FlashArray, the VASA provider is a service running natively in Purity; how does VASA forward it's requests to Purity and the FlashArray itself? This is done through a backend database service that allows VASA to convert the SOAP calls from vSphere to API calls to the FlashArray. Prior to vVols being released, the average FlashArray would only service 10 to 15 of these API calls a day, but after vVols was released we've seen these requests increase, in some cases, to 50,000 a day. This helped drive Pure to identify areas to improve on in the backend and in Purity itself, and one of the major improvements to this backed service was pushed to Purity //FA 5.3.
How does upgrading to Purity //FA 5.3 impact vVols?
Now that we know what is different about Purity 5.3, let's take a look at how the Purity upgrade can impact the vSphere Environment, and more specifically, vVols.
- When Purity is upgraded to 5.3, the VASA certificates currently being used are securely exported and then imported into Purity. Post upgrade, the VASA certificates will now show up in the purecert list command. There shouldn't be any issues with the certificates being exported/imported nor, at this time, has Pure Support observed any issues directly related to the certificates being managed by purecert post upgrade. (See this KB for more information on how to manage VASA certificates with purecert)
- When Purity 5.3 loads, the backend database service will be running on a newer version then was previously ran on Purity 5.0, 5.1 or 5.2. One of the problems this poses for vVols is that if the secondary controller is running Purity 5.3 and the primary is lower than 5.3, then the secondary controller won't be able to forward API requests to the primary controller. This means that if the Storage Provider for the controller on 5.3 is the active Storage Provider, any management request will fail. However, Pure has taken steps to help mitigate the controller running Purity 5.3 being the active Storage Provider in vCenter. Prior to the FlashArray controller reboot to 5.3, the VASA service will be stopped for 60 to 90 seconds. In the event that this controller is the active Storage Provider, vCenter should failover to the standby Storage Provider. This should help minimize the chance for the controller coming back online and then failing those management requests.
- Another behavior that Pure has observed is that during the upgrade and final hop of that upgrade, sometimes the backend database service has a longer than expected startup time. Pure has found that in the event of these long startup times, VASA was running and was not correctly communicating to vSphere that VASA was busy and that ESXi and vCenter would need to retry again later. Instead, the ESXi host would get a failure from the array's VASA service and would not retry the request down those sessions. While there is a remediation for this situation, it did expose an issue that Pure Storage will correct. The issue of VASA not correctly communicating the busy/waiting state will be fixed in a future Purity 5.3 release.
What are some examples of management path related operations that might be impacted during the Purity upgrade itself?
This is a great question! As great as knowing the underlying changes in Purity //FA 5.3, it's just as important as how those changes might manifest in the vSphere Environment. Here are some possible results that could be observed if there is a disruption on the management path on the FlashArray being upgraded.
- A vSphere vMotion for a vVols based VM may fail if the management path requests timeout or fail
- A template may fail to deploy or a VM may fail to clone to a vVol Datastore
- A Managed Snapshot request my fail or timeout for a vVols based VM
- A Backup Vendor's scheduled backup may fail if the managed snapshot is taken at the time of the Primary Controller rebooting (such as Veeam, Rubrik, etc)
- A vVols based VM may fail to power on, as the binds would fail.
Essentially, these issues, while uncommon, could occur in the timeframe that the Management Path (the Storage Providers) is in an offline/down state or that the backend database service in Purity is not ready but VASA is up. Primarily these would be observed when the Primary controller is rebooted and the Secondary controller takes over as the Primary when running Purity 5.3 for the first time.
What to watch for during the upgrade?
We know what's going to happen during the Purity upgrade, so what should be getting watched during the upgrade?
During the upgrade, watch the storage providers state/status to verify that when the controller is rebooted the other storage provider is active. Beyond that, you'll want to make sure that the vVol datastore is in a healthy state and not inaccessible for any of the ESXi hosts. There are a couple ways to monitor the Storage Provider states and transitions: in the vCenter UI or from tailing the VMware sps.log on vCenter.
From the vCenter UI, navigate to the Configure => Storage Providers page. Locate the Storage Providers for the FlashArray being upgraded. Confirm that both are in a healthy, online and that one in an active state (the other will be in Standby). In the example below CT0 is the active Storage Provider: In the example screenshot below, CT0 was the secondary controller being rebooted for the upgrade. We can see that the VASA service is stopped and then the Storage Provider for CT0 is offline. Now note that the active Storage Provider is CT1 and is online. Meaning that we are in a good state here and management path requests will still succeed:
From an ssh session to the vCenter Server, the sps.log can be tailed to identify when the storage provider is offline and when the standby takes over as active.
The log that needs to be tailed is the sps.log located in this path: /var/log/vmware/vmware-sps/sps.log. Here is an example of lines that show the Storage Provider unreachable and then failing over to the other storage provider:
When the VASA service is stopped on the controller prior to the reboot, you will see a bad gateway error, as the service is unreachable:
2020-01-21T16:38:27.275Z [pool-10-thread-2] ERROR opId=sps-Main-611046-139 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error occurred while polling alarms for provider: https://10.21.149.40:8084/version.xml com.vmware.vim.sms.fault.VasaServiceException: org.apache.axis2.AxisFault: Transport error: 502 Error: Bad Gateway 2020-01-21T16:38:37.275Z [pool-10-thread-3] ERROR opId=sps-Main-611046-139 com.vmware.vim.sms.provider.vasa.alarm.AlarmDispatcher - Error: org.apache.axis2.AxisFault: Transport error: 502 Error: Bad Gateway occured as provider: https://10.21.149.40:8084/version.xml is offline
Now that we see that the Provider is unreachable, we should see the other provider get polled and then updated as active:
2020-01-21T16:38:47.371Z [Thread-276] INFO opId=sps-Main-611046-139 com.vmware.sps.vp.NonLegacyVasaDataCollector - Received 2 capabilityObjectSchemas from provider 4b57c143-e2bc-4764-9cce-4f7ac216dcb7. Time taken 27ms 2020-01-21T16:38:47.381Z [Thread-275] INFO opId=sps-Main-611046-139 com.vmware.sps.vp.NonLegacyVasaDataCollector - Registering capability metadata with PBM for provider 4b57c143-e2bc-4764-9cce-4f7ac216dcb7 ... 2020-01-21T16:38:47.388Z [Thread-275] INFO opId=sps-Main-611046-139 com.vmware.sps.vp.VasaDataCollector - Persisting CBP and storage associations for provider 4b57c143-e2bc-4764-9cce-4f7ac216dcb7 with url https://10.21.149.41:8084/version.xml
That gives us a couple way to monitor the status and health of the Storage Providers during the Purity Upgrade to 5.3.
How to troubleshoot and remediate an issue from the upgrade?
In the event that there is an issue during or after the upgrade, let's breakdown how to correct or troubleshoot these issues.
- After the upgrade is complete, the vVol Datastore is marked as inaccessible
- After the upgrade is complete, both of the Storage Providers are marked as offline
- After the upgrade is complete, vSphere vMotion, Clone, Backup or other vVols related jobs are failing
While uncommon, these are three of the issues that could be occurring post upgrade. More often than not, if there is an observed issue post upgrade, the quickest remediation will be to remove the Storage Providers in vCenter for the FlashArray that was just upgraded. After confirming with Pure Support that both the backend database service and VASA are running healthy on both controllers, re-registering those Storage Providers will likely help.
After the upgrade is complete, the vVol Datastore is marked as inaccessible
In the event that the vVol Datastore is marked as "inaccessible" post upgrade, check the following:
- Are the Storage Providers online with an active and standby Storage Provider?
- If the Storage Providers are marked as offline or show up as a sync error
- Have Pure Support check that the DB Service and VASA are Online and Health
- If VASA and the DB service are healthy and the SPs are offline, then remove and re-register the storage providers. (steps are in the next section)
- If the Storage Providers are marked as online and one is active, check if the vVol Datastore is marked inaccessible for all or some hosts
- If all the hosts are showing inaccessible, work to remove and re-register the storage providers as outlined below.
- If the Storage Providers are marked as offline or show up as a sync error
- Is the vVol Datastore marked as inaccessible for one host or all hosts?
- If the vVol Datastore is marked for all or multiple hosts, remove and re-register the storage providers (steps are in the next section)
- If the vVol Datastore is only disconnected for one ESXi host, log into the ESXi hosts and restart vvold
ssh to the ESXi host as root and then run /etc/init.d/vvold restart - This will be non-disruptive from an IO and Data Path perspective:
$ ssh email@example.com Password: [root@sn1-r720-b05-13:~] [root@sn1-r720-b05-13:~] /etc/init.d/vvold restart /etc/init.d/vvold restart, PID 2661158 Added 2661158 to /var/run/vmware/.vmware-vvol.lock-dir/vvold-lock-dir-pid (1) watchdog-vvold: Terminating watchdog process with PID 2660881 vvold stopped. Successfully cleared vvold memory reservation PID 2661158 removed /var/run/vmware/vvold-done-calling-start Removed /var/run/vmware/.vmware-vvol.lock-dir Added 2661158 to /var/run/vmware/.vmware-vvol.lock-dir/vvold-lock-dir-pid (1) vvold max reserve memory set to 200 WaitVvoldToComeUp /var/run/vmware/.vmware-vvol.started created vvold started successfully! PID 2661158 Created /var/run/vmware/vvold-done-calling-start Removed /var/run/vmware/.vmware-vvol.lock-dir [root@sn1-r720-b05-13:~] exit
In the event that the Storage Providers are online/healthy and there are still hosts marked as inaccessible to the vVol Datastore after running an /etc/init.d/vvold restart, then the vCenter Root CA and CRLs need to be refreshed to the ESXi hosts, leverage the following workflow:
## Connect to the vCenter Server ## Connect-VIServer -server vcenter-server ## Get the ESXi hosts and set it to a variable ## $hosts = get-vmhost ## Start the Service Instance ## $si = Get-View ServiceInstance ## Start the certificate Manager view ## $certMgr = Get-View -Id $si.Content.CertificateManager ## Using the Cert Manager, refresh the ESXi hosts Certs ## ## This pushes all certificates in the TRUSTED_ROOTS store in the vCenter Server VECS store to the host. ## $certMgr.CertMgrRefreshCACertificatesAndCRLs($Hosts.ExtensionData.MoRef) ## Now in vCenter the vvol datastore should be accessible for each of those hosts. No need to do the ssl_reset and restart on vvold again.##
Here is an example of running this for a single ESXi Cluster:
PS C:\> Connect-VIServer -Server dev-vcsa Name Port User ---- ---- ---- dev-vcsa 443 ALEX\Carver PS C:\> Get-Cluster -Name "Dev Cluster" Name HAEnabled HAFailover DrsEnabled DrsAutomationLevel Level ---- --------- ---------- ---------- ------------------ Dev Cluster True 1 True FullyAutomated PS C:\> $ESXi_Cluster = Get-Cluster -Name "Dev Cluster" PS C:\> $ESXi_Cluster | Get-VMHost Name ConnectionState PowerState NumCpu CpuUsageMhz CpuTotalMhz MemoryUsageGB MemoryTotalGB Version ---- --------------- ---------- ------ ----------- ----------- ------------- ------------- ------- esxi-7.alex.pures... Connected PoweredOn 16 151 38304 14.586 255.897 6.7.0 esxi-6.alex.pures... Connected PoweredOn 20 141 43880 16.166 255.892 6.7.0 esxi-4.alex.pures... Connected PoweredOn 20 94 43880 8.945 255.892 6.7.0 PS C:\> $hosts = $ESXi_Cluster | Get-VMHost PS C:\> $hosts Name ConnectionState PowerState NumCpu CpuUsageMhz CpuTotalMhz MemoryUsageGB MemoryTotalGB Version ---- --------------- ---------- ------ ----------- ----------- ------------- ------------- ------- esxi-7.alex.pures... Connected PoweredOn 16 151 38304 14.586 255.897 6.7.0 esxi-6.alex.pures... Connected PoweredOn 20 141 43880 16.166 255.892 6.7.0 esxi-4.alex.pures... Connected PoweredOn 20 94 43880 8.945 255.892 6.7.0 PS C:\> $si = Get-View ServiceInstance PS C:\> $certMgr = Get-View -Id $si.Content.CertificateManager PS C:\> $certMgr.CertMgrRefreshCACertificatesAndCRLs($Hosts.ExtensionData.MoRef) PS C:\>
After the upgrade is complete, the Storage Providers are marked as offline
In the event that the Storage Providers are both marked as offline, the recommendation is to remove and re-register the storage providers for both controllers.
- Re-Register the Storage Providers with the vSphere Plugin
Navigate to the Storage Provider screen, locate the correct Storage Provider and then remove the Storage Provider:
Using the vSphere Plugin, select the correct FlashArray and then Register the Storage Provider:
Use a FlashArray Array Admin user/password and Register the Storage Providers:
- Re-Register the Storage Providers with PowerShell/PowerCLI
With PowerShell and PowerCLI you can remove the Storage Providers and then re-register the Storage Providers. This requires both the Pure Storage PowerShell SDK and the Pure Storage FlashArray VMware module.
## Connect to both the vCenter and FlashArray ## PS C:\Users\powercli_1> $FA = New-PfaConnection -endpoint sn1-m20r2-c05-36.purecloud.com -credentials (Get-Credential) -nonDefaultArray cmdlet Get-Credential at command pipeline position 1 Supply values for the following parameters: Credential PS C:\Users\powercli_1> $FA Disposed : False EndPoint : sn1-m20r2-c05-36.purecloud.com UserName : vvol-admin ApiVersion : 1.16 Role : ArrayAdmin PS C:\Users\powercli_1> Connect-VIServer -Server ac-vcenter-1.purecloud.com Name Port User ---- ---- ---- ac-vcenter-1.purecloud.com 443 PURECLOUD\alex ## Then remove the Storage Providers for the Given FlashArray ## PS C:\Users\powercli_1> Remove-PfaVasaProvider -flasharray $FA Confirm Are you sure you want to perform this action? Performing the operation "Unregister FlashArray VASA Provider" on target "sn1-m20r2-c05-36-CT0". [Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help (default is "Y"): a PS C:\Users\powercli_1>
## Connect to both the vCenter and FlashArray ## ## If already connected from the steps to remove the Storage Providers, you can skip this step as you already have the connections ## PS C:\Users\powercli_1> $FA = New-PfaConnection -endpoint sn1-m20r2-c05-36.purecloud.com -credentials (Get-Credential) -nonDefaultArray cmdlet Get-Credential at command pipeline position 1 Supply values for the following parameters: Credential PS C:\Users\powercli_1> $FA Disposed : False EndPoint : sn1-m20r2-c05-36.purecloud.com UserName : vvol-admin ApiVersion : 1.16 Role : ArrayAdmin PS C:\Users\powercli_1> Connect-VIServer -Server ac-vcenter-1.purecloud.com Name Port User ---- ---- ---- ac-vcenter-1.purecloud.com 443 PURECLOUD\alex ## Register the Storage Providers for the given FlashArray and vCenter ## PS C:\Users\powercli_1> New-PfaVasaProvider -flasharray $FA -credentials (Get-Credential) cmdlet Get-Credential at command pipeline position 1 Supply values for the following parameters: Credential Name Status VasaVersion LastSyncTime Namespace Url ---- ------ ----------- ------------ --------- --- sn1-m20r2-c05-36-CT0 online 3.0 1/21/2020 2:59:03 PM com.purestorage https://10.21.149.61:8084 sn1-m20r2-c05-36-CT1 online 3.0 com.purestorage https://10.21.149.62:8084 PS C:\Users\powercli_1>
- Re-Register the Storage Providers manually in vCenter
After removing the previous Storage Providers (this is covered in step 1), you will manually add them by clicking the +Add option. From there, enter the Storage Provider information for VASA-CT0 and then for VASA-CT1: Note that the port :8084 must be used in the URL and then both -ct0 and -ct1 are both being set with an Array Admin user.
If re-registering the storage providers fails, then do the following:
- Log into the FlashArray as a array admin user, and run purecert list
Running purecert list should get you the output for the current status of the VASA certificates for each controller. You should see the common name set to the IP address for ct0.eth0 and ct1.eth0. The org and org unit should both be Pure Storage as well. If you are seeing this, then your certificates should be in a good state for registering the Storage Providers.
pureuser@sn1-405-c12-21> purecert list Name Status Key Size Issued To Issued By Valid From Valid To Country State/Province Locality Organization Organizational Unit Email Common Name management imported 2048 sn1-405-c12-21.purecloud.com Online-Sub-CA-2 2019-12-10 23:45:36 UTC 2021-12-09 23:45:36 UTC US California Mountain View Pure Storage Solutions Engineering firstname.lastname@example.org sn1-405-c12-21.purecloud.com vasa-ct0 imported 2048 10.21.88.117 CA 2020-01-07 00:02:02 UTC 2021-01-07 00:02:02 UTC US - - Pure Storage Pure Storage - 10.21.88.117 vasa-ct1 imported 2048 10.21.88.118 CA 2020-01-07 00:02:06 UTC 2021-01-07 00:02:06 UTC US - - Pure Storage Pure Storage - 10.21.88.118
- Try to re-register the storage providers after a few minutes after confirming purecert list looks healthy.
- In the event that purecert is not returning anything for vasa-ct0 or vasa-ct1, confirm that your array is running Purity //FA 5.3. From there, follow this KB for resetting your VASA certificates. Please be sure to reset the certificate, not creating a CSR and importing a certificate. The purpose for this is to first get everything into a health spot. You can generate a CSR and import a signed certificate at a later time.
- In the event that the purecert command is failing to update/reset the certificates or the Storage Providers continue to fail, Open a Remote Assist session on the FlashArray and have Pure Support check that the internal database service and VASA are running healthy on both controllers.
After the upgrade is complete, vSphere vMotion, Clone, Backup or other vVols related jobs are failing
These types of symptoms post upgrade have been the most uncommon, but given the impact that this could cause, it's something that needs to be reviewed on how to check, troubleshoot and remediate the issue.
First, what could possibly be the issue? When the Primary Controller is rebooted and the Secondary Controller takes over as primary, this will trigger a Storage Provider failover in the vSphere environment. When this happens, a request, setPEContext, is issued from the ESXi hosts to the VASA Provider. This request wants to get information about the Protocol Endpoints and the VASA Provider. In the event that the backend database services are taking longer than expected, this request will then fail. In the /var/log/vvold.log file on the ESXi host, the following logging can be observed:
2020-01-21T20:55:20.207Z info vvold [Originator@6876 sub=Default] HostManager::RefreshPEMap needSetPEContext=true --> VasaOp::SetPEContextInt [#1]: ===> Issuing 'setPEContext' to VP sn1-m20r2-c05-36-ct0 (#outstanding 0/5) [session state: Connected] --> VasaOp::ThrowFromSessionError [#1]: ===> FINAL FAILURE setPEContext, error (STORAGE_FAULT / Error setting PE Context org.apache.http.ConnectionClosedException: Connection closed / ) VP (sn1-m20r2-c05-36-ct0) Container (sn1-m20r2-c05-36-ct0) timeElapsed=14 msecs (#outstanding 0) 2020-01-21T20:55:21.167Z error vvold [Originator@6876 sub=Default] VasaSession::SetPEContextInt: for url https://10.21.149.61:8084/version.xml (#PEs 1) VP failed to setPEContext: STORAGE_FAULT (Error setting PE Context org.apache.http.ConnectionClosedException: Connection closed / ) 2020-01-21T20:55:21.167Z info vvold [Originator@6876 sub=Default] VasaSession::SetPEContextInt: SCSI PE (uniqueIdentifier=__none__ lunId=naa.624a9370ef9d69657e164d4600011535 ipAddress=__none__ serverMount=__none__)
The problem here is that natively in ESXi, this request will not be retried for this VASA session. Unless the ESXi host's vvold service is restarted, another Storage Provider failover occurs in vCenter or the Storage Providers are removed and re-registered. In which case in the vvold.log the following logging should show that the SetPEContext is successful:
2020-01-21T21:02:15.086Z info vvold [Originator@6876 sub=Default] HostManager::RefreshPEMap needSetPEContext=true --> VasaOp::SetPEContextInt [#1]: ===> Issuing 'setPEContext' to VP sn1-m20r2-c05-36-ct0 (#outstanding 0/5) [session state: Connected] --> VasaOp::ThrowFromSessionError [#1]: ===> FINAL SUCCESS setPEContext VP (sn1-m20r2-c05-36-ct0) Container (sn1-m20r2-c05-36-ct0) timeElapsed=16 msecs (#outstanding 0) 2020-01-21T21:02:15.839Z info vvold [Originator@6876 sub=Default] VasaSession::SetPEContextInt: for url https://10.21.149.61:8084/vasa (#PEs 1): success 2020-01-21T21:02:15.839Z info vvold [Originator@6876 sub=Default] VasaSession::SetPEContextInt: SCSI PE (uniqueIdentifier=__none__ lunId=naa.624a9370ef9d69657e164d4600011535 ipAddress=__none__ serverMount=__none__)
Likely the quickest way to ensure that the request is getting reissued to all ESXi hosts in the vCenter would be to remove and re-register the storage providers. These steps are outlined above. In the event that only one or two hosts have this issue, then you could just ssh as root to each host and restart the vvold service. Overall, it may just be easier to re-register the storage providers. Both steps are covered in the previous section. Before re-registering the storage providers, check with Pure Support to ensure that both the backend database service and VASA are both running healthy on both controllers.
With all of the above said, Purity 5.3 marks a significant improvement to the overall vVols experience on the FlashArray. While uncommon, some end user reported issues with Storage Providers being offline, vVol Datastores being unavailable or with Management Path requests failing post upgrade. Pure Storage Engineering investigated the issues reported and found that there was room to improve on a few different fronts: What could be done immediately, what could be done in the next Purity release and what can be done long term.
- What could be done now? Pure updated the upgrade process to make sure that the secondary controller had a minimized chance to continue to be the active storage provider. Along with this, an issue was observed when IO Balances with long sample times caused the backend database service to take much longer to satisfy VASA requests was corrected and fixed.
- What can be done in the next couple of Purity Releases? Pure Engineering is looking into further ways to improve the upgrade experience when the secondary controller is initially rebooted. Along with this, Pure will be improving the response when the ESXi host sends a SetPEContext so that the ESXi host will not immediately fail the request, but will retry again after another 90 seconds. Additionally, Engineering is collaborating further with VMware Engineering on how to improve this behavior.
- What is going to be done long term by Pure? Along with making any additional improvements and enhancements based off the above investigations, Pure Engineering is continuing to improve the overall backend database service performance and starting times. There is additional work that will help the backend database service for taking over after a planned or unplanned failover. Combined with the improvements to VASA's response to the SetPEContext when not ready, this should greatly improve the upgrade process to Purity //FA 5.3.
An important item to reiterate is that these have just been issues observed when upgrading from Purity 5.2 and lower to Purity 5.3. Once on Purity 5.3, Pure Storage does not expect to see any further issues about the backed database service being upgraded. If you have an questions, concerns or feedback you would like to leave, please leave feedback in the KB, open up a Pure support case or reach out to Cody Hosterman or Alex Carver in the Pure Storage Community Slack.