Skip to main content
Pure Technical Services

Troubleshooting: VASA Operations Fail with Vendor Specific Error - Duplicate Key Found

Currently viewing public documentation. Please login to access the full scope of documentation.

KP_Ext_Announcement.png
Pure Storage and VMware have discovered an issue when running Purity//FA versions 5.3.12 or lower, 6.0.4 or lower, and 6.1.0 or lower, where VASA operations will fail with a vendor specific error saying that there are duplicate keys found for the request.  This issue, introduced with the release of vSphere 7.0 U3+, will cause operations such as Managed Snapshots, VM Power On, vSphere vMotion or Storage vMotion to fail.  Customers will need to work with Pure Support to correct the issue.

What is the Issue?

After upgrading to vSphere 7.0 U3, when a VM Managed Snapshot has been taken, operations and workflows for that VM fail with a Vendor Error.  When the operation is issued against the virtual disks, the vendor error is "duplicate key found".  When the operation is issued against the snapshot, the vendor error will say snapshot id not found.  That is what the error and failure looks like, but why is this happening?  What changed with vSphere 7.0 U3 that is making this happen?  What Purity versions are exposed to this?

Why is this happening?  The reason that these failures happen is due to the nature of how managed snapshots work with vVols combined with an issue in the Pure VASA service.  The issue with VASA is that in older versions of Purity, VASA would return all metadata for a virtual volume as part of the queries, clone and create workflows.  This included Pure specific metadata, which the VASA service requires for vVols to function correctly on the FlashArray.  Generally, this is not a problem as VMware will not set non VMware metadata key value pairs, but a change in vSphere 7.0 U3 included a change to how managed snapshots operate.  

vSphere managed snapshots with vVols based VMs work in the following way:

  1. vSphere issues a request (prepareToSnapshotVirtualVolume) for VASA to get ready to snapshot the virtual volumes for that VM.  When VASA receives this request a new volume is created, named after the data vVol with a -snap suffix.  In the response back to vSphere VASA returns the metadata key value (KV) pairs for the parent (source data vVols) and the data-snap vVol id.  
  2. vSphere uses the information returned with the completed prepareToSnapshotVirtualVolume and structures the snapshotVirtualVolume request metadata based off that information.
  3. vSphere pauses the VM so that no IO is issued during the snapshot process and then issues a request (snapshotVirtualVolume) to VASA.  VASA will then copy out the data vVol and overwrite the -snap data vVols created during the prepare phase.  Once that is completed, VASA returns a success to vSphere and once all virtual disks have a successful snapshotVirtualVolume request the VM is un-paused and the managed snapshot is complete.

Knowing the process in which the managed snapshot is taken helps outline why the issue is happening. The VASA provider returns a Pure vendor KV (key value) pair as part of the parent info for the prepareToSnapshotVirtualVolume request.  Prior to vSphere 7.0 U3, VMware wouldn't do anything with that KV pair and wouldn't use it in the snapshotVirtualVolume request.  A change in vSphere 7.0 U3 started to use all of the parent info from the prepareToSnapshotVirtualvolume respons in the snapshotVirtualVolume request, meaning that the Pure KV pair for the parent vVol ID is then passed to the volume used in snapshotVirtualVolume.  This causes both the data vVol and the data-snap vVol to have the same vVol ID on the FlashArray.  Now that there are two volumes on the FlashArray with the same vVol ID, several operations that would be issued to VASA for the VM will now fail. 

Here are examples of some of the operations that will fail when the VM has a duplicate key:

  • VM PowerOn requests
  • vSphere vMotion requests
  • Storage vMotion requests
  • VM Managed Snapshot creation requests
  • VM Clone or linked clone requests
  • VM vVol resize requests

What versions of Purity are affected?  FlashArray's running Purity 5.0.x, 5.1.x, 5.2.x, 5.3.0 - 5.3.12, 6.0.0 - 6.0.4 or 6.1.0 will be exposed to the duplicate key problem.

What versions of vSphere are affected?  While the problem could happen with any vSphere version, Pure has only seen this trigger with vSphere 7.0U3 and 7.0U3a.

What does the issue look like?

There are few different failures that you will end up seeing from the vSphere UI or ESXi host logs.  Here will be some examples and while they don't cover each case, it may help narrow down where the issue is and if you are getting duplicate key errors.

VM Power On Failure

When trying to power on the VM with duplicate keys found on the data vVol, the failure may look like this in the vSphere UI.

power-on-failure.png

The problem will not be clear from the ESXi vvold log though.

2021-11-12T18:38:29.479Z info vvold[2099707] [Originator@6876 sub=Default] VasaOp::QueryVirtualVolumeInt [#308]: ===> Issuing 'queryVirtualVolume' to VP [flasharray-m50-2-ct0:Connected (Outstanding 0/5)]
2021-11-12T18:38:29.487Z info vvold[2099707] [Originator@6876 sub=Default] VasaOp[#308] ===> FINAL SUCCESS queryVirtualVolume VP (flasharray-m50-2-ct0) Container (10c2cbcb-16a6-32d4-b2fd-7c8036f60f64) timeElapsed=7 msecs (#outstanding 0)
2021-11-12T18:38:29.487Z info vvold[2099707] [Originator@6876 sub=Default] SI::QueryVirtualVolumeInt Empty response
2021-11-12T18:38:31.373Z info vvold[2099446] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] Came to SI::ForceUnbindVVols: vvolIdVec.size = 1
2021-11-12T18:38:31.373Z info vvold[2099446] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] Doing BatchUnbind on VP Pure Storage VASA Provider, Batch Size 4 with 2 concurrent batch
2021-11-12T18:38:31.373Z info vvold[2099446] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] Waiting for 1 pending unbind threads..
2021-11-12T18:38:31.373Z info vvold[2099024] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] Starting batch unbinding for 1 devices
2021-11-12T18:38:31.374Z info vvold[2099024] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] VVolDevice:PrepareDeviceForBatchUnbind Preparing device for unbind rfc4122.e0dfca0e-2750-46c2-bcc2-aebb1109c5cd
2021-11-12T18:38:31.374Z info vvold[2099024] [Originator@6876 sub=Libs opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] 2099024:VVOLLIB : VVolLib_UserLockVVol:10215: Successfully acquired user lock on VVol rfc4122.e0dfca0e-2750-46c2-bcc2-aebb1109c5cd (curTime:1596 startTime:1596 timeoutSec:5)
2021-11-12T18:38:31.374Z info vvold[2099024] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] Batching 1 devices for unbindVirtualVolume
2021-11-12T18:38:31.374Z info vvold[2099024] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] VasaOp::BatchUnbindVirtualVolume [#309]: ===> Issuing 'unbindVirtualVolume' to VP [flasharray-m50-2-ct0:Connected (Outstanding 0/5)]
2021-11-12T18:38:31.398Z error vvold[2099024] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] VasaOp[#309] ===> FINAL FAILURE unbindVirtualVolume, error (STORAGE_FAULT / Duplicate key vvol-fa-m50-2-vvol-01-vm-6505489e-vg/Data-3d3b2fdf) VP (flasharray-m50-2-ct0) Container (flasharray-m50-2-ct0) timeElapsed=24 msecs (#outstanding 0)
2021-11-12T18:38:31.398Z error vvold[2099024] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] Batch unbind call failed with error STORAGE_FAULT: Duplicate key vvol-fa-m50-2-vvol-01-vm-6505489e-vg/Data-3d3b2fdf
2021-11-12T18:38:31.398Z error vvold[2099024] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] Batch call failed for Unbind  rfc4122.e0dfca0e-2750-46c2-bcc2-aebb1109c5cd, with error 22
2021-11-12T18:38:31.398Z info vvold[2099024] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] VVolDevice::FinishBatchUnbind Finishing unbind for rfc4122.e0dfca0e-2750-46c2-bcc2-aebb1109c5cd with VASA result 22
2021-11-12T18:38:31.398Z warning vvold[2099024] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] VVolDevice::GetBindStatusForResult Non-retryable error 22 occurred 1 times on unbind, for device rfc4122.e0dfca0e-2750-46c2-bcc2-aebb1109c5cd, bind status is unknown.
2021-11-12T18:38:31.399Z warning vvold[2099024] [Originator@6876 sub=Default opID=kvwp3lxz-1019-auto-sc-h5:70000597-fd-01-01-01-01-46-0736] Device may still be bound, unbind is not done.

There will be unbind failures with duplicate key, but there will also be queryVirtualVolume requests that have an Empty response.  The combination of not being able to find the managed snapshot vVols as well as failing unbinds on the data vVol will prevent the VM from powering on.  The telling point here is that there is any "FINAL FAILURE" with a duplicate key error as part of the workflow.

VM Snapshot Failure

When a duplicate key exists for a VMs data vVol, any managed snapshot request will fail.  The request will fail because the snapshotVirtualVolume request will be issued against a data vVol with a duplicate key and will fail in VASA.  The failure will look like this in the vSphere UI.

create-snap-failure.png

The failure will be clear in the ESXi hosts vvold log FINAL FAILURE showing a storage fault.

2021-11-12T18:37:11.563Z info vvold[2101178] [Originator@6876 sub=Default] VasaOp::ExecuteBatch [#278]: ===> Issuing 'snapshotVirtualVolume' to VP [flasharray-m50-2-ct0:Connected (Outstanding 0/5)]
2021-11-12T18:37:11.576Z error vvold[2101178] [Originator@6876 sub=Default] VasaOp::IsSuccessful [#278]: snapshotVirtualVolume transient failure: 22 (STORAGE_FAULT / Duplicate key vvol-fa-m50-2-vvol-01-vm-6505489e-vg/Data-3d3b2fdf)
2021-11-12T18:37:11.576Z warning vvold[2101178] [Originator@6876 sub=Default] VasaOp[#278] ===> Transient failure snapshotVirtualVolume VP (flasharray-m50-2-ct0) retry=false, batchOp=false container=10c2cbcb-16a6-32d4-b2fd-7c8036f60f64 timeElapsed=13 msecs (#outstanding 0)
2021-11-12T18:37:11.576Z error vvold[2101178] [Originator@6876 sub=Default] VasaOp[#278] ===> FINAL FAILURE snapshotVirtualVolume, error (STORAGE_FAULT / Duplicate key vvol-fa-m50-2-vvol-01-vm-6505489e-vg/Data-3d3b2fdf) VP (flasharray-m50-2-ct0) Container (10c2cbcb-16a6-32d4-b2fd-7c8036f60f64) timeElapsed=13 msecs (#outstanding 0)
2021-11-12T18:37:11.576Z error vvold[2101178] [Originator@6876 sub=Default] VASA_RETRY_BATCH_LOOP op (snapshotVirtualVolume) container (10c2cbcb-16a6-32d4-b2fd-7c8036f60f64) terminated with Exception (STORAGE_FAULT: Duplicate key vvol-fa-m50-2-vvol-01-vm-6505489e-vg/Data-3d3b2fdf)
2021-11-12T18:37:11.576Z error vvold[2101178] [Originator@6876 sub=Default] Batch snapshot call failed with error STORAGE_FAULT: Duplicate key vvol-fa-m50-2-vvol-01-vm-6505489e-vg/Data-3d3b2fdf
2021-11-12T18:37:11.576Z info vvold[2101178] [Originator@6876 sub=Default] BatchSnapshot:ExecuteBatch Operation completed, Thread signalled..
2021-11-12T18:37:11.576Z info vvold[2101178] [Originator@6876 sub=Default] BatchSnapshot:STATS Duration- 13ms, Total-1, Batched-1, Passed-0, Failed-1
2021-11-12T18:37:11.576Z info vvold[2099445] [Originator@6876 sub=Default] Wait for batch operation completed true
2021-11-12T18:37:11.576Z error vvold[2099445] [Originator@6876 sub=Default] Batch snapshot call failed with error 22: STORAGE_FAULT: Duplicate key vvol-fa-m50-2-vvol-01-vm-6505489e-vg/Data-3d3b2fdf

Here the vendor fault will be outlined as a duplicate key being found for the data vVol.  

vSphere vMotion Failure

A vSphere vMotion request will fail because the destination host is unable to bind the data vVols for the VM. The failure will look like this in the vSphere UI.

migrate-vm-failure.png

To find the issue in the ESXi vvold log, you will need to look at the host that was the VMs destination for the vMotion request and view the vvold log to see the failure.

2021-11-10T17:53:59.832Z info vvold[2099090] [Originator@6876 sub=Default] VasaOp::ExecuteBatch [#14231]: ===> Issuing 'bindVirtualVolume' to VP [flasharray-m50-2-ct0:Connected (Outstanding 0/5)]
2021-11-10T17:53:59.855Z error vvold[2099090] [Originator@6876 sub=Default] VasaOp::IsSuccessful [#14231]: bindVirtualVolume transient failure: 22 (STORAGE_FAULT / Duplicate key vvol-fa-m50-2-vvol-01-vm-6505489e-vg/Data-3d3b2fdf)
2021-11-10T17:53:59.855Z warning vvold[2099090] [Originator@6876 sub=Default] VasaOp[#14231] ===> Transient failure bindVirtualVolume VP (flasharray-m50-2-ct0) retry=false, batchOp=false container=10c2cbcb-16a6-32d4-b2fd-7c8036f60f64 timeElapsed=23 msecs (#outstanding 0)
2021-11-10T17:53:59.855Z error vvold[2099090] [Originator@6876 sub=Default] VasaOp[#14231] ===> FINAL FAILURE bindVirtualVolume, error (STORAGE_FAULT / Duplicate key vvol-fa-m50-2-vvol-01-vm-6505489e-vg/Data-3d3b2fdf) VP (flasharray-m50-2-ct0) Container (10c2cbcb-16a6-32d4-b2fd-7c8036f60f64) timeElapsed=23 msecs (#outstanding 0)
2021-11-10T17:53:59.855Z error vvold[2099090] [Originator@6876 sub=Default] VASA_RETRY_BATCH_LOOP op (bindVirtualVolume) container (10c2cbcb-16a6-32d4-b2fd-7c8036f60f64) terminated with Exception (STORAGE_FAULT: Duplicate key vvol-fa-m50-2-vvol-01-vm-6505489e-vg/Data-3d3b2fdf)
2021-11-10T17:53:59.855Z error vvold[2099090] [Originator@6876 sub=Default] Batch bind call failed with error STORAGE_FAULT: Duplicate key vvol-fa-m50-2-vvol-01-vm-6505489e-vg/Data-3d3b2fdf
2021-11-10T17:53:59.855Z info vvold[2099090] [Originator@6876 sub=Default] BatchBind:OnBatchCompletion Operation completed, Thread signalled..
2021-11-10T17:53:59.855Z info vvold[2099090] [Originator@6876 sub=Default] BatchBind:STATS[14208536544347684866] Duration- 23ms, Total-1, Batched-1, Passed-0, Failed-1

The failure will be on the destination host after the bind against the data vVol fails with the duplicate key found vendor fault.


Workaround or Fixes?

Are there any Fixes for this issue or any workarounds if affected by it?  The first step if you have duplicate keys is to contact Pure Support immediately and work with them to correct the duplicate keys and develop an action plan moving forward.  The primary reason for this is Pure Support will be able to work to correct the duplicate keys.

Work with Pure Support

In the event that there will be a delay before contacting Pure Support, a VM is down and the VM needs to be back up ASAP there are a couple of methods to power on the VM. Please note that if you contact Pure Support with a production VM not being able to power on, it is treated as a severity one issue because of the current production impact.

  1. Delete and Consolidate VM Managed Snapshots, then destroy and eradicate the VMs data-snap volumes on the FlashArray
  2. Create a PlaceHolder VM with the same VM settings and virtual disks, then on the FlashArray copy and overwrite the placeholder data vVols with the source VMs data vVols

Delete and Consolidate VM Snapshots

This is likely the quickest workflow to get a powered off VM corrected and into a state that the VM can power back on.  Essentially this workaround is removing the duplicate keys on the FlashArray by destroying and eradicating the VMware managed snapshot vVol.  The important part here, though, is first the managed snapshot is deleted in vCenter and the VM is snapshots are consolidated.  This is important to make sure that the vmdk's names are updated and future issues aren't encounted with the VM.

  1. The first step is to delete the Managed Snapshot in vSphere and then Consolidate all Snapshots afterwards.
    delete-consolidate-snapshot.png
    This screenshot was taken after the delete all snapshots had completed and now a consolidation needed status is triggered for the VM.
  2. Then from the FlashArray, go to the VMs volume group and locate the data-snap volumes.

    Screen Shot 2021-11-10 at 12.11.52 PM.png
  3. While this step is optional, it's a good idea to make a copy of the data-snap, naming it something like (vm-name-vmdk-1-snap-temp).

  4. Destroy and eradicate the data-snap for the VMs deleted managed snapshot.

    Screen Shot 2021-11-10 at 12.12.09 PM.png
    eradicate-snap.png
  5. Navigate back to the vCenter Server and power the VM back on, and check that the guest OS fully starts.
  6. If the data-snap was copied out to a temp volume and not needed, then go ahead and destroy it if desired.

This is a generalized workflow, but the key to the process is making sure that the VM's VMware managed snapshots are deleted in vCenter before attempting to do anything on the array with those data-snap vVols.

Copy Data vVols to Placeholder VM

This workflow is to create a new Placeholder VM and then copy out the source VM's data vVols and overwrite them.

Detailed workflows for doing this (albeit with a VM's VMware managed snapshots) can be found in this KB.  The workflow summarized would look something like this:

  1. From vCenter Server, create a new vVol VM with the same configuration (CPU, Memory and VMDKs) as the affected VM that can't power on
  2. From the FlashArray, navigate to the new VM's volume group, get the volume name for each data volume
  3. From the FlashArray, navigate to the affected VM's volume group, select the data volume for VMDK 1
  4. From the FlashArray, copy the data volume for VMDK 1 to overwrite the data volume on the placeholder VMs volume group for VMDK 1
  5. Repeat for any additional VMDKs as needed
  6. From vCenter Server, select the new VM that was created and power on the VM
  7. Access the VM after it's powered on (RDP or Console) and confirm the Application or Files are all there

This would be the last resort if you are unable to work with support and deleting the managed snapshots doesn't work.  Please work with Pure Support if VMs are affected by duplicate keys.


Recommendations?

There are different recommendations based on whether there are duplicate keys found or not, if Purity is on a fixed release or not, then if vSphere is upgraded to 7.0U3 or not.  These are the four situations you may be in, but the recommendation may be different whether or not duplicate keys are found or there are existing failures as covered in this KB.

  1. What to do if vSphere has not been upgraded to 7.0U3, but Purity is not on a fixed release
  2. What to do if vSphere has been upgraded to 7.0U3, but Purity is not on a fixed release
  3. What to do if vSphere has not been upgraded to 7.0U3, but Purity is on a fixed release
  4. What to do if vSphere has been upgraded to 7.0U3, but Purity is on a fixed release

Duplicate Keys are found

This means that there are Data vVols that have some managed snapshot data-snap vVols that have the duplicated key on them.  These could be orphaned objects if the VM that was affected had been deleted or was removed from inventory.  Either way, if data vVols with duplicate keys are on the array, what are the recommendations?

  1. The following is the least likely situation to occur, but if vSphere has not been upgraded to 7.0U3 and Purity is not on a fixed release then first work with Pure Support to correct the duplicate keys and then work to schedule a Purity upgrade before upgrading vSphere to 7.0U3.
  2. The following is the most likely situation to occur and will need to be addressed in two parts.  First, fix the existing duplicate key issues with Pure Support, and then working to upgrade the FlashArray to Purity a fixed release.  After the FlashArray has been upgraded, Pure Support will do an additional scan of the vVols to make sure that duplicate keys were not created during the upgrade process.
  3. This is almost as unlikely as 1, but could still happen.  When duplicate keys are found on arrays that are on a fixed Purity version, Pure Support should be contacted and worked with to correct the duplicated keys.  Once they are fixed then there is no way for vSphere to set Pure key value pairs in the future.
  4. This is likely to have occurred when an upgrade of vSphere to 7.0U3 was done right before an upgrade of Purity was completed.  During the time that vSphere was on 7.0U3 and before Purity was upgraded, the duplicated keys were created.  In these cases, contact Pure Support to fix the duplicate keys and then the issue should not happening moving forward.

Overall for all of these, contact Pure Support ASAP to work on correcting the duplicate keys.  When vSphere is not on 7.0U3, hold off upgrading to 7.0U3.

Duplicate Keys are not found

In the event that there are no duplicate keys found, what are the recommendations based off the 4 scenarios above?

  1. If vSphere is not on 7.0U3 and Purity is not on a fixed release, DO NOT upgrade vSphere to 7.0U3 until AFTER Purity has been upgraded.
  2. If vSphere is on 7.0U3 and Purity is not on a fixed release, minimize or disable managed snapshots from being taken on vVols based VMs until the Purity upgrade to a fixed release can be completed.
  3. If vSphere has not been upgraded to 7.0U3 and Purity is on a fixed released, there is nothing to worry about and an upgrade to vSphere 7.0U3 is fine.
  4. If vSphere has been upgraded to 7.0U3 and Purity is on a fixed released, there is nothing to worry about.

Overall, if there are no duplicate keys found, checking if Purity is on a fixed released should be the first thing, and making sure that Purity is upgraded to a fixed release if it isn't already.