Skip to main content
Pure Technical Services

Troubleshooting: CBT Enabled vVols VMs Fail QueryChangedDiskAreas when FlashArray SafeMode is Enabled

Currently viewing public documentation. Please login to access the full scope of documentation.

The CBT compatibility issues with Safemode and vVols have been addressed with the release of Purity//FA 6.1.3 and higher.  Safemode will no longer cause CBT enabled managed snapshots to fail building out the tracking, storage vMotions from vVols to VMFS will no longer fail and queryChangedDiskAreas will work correctly.  However, object limit counts will be higher if Safemode is enabled on the FlashArray using vVols. 

While CBT and Safemode do not have the same issues, please keep in mind these important points when using Safemode with vVols: 

  • Swap and Data-Snap vVols will not be auto-eradicated.
    Which will lead to higher volume object counts in the following events:
    • Taking vSphere Managed Snapshots and Deleteing vSphere Managed Snapshots
    • Power On and Powering Off the same Virtual Machines
    • Frequent vSphere vMotions between ESXi Hosts
  • In Purity 6.1.x Safemode prevent FlashArray Protection Groups schedules from being increased beyond the default value.  Which means that VASA is unable to adjust the Protection Group snapshot or replication schedules.  When Storage Policies are used that have existing replication groups, the auto pgroup option will not work.  
  • Additionally in Purity 6.1.x, when Safemode is enabled, vVols that are placed in FA protection groups can not be removed from those protection groups.  VASA is now unable to move volumes between protection groups with storage policy replication groups are changed in vSphere.  Once a storage policy and replication group are applied to a vVol based VM, they can not be changed with Safemode enabled.

Please see the following KBs for more information

What is the Issue?

Pure Storage has identified an issue that prevents some VASA calls from completing when the FlashArray has SafeMode enabled.  SafeMode is a feature that came with Purity//FA 5.3 that prevents the VASA Service from eradicating objects outside eradication timer on the FlashArray.  This feature was introduced with Purity//FA 5.3 with eradication timer adjustment introduced with 5.3.7 and customers would need to work with support to enable this feature.  

Why does this impact vVols and VASA?  One of the core features of vVols is the automation of storage tasks by VASA.  Some examples of these tasks are volume creation, connection, snapshot creation, volume deletion, snapshot deletion and disconnects.  When SafeMode is enabled, this prevents the VASA Service from being able to eradicate volumes on the FlashArray. 

There are a few instances when VASA will automatically eradicate destroyed volumes.

  1. When a vSphere Managed Snapshot is destroyed via vCenter Server, VASA will destroy and eradicate managed snapshot objects on the FlashArray for the vVol VM.
  2. When a testFailoverReplicationGroupEnd API is issued to VASA all test failover volume groups, volumes and snapshots will be destroyed and eradicated on the target FlashArray.  This is part of the Test Failover Cleanup job with SRM or if manually running through a full test failover workflow for a vVols replication group in vRO or with PowerCLI.
  3. In the event that Changed Block Tracking (CBT) is enabled on a vVols based virtual disk, required VASA calls will fail which will cause errors in building out the block tracking for VMware.  Specifically, the allocatedBitmapVirtualVolume op will fail as VASA automatically copies out and eradicates snapshot/volumes for this bitmap scanning process.  Since the eradication will fail, the allocated bitmap calls fail.
  4. Similar to when CBT is enabled, VMware will use allocatedBitmapVirtualVolume VASA calls when a vVols based VM is storage vMotioned to another array or to a VMFS datastore on the same array.  To satisfy these calls, VASA will copy out snapshots to difference from and then destroy and eradicate once the differencing is done.  Since the eradication will fail, the allocated bitmap call will fail.

What is the impact to these workflows when the volume eradication fails?

  1. From the vCenter Server, the managed snapshot destruction will succeed and no errors will be seen in vCenter.  However, on the FlashArray the manages snapshot objects will be destroyed and pending eradication.  This means that object count can drastically increase if there are a lot of managed snapshot creation/deletion processes.  By default, the pending eradication timer is 24 hours, but if the timer has been changed to greater than 24 hours; this causes these pending eradication objects to count against the object count for a longer duration. This can lead to the FlashArray hitting it's object count limits due to pending eradication objects still existing on the array.
  2. The testFailoverReplicationGroupEnd call may succeed, but the objects that are expected to be cleaned up may still exist which does increase the object count on the target FlashArray.  At this time, there have not been any other failover replication group calls that have been observed to fail, but the testing on these calls is not fully complete when SafeMode is enabled.  Should any other APIs fail, then Pure Storage will update this KB accordingly. 
  3. When CBT is enabled for the vVol VM and a managed snapshot is taken, all of the allocated bitmap calls will fail and this will prevent VMware from building out the tracking files required for CBT to work correctly.  The managed snapshot will still succeed from the VMware perspective, but CBT will not be correctly enabled or fully enabled.
  4. Storage vMotion from vVols to either VMFS or vVols on another array will always fail.

What does this look like from a logging perspective in VMware?

In the VMs logging, there will be errors found found when the call/request DiskLibGetAllocatedSectorChunksInRangeInt is issued.  This will cause failures to get allocated sectors for the virtual disk.

The vVol VM's vmware.log
  • Here is what you would see from the VMs vmware.log when the first snapshot is taken after CBT has been enabled.  The log is located on the ESXi host that the VM is powered on and can be found on the esxcli in this file path: /vmfs/volumes/vvol-datastore-name/vm-name/vmware.log
    [/vmfs/volumes/FlashArray-B-vVol-DS/test-b-VM-light-0001/vmware.log]
    2021-01-22T18:51:22.503Z| worker-2455929| I125: DISKLIB-CBT   : Initializing ESX kernel change tracking for fid 4669518.
    2021-01-22T18:51:22.503Z| worker-2455929| I125: DISKLIB-CBT   : Successfuly created cbt node 47404e-cbt.
    2021-01-22T18:51:22.503Z| worker-2455929| I125: DISKLIB-CBT   : Opening cbt node /vmfs/devices/cbt/47404e-cbt
    2021-01-22T18:51:22.504Z| worker-2455929| I125: 2455929:VVOLLIB : VVolLib_BlockSizeVVol:6346: Successful
    2021-01-22T18:51:22.504Z| worker-2455929| I125: 2455929:VVOLLIB : VVolLib_GetSoapContext:379: Using 30 secs for soap connect timeout.
    2021-01-22T18:51:22.504Z| worker-2455929| I125: 2455929:VVOLLIB : VVolLib_GetSoapContext:380: Using 200 secs for soap receive timeout.
    2021-01-22T18:51:24.506Z| worker-2455929| I125: 2455929:VVOLLIB : VVolLibConvertSoapFault:1745: client.VvolLibDoRetryCheckAndUpdateError failed with a fault
    2021-01-22T18:51:24.506Z| worker-2455929| I125: 2455929:VVOLLIB : VVolLibConvertVvolStorageFault:1157: Storage Fault STORAGE_FAULT (22): 403: eradication is disabled. /
    2021-01-22T18:51:24.506Z| worker-2455929| I125: OBJLIB-VVOLOBJ : VVolObjGetAllocatedBitmap: Failed GetAllocatedBitmap VVOL rfc4122.e4988b27-7db1-498c-8f86-516da4e9ddd3 error The VVol target encountered a vendor specific error.
    2021-01-22T18:51:24.506Z| worker-2455929| I125: DISKLIB-LIB_MISC   : DiskLibGetAllocatedSectorChunksInRangeInt: failed to get allocated sector bitmap with 'The VVol target encountered a vendor specific error' (150999371).
    2021-01-22T18:51:24.506Z| worker-2455929| W115: DISKLIB-CBT   : ChangeTrackerESX_MarkAllUsedAreas: Failed to get allocated sectors: The VVol target encountered a vendor specific error.
    2021-01-22T18:51:24.506Z| worker-2455929| I125: DISK: Change tracking for disk 'scsi0:0' is now enabled.
    2021-01-22T18:51:24.507Z| worker-2455929| I125: UTIL: Change file descriptor limit from soft 4326,hard 4326 to soft 4441,hard 4441.
    

In the ESXi host's vvold log, the error will appear after an allocatedBitmapVirtualVolume call is issued to the VASA Provider.  The error returned will say that eradication is disabled.

The ESXi host's vvold.log
  • Here is what you would see from the ESXi hosts vvold log when the first snapshot is taken after CBT has been enabled.  The log can be found via esxcli in this file path: /var/log/vvold.log
    [/var/log/vvold.log]
    --> VasaOp::AllocatedBitmapVirtualVolume [#535445]: ===> Issuing 'allocatedBitmapVirtualVolume' to VP sn1-x70-c05-33-ct0 (#outstanding 0/5) [session state: Connected]
    2021-01-22T18:59:07.488Z error vvold[2189541] [Originator@6876 sub=Default] VasaOp::IsSuccessful [#535445]: allocatedBitmapVirtualVolume transient failure: 22 (STORAGE_FAULT / 403: eradication is disabled. / )
    --> VasaOp::DoRetry [#535445]: ===> Transient failure allocatedBitmapVirtualVolume VP (sn1-x70-c05-33-ct0) retry=false, batchOp=false container=de292959-b980-3e58-b1dd-a343d2610cf9 timeElapsed=159 msecs (#outstanding 0)
    --> VasaOp::ThrowFromSessionError [#535445]: ===> FINAL FAILURE allocatedBitmapVirtualVolume, error (STORAGE_FAULT / 403: eradication is disabled. / ) VP (sn1-x70-c05-33-ct0) Container (de292959-b980-3e58-b1dd-a343d2610cf9) timeElapsed=159 msecs (#outstanding 0)
    

Here is where the impact is directly seen after CBT is not correctly been enabled.  When the API queryChangedDiskAreas is issued via VADP or manually in the MOB, the request will fail on any CBT enabled virtual disk.

The vCenter Server's vpxd service log
  • Here is what is viewed from the vCenter vpxd log when the queryChangedDiskAreas is issued, this log file can be find on the vCenter Server cli here: /var/log/vmware/vpxd/vpxd.log
    [/var/log/vmware/vpxd/vpxd.log]
    2021-01-22T20:50:07.517Z info vpxd[05544] [Originator@6876 sub=vpxLro opID=28a7cb11] [VpxLRO] -- BEGIN lro-68734258 -- vm-276802 -- vim.VirtualMachine.queryChangedDiskAreas -- 52bcdd88-f19e-59cd-5aa3-68ec18544b29(523a838a-8fc1-5c53-ad3a-db744fce3686)
    2021-01-22T20:50:07.555Z info vpxd[05544] [Originator@6876 sub=vpxLro opID=28a7cb11] [VpxLRO] -- FINISH lro-68734258
    2021-01-22T20:50:07.555Z info vpxd[05544] [Originator@6876 sub=Default opID=28a7cb11] [VpxLRO] -- ERROR lro-68734258 -- vm-276802 -- vim.VirtualMachine.queryChangedDiskAreas: vim.fault.FileFault:
    --> Result:
    --> (vim.fault.FileFault) {
    -->    faultCause = (vmodl.MethodFault) null,
    -->    faultMessage = (vmodl.LocalizableMessage) [
    -->       (vmodl.LocalizableMessage) {
    -->          key = "vim.hostd.vmsvc.cbt.cannotGetChanges",
    -->          arg = (vmodl.KeyAnyValue) [
    -->             (vmodl.KeyAnyValue) {
    -->                key = "path",
    -->                value = "/vmfs/volumes/vvol:de292959b9803e58-b1dda343d2610cf9/rfc4122.812ccfc2-feae-403e-bd88-bfdfb6be73cc/test-b-VM-light-0001.vmdk"
    -->             },
    -->             (vmodl.KeyAnyValue) {
    -->                key = "reason",
    -->                value = "Unknown change epoch"
    -->             }
    -->          ],
    -->          message = "Cannot compute changes for disk /vmfs/volumes/vvol:de292959b9803e58-b1dda343d2610cf9/rfc4122.812ccfc2-feae-403e-bd88-bfdfb6be73cc/test-b-VM-light-0001.vmdk: Unknown change epoch."
    -->       }
    -->    ],
    -->    file = "/vmfs/volumes/vvol:de292959b9803e58-b1dda343d2610cf9/rfc4122.812ccfc2-feae-403e-bd88-bfdfb6be73cc/test-b-VM-light-0001.vmdk"
    -->    msg = "Received SOAP response fault from [<cs p:00007fea001bb7d0, TCP:ac-esxi-b-13.purecloud.com:443>]: queryChangedDiskAreas
    --> Received SOAP response fault from [<cs p:0000006ee554fbd0, TCP:localhost:8307>]: queryChangedDiskAreas
    --> Error caused by file /vmfs/volumes/vvol:de292959b9803e58-b1dda343d2610cf9/rfc4122.812ccfc2-feae-403e-bd88-bfdfb6be73cc/test-b-VM-light-0001.vmdk"
    --> }
    --> Args:
    -->
    --> Arg snapshot:
    --> 'vim.vm.Snapshot:snapshot-276911'
    --> Arg deviceKey:
    --> 2000
    --> Arg startOffset:
    --> 0
    --> Arg changeId:
    --> "*"
    

What is the impact to queryChangedDiskArea failing?  This will directly impact incremental backups with vendors such as Veeam and Rubrik.  This means that during an incremental backup, they will still have to scan/read the entire virtual disk size.  The transfer size may still be small/incremental depending on the change rate, but the scanned size will still be larger than expected.

Here is what the difference would look like with backup jobs that are saying that CBT is enabled on all VMs without SafeMode enabled and with SafeMode enabled.

Here is the first backup taken for this backup job. The total data read is about 60 GBs and the transferred size is about 33 GBs.  The full backup job took just under 7 minutes to complete.

Screen Shot 2021-02-01 at 2.29.32 PM.png

SafeMode is enabled on the FlashArray that these vVol based VMs are located on.  Here is an Incremental backup and where the queryChangedDiskArea's calls are failing. 

The data read is slightly more than the first "full" backup.  Additionally, the backup job duration is actually longer than the full/baseline transfer.  The amount transferred is much lower than the first full backup though, as only a small amount of changes to the VMs were made.

Screen Shot 2021-02-01 at 2.30.10 PM.png

SafeMode has been disabled on the FlashArray for this backup.  Similar changes were made to the VMs that were made in the previous test as well.  

There is a large decrease in the amount of data read and how long the backup job took to complete.  This is because the queryChangedDiskAreas API calls succeeded and the backup appliance did not need to read through the entire virtual disk to see what the differences were and to copy out the differences to the backup target.  This is what a true incremental backup would be expected to look like when leveraging CBT with VADP.

Screen Shot 2021-02-01 at 2.30.31 PM.png

While the backup jobs succeeded in each case, the time it takes to complete the "incremental backup" with SafeMode enabled was much longer then when SafeMode was disabled.


Workarounds or Fixes? 

From the FlashArray and VASA provider perspective, there are currently no short term fixes for VASA being unable to auto eradicate objects on the FlashArray and there is no timeline in the event that Pure Storage does allow VASA the ability to do this.  A longer term fix for this is for Pure Storage to change the workflow that VASA uses to satisfy the allocatedBitmapVirtualVolumes to no longer fail if an eradication workflow fails.  Once a Purity and VASA update has been released this KB will be updated to reflect which release to be on.

From the vSphere perspective, there isn't a workaround in order to get CBT to really be enabled.  The impact is that CBT will appear to be enabled but any attempt to query for changed blocks or areas will fail.  The only option would be to disable CBT on vVols based VMs if SafeMode is enabled and plan accordingly for increased object counts on the FlashArray and longer backup times with Veeam/Rubrik/etc.

This is no workaround for storage vMotions failing.  The only way to correct this is for SafeMode to be disabled and then execute the storage vMotion.


Recommendations?

The primary recommendation is to upgrade to Purity//FA 6.1.8 or higher.  

Should an upgrade not be possible, then there are two recommendations that Pure Storage can make for customers that want SafeMode enabled and are using vVols or plan to use vVols while SafeMode is enabled.

The first recommendation is if SafeMode is required and can not be disabled.  In this case, please take the following into account and plan accordingly if an upgrade can not be scheduled promptly. 

  • VMs that require CBT to be enabled should be left on a VMFS Datastore rather than migrating to vVols.
  • VMs that have CBT enabled, but do not need it to be enabled, but will not be able to completely leverage incremental backups.
  • Expectation that Storage vMotions from vVols to any target will fail when SafeMode is enabled on the FlashArray.
  • Object scale limits count destroyed and pending eradication objects against the limit.  

The second recommendation is if SafeMode is not required and can be disabled if needed.  In particular, SafeMode should be disabled if the following are true:

  • vVols based VMs must have CBT enabled.
  • Storage vMotion from vVols to VMFS is needed.
  • FlashArray object limits are close to being reached.
  • Backup Software that leverages VADP and managed snapshots is being used.