Troubleshooting: CBT Enabled vVols VMs Fail QueryChangedDiskAreas when FlashArray SafeMode is Enabled
The CBT compatibility issues with Safemode and vVols have been addressed with the release of Purity//FA 6.1.3 and higher. Safemode will no longer cause CBT enabled managed snapshots to fail building out the tracking, storage vMotions from vVols to VMFS will no longer fail and queryChangedDiskAreas will work correctly. However, object limit counts will be higher if Safemode is enabled on the FlashArray using vVols.
While CBT and Safemode do not have the same issues, please keep in mind these important points when using Safemode with vVols:
- Swap and Data-Snap vVols will not be auto-eradicated.
Which will lead to higher volume object counts in the following events:
- Taking vSphere Managed Snapshots and Deleteing vSphere Managed Snapshots
- Power On and Powering Off the same Virtual Machines
- Frequent vSphere vMotions between ESXi Hosts
- In Purity 6.1.x Safemode prevent FlashArray Protection Groups schedules from being increased beyond the default value. Which means that VASA is unable to adjust the Protection Group snapshot or replication schedules. When Storage Policies are used that have existing replication groups, the auto pgroup option will not work.
- Additionally in Purity 6.1.x, when Safemode is enabled, vVols that are placed in FA protection groups can not be removed from those protection groups. VASA is now unable to move volumes between protection groups with storage policy replication groups are changed in vSphere. Once a storage policy and replication group are applied to a vVol based VM, they can not be changed with Safemode enabled.
Please see the following KBs for more information
What is the Issue?
Pure Storage has identified an issue that prevents some VASA calls from completing when the FlashArray has SafeMode enabled. SafeMode is a feature that came with Purity//FA 5.3 that prevents the VASA Service from eradicating objects outside eradication timer on the FlashArray. This feature was introduced with Purity//FA 5.3 with eradication timer adjustment introduced with 5.3.7 and customers would need to work with support to enable this feature.
Why does this impact vVols and VASA? One of the core features of vVols is the automation of storage tasks by VASA. Some examples of these tasks are volume creation, connection, snapshot creation, volume deletion, snapshot deletion and disconnects. When SafeMode is enabled, this prevents the VASA Service from being able to eradicate volumes on the FlashArray.
There are a few instances when VASA will automatically eradicate destroyed volumes.
- When a vSphere Managed Snapshot is destroyed via vCenter Server, VASA will destroy and eradicate managed snapshot objects on the FlashArray for the vVol VM.
- When a testFailoverReplicationGroupEnd API is issued to VASA all test failover volume groups, volumes and snapshots will be destroyed and eradicated on the target FlashArray. This is part of the Test Failover Cleanup job with SRM or if manually running through a full test failover workflow for a vVols replication group in vRO or with PowerCLI.
- In the event that Changed Block Tracking (CBT) is enabled on a vVols based virtual disk, required VASA calls will fail which will cause errors in building out the block tracking for VMware. Specifically, the allocatedBitmapVirtualVolume op will fail as VASA automatically copies out and eradicates snapshot/volumes for this bitmap scanning process. Since the eradication will fail, the allocated bitmap calls fail.
- Similar to when CBT is enabled, VMware will use allocatedBitmapVirtualVolume VASA calls when a vVols based VM is storage vMotioned to another array or to a VMFS datastore on the same array. To satisfy these calls, VASA will copy out snapshots to difference from and then destroy and eradicate once the differencing is done. Since the eradication will fail, the allocated bitmap call will fail.
What is the impact to these workflows when the volume eradication fails?
- From the vCenter Server, the managed snapshot destruction will succeed and no errors will be seen in vCenter. However, on the FlashArray the manages snapshot objects will be destroyed and pending eradication. This means that object count can drastically increase if there are a lot of managed snapshot creation/deletion processes. By default, the pending eradication timer is 24 hours, but if the timer has been changed to greater than 24 hours; this causes these pending eradication objects to count against the object count for a longer duration. This can lead to the FlashArray hitting it's object count limits due to pending eradication objects still existing on the array.
- The testFailoverReplicationGroupEnd call may succeed, but the objects that are expected to be cleaned up may still exist which does increase the object count on the target FlashArray. At this time, there have not been any other failover replication group calls that have been observed to fail, but the testing on these calls is not fully complete when SafeMode is enabled. Should any other APIs fail, then Pure Storage will update this KB accordingly.
- When CBT is enabled for the vVol VM and a managed snapshot is taken, all of the allocated bitmap calls will fail and this will prevent VMware from building out the tracking files required for CBT to work correctly. The managed snapshot will still succeed from the VMware perspective, but CBT will not be correctly enabled or fully enabled.
- Storage vMotion from vVols to either VMFS or vVols on another array will always fail.
What does this look like from a logging perspective in VMware?
In the VMs logging, there will be errors found found when the call/request
DiskLibGetAllocatedSectorChunksInRangeInt is issued. This will cause failures to get allocated sectors for the virtual disk.
|The vVol VM's vmware.log|
In the ESXi host's vvold log, the error will appear after an
allocatedBitmapVirtualVolume call is issued to the VASA Provider. The error returned will say that eradication is disabled.
|The ESXi host's vvold.log|
Here is where the impact is directly seen after CBT is not correctly been enabled. When the API queryChangedDiskAreas is issued via VADP or manually in the MOB, the request will fail on any CBT enabled virtual disk.
|The vCenter Server's vpxd service log|
What is the impact to queryChangedDiskArea failing? This will directly impact incremental backups with vendors such as Veeam and Rubrik. This means that during an incremental backup, they will still have to scan/read the entire virtual disk size. The transfer size may still be small/incremental depending on the change rate, but the scanned size will still be larger than expected.
Here is what the difference would look like with backup jobs that are saying that CBT is enabled on all VMs without SafeMode enabled and with SafeMode enabled.
|Here is the first backup taken for this backup job. The total data read is about 60 GBs and the transferred size is about 33 GBs. The full backup job took just under 7 minutes to complete.|
SafeMode is enabled on the FlashArray that these vVol based VMs are located on. Here is an Incremental backup and where the queryChangedDiskArea's calls are failing.
The data read is slightly more than the first "full" backup. Additionally, the backup job duration is actually longer than the full/baseline transfer. The amount transferred is much lower than the first full backup though, as only a small amount of changes to the VMs were made.
SafeMode has been disabled on the FlashArray for this backup. Similar changes were made to the VMs that were made in the previous test as well.
There is a large decrease in the amount of data read and how long the backup job took to complete. This is because the queryChangedDiskAreas API calls succeeded and the backup appliance did not need to read through the entire virtual disk to see what the differences were and to copy out the differences to the backup target. This is what a true incremental backup would be expected to look like when leveraging CBT with VADP.
While the backup jobs succeeded in each case, the time it takes to complete the "incremental backup" with SafeMode enabled was much longer then when SafeMode was disabled.
Workarounds or Fixes?
From the FlashArray and VASA provider perspective, there are currently no short term fixes for VASA being unable to auto eradicate objects on the FlashArray and there is no timeline in the event that Pure Storage does allow VASA the ability to do this. A longer term fix for this is for Pure Storage to change the workflow that VASA uses to satisfy the allocatedBitmapVirtualVolumes to no longer fail if an eradication workflow fails. Once a Purity and VASA update has been released this KB will be updated to reflect which release to be on.
From the vSphere perspective, there isn't a workaround in order to get CBT to really be enabled. The impact is that CBT will appear to be enabled but any attempt to query for changed blocks or areas will fail. The only option would be to disable CBT on vVols based VMs if SafeMode is enabled and plan accordingly for increased object counts on the FlashArray and longer backup times with Veeam/Rubrik/etc.
This is no workaround for storage vMotions failing. The only way to correct this is for SafeMode to be disabled and then execute the storage vMotion.
The primary recommendation is to upgrade to Purity//FA 6.1.8 or higher.
Should an upgrade not be possible, then there are two recommendations that Pure Storage can make for customers that want SafeMode enabled and are using vVols or plan to use vVols while SafeMode is enabled.
The first recommendation is if SafeMode is required and can not be disabled. In this case, please take the following into account and plan accordingly if an upgrade can not be scheduled promptly.
- VMs that require CBT to be enabled should be left on a VMFS Datastore rather than migrating to vVols.
- VMs that have CBT enabled, but do not need it to be enabled, but will not be able to completely leverage incremental backups.
- Expectation that Storage vMotions from vVols to any target will fail when SafeMode is enabled on the FlashArray.
- Object scale limits count destroyed and pending eradication objects against the limit.
The second recommendation is if SafeMode is not required and can be disabled if needed. In particular, SafeMode should be disabled if the following are true:
- vVols based VMs must have CBT enabled.
- Storage vMotion from vVols to VMFS is needed.
- FlashArray object limits are close to being reached.
- Backup Software that leverages VADP and managed snapshots is being used.