VMware KBs on WSFC
VMware has two useful KBs for the configuration of WSFC configuration on two different vSphere versions. Please have the customer validate these are set correctly:
1. Microsoft Windows Server Failover Clustering (WSFC) with shared disks on VMware vSphere 6.x: Guidelines for supported configurations
2. Microsoft Windows Server Failover Clustering (WSFC) with shared disks on VMware vSphere 7.x: Guidelines for supported configurations
When running validation tests on RDM volumes presented directly to a Windows Server Guest VM running on ESXi, the validation test fails with SCSI-3 persistent reservation failures. There are are many different failure modes, but as of this writing, we are aware of two.
1. When running this test, the array will send a unit attention back down a path to let the path and host know that there is a change; we fail the next operation, forcing the the Windows Server host to retry the command 4 times, but down 4 different paths. If the customer has a round robin multi-pathing policy set up on ESXi to the array and they have more than 4 paths to the array, they will have no successful IO down a path for the duration of the test. The reason this is not normally concerning is because there will almost always be other IO down a path during regular operations. Because the Windows disk is offline and there is no other IO on it, only the test IO will go down each of the 4+ paths, showing the Windows Server nothing but failures down every path.
2. If the customer is using vVols and running the Cluster Validation tests, there is a known issue with versions less than ESXi 6.7 U2 (6.7 U1 and lower) that prevents the test from completing successfully.
PR 2250697: Windows Server Failover Cluster validation might fail if you configure Virtual Volumes with a Round Robin path policy If during the Windows Server Failover Cluster setup you change the default path policy from Fixed or Most Recently Used to Round Robin, the I/O of the cluster might fail and the cluster might stop responding. This issue is resolved in this release.
Basic troubleshooting steps
The issue usually manifests itself as a SCSI-3 reservation failure during the Cluster Validation test in the Microsoft Failover Clustering management pane.
If everything looks OK with the VM, you'll want to get the Cluster Validation log and report from the Windows Server where the validation steps were run from. These are in C:\Windows\Cluster\Reports; the files are ValidateStorage.txt and Validation Report (Time the validation test was run). Before running a validation test, it's useful to rename the current ValidateStorage.txt file to something like ValidateStorage.old so the text file contains only the information relevant to the test.
1. If the customer is not using vVols and getting this error, they might need to change their IO count that is sent via Round Robin down each path from 1 to 2 and run the test again. If this is a test they'll be running often, they are OK to leave it as 2. If this is a test they only plan to run once, they can change this value back to 1 or leave it as 2. Please open a Jira and validate they are having the same issue as we see in PURE-140182.
2. It is recommended for the customer to be on ESXi 6.7 U2 or higher.
3. In 5.3.6+, we've also fixed PURE-148458:
Enhances handling of unit attentions for SCSI persistent reservations (PURE-148458)
Now properly establishes unit attention conditions for SCSI persistent reservations to prevent unit attentions from being returned for initiator-target (I_T) nexuses that should have not returned, especially during Microsoft Cluster Service cluster validation tests.