ES-70342 - Example Jira
- Host lost MPIO path during failover or controller reboot
- File System Health of one or more of Cluster Disks was affected
- 5120 Event accompanied by other error, like Event 5142 on the CSV Layer is triggered
IO packets in the resubmit queue are stamped with physical disk state during enqueuing and dequeuing in the queue. If for some reason, a particular path to the disk fails, the irp is queued to the resubmit queue. When the disk or path comes back up online, the irp is dequeued and dispatched to the disk. If for some reason, the path to the disk failed, to prevent a situation where the IRP gets infinitely requeued to the same path, a check is done to compare the irp's queued and dequeued physical disk state and only issued if they are not the same else irp is discarded.
In a scenario where the disk is resumed from the throttled state but put back to throttled state immediately, the IRP's enqueued physical disk state and dequeued physical disk state is being set to the same state MPIO_STATE_NORMAL. The previous design had a dedicated state for a throttled state which was not failing this check. In the last version of changes, the state-specific for throttled state MPIO_STATE_THROTTLED was removed but the code was still checking for physicaldisk states before dispatching. Since now the queued and dequeued physicaldisk states are same (MPIO_STATE_NORMAL), the packet never gets dispatched to the disk and keeps getting dropped.
Corruption may be detected on one or more of the files and fixed with CHKDSK. This kind of corruption is rare and difficult to track, however in this case (ES-70342) Microsoft can see that the MPIO version customer is using is not up to date and may be exposing the servers to a regression introduced in the particular version they're using (10.0.14393.2097) published since 02/2018.
The KB that publishes the Driver code correction was already published on 05/2019, Microsoft recommended to install at least KB4499177 or newest to have the fix.