A 25 second pause in I/O during a single link loss with iSCSI
A question frequently asked is: “Why do I see a pause in I/O to back-end storage when I have multiple paths available and only a single path was taken offline?”
You have taken the time to setup an environment in a redundant and resilient manner and then suddenly a single path is dropped and I/O isn’t serviced for a period of 20-35 seconds.
So let’s dig into this. We will discuss the following things throughout this post:
- What are ‘No Operation’ (NOP) PDUs?
- What happens if NOPs are not responded to?
- What is the iSCSI Recovery Timeout value?
- How can I lower the I/O pause if 20-35 seconds is just too high?
NOTE: While we will be referencing VMware vSphere specific values in this post, the concepts are the same for other operating systems. Their parameters may have slightly different names but the functionality is the same.
What are ‘No Operation’ (NOP) PDUs?
In the world of iSCSI, it was necessary to implement a way for initiators and targets to verify that connections and/or sessions are still active and functioning by design. This is what NOP is, it is basically a “ping” or “echo” request at the iSCSI layer to do just that; ensure everything is in a healthy state. This operation can be performed by either the target or the initiator as it is important for both to be up-to-date on the status of connectivity.
There are two different types of NOP requests:
NOP-In: A No Operation request sent from the target to the initiator.
NOP-Out: A No Operation request sent from the initiator to the target.
Each of the above can be either a request or a response to the others request. Remember, either a target or initiator can send these requests and each must be responded to. So if a target sends a NOP-In request then it expects a NOP-Out response to ensure all is well and the same with initiator to target.
The next logical questions here seem to be, “how often can a NOP request be issued,” “how long does the responder have to reply to the request,” and “what happens if the requestor doesn’t get a response?”
Let’s start with the first question, how often can a target or initiator issue a NOP request? While there are other reasons for NOP requests to be issued, it is often timer based when a request is issued. More specifically, when the link is considered “idle”. Again, this is not a hard and fast rule but is generally what you will see. Within ESXi this time is controlled by the “NoopOutInterval” and is set to 15 seconds by default.
Next, how long does the responder have to reply? This is controlled by NOP-Out Timeout value (NoopOutTimeout) in ESXi and is 10 seconds by default. This means that once the NOP-Out request is sent to the target, it will wait 10 seconds for the response before any action is taken. 10 seconds is a generous time in the world these days and is more than ample time to wait for a response.
This leaves us with the next question: what happens if the requestor doesn’t get a response? If no response is received to this request, then the initiator (or target) believes something to be wrong with their current session, and thus, is torn down to try and recover. Upon successful tear down, iSCSI will attempt to recover the session and if recovery is unsuccessful, reports this back to the SCSI layer where it will work in parallel with the Native Multipathing Plugin (NMP) to determine where to retry the outstanding I/O requests.
An iSCSI session recovery is deemed as failed after 10 seconds by default in ESXi. This value can be modified to be as low as 1 second or as high as 120 seconds.
There are some exceptions where uni-directional NOPs would be used but that isn’t the purpose of this document.
Putting it all together
Now that we have a basic understanding of how iSCSI liveness is determined, let’s put this in a way that is easier to understand and why when a single cable pull is performed, a backend storage array is rebooted, a switch is taken down, etc. results in this lengthy pause.
Conceptually, this isn’t all too difficult to understand if we think about it logically for a moment based off of what we know. If we take a moment to review the information we have been given we can do some pretty simple math. If we pull a cable and an iSCSI session is unexpectedly lost the following events transpire:
- NOP-Out request is sent to the target port from the initiator (up to 15 seconds, depending on several factors).
- Initiator waits for a response from the target (10 seconds).
- No response is received from the target so session recovery kicks in (10 seconds).
- Session recovery fails I/O is sent to the SCSI layer to be failed and sent down existing healthy paths.
So we have: 15 + 10 + 10 = Up to 35 seconds of I/O time paused.
I have found that on average with ESXi the paused I/O time is roughly 25-28 seconds in time. But have seen on a few occasions where 30+ seconds was experienced and caused unwanted behavior to applications.
So why does all I/O pause for that amount of time when a single path is lost? Well, since applications are often reliant on previous I/O completing the initiator pauses all I/O to the target, down all paths, until the state of the suspect path(s) and all pending I/Os are determined. Since there is a chance it could recover and I/O is just completing slower than normal it doesn’t want to prematurely fail or retry until it is certain things are down. After all, recovery of an environment, especially prematurely, can cause just as many problems. That is what timers are for after all to dictate safe zones for specific actions to be taken!
How can I lower the I/O pause if 20-35 seconds is just too high?
With all of the information above, it is pretty easy at this point to make an educated decision on how to lower the overall pause in I/O when a link (or multiple links) are unexpectedly lost.
In speaking specifically to VMware ESXi, the options are fairly limited as 1 of the 3 configurable values are already set to their minimum by default.
Name Current Default Min Max Settable Inherit -------------------- ---------- ---------- --- -------- -------- ------- NoopOutInterval 15 15 1 60 true false NoopOutTimeout 10 10 10 30 true true RecoveryTimeout 10 10 1 120 true true
Since we know that the NoopOutTimeout value is already set to the lowest available value (10 seconds) that means we have two options we can modify to lower the length of time which I/O is paused (NoopOutInterval & RecoveryTime).
Modifying the two remaining values will assist in your effort to lower failover times and get I/O back up and running quicker.
What is most effective for the majority of environments and what we have tested
For the majority of the environments, we have found that the defaults work just fine. Since ESXi is able to queue I/O and abstract some of the underlying storage problems away from VMs and applications, often times environments are unscathed by this I/O pause. More often than not it is more of a concern because somebody saw the pause in I/O and is worried something was impacted or is going to be impacted. So don’t panic if you see this pause, simply assess your environment and verify whether or not everything survived. If it did, great! You know these values work for you.
If on the off-chance it didn’t survive, then luckily there are ways to fix this. Below is what we have found to be safe values for environments that needed slightly better failover times:
NoopOutInterval – 5 seconds
NoopOutTimeout – 10 seconds
RecoveryTime – 10 seconds
This would give you a maximum of 25 seconds in I/O pause but have found that this often results in 15-16 seconds of failover time, which has been more than sufficient for customers we have worked with in times past.
Remember that modifying these values requires two things:
1. Restarting the iSCSI sessions and/or a reboot of the ESXi host to take effect.
2. Thorough testing in your own environment to ensure it meets both your needs and doesn’t cause additional unwanted side effects.
While the results above have worked well for us and customers we have worked with in the past, it doesn’t guarantee it will work for you. The most important thing is to ensure that it works for your environment.