The question of the day: Can (or should) I use Adaptive Queueing (AQDA) or Storage I/O Control (SIOC) when QoS Rate Limiting is enabled on a FlashArray volume or vgroup?
We will not go super in-depth about everything under the covers but we will provide a comprehensive overview so that there is an understanding of what each technology is trying to accomplish.
Note: If you are not familiar with Queue Depth then reading my colleagues blog post called Understanding VMware ESXi Queueing and the FlashArray will help understand why we reference queue depth below.
What is the Adaptive Queue Depth Algorithm (AQDA)?
The first VMware feature we want to discuss today is called “Adaptive Queueing”, also known as the “Adaptive Queue Depth Algorithm”. Throughout this article we will refer to this feature as AQDA.
AQDA was first introduced in ESXi 3.5U4 (over 12 years ago in 2008!). This technology was released to help overloaded storage arrays (or storage ports) by throttling how much I/O is allowed to be sent at once. AQDA accomplishes this task by dynamically changing (reducing) the queue depth on a per datastore level for any device that is reporting congestion.
By reducing the queue on the impacted datastores, this limits the amount of outstanding I/O that can be sent to the storage array. In theory, by slowing down incoming I/O request, this should give the array time to “catch up” to the workload. If nothing else, it will hopefully at least create a more consistent bandwidth / throughput / latency. This means that there “should be” less impact to the environment. The “should be” is in quotes because there are obviously some situations in which it may not help or may not help enough. You hopefully get the general idea of its purpose by this point in time though.
One of the most important takeaways with AQDA is that the congestion has to be reported to the ESXi host from the storage array before it kicks in. It is not simply based off of latency. The storage array informs the ESXi host of congestion by sending specific SCSI Status Codes back to the host as I/O requests are received.
The specific SCSI codes required to trigger AQDA are:
- TASK_SET_FULL (0x28)
- BUSY (0x8)
As the ESXi host monitors the I/O responses it will determine which datastore is reporting the congestion and keep a count of how many of those requests are tagged as BUSY or FULL. Once a specific amount of these congestions responses have been received (we’ll say 16) then the ESXi host will begin to reduce the queue depth and throttle the I/O being sent to the impacted datastore.
AQDA does this by cutting the queue in halves as congestion is continually reported (e.g 64 > 32 > 16 > 8 > 4 > 2 > 1). We have confirmed through testing that queue depth can indeed drop all the way down to 1 with AQDA. This only happened in the most extreme of circumstances though when QoS was set very low on the volume and a very high workload was triggered.
As time goes on, and the storage array is able to successfully complete I/O requests, the ESXi host will keep track of these completions (0x00). Again, once a specific amount of successful completions are received (we’ll say 8) from storage it will slowly start to increase the queue depth back to its original value before the congestion event was reported. These queue slots are returned one at a time and do not double like it does when the queue is being reduced. So if the queue dropped from 64 down to 32 then AQDA would need a total of 256 successful I/O completions to get back to the original queue depth value of 64 (32 * 8).
Important Note: If TASK_SET_FULL or BUSY responses are received while the queue is being replenished then the queue slots will stop filling until there are 8 successful completions in a row. If the threshold for BUSY or TASK_SET_FULL responses are met, then not only will the queue stop increasing but it will begin to decrease again to slow the flow of I/O.
How do you enable this feature?
Adaptive Queue Depth is disabled by default. So in order to enable this feature you need to change the following settings from “0” to your preferred values:
– QFullSampleSize – The amount of TASK_SET_FULL or BUSY responses needed before AQDA kicks in and begins to throttle.
Maximum value that should be set here is: 64.
– QFullThreshold – The amount of GOOD (successful completion) responses needed before AQDA begins to increase the queue depth until back to its original value.
Maximum value that can be set here is: 16.
Enabling this feature does not require a reboot.
How do you disable this feature?
You can disable AQDA by simply setting both the QFullSampleSize and QFullThreshold back to a value of “0“.
Disabling this feature does not require a reboot.
Important Note: One issue that we have been able to reproduce in our environment (on 6.7U3) is that disabling AQDA may result in the ESXi host setting the Queue Depth to a value of “1”. This means that once you disable AQDA you may notice that your I/O throughput and latency gets worse! If you were to look in ‘esxtop’ you would see that the queue depth (DQLEN) would be reporting a value of “1” rather than the default of (likely) 64 or 32.
In order to resolve this you can either:
- Re-enable AQDA and then disable it again immediately after.
- Reboot the ESXi host
Obviously re-enabling and disabling again is quicker and less impactful to the environment. The only caution to note here is that I don’t know if any other underlying problems are created as a result from that. I personally haven’t noticed any issues, but if you prefer to take the safe route, then a reboot may be best for your environment.
If you decide that AQDA should be used in your environment it is very important that you enable it on ALL hosts within the cluster. If you do not then you could make things even worse during times of congestion.
For more information you can review the following VMware KB:
Controlling LUN queue depth throttling in VMware ESX/ESXi (1008113)
What is Storage I/O Control (SIOC)?
Storage I/O Control (referred to as SIOC moving forward) is a feature created by VMware that is also triggered during times of increased congestion. This is actually quite a powerful feature (if used properly) and we could spend quite a long time explaining every detail. Since this is more about helping you understand what it does and what conditions trigger this feature, we won’t give you all the details, but will provide links to places you can go further into detail if you so desire.
Some people may ask, “If we already have AQDA, why create another feature for congestion?”. This is a fair and reasonable question, one that is relatively simple in concept but a little trickier under the hood.
We know that AQDA is engaged only when the storage array says, at a SCSI level,
“Hey, I am a little overloaded over here… mind slowing down your I/O requests so I can catch up?”
Once that signal is received the ESXi host begins to reduce the queue and slow things down until the signal received from storage says,
“Okay, all caught up”,
at which point the queue slots are slowly replenished.
This is a great feature, right? But what about times the storage array isn’t reporting problems because it isn’t responsible for the congestion? Or maybe the storage array simply doesn’t send those notifications? What if congestion is happening within the environment due to other limitations? I know we all love to blame storage, but storage isn’t always the problem! Other things that could cause congestion are oversaturated network, failing switches and/or paths, a rouge VM (or VMs), and the list goes on…. and this is where SIOC comes into play and really shines.
Unlike AQDA, SIOC is engaged at times of increased latency on individual datastores rather than waiting for SCSI codes from an array. This means that SIOC doesn’t have to wait for storage to say there is a problem before it is able to kick in. It can engage and start trying to mitigate (or lessen the impact) of the problem without that signal.
In current versions of ESXi there are two options available for determining when SIOC should engage:
– Percentage of Peak Throughput – This option indicates the estimated latency threshold when the datastore is using that percentage of its estimated peak throughput. (Default is 90%)
– Manual – This allows the end-user to manually determine what the latency should be before SIOC engages (Default: 30ms)
Once engaged, SIOC works very similarly to adaptive queueing (with slight differences), in the sense that it slows the rate of I/O by adaptively changing (reducing) the queue depth of the impacted datastore. As soon as latency subsides then the queue depth will be increased to allow for full rate of exchange between the hosts and storage.
Okay great, but if AQDA and SIOC both work by reducing queues then what more is SIOC providing outside of simply engaging during increased latency rather then SCSI triggered events?
In a single (important) word: fairness. The best part about SIOC is that it takes into account all of the VMs residing on the impacted datastore and is cluster aware, which means it doesn’t matter which ESXi hosts are housing the VMs running on the impacted datastore. This means that when I/O throttling does need to be enforced, SIOC will ensure that every VM has the same opportunity to send I/O to the storage array. This will ensure that no single VM can “steal” all of the storage queues during the time of lessened storage resources (unless you specifically configure it to do so).
This is something that AQDA does not offer and is one of the big differentiators between the two.
Several other key distinctions about SIOC should be noted:
- SIOC statistics are evaluated every 4 seconds to determine what action should be taken based on the latency and incoming I/O.
- SIOC uses the average latency of a datastore across all connected hosts. This means that if a single host for some reason is seeing higher latency, but all others don’t, SIOC will not engage… not unless their average still exceeds the latency threshold.
- SIOC will not lower the queue depth below a value of 4.
As stated previously, this is a somewhat simplistic view of what SIOC does under the covers, and there is more complexity around ensuring proper shares are set for more critical VMs, how it can be tuned, etc and that can be found in the resources listed below (and many more I am sure).
- Storage I/O Control, the basics – Duncan Epping Blog Post
- vSphere 6.5 what’s new – Storage IO Control – Duncan Epping Blog Post
- Managing I/O Resources – VMware Official Documentation
Quality of Service (QoS) on the Pure Storage FlashArray
Phew, now that we got through the sections on AQDA and SIOC we can talk a little bit about the FlashArray and why those distinctions above were necessary!
Before we go into discussions about this I want to make a quick clarification about the two different types of QoS that are available on the FlashArray:
- QoS Fairness (Always on QoS) – Array wide setting that ensures the FlashArray itself is not overrun by all attached hosts / initiators. This isn’t configured by end-users and is simply “Always On” as the name indicates.
- QoS Rate Limiting – This is configurable by end-users and is enabled on individual volumes or volume groups. Depending on the version of Purity these restrictions can be set based on bandwidth or IOPs limits.
We will be specifically referring to QoS Rate Limiting here but some of the concepts will apply to both. If you are new to this QoS rate limiting, and want to learn more, you can read this Pure Storage KB for additional information.
When a bandwidth or IOPs limitation is set on a FlashArray volume or volume group, it is up to the FlashArray to enforce and inform the connected host(s) that a limitation has been reached.
So how does the FlashArray do that? Well, it does it the way we discussed up above in the adaptive queue depth section, by sending SCSI code 0x28 (TASK_SET_FULL) responses to the host(s) to inform it that a limitation has been reached. The FlashArray continues to send these responses as long as the host(s) attempting to send I/O are surpassing the desired limitation.
One of the challenges here, specifically for VMFS datastores, is that the FlashArray has absolutely no way to specify which VMs or I/O requests should be prioritized over another. This is because the FlashArray has no real insight into that VMFS datastore and thus no way to make an educated decision on what should have priority. In this sense, the I/O is going to be processed in the “first come, first serve” mind set and simply process what it can and reject the rest with the TASK_SET_FULL response so the host can retry the request. This means it would be up to the initiator (host) sending the data to determine what should be processed first and send that data accordingly.
With vVols, QoS on the FlashArray is more granular and simplistic in theory. Since each VMDK is a standalone volume on the FlashArray, you can not only determine which VMs should be throttled, but you can even determine which VMDK(s) you want throttled for specific VMs. This can be done by setting the individual volume restrictions (VMDK level) or setting vgroup restrictions (VM level) depending on the level of granularity you need.
Now that we have a better understanding of what each of these features provides, we should be able to more accurately answer our original question depending on your situation:
Can (or should) I use Adaptive Queueing (AQDA) or Storage I/O Control (SIOC) when QoS Rate Limiting is enabled on a FlashArray volume or vgroup?