Virtual Machine and Guest Configuration
This section reviews recommended settings and configurations for virtual machines and their guest operating systems. In general, refer to VMware recommendations for configuration of virtual guests, but Pure Storage does have some additional recommendations for certain situations.
As always, configure guest operating systems in accordance with the corresponding vendor installation guidelines.
Virtual Disk Choice
Storage provisioning in virtual infrastructure involves multiple steps of crucial decisions. VMware vSphere offers three virtual disks formats: thin, zeroedthick and eagerzeroedthick.
To quickly review the types:
- Thin—thin virtual disks only allocate what is used by the guest. Upon creation, thin virtual disks only consume one block of space. As the guest writes data, new blocks are allocated on VMFS, then zereod out, then the data is committed to storage. Therefore there is some additional latency for new write
- Zeroedthick (lazy)— zeroed thick virtual disks allocate all of the space on the VMFS upon creation. As soon as the guest writes to a specific block for the first time in the virtual disk, the block is first zeroed, then the data is committed. Therefore there is some additional latency for new writes. Though less than thin (since it only has to zero—not also allocate), there is a negligible performance impact between zeroedthick (lazy) and thin.
- Eagerzeroedthick—eagerzeroedthick virtual disks allocate all of their provisioned size upon creation and also zero out the entire capacity upon creation. This type of disk cannot be used until the zeroing is complete. Eagerzeroedthick has zero first-write latency penalty because allocation and zeroing is done in advance, and not on-demand.
Prior to WRITE SAME support, the performance differences between these virtual disk allocation mechanisms were distinct. This was due to the fact that before an unallocated block could be written to, zeroes would have to be written first causing an allocate-on-first-write penalty (increased latency). Therefore, for every new block written, there were actually two writes; the zeroes then the actual data. For thin and zeroedthick virtual disks, this zeroing was on-demand so the penalty was seen by applications. For eagerzeroedthick, it was noticed during deployment because the entire virtual disk had to be zeroed prior to use. This zeroing caused unnecessary I/O on the SAN fabric, subtracting available bandwidth from “real” I/O.
To resolve this issue, VMware introduced WRITE SAME support. WRITE SAME is a SCSI command that tells a target device (or array) to write a pattern (in this case, zeros) to a target location. ESXi utilizes this command to avoid having to actually send a payload of zeros but instead simply communicates to any array that it needs to write zeros to a certain location on a certain device. This not only reduces traffic on the SAN fabric, but also speeds up the overall process since the zeros do not have to traverse the data path.
This process is optimized even further on the Pure Storage FlashArray. Since the array does not store space-wasting patterns like contiguous zeros on the array, the zeros are discarded and any subsequent reads will result in the array returning zeros to the host. This additional array-side optimization further reduces the time and penalty caused by pre-zeroing of newly-allocated blocks.
With this knowledge, choosing a virtual disk is a factor of a few different variables that need to be evaluated. In general, Pure Storage makes the following recommendations:
- Lead with thin virtual disks. They offer the greatest flexibility and functionality and the performance difference is only at issue with the most sensitive of applications.
- For highly-sensitive applications with high performance requirements, eagerzeroedthick is the best choice. It is always the best-performing virtual disk type.
- In no situation does Pure Storage recommend the use of zeroedthick (thick provision lazy zeroed) virtual disks. There is very little advantage to this format over the others and can also lead to stranded space as described in this post.
With that being said, for more details on how these recommendations were decided upon, refer to the following considerations. Note that at the end of each consideration is a recommendation but that recommendation is valid only when only that specific consideration is important. When choosing a virtual disk type, take into account your virtual machine business requirements and utilize these requirements to motivate your design decisions. Based on those decisions, choose the virtual disk type that is best suitable for your virtual machine.
- Performance—with the introduction of WRITE SAME (more information on WRITE SAME can be found in the section Block Zero or WRITE SAME) support, the performance difference between the different types of virtual disks is dramatically reduced—almost eliminated. In lab experiments, a difference can be observed during writes to unallocated portions of a thin or zeroedthick virtual disk. This difference is negligible but of course still non-zero. Therefore, performance is no longer an overridingly important factor in the type of virtual disk to use as the disparity is diminished, but for the most latency-sensitive of applications eagerzeroedthick will always be slightly better than the others. Recommendation: eagerzeroedthick.
- Protection against space exhaustion—each virtual disk type, based on its architecture, has varying degrees of protection against space exhaustion. Thin virtual disks do not reserve space on the VMFS datastore upon creation and instead grow in 1 MB blocks as needed. Therefore, if unmonitored, as one or more thin virtual disks grow on the datastore, they could exhaust the capacity of the VMFS. Even if the underlying array has plenty of additional capacity to provide. If careful monitoring is in place that provides the ability to make proactive resolution of capacity exhaustion (moving the virtual machines around or grow the VMFS) thin virtual disks are a perfectly acceptable choice. Storage DRS is an excellent solution for space exhaustion prevention. While careful monitoring can protect against this possibility, it can still be of a concern and should be contemplated upon initial provisioning. Zeroedthick and eagerzeroedthick virtual disks are not susceptible to VMFS logical capacity exhaustion because the space is reserved on the VMFS upon creation. Recommendation: eagerzeroedthick.
- Virtual disk density—it should be noted that while all virtual disk types take up the same amount of physical space on the FlashArray due to data reduction, they have different requirements on the VMFS layer. Thin virtual disks can be oversubscribed (more capacity provisioned than the VMFS reports as being available) allowing for far more virtual disks to fit on a given volume than either of the thick formats. This provides a greater virtual machine to VMFS datastore density and reduces the number or size of volumes that are required to store them. This, in effect, reduces the management overhead of provisioning and managing additional volumes in a VMware environment. Recommendation: thin.
- Time to create—the virtual disk types also vary in how long it takes to initially create them. Since thin and zeroedthick virtual disks do not zero space until they are actually written to by a guest they are both created in trivial amounts of time—usually a second or two. Eagerzeroedthick disks, on the other hand, are pre-zeroed at creation and consequently take additional time to create. If the time-to-first-IO is paramount for whatever reason, thin or zeroedthick is best. Recommendation: thin.
- Space efficiency—the aforementioned bullet on “virtual disk density” describes efficiency on the VMFS layer. Efficiency on the underlying array should also be considered. In vSphere 6.0, thin virtual disks support guest-OS initiated UNMAP to a virtual disk, through the VMFS and down to the physical storage. Therefore, thin virtual disks can be more space efficient as time wears on and data is written and deleted. For more information on this functionality in vSphere 6.0, refer to the section, In-Guest UNMAP in ESXi 6.x, that can be found later in this paper. Recommendation: thin.
- Storage usage trending—A useful metric to know and track is how much capacity is actually being used by a virtual machine guest. If you know how much space is being used by the guests, and furthermore, at what rate that is growing, you can more appropriately size and project storage allocations. Since thick type virtual disks reserve all of the space on the VMFS whether or not the guest has used it, it is difficult to know, without guest tools, how much the guest has actually written. Often it is not known until the application has used its available space and the administrator requests more. This leads to abrupt and unplanned capacity increases. Thin virtual disks only reserve what the guest has written, therefore will grow as the guest adds more data. This growth can be monitored and trended. This will allow VMware administrators to plan and predict future storage needs. Recommendation: thin.
BEST PRACTICE: Use thin virtual disks for most virtual machines. Use eagerzeroedthick for virtual machines that require very high performance levels.
Do not use zeroedthick.
No virtual disk option quite fits all possible use-cases perfectly, so choosing an allocation method should generally be decided upon on a case-by-case basis. VMs that are intended for short term use, without extraordinarily high performance requirements, fit nicely with thin virtual disks. For VMs that have higher performance needs eagerzeroedthick is a good choice.
Virtual Hardware Configuration
Pure Storage makes the following recommendations for configuring a virtual machine in vSphere:
Virtual SCSI Adapter—the best performing and most efficient virtual SCSI adapter is the VMware Paravirtual SCSI Adapter. This adapter has the best CPU efficiency at high workloads and provides the highest queue depths for a virtual machine—starting at an adapter queue depth of 256 and a virtual disk queue depth 64 (twice what the LSI Logic can provide by default). The queue limits of PVSCSI can be further tuned, please refer to the Guest-level Settings section for more information. The virtual NVMe adapter is supported by both Pure and VMware, but at this time there is no significant benefit to its use over PVSCSI. In the future, that will likely change, but as of ESX 7.0 U1 the recommendation (not requirement though) remains PVSCSI.
Virtual Hardware—it is recommended to use the latest virtual hardware version that the hosting ESXi hosts supports.
VMware tools—in general, it is advisable to install the latest supported version of VMware tools in all virtual machines.
CPU and Memory - provision vCPUs and memory as per the application requirements.
VM encryption—vSphere 6.5 introduced virtual machine encryption which encrypts the VM’s virtual disk from a VMFS perspective. Pure Storage generally recommends not using this and instead relying on FlashArray-level Data-At-Rest-Encryption. Though, if it is necessary to leverage VM Encryption, doing so is fully supported by Pure Storage—but it should be noted that data reduction will disappear for that virtual machine as host level encryption renders post-encryption deduplication and compression impossible.
IOPS Limits—if you want to limit a virtual machine or a particular amount of IOPS, you can use the built-in ESXi IOPS limits. ESXi allows you to specify a number of IOPS a given virtual machine can issue for a given virtual disk. Once the virtual machine exceeds that number, any additional I/Os will be queued. In ESXi 6.0 and earlier this can be applied via the “Edit Settings” option of a virtual machine.
In ESXi 6.5 and later, this can also be configured via a VM Storage Policy:
BEST PRACTICE: Use the Paravirtual SCSI adapter for virtual machines for best performance.
In general, template configuration is no different than virtual machine configuration. Standard recommendations apply. That being said, since templates are by definition frequently copied, Pure Storage recommends putting copies of the templates on FlashArrays that are frequent targets of virtual machines deployed from a template. If the template and target datastore are on the same FlashArray, the copy process can take advantage of VAAI XCOPY, which greatly accelerates the copy process while reducing the workload impact of the copy operation.
BEST PRACTICE: For the fastest and most efficient virtual machine deployments, place templates on the same FlashArray as the target datastore.
Prior to Full Copy (XCOPY) API support, when virtual machines needed to be copied or moved from one location to another, such as with Storage vMotion or a virtual machine cloning operation, ESXi would issue many SCSI read/write commands between the source and target storage location (the same or different device). This resulted in a very intense and often lengthy additional workload to this set of devices. This SCSI I/O consequently stole available bandwidth from more “important” I/O such as the I/O issued from virtualized applications. Therefore, copy or move operations often had to be scheduled to occur only during non-peak hours in order to limit interference with normal production storage performance. This restriction effectively decreased the ability of administrators to use the virtualized infrastructure in the dynamic and flexible nature that was intended.
The introduction of XCOPY support for virtual machine movement allows for this workload to be offloaded from the virtualization stack to almost entirely onto the storage array. The ESXi kernel is no longer directly in the data copy path and the storage array instead does all the work. XCOPY functions by having the ESXi host identify a region of a VMFS that needs to be copied. ESXi describes this space into a series of XCOPY SCSI commands and sends them to the array. The array then translates these block descriptors and copies/moves the data from the described source locations to the described target locations. This architecture therefore does not require the moved data to be sent back and forth between the host and array—the SAN fabric does not play a role in traversing the data. This vastly reduces the time to move data. XCOPY benefits are leveraged during the following operations:
- Virtual machine cloning
- Storage vMotion
- Deploying virtual machines from template
During these offloaded operations, the throughput required on the data path is greatly reduced as well as the ESXi hardware resources (HBAs, CPUs etc.) initiating the request. This frees up resources for more important virtual machine operations by letting the ESXi resources do what they do best: run virtual machines, and lets the storage do what it does best: manage the storage.
On the Pure Storage FlashArray, XCOPY sessions are exceptionally quick and efficient. Due to FlashReduce technology (features like deduplication, pattern removal and compression) similar data is never stored on the FlashArray more than once. Therefore, during a host-initiated copy operation such as with XCOPY, the FlashArray does not need to copy the data—this would be wasteful. Instead, Purity simply accepts and acknowledges the XCOPY requests and creates new (or in the case of Storage vMotion, redirects existing) metadata pointers. By not actually having to copy/move data, the offload duration is greatly reduced. In effect, the XCOPY process is a 100% inline deduplicated operation. A non-VAAI copy process for a virtual machine containing 50 GB of data can take on the order of multiple minutes or more depending on the workload on the SAN. When XCOPY is enabled this time drops to a matter of a few seconds.
XCOPY on the Pure Storage FlashArray works directly out of the box without any configuration required. Nevertheless, there is one simple configuration change on the ESXi hosts that will increase the speed of XCOPY operations. ESXi offers an advanced setting called the MaxHWTransferSize that controls the maximum amount of data space that a single XCOPY SCSI command can describe. The default value for this setting is 4 MB. This means that any given XCOPY SCSI command sent from that ESXi host cannot exceed 4 MB of described data. Pure Storage recommends leaving this at the default value, but does support increasing the value if another vendor requires it to be. There is no XCOPY performance impact of increasing this value. Decreasing the value from 4 MB can slow down XCOPY sessions somewhat and should not be done without guidance from VMware or Pure Storage support. For this reason, Pure Storage recommends setting the transfer size to the maximum value of 16 MB.
In general, standard operating system configuration best practices apply and Pure Storage does not make any overriding recommendations. So, please refer to VMware and/or OS vendor documentation for particulars of configuring a guest operating system for best operation in VMware virtualized environment.
That being said, Pure Storage does recommend two non-default options for file system configuration in a guest on a virtual disk residing on a FlashArray volume. Both configurations provide automatic space reclamation support. While it is highly recommended to follow these recommendations, it is not absolutely required.
- For Linux guests in vSphere 6.5 or later using thin virtual disks, mount filesystems with the discard option
- For Windows 2012 R2 or later guests in vSphere 6.0 or later using thin virtual disks, use a NTFS allocation unit size of 64K
Refer to the in-guest space reclamation section for a detailed description of enabling these options.
-  Note that there are VMware-enforced caveats in certain situations that would prevent XCOPY behavior and revert to legacy software copy. Refer to VMware documentation for this information at www.vmware.com.
High-IOPS Virtual Machines
As mentioned earlier, the Paravirtual SCSI adapter should be leveraged for the best default performance. For virtual machines that host applications that need to push a large amount of IOPS (50,000+) to a single virtual disk, some non-default configurations are required. The PVSCSI adapter allows the default adapter queue depth limit and the per-device queue depth limit to be increased from the default of 256 and 64 (respectively) to 1024 and 256.
In general, this change is not needed and therefore not recommended for most workloads. Only increase these values if you know a virtual machine needs or will need this additional queue depth. Opening this queue for a virtual machine that does not (or should not) need it, can expose noisy neighbor performance issues. If a virtual machine has a process that unexpectedly becomes intense it can unfairly steal queue slots from other virtual machines sharing the underlying datastore on that host. This can then cause the performance of other virtual machines to suffer.
BEST PRACTICE: Leave virtual machine queue depth limits at the default unless performance requirements dictate otherwise.
If an application does need to push a high amount of IOPS to a single virtual disk these limits must be increased. See VMware KB here for information on how to configure Paravirtual SCSI adapter queue limits. The process slightly differs between Linux and Windows.
Refer to this blog post for more information:
A few general recommendations:
- Only increase these limits when needed
- If you change this limit it is required to change queue depth limits in ESXi as well, otherwise changing these values will have no tangible affect
- A good rule of knowing if you need to change these values is if you are not getting the IOPS you expect and the latency is high in the guest, but not reported as high in ESXi or on the FlashArray volume