# FlashArray //C performance testing expectations vs. FlashArray //X

FlashArray //X and FlashArray //C were designed with different performance expectations. While there are many common elements, FlashArray //C was never designed to meet the needs of Tier1 applications that demand low latency and high throughput.

FlashArray //C was designed to provide customers with large capacity arrays that provided consistent, but higher latency. The higher latency profile of the system is due to modifications made to accommodate using native QLC NAND while still providing high reliability.

Since the platforms were designed for different use cases they should not typically be compared directly, but we have found that some customers want to understand the differences between the systems when testing.

The purpose of this document is to explain the performance expectations of FlashArray //C compared to //X and demonstrate how to fashion a test that achieves maximum throughput.

When measuring throughput, customers tend to focus on IOPS. IOPS is half of the throughput equation. IOPS x block size = throughput. Maximizing the IOPS for a given blocksize will maximize the throughput.

A system's capabilities are determined by the processing power, latency, and the throughput capabilities of the architecture. Each system has a distinct number of IOPS that it can process for a specific block size. But what it takes to reach that number may differ between systems based on the latency profile.

Let’s compare the IOPS for a 70/30 (r/w) workload @ 32k block size with a 1:1 DRR for FlashArray //C60 vs FlashArray //X50R3. The following are the estimated values:

Product |
IOPS |
Read latency |
Write latency |

FlashArray //C60 |
84,000 |
3.5ms - 4.0ms |
3.0ms - 3.5ms |

FlashArray //X50R3 |
120,000 |
0.800ms - 0.900ms |
0.400ms - 0.600ms |

Based on the IOPS one could expect FlashArray //C60 to perform at ~70% of what a FlashArray //X50R3 would and that expectation is TRUE, IF there are sufficient IO threads sent to each Array.

The key number to compare is the latency value. Latency is the time it takes for IO between a host and the array to complete. This function of time which directly impacts the number of IOPS per thread. The more time it takes for an IO the fewer IOPS can occur in 1 second.

Comparing the read latency one would expect a thread from FlashArray //C60 to have ~22% of the IOPS compared to a FlashArray //X50R3

Let’s take a look at an example of this behavior.

Using vdbench at 80% of max IOPS for a 32K block size and a 70/30 r/w with 4 LUNs and the and 8 threads on a FlashArray //C one could expect results similar to this:

Sep 30, 2020 interval i/o MB/sec bytes read resp read write read write resp queue cpu% cpu% rate 1024**2 i/o pct time resp resp max max stddev depth sys+u sys 17:05:15.061 avg_2-30 9136.9 285.53 32768 69.78 3.490 3.830 2.704 28.30 6.28 0.816 31.9 14.8 8.1 17:05:15.760 Vdbench execution completed successfully. Output directory: /vdbench/output

Latency = queue depth/IO rate

The total IO rate for this is 9136.9 IO/s.

The queue depth (8 threads* 4 LUNs) = 32

32/9136.9 IO/s which = ~0.0035 s

~0.0035 s* 1000ms/s =3.5ms latency

Given the relationship between latency, IO rate and queue depth, we should be able to predict the IOPS if we know the latency and the queue depth.

If Latency = queue depth/IO Rate

Then IO rate = queue depth / Latency

For the same workload with //X lets assume a latency of ~0.800ms. What is the expected IOPS with a queue depth of 32?

First convert milliseconds into seconds 0.800ms/1000ms/s = 0.00800s

IO rate = 32/0.000800s

IO rate = 40,000 IOPS

9,100 IOPS is ~ 22% of 40,000 IOPS. This matches the latency differences between the platforms.

But the maximum IOPS between the shows that FlashArray //C60 can do ~70% of what //X can do. How can this be achieved?

The Maximum IOPS for a system is reached by having enough concurrent IO threads to reach that number. For FlashArray //C60 it will take more threads (queue depth) because of the increased latency. Given the IOPS and latency we can determine how many threads are required.

Queue depth = IO rate * latency

FlashArray //C60 IO rate = 84,000 latency = 3.5ms

3.5ms/1000ms/s = 0.0035s

Queue depth = 9100 IO/s * 0.0035s

Queue depth ~ 319

FlashArray //X IO rate = 120,000 latency = 0.800ms

0.800ms/1000ms/s = 0.0008s

Queue depth = 120,000 IO/s * 0.0008s

Queue depth ~ 96

This means if you were testing with 4 LUNs you would need ~24 threads on the initiator to reach the peak IOPS for FlashArray //X50R3 and ~80 threads on the initiator to reach the peak IOPS for FlashArray //C60.