Skip to main content
Pure Technical Services

FlashArray //C performance testing expectations vs. FlashArray //X

Currently viewing public documentation. Please login to access the full scope of documentation.

 

FlashArray //X and FlashArray //C were designed with different performance expectations.  While there are many common elements, FlashArray //C was never designed to meet the needs of Tier1 applications that demand low latency and high throughput.   

 

FlashArray //C was designed to provide customers with large capacity arrays that provided consistent, but higher latency.  The higher latency profile of the system is due to modifications made to accommodate using native QLC NAND while still providing high reliability.

 

Since the platforms were designed for different use cases they should not typically be compared directly, but we have found that some customers want to understand the differences between the systems when testing.  

 

The purpose of this document is to explain the performance expectations of FlashArray //C compared to //X and demonstrate how to fashion a test that achieves maximum throughput.

 

When measuring throughput, customers tend to focus on IOPS.  IOPS is half of the throughput equation.  IOPS x block size = throughput.  Maximizing the IOPS for a given blocksize will maximize the throughput.  

 

A system's capabilities are determined by the processing power, latency, and the throughput capabilities of the architecture.  Each system has a distinct number of IOPS that it can process for a specific block size.  But what it takes to reach that number may differ between systems based on the latency profile. 

 

Let’s compare the IOPS for a 70/30 (r/w) workload @ 32k block size with a 1:1 DRR for FlashArray //C60 vs FlashArray //X50R3.  The following are the estimated values:

 

Product

IOPS

Read latency

Write latency

FlashArray //C60

84,000

3.5ms - 4.0ms

3.0ms - 3.5ms

FlashArray //X50R3

120,000

0.800ms - 0.900ms

0.400ms - 0.600ms

 

Based on the IOPS one could expect FlashArray //C60 to perform at ~70% of what a FlashArray //X50R3 would and that expectation is TRUE, IF there are sufficient IO threads sent to each Array.  

 

The key number to compare is the latency value.  Latency is the time it takes for IO between a host and the array to complete.  This function of time which directly impacts the number of IOPS per thread.  The more time it takes for an IO the fewer IOPS can occur in 1 second.  

 

Comparing the read latency one would expect a thread from FlashArray //C60 to have ~22%  of the IOPS compared to a FlashArray //X50R3

Let’s take a look at an example of this behavior.  

 

Using vdbench at 80% of max IOPS for a 32K block size and a 70/30 r/w with 4 LUNs and the and 8 threads on a FlashArray //C one could expect results similar to this:

Sep 30, 2020    interval        i/o   MB/sec   bytes   read     resp     read    write     read    write     resp  queue  cpu%  cpu%
                               rate  1024**2     i/o    pct     time     resp     resp      max      max   stddev  depth sys+u   sys
17:05:15.061    avg_2-30     9136.9   285.53   32768  69.78    3.490    3.830    2.704    28.30     6.28    0.816   31.9  14.8   8.1

17:05:15.760 Vdbench execution completed successfully. Output directory: /vdbench/output 

 

Latency =  queue depth/IO rate

 

The total IO rate for this is 9136.9 IO/s.

 

The queue depth (8 threads* 4 LUNs) = 32

 

32/9136.9 IO/s which = ~0.0035 s

~0.0035 s* 1000ms/s =3.5ms latency

 

Given the relationship between latency, IO rate and queue depth, we should be able to predict the IOPS if we know the latency and the queue depth.  

 

If  Latency = queue depth/IO Rate

 

Then IO rate = queue depth / Latency

 

For the same workload with //X lets assume a latency of ~0.800ms.  What is the expected IOPS with a queue depth of 32?

 

First convert milliseconds into seconds 0.800ms/1000ms/s = 0.00800s

 

IO rate = 32/0.000800s

 

IO rate = 40,000 IOPS

 

9,100 IOPS is ~ 22% of 40,000 IOPS.  This matches the latency differences between the platforms.

 

 

But the maximum IOPS between the shows that FlashArray //C60 can do ~70% of what //X can do.  How can this be achieved?

 

The Maximum IOPS for a system is reached by having enough concurrent IO threads to reach that number.  For FlashArray //C60 it will take more threads (queue depth) because of the increased latency.  Given the IOPS and latency we can determine how many threads are required.

 

Queue depth = IO rate * latency

 

FlashArray //C60  IO rate = 84,000  latency = 3.5ms

 

3.5ms/1000ms/s = 0.0035s

Queue depth = 9100 IO/s * 0.0035s 

Queue depth ~ 319 

 

FlashArray //X  IO rate = 120,000 latency = 0.800ms

0.800ms/1000ms/s = 0.0008s

Queue depth = 120,000 IO/s * 0.0008s

Queue depth ~ 96

 

This means if you were testing with 4 LUNs you would need ~24 threads on the initiator to reach the peak IOPS for FlashArray //X50R3 and ~80 threads on the initiator to reach the peak IOPS for FlashArray //C60.