Storage Performance Basics for Deep Learning

Introduction

When production systems are not delivering expected levels of performance, it can be a challenging and time-consuming task to root-cause the issue(s). Especially in today’s complex environments, where the workload is comprised of many software components, libraries, etc, and rely on virtually all of the underlying hardware subsystems (CPU, memory, disk IO, network IO) to deliver maximum throughput.

In the last several years we have seen a huge upsurge in a relatively new type of workload that is evolving rapidly and becoming a key component to business-critical computing – Artificial Intelligence (AI) derived from deep learning (DL). NVIDIA GPU technology – the  technology of choice for running computationally-intensive Deep Learning workloads across virtually all vertical market segments. The software ecosystem, built on top of NVIDIA GPUs and NVIDIA’s CUDA architecture, is experiencing unprecedented growth, driving a steady increase of deep learning in the enterprise data center deployments.

The complexity of the workloads plus the volume of data required to feed deep-learning training creates a challenging performance environment. Deep learning workloads cut across a broad array of data sources (images, binary data, etc), imposing different disk IO load attributes, depending on the model and a myriad of parameters and variables. Minimizing potential stalls while pulling data from storage becomes essential to maximizing throughput. Especially in GPU-driven environments running DL jobs, where AI derived from batch training workloads is processed to drive realtime decision making, ensuring a steady flow of data from the storage subsystem into the GPU jobs is essential for enabling optimal and timely results.

Given the complexity of these environments, collecting baseline performance data before rolling into production, verifying that the core system — hardware components and operating system — can deliver expected performance under synthetic loads, is essential. Microbenchmarking uses tools specifically designed to generate loads on a particular subsystem, such as storage. In this blog post, we use a disk IO load generation tool to measure and evaluate disk IO performance.

Tools of The Trade

Fortunately, when it comes to tools and utilities for microbenchmarking, much of the development work has already been done. Gone are the days when you need to roll up your sleeves and start coding up a synthetic workload generator. There is almost certainly something already available to meet your requirements.

  • iperf is the tool of choice for verifying network IO performance.
  • fio has emerged as the goto tool of choice for generating a storage workload in Linux.
  • Vdbench is also an extremely powerful and flexible storage load generator.

Both fio and vdbench work well. Both facilitate the creation of a run file that has a predefined syntax for describing the workload you wish to generate, including the target devices, mountpoints, etc. Let’s take a look at a few microbenchmarking experiments using fio. Bundled Linux utilities (iostat(1)), combined with the data generated by either fio or vdbench, are well-suited for determining whether or not your storage meets performance expectations.

What to Expect

Before discussing methods and results, let’s align on key terms directly related to all performance discussions.

  • Bandwidthhow big. Theoretical maximum throughput, typically expressed as bytes per second, e.g. MB/sec, GB/sec, etc. Bigger numbers are better.
  • Throughputhow much. How much data is really moving, also expressed as bytes per second. The larger the number, the better.
  • OPShow many. Operations per second, device dependent. For disks, reads per second, writes per second, typically expressed as IOPS (IO operations per second). Again, bigger is better.
  • Latencyhow long. Time to complete an operation, such as a disk read or write. Expressed as a unit of time, typically in the millisecond range (ms) for spinning disks, microsecond range (us) for SSD’s. With latency, smaller is better.

Determining what to expect requires assigning values to some of the terms above. The good news is that, when it comes to things like storage subsystems and networks, we can pretty much just do the math. Starting at the bottom layer, the disks, it’s very easy to get the specs on a given disk drive.

For example, looking at the target system configuration:

  • We typically see 500MB/sec or so of sequential throughput and about 80k-90k of small, random IOPS (IO operations per second) with modern Solid State Disks (SSD). Aggregate the numbers based on the number of installed disks.
    • NVMe SSD’s provide substantially higher IOPS and throughput than SATA drives
  • If we’re going to put a bunch of those in a box, we need only determine how that box connects to our system.
    • If it’s DAS (Direct Attached Storage), what is the connection path – eSATA, PCIe expansion, etc?
    • If it’s NAS (Network Attached Storage), we need to now factor in the throughput of the network – is the system using a single 10Gb link, or multiple links bonded together, etc.

Looking at the system architecture and understanding what the capabilities are of the hardware helps inform us regarding potential performance expectations and issues. Let’s take a look at some concrete examples.

A Single SSD

We begin with the simplest possible example, a single SSD installed in an Nvidia DGX Station. First, we need to determine what the correct device name is under Linux:

# lsscsi
[2:0:0:0] disk ATA Mobius H/W RAID0 0962 /dev/sda
[3:0:0:0] disk ATA Samsung SSD 850 3B6Q /dev/sdb
[4:0:0:0] disk ATA SAMSUNG MZ7LM1T9 204Q /dev/sdc
[5:0:0:0] disk ATA SAMSUNG MZ7LM1T9 204Q /dev/sdd
[6:0:0:0] disk ATA SAMSUNG MZ7LM1T9 204Q /dev/sde
[7:0:0:0] disk ATA SAMSUNG MZ7LM1T9 204Q /dev/sdf

We’re specifically looking at the Samsung 850 EVO, device /dev/sdb (/dev/sda is an eSATA connected desktop RAID box that we will test later).  The other 4 disks are the 4 internal SSD’s that ship with the DGX Station. As we have disk model information, it’s a simple matter to look up the specs for the device. Samsung rates the 850 EVO at roughly 98k random reads/sec, 90k random writes, and a bit over 500MB/sec for large, sequential IO. The second thing to check is the speed of the SATA link. This requires searching around a bit in the output of dmesg(1)and locating the link initialization information:

[1972396.275689] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[1972396.275910] ata4.00: supports DRM functions and may not be fully accessible
[1972396.276209] ata4.00: disabling queued TRIM support
[1972396.276211] ata4.00: ATA-9: Samsung SSD 850 EVO 1TB, EMT03B6Q, max UDMA/133
[1972396.276213] ata4.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA

Similar information can be derived by reading various subdirectories in /sys/class/. We see the drive on ata4, configured for 6.0 Gbps, or about 600MB/sec. So now we have an expectation for sequential throughput – the drive spec indicates 510-540MB/sec, and the link is sufficient to enable achieving full drive throughput.

Let’s fio! We’ll create a basic fio run file which generates simple large sequential reads:

# cat seq_r.fio
[seq-read]
ioengine=psync
rw=read
bs=1024k
filename=/dev/sdb
runtime=180

Here’s a snippet from the output of fio during the run in which we observe throughput (519.0MB) and IOPS (519/0/0 iops) during the fio execution:

# fio -f seq_r.fio
seq-read: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=psync, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [R(1)] [7.2% done] [519.0MB/0KB/0KB /s] [519/0/0 iops] [eta 02m:47s]

We recommend that you monitor the output of iostat(1)in conjunction with the output of whatever disk IO load generator tool is being used. This allows us to validate the various metrics provided by the load generator tool as well as ensure that we’re seeing disk IO and not cache IO. Iostat captures data at the Linux kernel block layer and thus reflects actual physical disk IO.

We suggest using the command line ‘iostat -cxzk 1’. Let’s look at a sample line from the output:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00 2077.00    0.00 531712.00     0.00   512.00     1.79    0.86    0.86    0.00   0.48 100.00

We see very quickly that the SSD under test indeed delivers the expected throughput, at about 519MB/sec, as reported by fio. A couple of bits of iostat(1) data pique our interest:

  • Difference in reported throughput: 519MB/sec (fio) versus 532MB/sec (iostat – the 531712.00 value under the rkB/s heading). This 13MB/sec variance represents only a 2.5% difference so we don’t consider this an issue. The difference occurs because the fio test workload calculates throughput in the user context of the running process whereas iostat captures results several layers down in the kernel.
  • The reported IOPS (IO operations per second) rates look more interesting. Fio reports 519 IOPS as opposed to 2077 for iostat (r/s – reads-per-second value). This is again attributable to differences in kernel versus user-space metrics. Fio issues 1MB reads, which apparently get decomposed somewhere down the code path into smaller IOs, which are actually what gets sent down to the disk.
    • The avgrq-sz value is 512; this is the average size, in sectors, of the IOs being issued to the device.
    • Since a disk sector is 512 bytes, 512 x 512 = 256k. Therefore each 1MB read issued by fio is being decomposed into four 256k reads at the block layer.
    • The reads-per-second reported by iostat (2077) is roughly 4X the IOPS reported by fio (519). We will get back to this in just a bit.

We also observed expected results of about 515MB/sec for the sequential mixed read/write results, which we choose not to show due to space constraints. The bottom line: we’re getting expected throughput with large sequential reads and writes to/from this SSD.

Sequential throughout is just one performance attribute. Random IO performance, measured as IOPS (IO operations per second) is also extremely important. And it’s not just the actual IOPS rate, but observed latency as well, as reported in the iostat(1) r_await column (for reads, w_await for writes).

Let’s tweak the fio run file to specify random IO and a much smaller IO size – 4k in this case for the IOPS test. Here is the random read fio run file:

# cat rand_r.fio
[random-read]
ioengine=psync
rw=randread
bs=4k
filename=/dev/sdb
runtime=180

Now running 4k random reads:

# fio -f rand_r.fio
random-read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [r(1)] [16.7% done] [98.91MB/0KB/0KB /s] [25.4K/0/0 iops] [eta 02m:30s]

And here’s the matching iostat output:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00 26314.00    0.00 105256.00     0.00     8.00     0.76    0.03    0.03    0.00   0.03  76.00

We see about 25k random read IOPs. This sounds pretty good, but a new 850 EVO SSD should get around 92k random 4k reads per second, so this is substantially less than expected. Before we write off the SSD as being an issue, let’s make sure the workload is sufficient to maximize the performance of the target device. The speed of modern SSDs means we often need concurrent load (multi-process or multi-thread) to extract maximum performance. (The same holds true for high speed networks).

The fio run file includes an attribute called iodepth, which determines the number of IOs that are queued up to the target. The iostat data, avgqu-sz, shows that the queue depth to the device is typically less than 1 (0.76). Let’s try queuing up more IO’s.

We can use the Linux lsblk(1) utility to take a look at the kernels request queue size for the disk devices:

# lsblk -o "NAME,TRAN,RQ-SIZE"
NAME   TRAN   RQ-SIZE
sda    sata       128
sdb    sata       128
sdc    sata       128

This means the kernel maintains a queue depth of 128 for each device. Let’s try tweaking the iodepth parameter in the fio run file by adding iodepth=32 to the random IO run file shown above.

# fio -f rand_r.fio
random-read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=32
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [r(1)] [31.1% done] [100.1MB/0KB/0KB /s] [25.9K/0/0 iops] [eta 02m:04s]

 

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb              0.00     0.00 25218.00    0.00 100872.00     0.00     8.00     0.68    0.03    0.03    0.00   0.03  68.00

We show both the fio output and iostat output samples above. With the change in the iodepth from 1 to 32, we observe no improvement in IOPS. Also, in both cases, the avgqu-sz (average queue size during the sampling period) remained less than 1. It seems the iodepth value change did not result in a larger number of IO’s queued to the device. Let’s take a closer look at the man page for fio(1). Zeroing in on the iodepth parameter description, the man page tells us:

“Note that increasing iodepth beyond 1 will not affect synchronous ioengines…”.

Thus, we need to use a different ioengine parameter for fio. Of the available ioengines, libaio seems the logical choice, so we change the run file by replacing  ioengine=psync with ioengine=libaio. The fio results after the change to libaio generated the same IOPS result – about 24k, and the queue depth (avgqu-sz) reported by iostat still showed less than 1 IO in the queue. A trip back to the man page reveals the problem lies with the iodepth parameter:

“Even async engines may impose OS restrictions causing the desired depth not to be achieved. This may happen on Linux when using libaio and not setting direct=1, since buffered IO is not async on that OS.”

Let’s add direct=1 to the fio run file, and try again.

# fio -f rand_r.fio
random-read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [r(1)] [25.0% done] [383.8MB/0KB/0KB /s] [98.3K/0/0 iops] [eta 02m:15s]
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 98170.00 0.00 392680.00 0.00 8.00 31.35 0.32 0.32 0.00 0.01 100.00

This time we observe 98k random 4k reads per second from the device, which aligns well with the specification for the device. The latency (await_r) is excellent at 320us (sub-millisecond disk IO). Also, the avgqu-sz value sits right around 32, which aligns with the iodepth value in the run file. The random write and mixed random read/write results also all aligned well with what is expected from the device, now that we have a correct run parameter file.

The key lesson learned on this simple experiment:  ensure you understand your load generation tool. For reference, here is the fio run file used for the random 4k read load:

# cat rand_r.fio
[random-read]
ioengine=libaio
rw=randread
bs=4k
filename=/dev/sdb
runtime=180
iodepth=32
direct=1

File Systems

Testing on the raw block device has benefits because it’s the simplest code path through the kernel for doing disk IO, thus making it easier to test the capabilities of the underlying hardware. But file systems are a fact of life, and it’s important to understand performance when a file system is in the mix. Let’s create a ext4 file system on /dev/sdb, using all defaults (no parameter tweaks, straight out of the box), and run another series of fio tests.

Here is the fio run file for file system large sequential reads:

# cat srfs.fio
[seq_fs_read]
ioengine=psync
rw=read
directory=/myfs/fio_dir
size=1g
blocksize=1024k
runtime=180
numjobs=1
time_based=1

First, sequential reads:

# fio -f srfs.fio
seq_fs_read: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=psync, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [R(1)] [92.3% done] [5433MB/0KB/0KB /s] [5433/0/0 iops] [eta 00m:14s]

Just a single job doing 1MB sequential reads generates sustained throughput of 5433MB/sec, or just over 5.4GB/sec. Given we’re on a eSATA link capable of a maximum of 600MB/sec, we can assume this workload is running out of the kernel page cache: in other words, system memory. Running iostat confirms there is no physical disk IO happening. As an experiment, we add concurrency to this load, using fio’s numjobs parameter, setting numjobs=4 in the run file.

# fio -f srfs.fio
seq_fs_read: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=psync, iodepth=1
...
fio-2.2.10
Starting 4 processes
Jobs: 4 (f=4): [R(4)] [22.7% done] [22016MB/0KB/0KB /s] [22.2K/0/0 iops] [eta 02m:20s]

With 4 processes running concurrently, throughput increases about 4x, to 22061MB/sec, or 22GB/sec. It’s great that the system can pull a lot of data out of the page cache (memory), but we’re still not measuring disk IO performance.

Tracking the disk IO via iostat when starting this fio test, the system does perform actual disk reads, with the reported throughput from fio at about 520MB/sec. After the files are read,  all subsequent reads are from the page cache, at which point you see the throughput value reported by fio increases substantially, depending on the IO size, number of jobs, etc.

As an aside, using Intel’s Memory Latency Checker, mlc, we verified that this system is capable of just over 70GB/sec read throughput from memory, so the 23GB/sec we observed with 4 jobs running falls well within the capabilities of the hardware.

Linux includes direct IO support, using the O_DIRECT flag passed to the kernel via the open(2) system call, instructing the kernel to do unbuffered IO. This bypasses the page cache. Adding direct=1 to the run file enables this. After setting this flag in the run file, we noticed a sustained 520MB/sec throughput during the sequential read test — similar to performance with the block device and consistent with the SSD performance specification. With direct=1, we also observed physical disk reads that aligned with the metrics reported with fio.

An interesting side-note regarding direct=1: the sequential read experiments on the raw block device resulted in a disparity between the IOPS reported by fio, and the reads-per-second reported by iostat. Recall that 1MB reads issued by fio broke down into four 256k reads in the kernel block layer. Thus, iostat reported 4X more reads-per-second than fio. That disparity goes away when setting direct=1 in the run file for the block device.The IOPS reported by fio and iostat aligned similarly to our observations with the file system. Using direct=1 on block devices changes the disk IO behavior.

We observed a similar disparity in results for the 4k random read load based on whether we set direct=1 or not. The IOPS value jumped to 6124k IOPS, or roughly 6.1M IOPS without direct IO after the initial file read — well beyond what a single SSD can do. This resulted in 6.1M read IOPS with 8 jobs running. Increasing the number of concurrent jobs from 8 to 16 brought the IOPS number to 10.2M – not linear, but an interesting data point nonetheless in terms of having a sense for small random reads from memory.

The 10.2M IOPS may be a limitation in fio, a limitation in the kernel for a single ext4 file system, or several other possibilities. We emphatically do not assert that 10.2M 4k reads-per-second is a ceiling/limit – this requires more analysis and experimentation. We’re focused on what we can get from physical storage so we’ll defer chasing the cached random read IOPS ceiling for another article.

Finally, setting direct=1 for the random 4k read test resulted in device-limited results – about 96k IOPS. Therefore, the hardware can sustain device-level performance with no noticeable issues when doing direct IO with a file system in place. We also see a substantial increase in throughput and IOPS when our load is reading from the page cache (kernel memory).

Evaluating write performance with file systems in the mix gets a little trickier because of synchronous write semantics and page cache flushing. For a quick first test, we have an fio run file configured to do random 4k writes, one job, psync engine. In this test, fio reported an IOP rate of just over 570k IOPS:

# fio -f rwfs.fio
random_fs_write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [w(1)] [16.6% done] [0KB/2259MB/0KB /s] [0/578K/0 iops] [eta 02m:31s]

We know the SSD can handle peak writes of around 90k IOPS; clearly, these writes were writes to the page cache, not the SSD itsefl. Watching iostat data confirms this, as nowhere near that number of disk writes is seen. We do observe a regular burst of writes to the disk, on the order of 1k write IOPS every few seconds. This is the Linux kernel periodically flushing dirty pages from the page cache to disk.

In order to force bypassing the page cache, add direct=1 to the run file, and start again.

# fio -f rwfs.fio
random_fs_write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [w(1)] [51.4% done] [0KB/135.3MB/0KB /s] [0/34.7K/0 iops] [eta 01m:28s]
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     1.00    0.00 32360.00     0.00 129444.00     8.00     0.66    0.02    0.00    0.02   0.02  65.60

The fio output now reports 34.7k IOPS, and the iostat samples indicate this is all physical IO going to disk. Adding the direct flag bypasses the page cache, enabling writing directly to disk. We know the 34.7k write IOPS falls below the stated spec for the drive; based on our previous experiment tells us we need to add more processes (concurrency) to max out the SSD. Adding numjobs=4 increased write IOPS to 75k; increasing to numjobs=8 boosts results slightly to 85k write IOPS — pretty close to the 90k specification. Since this microbenchmark is hitting the ext4 file system, we have a longer code path through the kernel, even with direct IO. We used fio’s numjobs instead of iodepth in this example intentionally, as we wish to illustrate there are multiple ways of increasing IO load.

We can also force synchronous writes in fio using sync=1 in the run file. In this example, direct=1 was replaced with sync=1.

# fio -f rwfs.fio
random_fs_write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.2.10
Starting 8 processes
Jobs: 8 (f=8): [w(8)] [42.0% done] [0KB/2584KB/0KB /s] [0/646/0 iops] [eta 01m:45s]

Write IOPS dropped to 646 with sync=1, a substantial drop in write performance. In general, this is not unusual when enforcing synchronous write semantics because each write needs to be committed to non-volatile storage before it returns success to the calling application.

We need to keep several key points in mind when evaluating write performance:

 

  • Concurrency required for maximum IOPS
  • Enforcing synchronous writes substantially reduces write IOPS, and increases latency
  • Using direct=1 and sync=1 on write loads with fio have two very different effects on the resulting performance. Direct enforces bypassing the page cache, but writes may still get cached at the device level (NVRAM in SSD’s, track buffers, etc), whereas sync=1 must ensure writes are committed to non-volatile storage. In the case of SSD’s, that will be the backend NAND storage, where things like write amplification and garbage collection can get in the way of write performance.

External RAID Box

Let’s now examine a desktop storage system that implements hardware RAID which can be configured with up to 5 SSDs. We’ll configure the device as RAID 0 to maximize potential performance. The RAID system employs five Samsung 850 EVO SSDs, the same make and model used previously, so we can do the math to determine expected performance levels. Each drive offers 520MB/sec sequential throughput per drive, which translates to 2.6GB/sec total aggregate throughput. The limitations of eSATA v3 throttles throughput – we know we will never get close to that as we’re on eSATA v3 so throughput will be limited to about 600MB/sec.

Theoretically, we should see roughly 90k IOPS per device for random IO or 450k random IOPS for the array. In reality, the system will never achieve that due fo the eSATA limit, since 450k IOPS at 4k per IO would generate about 1.6GB/sec throughput — substantially more than the eSATA link can sustain. Doing the math, our random IOPS performance will top out at about 140k IOPS.

# lsscsi
[2:0:0:0]    disk    ATA      Mobius H/W RAID0 0962  /dev/sda
[3:0:0:0]    disk    ATA      Samsung SSD 850  3B6Q  /dev/sdb
[4:0:0:0]    disk    ATA      SAMSUNG MZ7LM1T9 204Q  /dev/sdc
. . .

There’s our RAID device, /dev/sda.

Once again, we begin with large, sequential reads.

# fio -f seq_r.fio
seq-read: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [R(1)] [17.7% done] [256.0MB/0KB/0KB /s] [256/0/0 iops] [eta 02m:29s]

The fio results come in at around 256MB/sec, well below expectations. We expected to saturate the eSATA link at near 600MB/sec. The iostat data aligned with this, showing around 260MB/sec from /dev/sda. Our fio run file reflects what we learned the first time around, thus we have iodepth=32 and ioengine=libaio. It looks like we’re not getting what we expect in terms of IO’s getting queued to the device based on the iostat data:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00 1020.00    0.00 261120.00     0.00   512.00     1.86    1.82    1.82    0.00   0.98 100.40

The avgqu-sz field is less than 2 (1.86) which is inconsistent with our earlier test. When we previously set iodepth=32,  IO’s queued to a single device increased substantially. Earlier, we solved a similar issue by setting direct=1.

# fio -f seq_r.fio
seq-read: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [R(1)] [11.6% done] [262.0MB/0KB/0KB /s] [262/0/0 iops] [eta 02m:40s]
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00  277.00    0.00 267264.00     0.00  1929.70    33.94  122.06  122.06    0.00   3.61 100.00

Setting direct=1 in the run file definitely changes things, even to a block device. Throughput increases to 262MB/sec. We also see the IO queue depth at an expected value of just over 32 (avgqu-sz). IOPS values also improved. In the first case, fio generates 254 read IOPS with direct=0 (default) whereas setting direct=1 generates 1020 read IOPS – four times as many IOPS.  It’s possible the IO size is being reduced somewhere in the kernel code path. We’re generating 1MB reads from fio, but we’re sending (512 x 512) 256k IO’s to the device at the block layer. That explains why we see four times as many IOPS: the IO size is reduced by one fourth. The IOPS reported by iostat and fio are virtually the same with direct=1 and we can see the avgrq-sz of 1929.70 reflects an average IO size of about 1MB. (Remember, this data reflects 277 data points averaged over 1-second intervals, so the math will not align precisely).

Determining why using direct IO on a block device changes things requires some research and likely an excursion through the source code, which we may cover in a future post.

That 262MB/sec result presents a problem. Why is our desktop RAID box delivering substantially less sequential read throughput than a single SSD? Let’s verify the link speed first by looking at kernel messages via dmesg(1):

[   10.174282] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   10.174756] ata3.00: ATA-7: Mobius H/W RAID0, 0962, max UDMA/133
. . .

Now we see the problem: the external SATA reports 3.0Gbps, not 6.0Gbps, which implies a maximum theoretical throughput of about 300MB/sec. We’re getting about 262MB/sec which is probably realistic given protocol overhead.

If our production workload needs maximum available sequential read performance, this is not a good solution to pair with a DGX Station. But there’s still random IOPS; perhaps it delivers on a different workload. The appropriate changes are made to the fio run file to generate 4k random reads with libaio as the ioengine, and an iodepth of 32.

# fio -f rand_r.fio
random-read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [r(1)] [6.6% done] [62396KB/0KB/0KB /s] [15.6K/0/0 iops] [eta 02m:49s]

Only 15k random 4k reads per second looks pretty abysmal — much worse than the first go-round with only a single SSD.

Let’s crank up the iodepth to 128 since we have multiple physical SSD’s behind this RAID controller, all striped up as a RAID 0 device. This allows us to see if more IO’s in the queue yield a better IOPS result.

# fio -f rand_r.fio
random-read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [r(1)] [17.1% done] [62096KB/0KB/0KB /s] [15.6K/0/0 iops] [eta 02m:30s]

Even pumping up iodepth results in no improvement. Checking for random writes, we observed pretty much the same level of performance. It’s looking like the RAID box under test offers notably substandard performance. Results such as these need to be a key component of the decision-making process when selecting a storage solution.

Conclusion

A few key points come to mind:

  • Microbenchmarking with a synthetic load generator before going into production is relatively easy and yields big benefits.
  • It’s important to analyze and understand the tool(s) and the data being reported. Sometimes not getting expected values is a load issue, not a target issue.
  • Disk IO has several potential variables that can impact performance in subtle or not-so-subtle ways
    • An active page cache potentially masks actual file system performance.
    • The direct IO flag apparently affects IO with both block and file system. We’ll dive deeper on this in a future article.
    • Synchronous writes used to ensure data integrity are expensive in terms of performance.

It seems that the particular external RAID box we examined isn’t well-suited for deep learning applications if any level of sustained IO performance is important to the workload. The time invested in performance testing really paid off when evaluating the performance of the storage hardware. Had we just started using this storage for production work, the poor performance may have been blamed on the overall system.

Both of these examples represent simple and small configurations, but the same methodology can be applied to much larger data center environments. For example, if you’re deploying DGX-1 servers with NAS storage connected via the DGX-1 10Gb ethernet ports, you know peak performance tops out at about 1.2GB/sec using one port, or 2.4GB/sec using both ports concurrently. That will be your max deliverable throughput (wire speed of the network link).

If you’re doing small (for example, 4k) random IOs, you’ll hit the throughput limit on the wire at about 300k IOPS for a single 10Gb link or twice that if you’re aggregating across both ports. Five SSDs employed in a current-generation NAS system will easily saturate dual 10Gbe links. A six-or-seven drive array will saturate dual 10Gbe connections with random IOs.

Summary

NVIDIA GPUs provide massive computational power for deep learning workloads. The massive parallelism of NVIDIA GPUs means deep learning jobs can run at very high rates of concurrency – thousands of threads processing data on NVIDIA GPU cores. Completing those jobs in a timely manner requires high sustained rates of data delivery from the storage. Understanding the throughput capabilities is critical to properly assessing performance and capacity requirements for DL in the enterprise.

High rates of low-latency IOPS may be just as important as throughput, depending on the type of source data (image files, binary files, etc) and how it is stored. Low-latency minimizes wait time for data, whether feeding training jobs, writing files as a result of a transformation process prior to DL training, or generating results for use by data scientists and business analysts, maximizing customer ROI when using NVIDIA Deep Learning accelerators. Spending a little time running benchmarks on your storage subsystem likely reaps signicant returns in performance.

No Comments