I've been working a bit on ZFS benchmarking recently, and I'd like to share some of my experience in benchmarking this to find various bottlenecks.
Benchmarking the hardware
I have a system with 34 3TB SAS hard drives, two LSI SAS 9200-8e HBAs (SAS controllers), and a SuperMicro disk shelf. So, what sort of throughput should I expect?
Raw disk throughput
First of all, it's a good idea to benchmark a single drive. You can start by writing raw data directly to a drive. BEWARE: IF YOU WRITE TO A DRIVE THAT IS IN USE, YOU MIGHT LOSE DATA! MAKE SURE YOU KNOW WHAT YOU ARE DOING! But here's what I did:
# dd if=/dev/zero of=/dev/rdsk/<device> bs=4k count=1024 # dd if=/dev/zero of=/dev/rdsk/<device> bs=1M count=100At this stage, you should probably be seeing speeds around 50-70MB/s for a decent 7200RPM hard drive. If you're way under, something is probably wrong. If your numbers are way higher, you've probably got 15kRPM or SSD drives. I initially got very low speeds, but this turned out to be because of an MPxIO issue with the LSI HBAs, and was resolved by changing the load balance technique from "round-robin" to "logical-block". When that was resolved, I got proper results in these and the next tests.
File system throughput
The next step is to test a file system on a single drive. The raw write to the disk above is unbuffered, so now we'll test buffered IO and also see if the file system itself imposes any limitations. So I created a pool made up of a single drive, and then ran these commands. At this point, make sure you've not enabled compression or deduplication, otherwise you won't me measuring anything interesting!
# dd if=/dev/zero of=/volumes/singledrive/output bs=1M count=10240At this stage, I got about 135MB/s to the single drive. That's impressive, but keep in mind that a drive is faster in some areas than others, so that speed across a drive will probably vary from this maximum down to about 70MB/s to the slowest area. But we're definitely where we should be.
Next we want to benchmark a more complex pool structure. I started off with a pool made up of a single vdev, 5 drives in a raidz configuration. In simplified terms, this means 4 drives with data and one drive with parity. So what sort of numbers do we get? Let's check:
# dd if=/dev/zero of=/volumes/z5-1/output bs=1M count=102400Now I get about 500MB/s to my pool. That's good, and roughtly equates to 125MB/s to each of the four data drives. I'm pleased with that!
OK, so let's see what happens when we add more vdevs to the pool! I repeated this test with 2, 3, 4, 5 and 6 vdevs in the pool, where each vdev was a raidz of 5 drives. I ran the command above again on each, ie:
# dd if=/dev/zero of=/volumes/z5-2/output bs=1M count=102400 # dd if=/dev/zero of=/volumes/z5-3/output bs=1M count=102400 # dd if=/dev/zero of=/volumes/z5-4/output bs=1M count=102400 # dd if=/dev/zero of=/volumes/z5-5/output bs=1M count=102400 # dd if=/dev/zero of=/volumes/z5-6/output bs=1M count=102400The numbers I got were very interesting! ZFS uses dynamic striping, so the stripe width is adjusted so that we always stripe across all vdevs. That means we should add 500MB/s of throughput every time we add another vdev of 5 drives with raidz. What I got was:
- 1 vdev: approx 500MB/s
- 2 vdevs: approx 1000MB/s
- 3 vdevs: approx 1400MB/s
- 4 vdevs: approx 1500MB/s
- 5 vdevs: approx 1400MB/s
- 6 vdevs: approx 1400MB/s
My HBA has a theoretical limit of 6Gbit/s per SAS2 channel, and 4000MB/s on the PCI side. I'm using MPxIO with load balancing, so I use both HBAs at the same time. The cable from each HBA to the disk chassis contains 4 SAS2 ports. My disk chassis has a backplane with 4 SAS2 ports, which the disks are connected to. So the backplane with 4 ports appears to be my limiting factor. Further benchmarking with iozone shows that I get speeds up to approximately 2.2GB/s data throughput for read under optimal conditions. The theoretical maximum for the backplane with 4 ports is 4 * 6Gbit/s = 24 Gbit/s = 3 GB/s, so this looks pretty good. Write throughput tends to be lower, presumably because parity/mirror data also needs to pass across the same channels, "stealing" effective bandwidth when you've hit the maximum.
Benchmarking for specific applications
When benchmarking ZFS with iozone, it's interesting to experiment with different record size (the -r flag). Compare it with ZFS's own record size (check with "zfs get recordsize"), the default is 128K in the systems I've tested. You'll probably find that speeds are similar for many operations, but operations like rewrite and "mixed workload" (which is re-read and re-write at random offsets) can have dramatic speed differences depending on whether the record size used by iozone matches ZFS's recordsize. This is why you might need to tune ZFS's recordsize to match your application's record size, especially with databases. If you don't do lots of rewrites, you'll probably be better off with the higher throughput you get with 128K recordsize.
posted at: 16:11 | path: /2011/11/28 | permanent link to this entry