by Mike Riley, Director of Strategy & Technology, Americas
In doing some tweeting, e-mailing - God forbid we pick up a phone - some colleagues of mine started talking about how we best position our Flash Cache technology especially in the face of all of the automated storage tiering offerings out there. All too often we - and I mean customers and vendors alike - get caught up in the "Well, ask them why their product doesn't do this or doesn't do that?!" We go back and forth trading barbs and FUD and if something seems to stick, heck, we just ride it regardless of whether we know or even suspect that it's not true. Really sad. All of this BS has come roaring back to life with the introduction of solid state.
I don't think anyone is disputing the performance potentional of solid state but it also raised the question for customers on how to best deliver on price as well as performance. Per usual, the storage industry's age-old answer to this age-old question: storage tiering (as if we don't already know how this movie ends). As you would hopefully come to expect, different vendors have different approaches (some vendors have several). EMC, Compellent, 3Par and others have a handful of different Automated Storage Tiering (AST) solutions. NetApp has marketed a Virtual Storage Tiering (VST) strategy. Both have merit because both rely on the historical strengths of the companies that back them. AST is an extension of the ILM messaging that companies such as EMC promoted in the past. VST builds on NetApp's fellowship ring, WAFL.
Where this falls apart for a customer is when we ask Vendor A to compare it's technology/solution to Vendor B. Wrong question. (Sorry - the customer is not always right. Heresy!) The right question is, "Tell me how you solve my problem." I suggest that it's O.K. to step back and answer that question for the customer vs. getting into some technology feud with another vendor.
To best understand how and why NetApp VST addresses the price/performance question for customers, it's good to know a little history on how NetApp arrays work. Why? Well, I figure if you know what questions NetApp has already answered with ONTAP, it lays out for you the next logical step NetApp would take with their technology. One question (or accusation depending on who is doing the speaking) that comes up quite a bit is "Heh, Flash Cache doesn't cache writes!" I think it's actually a great lead-in for VST.
John Fullbright is one of our top Professional Services Architects and does an amazing amount of personal research in his "spare" time. As we were having tihs conversation around VST, John started to talk about some personal testing he was doing to characterize WAFL write acceleration. I asked if John wouldn't mind writing up his findings. John was gracious enough to share the results and I think it's a great empirical way to ground VST/AST discussions. I'll turn the rest of this blog over to John.
by John Fullbright, Public Cloud Architect, MSBU
NetApp’s Data ONTAP does something uniquely different from the majority of other storage vendors’ products; it’s optimized for writes. Indeed, write optimization was one of the original design criteria for Data ONTAP back in 1992. Dave Hitz himself explained this many years ago in TR-3001 (since updated). In brief, Data ONTAP eliminates the “Disk Parity Bottleneck” through its use of WAFL to coalesce a group of temporally located write IOs; pre-emptively “defrag” if you will this group of I/Os based upon the best possible allocation unit or “tetris” available; calculate parity for the entire lot while in memory, and stripe the lot of them across all available drives during the next write event (aka consistency point, CP).
Several properties of WAFL make NetApp stand out from legacy arrays – “unified” or otherwise: WAFL is “RAID-aware” which allows engineers to introduce features like RAID-DP with zero performance impact. Snapshots are simply a preserved, specially named CP and part of the WAFL DNA. That’s why NetApp performance with or without Snapshots is the same and left on by default vs. after-market bolt-on snapshots that we see turned off by default and rarely, if ever, featured in 3rd party benchmarks. WAFL does a great job of mitigating the largest source of latency in any array: disk access. Essentially, Hitz and Lau conceived of a system that solved the megabytes-per-second-to-miles-per-hour equation more efficiently than anyone did before or has since. For more details, I recommend reading John Martin’s blog where he describes WAFL as the ultimate write accelerator. Kostadis Roussos, in his WAFL series on Extensible NetApp, provides great detail about how the process works as well.
Still, there are those who fail to get it. In this world of hypercompetitive storage vendors, it’s common to hear things like “WAFL degrades over time” or “Flash Cache doesn’t cache writes”. I like to think that these statements are made out of misunderstanding, not malice. To help demonstrate what WAFL means for write performance, I decided to run some tests using 100% random write workloads on my own FAS3050.
The Test Platform:
- ONTAP 7.3.4
- (2) DS14MK2-AT drive shelves – dual looped
- (28) 320GB Maxtor MaxLine II 5400RPM SATA drives
- Storage for Checksums ~8%
- Right-sized (for hot-swap) = ~2% across the industry
- WAFL Formatted = 26.8 GB per drive (reserved for storage virtualization)
- 11ms average seek time
- 2MB buffer
- Internal data rate 44MBps
- Drive transfer rate of 133 MBps
- Storage Efficiency/Virtualization Technologies Employed:
- Volume-level Deduplication
- Thin Provisioning
- RAID-DP RG size = 24
- Tuning options:
- optimize_write_once = off
- read_reallocate = on
- options raid.raid_dp.raidsize.override on (Updated: 2/14/2011. See comments below.)
As you can see, a FAS3050 running two shelves of 5400 RPM MaxLine II drives each with a whopping 2MB buffer doesn’t exactly qualify as “state of the art” but it will do and actually help prove a point in the following test.
Like any customer, I wanted to make sure I could squeeze as much usable space out of my configuration as possible. Starting with a total of 28 drives I used:
- 3 of the drives for the aggregate containing the root volume
- 1 one for a hot spare
- 24 disks in a single RAID-DP raid group for my test aggregate.
This filer is not part of a Metrocluster. I’m not using Synchronous SnapMirror. This allowed me to remove the 5% aggregate reserve to gain that space back. This leaves me with 5.18 TB (5.18 TB = 5304 GB = 22 drives * 241.2 GB) usable in the test aggregate. This filer also supports my home test environment, so I do have 26 Hyper-V VMs stored on it as well as a CIFS share that I put my ISO images on. I thin provision and de-duplicate all of my volumes and LUNs, so the 4.1 TB I have presented to my Hyper-V servers actually only consumes 52 GB from my test aggregate. I use iSCSI to connect my Hyper-V VMs to the storage.
For this test I created three volumes, each containing a 1.5 TB LUN connected via iSCSI to the VM running Iometer. I created a 100% random, 100% 4K write workload for this test. In the Iometer Access Specification, I also ensure that the IOs are aligned on a 4K boundary. Iometer creates a single test file on each LUN that it writes to. I created a separate worker for each LUN, for a total of three workers. The goal was to fill the filer up to 90% or so and then see how well both write allocation and segment cleansing work under these conditions. The test duration was 10 days in order to ensure that I would overwrite the data several times.
Access Specification used for test
From the beginning of the test, I was hitting in excess of 10K IOPS with ~4 ms latency.
IOPS Achieved ~15 hours into the test
Into the third day, by this time with the entire test data set overwritten at least twice, the IOPS had increased to 12063 IOPS, and the latency was slightly less than 4 ms.
IOPS Achieved ~60 hours into the test
By the time the test was complete, we had written over 36 TB to the 4.5 TB of LUNs. That 4.5 TB of LUNs for the test, combined with the 52GB of other data in my test aggregate filled about 88% for the useable space. Over the 10 days the test ran, the achieved IOPS increased roughly 18% while the latency remained essentially flat at 4 ms.
OPS/Time 240 hour run
So, what’s the point of a test based on 100% random writes? This represents only a small fraction of workloads, right? The test is an example of how, by transforming random writes into temporally grouped sequential writes, Data ONTAP is truly optimized for write performance. With writes already optimized, if you’re in NetApp’s shoes, the next logical question to ask yourself is how would you optimize for reads? I think Kostadis said it well here:
So why do we need a lot of IOPS? Not to service the writes, because those are being written sequentially, but to service the read operations. The more read operations the more IOPS. In fact if you are doing truly sequential write operations, then you don’t need that many IOPS …
NetApp’s solution: reduce read latency by reducing the number of reads that go to disk with De-dupe aware Flash Cache. Why doesn’t Flash Cache cache writes? The write latency issue has already been solved. It makes no sense to cache writes if it only serves to put a bump in the wire. For write caching in Flash to make sense, it would either need to reduce latency at the same level of load or maintain the current latency at a higher level of load. In a highly write optimized environment, it does neither. In fact, it doesn’t appear to do much except put a bump in the wire for traditional arrays.
This was a test run on two generation old hardware to examine the impact of 100% random write workloads on the write allocation and segment cleansing processes in Data ONTAP. Although the ONTAP version is fairly new, it’s certainly not the latest. NetApp has been refining these processes for nearly 20 years. Many trends are working in NetApp’s favor:
- With each new ONTAP release, the algorithms improve.
- With each new iteration of hardware, there are faster CPUs, more cores, more RAM, faster hard drives, and so on. That’s more time and resources to make those improved algorithms work even better.
- As caching becomes ubiquitous at all levels of the stack, from the application to host to the network to the storage array, this tends to “wring the reads” out of the IO workload. Workloads increasingly have a higher percentage of writes – a demonstrated strength for WAFL.
- NetApp writes data in a manner which preserves temporal locality. This tends to make the job of the read caching algorithm a bit easier.
Going forward, it will be interesting to see what the result is when I start mixing reads into the workload and examining the impact of read cache. That, however, will be a future blog.