In a recent post, I dove into NetApp Flash Cache technology, which are PCIe Flash modules installed into FAS and V-Series storage controller slots to accelerate read performance. In this blog we’ll jump right into the deep end of NetApp Flash Pool technology, the SSD companion to Flash Cache.
NetApp Flash Pools refer to a combination of SSDs and HDDs in what some call a “hybrid aggregate” available in all FAS and V-Series storage systems. NetApp offers a 100GB SLC SSD as well as a newly released 200GB eMLC SSD. Using a combination of SSD and HDD within a single aggregate allows hot data to be tiered automatically between HDD and SSD for optimal performance. Like Flash Cache; hot data is not “moved” to SSD; rather a copy of the data is created in SSD. The reason for this is twofold: first, copying data is a faster I/O process than moving it, and second, once the hot data is ejected from the SSD, there is no need to re-write the data back to HDD.
Unlike Flash Cache, however, both reads and writes can be cached in Flash Pool SSDs. When a write operation occurs on a Flash Pool-enabled volume, logic is used to determine if would be faster to write to both SSD and HDD together or just to HDD. Conversely, when a read operation occurs, logic determines whether data should be cached to SSD or read directly from HDD without caching. This determination is primarily based on whether the data pattern is random or sequential. In all cases, Flash Pool algorithms are speed-optimized.
Making SSDs Reliable
When using Flash-based SSDs, reliability is a major concern. See my blog, “The Flash Memory Evolution Revolution” for more information about Flash’s vulnerability. When NetApp decided to take on SSD, we knew that our customers wouldn’t tolerate products that were prone to failure, and we also knew that just like our HDDs, we’d need to back our SSDs with a 5 year warranty.
So how did we insure this reliability? We approached SSD quality from several fronts. For one thing, you may have noticed that in the early days NetApp was not the first to jump blindly into SSDs; instead we worked closely with SSD vendors to make sure their architecture had adequate safeguards built in, and then we took time to do our own extensive testing to validate SSD designs.
Next, we changed the way SSDs are handled in our systems as compared to HDDs, specifically with regard to two internal Data ONTAP features: Disk Sanitize and Maintenance Center. Think of Sanitize as a low level formatting and overwrite utility. It’s great for scrubbing data from HDDs but very slow for SSDs with way too many erasure cycles for Flash to endure. So for SSDs we came up with a new way to scrub data without using traditional HDD methods, preserving those valuable erasure cycles.
Maintenance Center is another routine within Data ONTAP discussed in great detail in my book Evolution of the Storage Brain. Suffice to say when an HDD begins behaving badly, it is sent to Maintenance Center, has some diagnostics run on it to correct it’s behavior, and if it responds successfully, we place the HDD back into service without human hands ever touching it. The primary purpose for this is to prevent a high number of NTF (no trouble found) returns at the factory.
We treat SSDs differently. When an SSD acts up, we do not put it into Maintenance Center, instead we fly it directly to NetApp HQ and perform extensive failure analysis. Why do we do this? Because we want to learn more about why the SSD failed, and work with our vendors to identify and correct any trouble early. Fortunately for customers, our SSD failure rate has been extremely low to date, but as the field population grows and matures, we want to make absolutely sure we are putting the best possible product into our customer’s hands. Early failure analysis is crucial.
Finally, we developed advanced SSD reporting and made it available to customers through our AutoSupport capability. Four new parameters were added to SSDs and are displayed in AutoSupport logs:
- "rated life used" % estimate
- percent spare blocks consumed
- percent spare blocks consumed limit
- power-on hours
Taken together, these parameters provide customers a means for estimating, in calendar time, the projected use-based lifetime remaining for each SSD. This allows the customer to proactively plan for SSD replacements.
As you can see, there is more to SSD than meets to eye. I hope this information helps in understanding technical nuances of NetApp Flash Pool. I’ve always felt the best customer is the one that takes time for deep dives into emerging technologies.
For general information about Flash Pool, click here
For a good Tech OnTap article, click here
For my SearchStorage article on Flash Cache and Flash Pool, click here
Data Storage Matters,