Only two things in life are sure, Death, and Taxes, this applies notonly to life, but also to storage arrays. Drives die, data is writtento the wrong place and data gets lost, however we, like most otherstorage vendors do all we can to make sure our customers are notaffected by these events. This is done via a variety of practices, mostof which involve taking away some of the "Raw" capacity in order toprovide for some level of redundancy. For that reason I’ve decided tocall them taxes. My rational for this is that while none of us like taxes, most of us value theservices that they pay for. I could have called them reserves, orsomething less scary and more marketyish (I just made that word up, Ikind of like it), but as a label I think it's sufficiently descriptive.
In this post, I'll only be covering those areas that I think should be measured in Base-10 SI units, and only include those things over which the customer has little or no choice (much like real taxes), or where there are default or best practice recommendations that are implemented most of the time. If you think the breakdown wrong, confusing, misleading, or that I'veleft something out, let me know, and I'll try to address this insubsequent posts
Whole Disk Taxes
Disksfail, often at inconvenient times, this is why we have RAID which I'll discuss later. Unfortunately when one of the disks fail, the RAID group inquestion runs in a “degraded” mode. Depending on the RAIDconfiguration, this degraded mode may have a negative performanceimpact and may leave the RAID group unprotected.Although neither of these two things are true of RAID-DP, we, like allother vendors strongly recommend that some disks or disk capacity isreserved so that the data on the failed disk can be reconstructed from data containedthe rest of the RAID group quickly and easily. For this reason itsusually a good idea to have two disks of each type available forreconstruction, as you never want to be left without at least one hot spare if you canavoid it. Now in theory, if you’ve got dual parity RAID and a fairlyshort delivery time for a replacement disk, then you should be able toget by without any hot spares at all, but this is hardly what I’d call“Best Practice” and goes against NetApp's engineering approach whichemphasizes reliability and preservation of data above all else.
The "Raw" Capacity of the disks allocated to Hot Spares
SI Gigabytes or SI Terabytes i.e. 1 Gigabyte = 1,000,000,000 bytes
Somearchitectures don’t use dedicated disks for hot spares, but insteadallocate spare areas on a number of disks to fulfill the same function.As far as I know, EVA and XIV both fall into this category, however thesame amount of disk space is allocated to hot spare space as would bedone for dedicated physical disks. So logically it ends up the same.
Other “Whole Disk” Taxes
Anexample of this would be where a NetApp customer uses dedicated disksfor root volumes. This configuration may be considered to be a “desirable”configuration where there are a significant number of spindles in theoverall configuration. For those of you who are interested, thereasons for using (or not using) dedicated disks for root volumes, canbe found at http://media.netapp.com/documents/tr-3437.pdf
Anotherexample of this are the disks used by Clariion hold the FLARE operatingsystem and act as a location to dump uncommitted writes from cache incase of a complete power failure. I’ve seen configurations where thesedisks were dedicated for this purpose and the customer would neverplace any production loads on them. I’m not sure if this is typical,however I assume it too would be considered as “desirable” if not abest practice.
The RAW Capacity of whole disks allocated to "Vendor only" functions other than "hot spares"
Unit of Measurement
SI Gigabytes or SI Terabytes i.e. 1 Gigabyte = 1,000,000,000 bytes
Iwasn’t sure whether to put RAID under data protection, or Whole disktaxes, or under a category all of its own. For the most parttraditional RAID is a kind of whole disk tax, but while this holds truefor many vendors, there are examples like Lefthand and NetApp who mixRAID and cross site replication, and others who use de-clustered RAIDschemes where RAID groups are not built out of whole disks. Because ofthat, and because its such a well known aspect of storage efficiency Ithink it deserves a category of it’s own. Given it’s close ties to thephysical disk infrastructure, I feel that it should also be measured in"Raw" capacity (Base-10) units, just like the physical disk it protects.
The amount of "Raw" capacity allocated to RAID Protection.
e.g.“5+1” RAID-5 group made up of six 300GB disks has 300GB of capacityallocated to RAID protection, whereas a RAID-10 group made of the same six 300GB disks has 900GB of capacity allocated to RAID protection.
SI Gigabytes or SI Terabytes i.e. 1 Gigabyte = 1,000,000,000 bytes
Again, aftera fair amount of thought I’m combining a couple of things into theRightsizing tax, the first is “homogenizing” the disk drives so thatall disks of a certain “size” end up having exactly the same number ofusable blocks and the other is converting from 512 to 520 byte sectors.
Homogenising the disks
If you look at the following output from the OnTap command sysconfig –r, you’ll see an entry for one kind of “144GB” drive
Device Used (MB/blks) Phys (MB/blks)
------------ ... -------------- --------------
2a.20 136000/278528000 138959/284589376
Thething I’d like to focus on for the moment is the number of physicalblocks reported which is 284589376 number on the far right hand side.Now that is the number of 520 byte formatted sectors reported by thatparticular drive type. If you do the math, you’ll see that this “144GBdrive” actually has 147986475520 bytes of data, so its very nearly a148GB drive. So how big is this drive, 147.98GB ? like it reports, or146GB like EMC and many other Vendors would sell it, or 144GB as NetAppwould sell it ?
The answer is none of the above, well at leastfrom a NetApp perspective, what we do is to standardize every drive bysaying we will only use 278528000 520 byte blocks regardless of howmany might actually be on there. This works out to 144834560000 bytesor 144.83GB Raw. This is why we sell our drives as 144GB drives. I’mpretty sure that many other vendors also do similar kinds ofrightsizing, it allows them to get drives from multiple vendors andprovides some resiliency to slight changes in technology within thesame drive vendor. As a customer, this rightsizing simplifies purchasing and design decisions.
The next thing you’ll notice is that I’ve justsaid this is disk reports as being 147.98GB, then how does this work out to 138959 MB ?.There are two reasons for this, the first is that although each sectoris 520bytes long, 8 bytes of each of these sectors are reserved forchecksum overheads and data integrity purposes. The other reason is that the MB column isactually a Base-2 number not a Base-10 and IMHO should read MiB, butmore on that later. The same factors also explain how we get 136000MB(really MiB) out of the 144GB “rightsized capacity”, so what’s up withthis checksum overhead ?
Adding in checksum information
Moststorage vendors use some form of checksums to improve data integrity atthe block level. For most vendors, this involves reformatting the disksto change the block size on the disks from 512 bytes per sector to 520bytes per sector. For SATA disks where the block size is fixed andcannot be changed, a variety of other techniques can be used such asslip mask checksums or zoned checksums can also be used with varyingcapacity and performance tradeoffs.
NetApp has two differentmethods of adding checksum information. The first approach we used wascalled Zone Check Sums (ZCS), and subsequently, the now more generallypreferable method is something called Block Check Sums (BCS). There has beena bit of confusion about both of these approaches, and numerousexplanations. One of the best can be found as a response by SteveStrange on John Toigos blog here (http://www.drunkendata.com/?p=385), which I’ve edited slightly for readability and included below..
“ZCSworks by taking every 64th 4K block in the filesystem and using it tostore a checksum on the preceding 63 4K blocks. We originally did itthis way so we could do on-the-fly upgrades of WAFL volumes (fromnot-checksum-protected to checksum-protected). Clearly, reformattingeach drive from 512 sectors to 520 would not make for an easy, on-lineupgrade. One of the primary drawback to ZCS is performance,particularly on reads. Since the data does not always live adjacent toits checksum, a 4K read from WAFL often turns into two I/O requests tothe disk. Thus was born the NetApp 520-byte-formatted drive and BlockChecksums (BCS), this is the preferred checksum method. Note that avolume cannot use a combination of both methods — a volume is eitherZCS or BCS.
When ATA drives came along, we were stuck with512-byte sectors. But we wanted to use BCS for performance reasons. Sorather than going back to using ZCS, we use what we call and “8/9ths”scheme down in the storage layer of the software stack (underneathRAID). Every 9th 512-byte sector is deemed a checksum sector thatcontains checksums for each of the previous 8 512-byte sectors (whichis a single 4K WAFL block). This scheme allows RAID to treat the diskas if it were formatted with 520-byte sectors, and therefore they areconsidered BCS drives. And because the checksum data lives adjacent tothe data it protects, a single disk I/O can read both the data andchecksum, so it really does perform similarly to a 520-byte sector FCdrive (modulo the fact that ATA drives have slower seek times and datatransfer/rotational speeds).”
Now one thing about BCS is that forFC drives you lose around 1% of your available space to checksums, onSATA,which uses 8/9ths BCS, that figure is a little over to 11% (ouch!). If we use ZCS, thenet loss from checksums is a little under 2% which is consumed from thewafl reserve (explained later) so the net loss is zero, so why do webother, why add another 11% tax if it's not neccesary ?
Well, fora start, BCS is one of the technologies that helps us maintain a highlevel of performance, including high performance for FAS Deduplication. This means that while you might lose 11%from BCS on SATA, you’ll probably save a lot more than that in dedup,so overall you’re ahead of the game. Another thing about maintaining highperformance for SATA is that when you combine what is can be thought ofas “wide striping” via flexvols on large aggregates, high performancedual parity RAID and intelligent caching, we can start using SATA forworkloads previously reserved previously for RAID-10 on high speed FCdrives. Its this combination of efficiency technologies that makes thebig difference, but more on that later.
The other reason for using BCSis that we store a lot of interesting metadata inside those 8 bytes,not just CRC checksums. That metadata allows us to do some cool things, the first one of which is “Lost write protection”, somethingthat is as far as I’m aware unique to OnTap. I’m going to quote Stevehere again as this is one of the better explanation of this.
“Thoughit is rare, disk drives occasionally indicate that they have written ablock (or series of blocks) of data, when in fact they have not. Or,they have written it in the wrong place! Because we control both thefilesystem and RAID, we have a unique ability to catch these errorswhen the blocks are subsequently read. In addition to the checksum ofthe data, we also store some WAFL metadata in each checksum block,which can help us determine if the block we are reading is valid. Forexample, we might store the inode number of the file containing theblock, along with the offset of that block in the file, in the checksumblock. If it doesn’t match what WAFL was expecting, RAID canreconstruct the data from the other drives and see if that result iswhat is expected. With RAID-DP, this can be done even if a disk iscurrently missing!”
We’ve also found ways of leveraging thislost write capability to safely and transparently move blocks from onepart of a disk to another part of the same or even a completelydifferent disk. This is used in a number of ways to maintain andimprove the performance of a FAS via mechanisms such as theread_realloc volume option.
The amount of RAW capacity reserved by the array for the purposes of homogenizing and checksumming disks
Data Layout Taxes
Modern storage arrays all have some form of data layout engines where the storage which is presented as a single logical LUN to a host is assigned to a number of physical disks within the array. The methods for doing so can be broadly categorized in the following ways as defined in the SNIA dictionary
If a volume is algorithmically mapped, the physical location of a block of data may be calculated from its virtual volume address using known characteristics of the volume (e.g., stripe depth and number of member disks).
A form of mapping in which the correspondence between addresses in the two address spaces can change over time
A form of mapping in which a lookup table contains the correspondence between the two address spaces being mapped to each other. If a mapping between two address spaces is tabular, there is no mathematical formula that will convert addresses in one space to addresses in the other.
Most array vendors are beginning to move more towards Dynamic and Tabular mapping methods because this allows them to provide functionality such as thin provisioning and to non-disruptively allocate more spindles to existing workloads. While this is relatively new for most array vendors, NetApp has been using tabular mapping since it’s inception. This allows us to make some substantial savings later on, however Keeping this table metadata requires that some are of disk capacity is dedicated for the system. In NetApp’s case the amount of space reserved is 10% of the rightsized disk capacity as this provides space for both the metadata and makes the write allocators job a lot easier (and hence faster).
I’ve seen a lot of announcements regarding thin/dynamic/virtual provisioning from various vendors, but little if any disclosure on how much space needs to be reserved to keep the mapping information, its possible the overheads are negligible, but in the interests of full disclosure and apples-for-apples comparisons, I think the is a general category that should be included, or explicitly excluded in any efficiency comparisons/claims that are made.
Just because approaches that provide a mathematical formula for finding the location of a requested block have no need to keep metadata, doesn’t exclude them from this category. The requirement to have rigidly defined stripe sizes and widths etc often leads to areas of storage that cannot allocated to users. Any wastage/tax required for algorithmic mapping should also be included.
The amount of storage hidden from the user by the array in the process of creating a map between the physical storage on the array and the logical storage presented to the users.
Either SI Units (GB / TB) for Base-10 or IEC Units (GiB / TiB) depending on whatever makes for the fairest comparison or most understandable calculation provided that the units are explicitly stated.
No More Taxes
Ok, no more taxes, from here on I'll start talking about what we can do with the high performing, highly available, self repairing storage that array vendors create out the same disk drives that you can buy at your local computer store. If there are taxes you think I've left out are areas where I've been unclear, please let me know.
Consulting Sytems Engineer - ANZ