Over the years, I've described the design of NetApp deduplication to hundreds of co-workers, customers, prospects, resellers, and anybody else who would listen - to the point where I distilled my summary down to about 10 minutes in front of a whiteboard. For the benefit of those who haven't heard this mini-lecture, here is a simple description of exactly how NetApp dedupe works.
The Building Blocks of NetApp Deduplication
If you think of a NetApp storage system in terms of three main modules, it becomes easier to understand how we designed deduplication. As the diagram below shows, the three modules of a NetApp FAS or V-Series system are:
1) An operating system (Data ONTAP)
2) A filesystem (Write Anywhere FIle Layout, or WAFL)
3) The actual stored data objects (WAFL Blocks).
As with any other filesystem, WAFL consists of a hierarchy of superblocks, inode pointers, and associated metadata. It is important to understand that WAFL does not know or care what application wrote the data or what protocol sent it. Whether a data block is from a database, word doc, or medical image is irrelevant as is the protocol that delivered it - CIFS, NFS, iSCSI, FC-SAN - none of that matters to WAFL, it just knows it has received a 4K chunk of data that it will store it as a file within its directory structure.
Designing Deduplication into Data ONTAP
The first step in designing deduplication is to create a method of comparing data objects and figuring out which objects are unique and which are duplicate. This generally involves the creation of a hash, or fingerprint, which is a small digital representation of a larger data object, and NetApp deduplication is no exception. Fortunately, this fingerprint already exists in Data ONTAP. Each time a WAFL block is created, a checksum character is generated for the purpose of consistency checking. NetApp deduplication simply "borrows" a copy of this checksum and stores it in a catalog of all fingerprints, as shown in the diagram below.
As the diagram above illustrates, each time a system write occurs, the deduplication process interrupts the I/O stream and requests a copy of the checksum and stores it in its catalog as a fingerprint. Although customers tell me they don't see any measurable performance impact during this process, we've measured it in our labs to be approximately 7% write performance overhead.
The other thing shown in the diagram is that it is possible to scan existing data and pull those fingerprints into the catalog. In fact the first time you run dedupe, you'll get a message reminding you that you should do this.
We Ain't Deduping Yet
At this point we have enabled deduplication and gathered information in the form of digital fingerprints. No deduplication has actually occurred however. NetApp deduplication uses a "post-processing" routine which means that deduplication is run at intervals after the data is stored.
In NetApp's case, the actual deduplication process is triggered in one of three ways:
1) It can be started manually from the CLI or GUI
2) It can be scheduled to run at predetermined times and intervals
3) It can run automatically based on a data growth threshold being crossed
Regardless of how deduplication is started, the figure below describes the process.
The NetApp Deduplication Process
There is a lot going on in this diagram, so let’s break it down step-by-step:
1) The fingerprint catalog is sorted and searched for identical fingerprints
2) When a fingerprint "match" is made, the associated data blocks are retrieved and scanned byte-for-byte (as shown by the green boxes in the diagram above)
3) Assuming successful validation, the inode pointer metadata of the duplicate block is redirected to the original block (as shown by the two arrows pointing to the same WAFL block)
4) The duplicate block is marked as "Free" and returned to the system, eligible for re-use
That's It - NetApp Dedupe In A Nutshell!
Data Storage Matters,