During a recent conversation with a senior IT strategist about NetApp's technology solutions and capabilities, I noticed a theme that inspired me to write this blog. The strategist indicated that he was eager to learn about our solutions, yet he kept asking, “Why are NetApp's technologies better than other companies’?” He was particularly interested in snapshots. “Everyone claims to have snapshots" he commented, "What makes NetApp’s snapshot solutions different from, and perhaps better than, other vendors’?”
In this blog we’ll discuss snapshot basics -what a snapshot is, how it works, and how NetApp implements snapshots based on redirect-on-write (RoW) compared to other competitive implementations that are based on copy-on-write (CoW) architecture. We’ll explore key architectural and operational considerations and highlight important differences. This information is meant as a guide to IT practitioners (IT managers, architects, system administrators) and anyone involved in evaluating and testing snapshots as the basis for designing a backup, replication, and operational/disaster recovery strategy.
What is a snapshot?
In general terms, a snapshot is a locally retained, read-only, point-in-time virtual copy of a file system or volume. Most snapshots are time-and space-efficient. When properly implemented, they’ll enable faster operational recovery (OR) and help meet tighter recovery point objectives (RPOs), recovery time objectives (RTOs) and service level agreements (SLAs). Snapshots are not a replacement for backups but can be foundational to implementing a solid backup strategy. Research conducted by International Data Corporation (IDC) shows that enterprises are increasingly relying on disk base backup/restore software to meet their shrinking backup windows and meet application availability requirements. Interestingly, 46% of backup and restore implementations are disk based followed by tape at 38% (according to IDC), which is a dramatic change from the past. This change indicates the necessity of understanding the options – the capabilities and limitations of the different snapshot implementations.
Are all snapshots the same?
No. Snapshots differ in architectural design and implementation, which will have an impact on space utilization, performance, reliability, scalability, ease of operations, and restoration capabilities. It’s important to understand the similarities and differences of snapshots as they may have an impact on your meeting your business and technical requirements.
RoW and CoW snapshot technologies create time and space efficient snapshots…
it’s what happens next when handling changes that clear differentiation begins
What are two primary snapshot implementations?
Next, we’ll cover the two most widely used and adopted snapshot implementations: redirect-on-write and copy-on-write.
1. Redirect-on-Write snapshots (NetApp)
At the core of NetApp snapshots is WAFL (Write Anywhere File Layout) which is built in to Data ONTAP, the software that runs on FAS storage controllers. The WAFL file system was developed by NetApp to enable high-performance, high-integrity storage systems. By using a set of pointers (metadata) to the individual blocks of data, the file system knows where everything is and by making a copy of those pointers, and not the data, an instantaneous image of the entire file system can be captured. (Figure 1)
WAFL leverages the “redirect-on-write” technique to keep track of changes to snapshots. Redirect-on-write (RoW) is similar to copy-on-write (CoW) in that it’s time and space efficient. By design a RoW snapshot is optimized for write performance so any changes/updates are redirected to new blocks. Instead of writing one copy of the original data to a snapshot reserved space (cache, LUN reserve, or snapshot pool –the name changes according to the vendor) plus a copy of the changed data that is required with CoW, RoW writes only the changed data to new blocks.
Creation of a snapshot is space (a few KBs) and time (less than a second) efficient; only volume metadata is copied to the snapshot. Snapshots track changes to original volume; read requests are satisfied from the original volume.
Any changes/updates to the original volume are performed as follows:
Step 1: The filesystem writes updates to new blocks. WAFL keeps track of available blocks, which allows for changes to be done very efficiently. For example, as data blocks (B, C) are changed/updated, pointers in the active file system are redirected to new blocks (B’, C’); however the snapshot pointers still point to the original blocks to preserve that point-in-time image. (Figure 2)
In summary, a write to a volume/LUN takes:
• 1 write (1x write I/O)
It is important to understand the limitations of non-NetApp implementations of snapshot technology. Competitive offerings typically read and then write the old data to a new location before writing out the new data. This is often explained as a feature called “copy-on-write” (next section) but this feature adds dramatically to the system overhead. For each block of data changed in the copy-on-write process, there is a read and two writes, compared to a single write for NetApp.
2. Copy-on-Write snapshots (Other vendors’ snapshots)
When “copy-on-write” snapshots are first created, only the metadata about where the original data is stored is copied. No physical copy of the data is done at the time the snapshot is created. Therefore, the creation of the snapshot is time- and space-efficient.
As blocks on the original volume change, the original data is copied (moved over) into the pre-designated space (reserved storage capacity) set aside for the snapshot prior to the original data being overwritten. The original data blocks are copied just once at the first write request (after the snapshot was taken; this technique is also called copy-on-first-write). This process ensures that snapshot data is consistent with the exact time the snapshot was taken, and is why the process is called "copy-on-write."
After the initial creation of a snapshot, the snapshot copy tracks the changing blocks on the original volume as writes to the original volume are performed. The implementation of “copy-on-write” snapshots requires the configuration of a pre-designated space (typically 10-20% of the size of volume/LUN) to store the snapshots. A snapshot cache/reserve pool gets initiated; read requests are satisfied from the original volume. (Figure 3)
Any changes/updates to the original volume are performed as follows:
Step 1: The filesystem reads in original data blocks (1 x read I/O) in preparation for the copy. In this example blocks B and C will be updated with new data. (Figure 4)
Step 2: Once original data (B, C) is read by the production filesystem/LUN, data is copied (1 x Write I/O) into the designated storage pool that is set aside for the snapshot before original data is overwritten, hence the name "copy-on-write”. (Figure 5)
Step 3: Write the new and modified data blocks (B’, C’) to original data block location (1 x write I/O) and re-link the blocks to the original snapshot. (Figure 6)
In summary, a write (change/update) to a volume/LUN takes:
• 1 read (1 x read I/O) and
• 2 writes (2x write I/O)
Note that original data blocks are copied only once into the snapshot storage when the first write request is received, subsequent writes to the modified block are not copied to the snapshot reserved area (until a new snapshot is created) Copy-on-write snapshots will impact performance on the original volume while it exists, because write requests to the original volume must wait while original data is being "copied out" to the snapshot reserved pool. CoW snapshots require original copy of the data to be valid, similar to RoW implementations.
- Plan for appropriate storage capacity: copy-on-write based snapshots require the pre-allocation and provisioning of a dedicated and reserved storage capacity for snapshots. This space is off-limits to other workloads, thus reducing usable storage on the system. RoW based systems (NetApp) allow for more flexibility and “snapshot reserve” is an optional setting.
- Consider RPO, RTO and SLA requirements when evaluating the appropriate snapshot implementation. RoW based snapshots are designed to meet stringent RPO/RTO requirements.
- Consider the performance impact of frequent snapshots on write intensive workloads (OLTP). Tight RPOs mean frequent (multiple times during the day, or even hourly) snapshots which in turn may have a performance impact on the systems. CoW based snapshots are sensitive to write intensive workloads.
- Consider the “data change rate” of targeted workloads. This is important to understand, as a high data change rate will mean high IO (reads/writes) overhead on any changes/updates after snapshots are taken. Again, CoW based snapshots are sensitive to high data change rates.
- Scalability of the solution – an important question to ask storage vendors is the maximum number of snapshots they support. This number will be important when designing your data protection scheme. Most vendors support up to 64 snapshots, NetApp supports 255 snapshots.
- Consider how your snapshot design will fit into your overall data protection-and-recovery process and all its downstream processes (data replication, backup, DR…). Ideally you want fewer components and moving parts.
- Consider application awareness and integration. For example, does it integrate with my Oracle DB infrastructure? How about Exchange? SQL Server? ... For additional information about Oracle, refer to this NetApp Technical Report (TR) http://media.netapp.com/documents/tr-3858.pdf
- Compatibility with other systems: How well does this integrate with my existing environment? For example: can I leverage my existing backup infrastructure? How about replication? DR?
IT Operational considerations:
- Snapshot management: It’s important to consider the management overhead of any snapshot implementation. For example, it’s fair to aim for a unified snapshot solution for both blocks and files. IT departments need one way to create and manage snapshots, whether it's files or blocks.
- Monitoring and alerting is another key area to consider. Make sure to implement the appropriate level of monitoring and alerting for the dedicated “snapshot” volume for CoW-based implementations -lack of space will prevent the creation of additional snapshots.
- Clean-up: Creating and defining a snapshot policy should include clean up and removal of old snapshots. Avoid “snapshot” sprawl.
- License: Do you need a separate license for snapshots? If so, consider cost and maintenance.
In conclusion, you can see there are clear and significant differences between the CoW-and RoW-based snapshot implementations. It’s important to understand these differences and evaluate what solution provides the most value and benefits to achieving your business and technical requirements.