Ask Dr Dedupe in NetApp BlogsWhere is place located?

Log in to ask questions, participate and connect in the NetApp community. Not a member? Join Now!

Recent Blog Posts

Refresh this widget

Today marks the final blog for Ask Dr Dedupe.  After 5 years and a hundred or so posts, deduplication has become commonplace, meanwhile there are many other emerging and interesting data storage technologies to explore. As such, take look at my new blog: “About Data Storage” where I’ll be discussing all aspects of emerging data storage technologies.


So, for this, my final blog, I thought it would be interesting to take a look at the current state of array-based deduplication in the industry, by both the incumbents and the startups.  First, I’ll summarize six dominant storage vendors that provide over 80% of all networked enterprise storage to the world today:



Dell acquired Ocarina in 2010 and adapted their technology into a pseudo-dedupe offering.  I use the word pseudo because I still have doubts about whether or not image compression constitutes true deduplication.  But they advertise it as such, so I’ll give them the benefit of the doubt.  The Dell DR4000 is Dell’s only storage array with any type of deduplication capability.

Dedupe Grade: D



EMC’s product line is a conglomeration of 4 architectures: so we’ll take each one separately:


  1. Symmetrix V-Max – no deduplication currently offered.
  2. Isilon – no deduplication currently offered.  In an interesting side note, one Isilon blogger pondered whether deduplication is just a fad (like rock and roll?)
  3. VNX – deduplication of static files only.  This seems to be a half-hearted attempt at deduplication for primary or archival data.  Performance intense and not recommended for busy systems.
  4. Data Domain – Full deduplication offered.  As any old-timer like me might remember, Data Domain invented the concept of deduplication-embedded storage arrays.  Data Domain products are wholly focused on D2D backups, and they are the market leader in the space. Extra Credit granted to EMC for Data Domain's contribution..

Dedupe Grade: C


Hitachi Data Systems

It’s a bit difficult to keep up with the revolving door of storage arrays offered by HDS, but I’ll take a shot here:


  1. Virtual Storage Platform –No mention of deduplication.
  2. Unified Storage VM  - No mention of deduplication
  3. Unified Storage VM 100 – No mention of deduplication
  4. Content Platform – No mention of deduplication
  5. NAS Platform Family (BlueArc) – No mention of deduplication
  6. Adaptable Modular Storage 2000 Family – No mention of deduplication


Dedupe is apparently not spoken at HDS

Dedupe Grade: F



HP announced StoreOnce deduplication in 2010.  In the announcement, they offered a promise of dedupe everywhere with no need to ever rehydrate data once it’s stored (hence the name StoreOnce).  Unfortunately, this idea didn’t even sound good on paper, and with HP’s deduplication cornerstone being sparse indexing, was impossible.  Predictably, HP has relegated deduplication to D2D backup appliances only.  HP’s rising star, 3PAR Utility Storage, offers no indication of ever bringing deduplication into its portfolio.  Despite the early hype about dedupe everywhere, it appears that this idea remains locked in the minds of HP.  Because they promoted false dedupe hopes, HP receives a one grade point deduction.

Dedupe Grade:  D



IBM has taken a hybrid approach to deduplication, and, to their credit, has published a detailed document describing their dedupe strategy.  For D2D backup, IBM includes deduplication with the Diligent ProtectTIER backup appliance.  For NAS, the N-Series appliance, OEM’d from NetApp, naturally contains all the dedupe attributes of NetApp, discussed below.  For primary storage, IBM acquired Storewize in 2010 and apparently decided that compression is their preferred route to efficiency in primary storage.  The midrange V7000 includes StoreWize real time compression but no deduplication.  IBM’s high end storage arrays (SONAS, XIV, and DS8000) do not include either compression or deduplication.

Dedupe Grade: C



NetApp continues to be the only major storage provider offering fully featured deduplication across its entire line of general purpose storage systems.  All FAS and V-Series arrays share the same deduplication architecture. Dedupe’d data can be efficiently moved between arrays, either by transferring deduplicated data intact or by automatic re-deduplication after transfer.  Third party SAN systems, including the ones mentioned above, can be dedupe-enabled with the V-Series Open Storage Controller.  NetApp deduplication operates seamlessly regardless of storage protocol, application, or media type.

Dedupe Grade: A


So, for the big storage incumbents, it appears that the final course has been set with regards to deduplication.  NetApp has it, Dell, EMC, HP and IBM sort-of have it, and HDS doesn’t have it at all.


Next, let’s look at the plethora of emerging storage array startups.  All of these companies have a tiny combined market share, but of course all are vying to be the next big thing in storage.  Here, the story is a little different, as these vendors all offer deduplication (or have plans to include it.)  I won’t attempt to rank these companies, since viability can be a fleeting entity with any startup.  Instead, I’ve included a URL link that best describes each company’s commitment to deduplication.


Greenbytes – VDI-focused storage

Nimbus Data – Flash-based storage

Pure Storage – Flash-based storage

Skyera – Flash-based storage

Solidfire – Flash-based storage

StorSimple – Cloud-integrated storage (recently acquired by Microsoft)

Tegile – Virtual storage

Tintri – VM-aware storage

Violin Memory – Flash-based storage

Whiptail – Flash-based storage


These 10 storage array startups all include deduplication in their portfolio.  This should be a wake-up call to any incumbent storage vendor that doesn’t, or has limited capabilities.  History tells us that some of these startups will fold, some will be acquired (already happening), and others will rise to the top.  In the next generation of storage technology, deduplication will be requiste, not desirable.


In short, the state of deduplication is that it has proven itself to be viable in the data storage industry.  Now it’s up to all data storage vendors to prove that they can deliver it.


Signing Off-



Recently, Nigel Poulton asked me to be a guest on his Technical Deep Dive podcast, to which I graciously agreed (some of you may know this podcast by its old name, Infosmack).  The premise of the podcast was for me to explain some of the technical points of Data ONTAP 8, including why it took so long for us to release it, why clustering matters, and for that matter why Data ONTAP matters.  All in all I thought Nigel and co-host Rick Vanover did a fair job asking questions and assessing my responses - you'll have to listen to the Podcast for all the Q's andA's.  I must admit however I was thrown off by a couple of Nigel's questions, in particular 1) Why have you stayed so long at NetApp and 2) How are decisions made there?  Its not that I couldn't answer the questions, its just that I couldn't recall anyone ever asking them before.  If the job of a podcast moderator is to dig into previously undisturbed soil, then Nigel and Rick did a great job as moderators.


In that light, I thought I'd talk a little more about my NetApp experience in this blog post.  I joined NetApp 6 years ago as a product marketing manager (PMM).  The main purpose of a PMM is to work with the product managers (PM) for the products you are responsible for, and figure out how to promote these goods so that people eventually buy them.  The PM mostly works with engineers on things like roadmaps and bug fixes, while the PMM mostly works as the outside voice to the buying community.  I was lucky to work with some really good PM's in those days, they invited me to their engineering meetings and patiently answered my never-ending stream of questions.  This gave me great insight into the inner workings of both the technology and the company itself.


I was also lucky when I was given one of my first assignments - to promote NetApp deduplication, or A-SIS as it was known at the time.  Back in 2006, no one really thought much about NetApp deduplication, including a lot of people working at NetApp.  As I watched the other PMM's go about their business, I found that each had his or her own style and there seemed to be no recipe for a good PMM at NetApp.  So I decided to create my own style, that of an educator, as a way to convince people that our deduplication was worthy of their attention, mostly by explaining how it works and thus demystifying NetApp deduplication.  My effort started with a couple of whitepapers and some webcasts, which eventually lead to this blog and several published articles.  I founded an industry consortium of deduplication vendors at the Storage Networking Industry Association (SNIA) and briefed countless perspective customers both at NetApp HQ and on the road.


The thing I appreciated most about NetApp at that time was that the people around me kept encouraging me to keep stretching and keep trying new ways to reach my audience, some of which we knew would work and some we knew wouldn't.  Trouble is we didn't know which things would fall into which category until we actually tried them.  I am grateful for the fact that I could be innovative in my marketing approach and no one ever said "we don't do things that way here."  I am not sure every company would have given me the freedom that NetApp did, and I think this freedom still exists here for every PMM at NetApp.


Once dedupe became an inarguable success, in 2008 I was asked to focus my attention on NetApp Storage Efficiency.  Working with many supportive PMM's and PM's, we developed a list of seven criteria that were required for true storage efficiency:  Snapshots, RAID-DP, SATA Drives, Thin Provisioning, Thin Replication, Virtual Cloning, and of course Deduplication.  Once we had this list, I fell back into my role of educator and produced material to help both internal and external folks understand how these features worked and why they were so important in reducing the storage footprint.  And again I was allowed to extend the traditional boundaries of marketing, which this time included publishing the book "Evolution of the Storage Brain" - I wonder how many companies would have allowed me to partake in something so strenuous that I took me away from many of my normal day-to-day tasks?  I can testify that NetApp offered some amazing support to me once again.


Now, in my role of Senior Technologist, I have a new and exciting challenge.  With Data ONTAP 8, NetApp is once again changing the way people think about storage.  An agile data infrastructure that is Intelligent, Immortal, and Infinite is within our grasp.  Based on nine technical criteria, NetApp is the only storage provider with a singlular architecture that supports the most diverse set of workloads in IT history.  The ultimate in storage virtualization, Data ONTAP 8 allows you to start small and grow to petabytes without disruption and without constantly requiring new skill sets.  For this effort, I'll again take an non-traditional path in educating the value of agility while refreshed in the knowledge that NetApp will be supportive of this direction.


So...Over the next few months I'll be appearing on this blog educating you on each of the nine data agility points, and why they are so important in today's world of data-driven businesses.  I'll also blog on many other topics, such as the important work currently being done at UC-San Diego to develop an Enterprise Data Taxonomy and Data Growth Index.


Stay tuned!



AA Reservations.jpg


Webster defines an infrastructure as “the underlying foundation or basic framework (as of a system or organization.)”   Infrastructures in IT are getting plenty of attention these days, and for good reason.  As data scales inexhaustibly and as complexity increases, IT infrastructures are needed to maintain order.  When thinking of IT infrastructures, an analogy can be made with U.S. commercial aviation, a 100 year-old industry that’s seen its share of change.  In the history of commercial aviation, there have been 3 distinct phases:


1) In the early days of commercial aviation, air routes were spotty and disconnected.  Equipment was fragile, and prone to delays from even slightly inclement weather.  Early air travel was also very expensive, reserved for the wealthy.  From the 1920’s until the 1960’s, the system of airline reservations, ticketing, and air travel itself could hardly be defined as an infrastructure.


2) In the 1960’s, things began to change.  Commercial aircraft became more reliable, and navigational aids allowed flights to venture into and around severe weather.  Online ticketing was introduced, but required specialized equipment - signaling the importance of the travel agent’s computer.  Operational efficiencies were pursued by the airlines to reduce travel costs.  As a result, more people began to fly and a true infrastructure began to emerge.


3) In 1978, the U.S. government de-regulated the commercial aviation industry, heralding a new era for competitiveness by the airlines.  A hub-and-spoke model was designed – allowing for efficiency in routing and further reducing travel costs.  Frequent flyer programs were introduced in 1979 to entice passengers, and web-based reservations were introduced in 1999.


Commercial aviation has advanced through these 3 phases into a sophisticated infrastructure that includes things visible to us such as up-to-the-minute flight status, automatic boarding, and an interactive user experience.  This infrastructure also operates seamlessly in the background as flight crews, aircraft, and replacement parts are dispatched on-demand to accomodate for unexpected delays and a constantly shifting timetable.  While we’ve all experienced travel frustrations at various times, it’s remarkable to think that thousands of airline flights are tracked daily in the skies over the U.S.,  with the vast majority seamlessly flowing from departure to destination.


Interestingly, the evolution of Information Technology can also be broken in 3 distinct phases that are very similar to those above:


1) IT (or MIS as it was called in those days) began in the 1960s - small organizations consisting of mainframes, computer rooms, and software programmers.  Using a “build it yourself” model, custom applications were designed, focusing on research, accounting, and manufacturing – if you weren’t working in one of those areas, you never saw any data at all.  Equipment outages were common, I can attest that each computer room was a equipped with a row of cabinets that contained spare boards, motors, relays, power supplies and other assorted parts that a small army of engineers regularly replaced – I was one of the soldiers in this army!  In those days, infrastructure was one of the last words we would have used to describe the environment.  Mainframe computers had no ability to communicate with one another, and custom-built applications had no provision to easily share data.


2) Then in 1982, along came Sun Microsystems; where CEO Scott McNealy famously trumpeted “the network is the computer!”  Mainframes become servers and computer rooms became data centers.  While, yes, the network was the computer, the advent of open client/server computing brought something even more important – standards.  Standards for network protocols, standards for device interfaces, and standardized software in the form of commercial off-the-shelf software.  An IT infrastructure was emerging!   Major system vendors such as IBM and HP jumped on the open server bandwagon.  Software giants such as Oracle and SAP were born.  Microsoft, never given much of a chance to succeed beyond simple DOS PC’s, eventually forced their way into the data center with their NT server technology, dismissing McNealy’s infamous quote that NT stood for “Nice Try.”  Communication between applications, servers, and clients had emerged – the foundation of an IT infrastructure.  Storage devices were, however, at this point simply along for the ride - tethered to individual servers and for the most part seen as “dumb” devices.


3) Storage technologies began a drastic change in the 1990’s with the advent of SAN and NAS storage networking, where more and more intelligence was pushed into the storage layer.  Adding to this intelligence was the 2000s IT trend towards “virtualize everything.”  In fact, the 2000’s could be called the decade of virtualization, and it signaled IT’s third phase - virtualization techniques that were crucial in creating today’s sophisticated IT infrastructures.  Standardizing on TCP/IP Ethernet packets, a virtualized network infrastructure layer was the first to be put into place in IT.  Next, server virtualization crept into the data center –many organizations today are well on their way to becoming 100% virtualized.  The final emerging leg of modern IT infrastructures is the data infrastructure, which allows data to exist as its own intelligent entity in the data center, no longer tethered to any particular server or application.


This new "agile" data infrastructure offers compelling value to IT.  Data sets can exist on any storage array within the infrastructure, or can span multiple arrays.  Workloads and individual data chunks in the infrastructure can glide between performance-designed resources and capacity-designed resources based on need, without the knowledge of users and applications.  Storage resources can be added or removed from the infrastructure without any downtime requirement.  Policy-based data provisioning and data protection can be enforced without the need for human intervention.  In this era of monumental data growth and associated complexity, the 2010s are destined to become the decade of data agility.


If you’d like to learn more from industry visionaries and storage architects, I’d encourage you to join our June 21st webcast on the NetApp agile data infrastructure. Click here for details of this webcast.


See you online!



Last week I posted a guest blog in the Clouds OnCommand blog site, stressing the importance of enterprise data management.  For those new to NetApp, OnCommand is the umbrella term we use to describe our data management software suite.  Data management has come a long way in the 6 years I’ve been at NetApp.  Back in 2006 we had Filerview, Operations Manager, Protection Manager, and some SnapManagers, that’s it.  Then in 2008 we acquired Onaro and their SANscreen product, followed in 2010 by Akorri and their BalancePoint product.  These acquisitions, along with the internal development of Provisioning Manager and System Manager, has resulted in a comprehensive suite of storage and data management tools, arguably the most comprehensive ever seen in the industry.  Why did NetApp invest so much money and energy broadening our management portfolio?  I'd say it’s because data management at scale is the biggest problem facing IT today and will continue to be in the near future. This is something that we realized many years ago and history shows that we took some thought-leading steps to prepare for.


In my book Evolution of the Storage Brain, I predicted that by the year 2040, the average IT shop would be storing 1 exabyte of online data.  I based this number on the average annual data growth rate of 35.94% as witnessed from 1980-2010.  In retrospect, this number seems to have been conservative – most industry pundits are projecting closer to a 50% annual growth rate.  In any event, we are on the cusp of an era of monumental data growth.  Also mentioned in my book – storing this magnitude of data would not be the problem – rather it would be the management of this data.  This brings us back to the subject of automation.


Automation is becoming a necessity in IT.  Why? The avalanche of files, objects, volumes, LUNs and aggregates is moving beyond the grasp of human comprehension.  How many applications is your organization managing now?  What is the capacity growth of each one?  How many new apps will you be supporting next year? What is the backup schedule for each?  Which are being replicated?  How many of your storage arrays will be retired this year?  When can you schedule downtime to migrate the apps onto new arrays?  This is just a small sampling of the questions plaguing storage architects every day.


Now, imagine an agile data infrastructure that automatically provisions storage capacity, establishes all backup and replication connections and policies, and allows you to transparently move applications between arrays when maintenance, upgrades, or replacements are needed.  An infrastructure that consists of a virtual pool of storage that can grow from 10TB to 50PB – managed as a single entity.   An infrastructure that monitors performance bottlenecks, capacity growth, and out-of-policy conditions - notifying you of the condition with a recommended corrective action.  This infrastructure excels at something that humans don't do very well - repeating mundane tasks over and over, while leaving us with something we are quite good at - solving problems.


The era of monumental data growth should not necessitate a legion of storage administrators.  The intelligence of Data ONTAP, evolved over 20 years, and the automated management capabilities of OnCommand, invested heavily for 6 years, form a potent combination to help manage data at never-before-seen scale.  For a first-hand view of how data agility is helping IT today, be sure to join our June 21 webcast, where I’ll be interviewing several storage architects dealing with challenges - and taking your questions live via chat.


See you on the 21st!  Might be the best hour you'll spend all year…at work anyway.




Formatted Text


Recent Posts

Refresh this widget
Sorry, this widget is too big for this column. Please move it to a wider one.