Nearly twenty years ago, NetApp created a ground-breaking family of storage server technology which affectionately became known as “filers”. While the term Hadooplers doesn’t quite roll off the tongue as smoothly as filers, it does embody the quirky sense of humor within the Hadoop community while representing an important element of maturing Big Data (specifically Big Analytics) infrastructure.
The Evolution of Big Data
Big Data started as a disruptive economic and technical force that enabled product and service development teams to overcome current limitations of legacy compute and storage constraints. By exploiting a growing universe of privately-generated and publicly available data from external sources, a new era of business models has emerged which is generating substantial new revenue streams and competitive advantage for Big Data customers.
The year 2011 represents an important inflection point where IT departments worldwide are beginning to inherit many successful Hadoop pilot projects conceived within the lines of business they support. The latter valued high infrastructure flexibility in response to the experimental nature of initial Big Data projects. Project start-up costs and scalability then placed a careful focus on CapEx. The final inevitable phase begat what I often refer to as “the burden of success” where OpEx rises to the fore. Dave Hitz charmingly calls this the “arrival of the green visor people”. Other industry luminaries such as Werner Vogels & James Hamilton confirm the importance of OpEx by emphasizing Data Center operating costs (i.e. power & cooling) as the dominant factors in the successful deployment of scalable infrastructure.
Division of Labor
Historians and Economists will attest that every successful society or industry powered past early obstacles by dividing and conquering Big Problems into their associated sub-tasks. Big Data itself is composed of Big Content and Big Analytics.
NetApp views Big Content as fundamentally a storage function and therefore an opportunity to directly add value to the task of high-performance and high-scale content creation / ingest, pre & post processing and finally distribution & archive.
Big Analytics on the other hand is fundamentally a distributed (parallel) processing function with a vibrant (host-based) ecosystem of mature and emerging value-add players. NetApp’s highly successful business model is predicated on strong alliances with these best-of-breed solution partners.
The Hadoop technology stack is no different. Early installations of Hadoop are almost universally monolithic in nature. Meaning the Java & Linux host layers are entirely responsible for both the compute and storage components. Over time, imbalances between compute-centric and storage-centric workloads create islands of underutilized assets in each area. Moreover, the mechanical nature of disk drives tends to dominate the root causes of Hadoop hardware failure resulting in ever-increasing resilvering of object replicas during cluster recovery. These effects yield unpredictable and sometimes unacceptably low service levels for Hadoop clusters at the same time as they are supporting increasing amounts of business and mission-critical jobs.
As a result, NetApp research has identified numerous Hadoop workloads which will benefit from cost-effective and high performance offloading of storage processing and management to a dedicated layer. Enter the new NetApp E-Series family of server-attached storage arrays. This new Hadoop Storage Solution from NetApp will help you stand up a Hadoop cluster in hours versus weeks and scaling your cluster simply and predictably. The NetApp E-Series Hadoop Storage Solution will also help you control "Hadoop cluster sprawl" with Enterprise-class management tools to provision infrastructure, monitor jobs and SLA's as well as optimize utilization of Hadoop infrastructure.
Introducing Shared DAS
Shared DAS addresses the inevitable storage capacity growth requirements of Hadoop nodes in a cluster by placing disks in an external shelf shared by multiple directly attached hosts (aka Hadoop compute nodes). The connectivity from host to disk can be SATA, SAS, SCSI or even Ethernet, but always in a direct rather than networked storage configuration. Therefore Shared DAS does not use a storage switch.
Most Hadoop deployments start with nodes in a cluster which consist of a rack-mount server with internal disk. This regular DAS node configuration minimizes up-front cost but sacrifices long-term SLA’s and total cost of ownership due to the requirement of multiple data copies for high availability. Hadoop TCO’s are also negatively expanded due to unpredictable service levels of large Hadoop clusters during frequent active background replication tasks to recover from multiple disk failures. Therefore the three dimensions of Shared DAS benefit are:
- NetApp E-Series Shared DAS solutions can dramatically reduce the amount of background replication tasks by employing highly efficient RAID configurations to offload post-disk failure reconstruction tasks from the Hadoop cluster compute nodes and cluster network,
- When compared against single disk I/O configuration of regular Hadoop nodes, NetApp E-Series Shared DAS enables significantly higher disk I/O bandwidth at lower latency due to wide striping within the shelf, and finally,
- NetApp E-Series Shared DAS improves storage efficiency by reducing the number of object replicas within a rack using low-overhead high-performance RAID. Fewer replicas mean less disks to buy or more objects stored within the same infrastructure.
Cloud & Big-Data Proven
Prior to being known as NetApp E-Series, the Engenio division of LSI sold into many of the industry’s premier commodity Public Cloud and Big Data installations via popular OEM partners such as Dell, IBM, Oracle, Rackable/SGI & Teradata.
The new NetApp E-Series solution will continue to be sold by these OEM partners as part of their respective commodity Cloud & Big Data solutions. NetApp itself will focus on targeted Big Data opportunities via specific integrated bundles of software and hardware. Storage for Enterprise-scale Hadoop deployments is the first of many NetApp E-Series solution bundles for Big Analytics. Later this year we also expect to announce Internet-scale Hadoop storage configurations based on our learning from some of the largest Hadoop cluster deployments in the world.
Full-motion video ingest and processing is another solution category we’ve announced today for Big Content. For proper context it's also worth noting the existing NetApp StorageGRID solution continues to optimize Big Content archiving.
Avoiding Hadoop Hubris
There is a tendency by some hardware providers to create their own proprietary distributions of Hadoop in order to enter the market. This is highly reminiscent of confused vendor strategy regarding the Cloud marketplace over 2 years ago when some storage suppliers thought they could double-dip by competing against Cloud Infrastructure and Storage service providers by offering their own services while also selling to them.
NetApp’s noteworthy consistency (see #1) regarding our Cloud Strategy serves as a model for our industry as well as our own Hadoop strategy. We will not compete against our best Hadoop alliance partners for service or support revenue and will continue our partner-centric approach to adding value.
(Ironically after a 2 year NetApp head-start, it’s rumored that EMC will finally be following NetApp's lead by announcing their own Cloud Service Provider partner program at EMC World this week. One therefore has to wonder how long will it take them to abandon their proprietary distribution of Hadoop in the same manner they abandoned their proprietary Cloud Services?)
NetApp is committed to the open Apache Distribution of Hadoop which we believe will serve as a long-term unifying force in the Hadoop community and the foundation for durable future innovation in the Big Data ecosystem.
The NetApp Big Data Conversation
We’re going to harness all the obvious energy and excitement around NetApp solutions for Big Data via various on-line communities such as YouTube, Facebook, Twitter and Blogs. I invite you to join us at these linked sites as well as below via comments to share your thoughts about #BigData and what you’d like to see from NetApp in this thriving market!