Big Data Bingo in NetApp BlogsWhere is place located?

Log in to ask questions, participate and connect in the NetApp community. Not a member? Join Now!

Recent Blog Posts

Refresh this widget

See my post on the NetApp 360 blog here:


NetApp Extends Software-Defined Storage for a More Responsive IT Organization

These days when speaking with customers about scalable storage infrastructure, the conversation often boils down to the three aspects of scalability: capacity, performance and operational scale. You can read more about them in Mike McNamara’s blog:


Today, I want to focus on one aspect of scale-out – capacity. Clustered Data ONTAP enables customers to scale to 69PB of capacity in a single cluster. With Infinite Volume – which we first introduced in 2012 with Data ONTAP 8.1.1 – you can now scale to 20PB in a single volume. So Infinite Volume gives you a big, 20PB bucket to store data, along with the operational scale and performance you need for large content repositories. The unique thing, however, is that you don’t lose key functionality that enterprise IT departments expect. Essentially, Infinite Volume works just like any other volume from an end-user or application perspective, and in many ways from an administration perspective as well.


Let’s take a closer look:


Administration – you create Infinite Volume just like any other volume. Typically you would use a GUI wizard, but you can use the command line, too.


Efficiency – deduplication and compression are supported. You can even steer data via policies into separate storage classes that compress or deduplicate data, so you get to choose how data should be handled. Or let it all go into one single repository, your choice.


Scalability – well, 20PB of capacity, 2B files – and all that with only 10 nodes. So you don’t need dozens (or 144) nodes to get to this capacity, which saves you a lot of configuration and equipment headaches. This gives you a highly capable scale-out NAS solution, with support for NFS, pNFS and SMB/CIFS for data access.


Availability – the 99.999% uptime you get from Data ONTAP, even while you expand or manage an Infinite Volume or perform software updates. High-performance snapshots and replication via SnapMirror are also supported to increase data availability.


Multi-workloads – you are not limited to having Infinite Volume for large-scale content repositories in a cluster. You can easily add other workloads such as virtualized servers and desktops, enterprise applications and many other items while the system is running. Infinite Volume supports NFS (including pNFS) and SMB/CIFS for data access, but you can add other protocols like FC, iSCSI, FCoE etc. into the cluster for other applications. Great for large organizations and – with secure multi-tenancy supported – service providers that want to share some of that storage infrastructure with various customers.


In summary, with Infinite Volume you have the ability to easily create a scale-out NAS content repository for up to 20PB of data, while retaining efficiency, availability and many other aspects of clustered Data ONTAP – without excluding other workloads.


So don’t create another silo – leverage clustered Data ONTAP for all your workloads, even those where in the past you might have chosen another, often inferior, scale-out NAS solution.

We are a society that lives for choices. In fact, I believe we strive on choice. I recently read that Chipotle has 65,000 menu combinations! Honestly, about 3,000 too many options for me personally but by any means of measure, that’s a lot of choices. By definition the NetApp Cisco FlexPod solution line is all about choice, hence the “Flex” in FlexPod. We are expanding that “Flex” to include Hadoop.


So, what are we announcing?


FlexPod Select Family. The FlexPod Select solutions are designed for dedicated workloads like Big Data, high performance computing (HPC), video services (production, rendering), oil and gas applications and more. The first of these dedicated workloads that FlexPod addresses is Big Data with our Hadoop solutions.


FlexPod Select with Hadoop - Validated for Cloudera and Hortonworks Distributions


Hadoop is a powerful and essential tool that has extended business analytics to handle big data. However, there are challenges with Hadoop, as there are with any new technology.


Some of these are:


  • It can take significant time to become productive with Hadoop
  • It is not easy to use or operate with existing staff or skill sets
  • Hadoop NameNode reliability: if the NameNode fails, the Hadoop cluster goes down, so addressing this single point of failure is important
  • Standard Hadoop requires three copies of the data, requiring additional storage and reducing cluster throughput and performance while straining capacity and resources


Realizing the benefits of Hadoop and analytics in general requires an underlying infrastructure of storage, servers, networking, and connections to handle big data, and which needs to:


  • Have a reliable and robust infrastructure for analytical solutions and platforms
  • Implement high-performance Hadoop clusters
  • Be built on an open partner-based ecosystem to reduce risk of selecting emerging or new tools
  • Allow efficient storage for Hadoop clusters
  • Scale compute and storage independently and quickly as data grows in volume and ages


Our FlexPod Select with Hadoop configurations include:


  • Cisco UCS C-Series servers
  • High density, high performance NetApp E-Series storage for data store – E-series connected to Cisco UCS C-Series servers directly (SAS attached)
  • Storage connectivity through direct-attach to server
  • Data protocol is SAS for interoperability with existing systems (I’d put the benefit here instead of just mentioning it)
  • Targeted for data intensive workloads (think customer patterns,  video streams, fraud detection)  that require Hadoop
  • Highly reliable NetApp FAS to protect NameNode store (not for datastore)
  • Either Cloudera Enterprise or Hortonworks Data Platform as the Hadoop distribution


And businesses or departments get :



A little more on our E-Series Storage Systems and why it’s great for Big Data


  • Connectivity with Fibre Channel, InfiniBand, SAS, iSCSI interfaces
  • Up to 384 SAS drives, 1.44PB
  • Up to 6,000 sustained MBps


From a storage perspective, the E5460 is used in FlexPod Select with Hadoop. Outstanding bandwidth performance is a critical driver for creating bigger, faster solutions. The extreme density of the E-Series 60-drive enclosure saves floor space and helps lower operational costs for capacity-intensive environments. The E5400’s modular flexibility enables custom configurations that can be tailored for any requirement. Add to this bulletproof reliability, availability, and serviceability designed to ensure continuous, high-speed data delivery and you’ve got a Big Data platform that is ready to go.

Does Hadoop need a veteran technology presence, like SQL, to "make it" in the enterprise? Darned if I know. But the folks on our upcoming panel just might. As part of NetApp's partnership with HiveData we are very proud to host another meetup on NetApp campus, August 7th at 6pm. This meetup will focus on the Hadoop & SQL history, opportunity, positioning and landscape. All details below and link to registration here. This will sell out, don't wait to register.


Also, HiveData's media partner, O'Reilly Strata will be giving away books and a ticket for the NY Strata Conference! Make sure to bring your business cards for the raffle. If you don't win the raffle, you can save 20% for the conference with the discount code HIVE.


Ping me on Twitter @thebillp with any questions.



Hadoop & SQL


  •    Wednesday, August 7, 2013

    6:00 PM to 9:00 PM 

  • NetApp, Building 3

    439 E. Java Drive, Sunnyvale, CA  (map)


    Hadoop's usefulness was extended beyond programming in the MapReduce paradigm early on with initiatives such as Hive and Pig. These tools have enabled a much larger and varied set of practitioners on the Hadoop stack, many of whom are demanding SQL as the language of choice. Within enterprises, SQL is a prerequisite for integration with enterprise frameworks, but even in internet companies, where Hadoop first made its mark, most MapReduce jobs are generated by translation from Hive or Pig. This same force is pushing SQL to be more interactive, real-time, and act more and more like a traditional RDBMS.



    The first generation of SQL implementations on Hadoop relied on translating SQL queries into MapReduce. However, it is now recognized that this approach is not adequate for achieving good performance, especially as we try to address demands for interactive and real-time queries. A number of vendors are in the fray, and are making different tradeoffs. What does the emerging SQL-on-Hadoop landscape look like? How will it be positioned relative to traditional SQL and BI tools?

    Meet the experts

    Justin Erickson, Director of Product Management, Cloudera

    Justin Erickson is the director of product management at Cloudera for Hadoop storage (HDFS and HBase), query (Cloudera Impala and Hive), and performance. Prior to Cloudera he worked at Microsoft where he lead the development of the new high availability and disaster recovery solution for Microsoft SQL Server 2012. Justin is a graduate of Stanford University where he earned a B.S. with honors in Computer Science.

    Tomer Shiran, Vice President of Product Management, MapR

    Tomer Shiran heads the product management team at MapR and is PMC member and committer for the open source Apache Drill project. Tomer Shiran heads the product management team at MapR and is responsible for product strategy, roadmap and requirements. Prior to MapR, Tomer held numerous product management and engineering roles at Microsoft, most recently as the product manager for Microsoft Internet Security & Acceleration Server (now Microsoft Forefront). He is the founder of two websites that have served tens of millions of users, and received coverage in prestigious publications such as The New York Times, USA Today and The Times of London. Tomer is also the author of a 900-page programming book. He holds an MS in Computer Engineering from Carnegie Mellon University and a BS in Computer Science from Technion - Israel Institute of Technology.


    Alan Gates, co-founder, Hortonworks

    Alan is a co-founder of Hortonworks and an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan also designed HCatalog and guided its adoption as an Apache Incubator project. Alan has a BS in Mathematics from Oregon State University and a MA in Theology from Fuller Theological Seminary. He is also the author of Programming Pig, a book from O’Reilly Press.


    Gavin Sherry, Director of Engineering, Pivotal

    Gavin Sherry is Chief Strategist, Pivotal Data Fabric. In this role, he leads technology roadmap, architecture and R&D for Pivotal's data processing technologies, including HAWQ, Pivotal's industry leading SQL engine for Hadoop. Before Pivotal, Gavin has contributed significantly to the state of the art of database technology and to open source projects such as PostgreSQL.


    Priyank Patel, Director,Product Management, Teradata Aster

    Priyank Patel is Director, Product Management at Teradata Aster. In this role, he is responsible for Product Management of Teradata’s Hadoop Portfolio of products.  Priyank joined Aster as its 3rd employee working on building out the core product in the Engineering team. Since then he has held various roles in Engineering Management, Field Engineering and is currently responsible for Product Management of Aster’s SQL-MapReduce framework and Analytical Libraries built on it. Before joining Aster, Priyank held engineering roles at Microsoft Corporation where he worked on the Windows OS. Priyank holds a Master Degree in Computer Science from Stanford University and a Bachelors in Computer Engineering from Gujarat University.


    Raghu Ramakrishnan, CTO of Information Services, Microsoft (moderator)

    Raghu Ramakrishnan is a Technical Fellow in the Server and Tools Business (STB) at Microsoft Corp. He focuses his work on big data and integration between STB’s cloud offerings and the Online Services Division’s platform assets. He has more than 15 years of experience in the fields of database systems, data mining, search and cloud computing.



    6:00-7:00pm: Registration and pre-party (with food & beverages)
      7:00-7:10pm: Introduction
      7:10-8:10pm: Panel discussion
      8:10-8:30pm: Q&A session
      8:30-9:00pm: Networking session

    Cannot attend in person? Join the on-line event here.

    Our media partner, O'Reilly Strata will be giving away books and a ticket for the NY Strata Conference! Make sure to bring your business cards for the raffle. If you don't win the raffle, you can save 20% for the conference with the discount code HIVE.



Re-post from NetApp 360 blog


For years we mobile workers have struggled with getting access to “our” corporate data, always hoping that IT departments would figure out a way to make our lives easier – or at least trying not to get caught using our favorite tools (I’m looking at you Dropbox). You know who you are - you spend more time in airports, meeting rooms and hotels than at home. When you walk into a bar in a Silicon Valley hotel the bartender pours your favorite beer as soon as you walk in, knows you by name and your favorite vacation destination.


Today, with its sights clearly on us mobile users, NetApp announced a new product, enabling secure, mobile access to corporate data, called NetApp Connect.


NetApp Connect consists of essentially two components.


First – a server component that runs in your organizations’ data center (on-premise) and takes care of authentication (integrated with your existing infrastructure, so no new passwords to remember), streaming of content (like office docs and intranet applications) and policy enforcement (what you are allowed to access and how).


Second – a client app that runs on your mobile device and provides the sleek, native app experience we mobile users have come to expect.


So what do you do with NetApp Connect? I use it to access all of my corporate content from my iPad mini and iPhone. Everything I need to access sits either in my home directory, file shares, SharePoint sites or behind intranet applications (such as intranet pages, CRM and ERP tools, etc.). These are all supported data sources in NetApp Connect. I tap on a file to open it and I can review/present/annotate/comment as I please (as long as I have the right permissions), or use the built-in browser to get to my intranet sites without needing clunky VPN access.


A typical use case for me is presenting slides on the road. I now only bring my iPad and a display adapter to plug it into a projector. NetApp Connect provides pixel-perfect rendering, which means that I can easily present my slides without the awkward mis-renders that you usually get from native viewers (you know what I am talking about – diagrams all messed up, wrong fonts, bullets using Cyrillic for some reason and the like). With pixel-perfect rendering everything looks exactly right, even when I zoom into a spreadsheet or Word document.


The best part is that all of “my” content is always stored and brought back to where it should sit – storage systems inside the corporate data center. I don’t have to worry about outdated data or running backups. I can have offline copies when needed (don’t you wish every plane had WiFi?), but everything is brought back to the corporate data center when I get back online.


NetApp Connect combines the ease-of-use of your typical mobile experience with the control, data protection and compliance enforcement desired by IT.


More details on NetApp Connect here:


And watch the product in action here:


Safe travels!




Follow me on Twitter: @ingofuchs