Kamesh Raghavendra is an Advanced Product Manager working in my team for the Chief Technology Office at NetApp. He covers New Application trends including Big Data and NoSQL/NewSQL. After our big Agile Data Center launch last week, he decided to put some of his work into this new context.
Can you match Extreme Transactional Scale with Infrastructure Agility?
The advent of cloud computing and the ubiquitous presence of mobile platforms have caused a paradigm shift in the scale of business operations of enterprises not only in the web services (messaging, gaming, social et al) space but also retail, financial services, media, telecom, cloud service providers, public sector, healthcare and utilities verticals. In order to provide competitive quality of service to their customers, these enterprises need to sustain unprecedented demands of performance, availability & agility to accommodate a fast growing global scale of operations. This has led to the genesis of a new breed of super-agile applications that can service transactions at this scale by breaking out of the limits of relational models of data organization – and are called NoSQL (Not Only SQL) applications.
Although relational models provide very rich query abilities and powerful normalization of data, their consistency characteristics are too rigid to allow the read/write availabilities these workloads demand (from the CAP theorem). As business operations get hammered by Internet scale & multi-geo reach, latency SLAs get phenomenally squeezed out – leading to unreasonable demands of availability while maintaining the very bare minimal level of consistency. Also this scale is growing at a tremendous rate, forcing vendors to switch to scaled out NoSQL applications (where one can mindlessly add more instances/nodes and instantaneously scale without impacting uptime/performance) for extreme agility.
The multi-DC/multi-geo requirements preclude the use of file systems (and deal with volume level replication) and allow only HTTP/RESTful interfaced systems to scale at a K-V pair/object granularity.
Thus this new species of applications is getting neatly wedged in between RDBMS' and traditional file systems.
The key mantras of this species of applications are:
- Replicate transactions across the wire for containing the fault domain (/server failures)
- Cache as much data in memory to boost read & write latency (cross-wire replication avoids the need to persist to disk) - some applications go to the extent of storing all data in page files and map them to disk, thus it only need a virtual memory store manager and not a complete traditional file service
- Keep the node level capacity low (<2 TB) to contain the drain on the network during rebuilds
- Use versioning frameworks to achieve tunable/eventual consistency (some applications use quorum, while some use vector clocks for parallel version tracks)
- Extremely simplistic scaling out – you can just introduce a new node into the cluster and the application with automatically rebalance.
Where does Hadoop fit in?
These applications are very different from Hadoop – and Hadoop is only remotely connected to this world:
- Hadoop is relevant in DSS/DW workloads involving batch processing of unstructured data faster than relationally modeled ETL processes can do – whereas NoSQL/in-memory/K-V stores are relevant in tier-2 business processing/OLTP workloads that involve unprecedented scale & multi-geo conditions with very stringent per-transaction latency SLAs. In other words, these applications play in the IOPs tier (and hence replace traditional RDBMS'). Hadoop replaces traditional DW/ETL processes.
- Hadoop involves compute intensive map-reduce friendly processes, whereas these applications target latency sensitive OLTP-ish queries/transactions. Hadoop's ingest performance is in no way acceptable here.
The role of Intelligent Storage
However, these applications pose new problems to customers and hence opportunities for IT vendors like NetApp to provide value differentiation:
- The cross-wire replication for every transaction cannot sustain for long – even with 10gE networks – the increase in scale of transactions far outbeats the increase in network bandwidth
- The failure rebuilds drain the networks badly – the only remedy for which is to over provision the number of nodes (to contain the capacity per node). The average CPU utilization I have seen is around 5%, there is a great demand to perform with fatter nodes.
- These applications create a volume in every node (as they are mostly deployed on internal HDD), and it is a nightmare to perform DR/backup across large clusters. Recoverability is a very tough problem customers face (as against dealing with individual server failures – which is easy).
- Memory provisioning is another issue as latency performance is a function of the total memory in the cluster (some customers talk about "DGM" or "disk greater than memory" as a concern) - I have seen in-production clusters with 10TB of memory provisioned for a 50TB working set data.
- Their replication strategy is brute force – so no snapshot like functionality for test/dev clusters.
These applications can thus be bundled with best of breed external storage in smart ways to bring the following value differentiation:
- Use faster & more reliable disks to reduce amount of replication and bring about network I/O-less rebuilds
- Use host side presence to pass hints to the external storage for meaningful data management/check-pointing
- Simplify DR/backup by condensing the number of data storage volumes with external storage
- Reduce the amount of memory provisioned for a given performance SLA – and hence reduce the number of CPUs provisioned (& hence higher CPU utilization)
- De-link object/K-V management from storage management (through the host side framework) and hence give multiple (apart from internal HDD) storage architecture options including NFS/CIFS.
As smart phones continue to proliferate & more business operations leap frog into web-scale outpacing even Moore's law, IT infrastructures need to match the agility of newly evolved application paradigms. At NetApp, we have a strong track record of providing our customers the most flexible datacenter solutions in the industry. We continue to work with our enterprise customers in building agile data centers that would empower their competitive edge through this burst of web scale operations.