Currently Being Moderated

Part of - Learning NetApp - A storage admin's guide


What is the difference between an outage and a disaster?


'An outage is blamed on the networking team and a disaster is blamed on me.' says Brendon.


While not that funny a joke, it does describe a typical scenario within a large IT team: an event happens, nobody is sure as to exactly what has happened, but 'something' has to behind this service interruption and the incident update must be emailed out ASAP.  In my experience, whatever is labelled as 'at fault' during the event will be remembered in the wider 'business memory' as the actual cause, despite whatever is later shown to be true.  Once it used to be said 'What is a network?  A method of synchronising tea breaks throughout the business.'  And the engineers in the network team had to take the blame of every 'unknown' issue.


Today, shared storage has become the fall-guy for any unknown problem within the IT infrastructure, especially performance issues.  This is because the systems are complex and storage admins may not be ready to fight their corner when the crash teams get together.  The key tool in your armoury is information, and this comes directly from the monitoring, alerting and reporting systems.


Possible solutions


Meeting of doom

As storage admins, we will be expected to say what exactly is in our environment, how it links together and what the current status is.  We are paid to react to problems and even predict them!


'No one can predict the future'




When you look at the graph, at what point  do you think someone should have identified the trend and raised the alarm?  Imagine what the boss thinks when he sees this graph in mid November...


As this is a guide to NetApp storage, I will continue this blog using NetApp's Oncommand 'Opperations Manager' tool.  The main NetApp documentation is here.  Versions 3 and 5 (DFM) or Version 5






Step 1



Tune the system!  Every false alarm establishes the trend of ignoring the system and this is exactly how the storage team in the diagram managed to get to November without fixing the issue.  The default status on the dashboard for every day should be GREEN!  Sure, a system can degrade on Wednesday and the business will not let you fix it until the weekend, - so it will be broken - but make the effort to manage the alert dashboard.  This way, people will react to alarms and not ignore them.  For example, does the Operation Manager status diagram mean that Exchange, Oracle and VMware systems are all on the way out?  Or does it mean that they've been like that every day for the last 10 months, so don't worry about it...? 



A Microsoft MOM engineer once advised me to fix the underlying faults which were constantly alerting genius!   If it is a fault identified on a NetApp storage system by the NetApp monitoring solution, maybe you should sort it out... 


However, there will be good reasons to go your own way.  For example, the filer is used for archiving and capacity is far more important than performance - so why not fill up your aggregates to 98% if the data is at rest?  Tuning can be at a global or individual item level via the Operations Manager interface or CLI on the DFM server (refer to the manuals for how to do this).  But remember that global settings are for every system monitored - this is - not such an issue in small estates, but will need to be managed and agreed by the wider teams as the scale increases.


Step 2

Develop the data held in the DFM database so that it is useful information in terms of reporting.  There is no future in just responding to alarms - the real 'storage administrator' money is in being able to say what is required ie Capacity Planing.  Once it was enough to simply say how many GB of free space would be required, but these days, quality storage admins will be able to say how 'hot' the systems in their care are running.  This information can be used to balance workloads, change schedules, remove hot spots and maximise throughput.  CPU is an interesting metric on HA pairs to monitor because if both filers have an average CPU busy of 60%, what happens in the event of a CF - Takeover?  I hope the CPU has a magic 120% mode... 


I identified this risk to one business, and they created DR plans to reduce the load on storage in the event of a problem.  Meanwhile, my next boss went and used the information to support the case for a filer head upgrade.  Both are valid solutions to the problem - however, the key is to identify the problem before it happens and allow managers to manage.


'Lies, damned lies, and statistics'

'Drain the swamps and don't waste your time killing gators' was a phrase my first IT manager was very fond of saying.  Once the background noise from alerts and repeat offenders has been sorted out, start looking to establish trends. My favourites are:


  • Growth rate of data on disk
  • Latency of data access times
  • Processor utilisation
  • Average 'data' disk busy for each aggregate
  • Faults


Calculating the trend is simple with Oncommand.  However, the trick is understanding what the numbers mean!  For example, if the volume growth has increased above the normal trend, does this mean the business has opened a new office?  Or that an orphaned snapshot hasn't been removed and that's why the volume is full.  Simple fix or spending cash?  Identify the issue, understand what's causing it, and remove the problems one at a time.  Why was the storage requirement for the new office not identified?  Which system failed so that an orphaned was created?  These are the real questions that need to be asked here, the rewards are well worth the effort.  I was once able to finish work on the 23rd Dec, not have a single call-out for storage, and everything was still OK when I returned to work on the 3rd Jan.  And this was an environment with 16 filers and almost a Pb of storage.  Oh happy days.




The diagram shows the average LUN latency for three sampled LUNs over a period of three years.  Every month, I would take the information from Oncommand Performance Manager and use Excel to average out the average latency for each LUN.  The resulting number was then recorded in a spread-sheet and the trend established.  {This average value is only the baseline, and is of limited value to resolving performance problems.  Prefstat is the tool you need to understand for a performance issue.}
Throughout 2009, the system was 'new' and everything was good - Feb 2010 was when the trend was understood to be a problem, - and multiple 'fixes' were applied.  But the business had doubled in size by this point, and the little FAS3070 had given all it had.  The diagram enabled me to justify a head upgrade which solved the problem before end users became aware.

Now, another interesting sub-story on this diagram is that the disks were the same 300Gb 15k units throughout the sample period.  Notice the average latency for random read is somewhere between 5 and 10 ms (the red line is log files and therefore sequential) on a healthy system.  See how the values all started to climb together when the whole system was stressed and - my personal favourite - see the effect of the PAM cards:  spinning rust VS. data from cache memory.  (I'll write a blog about this later).


Managers love colour...

DFM_Pivot.pngThe diagram is an Excel Pivot table of the data taken from DFM, and is part of what I report to management so that they can understand system capacity in terms of what the storage system can deliver.  With care on how you are using the tools, the monthly report should take no more than a single hour to generate, - even when you scale-up to my current estate with 30 Pb plus of storage.  For example, it is possible to pull the information directly into Excel via ODBC, which can make the process easier.  Or why not bring the data into SQL and automate the generation of historical trending information too?


Fighting the good fight

The average latency benchmark and growth rate are your primary weapons in fighting your corner when it comes to stopping people pointing the finger at shared storage - when their application is running slow or a volume has suddenly filled up.  If you can quickly produce a trusted report which states the latency is normally about 9ms and the value is currently 9ms, it becomes VERY difficult for the 'problem' business application owner to say storage is causing the problem.  When there is no easy scapegoat available, the tiger team will have to focus on real root cause analysis, which will discover that 'undocumented' change that would normally be sneaked past in the chaos.


Once you have the basics sorted, you need to take reporting to the next level, and this link by Adaikkappan Arumugam should be in your bookmarks.


Holy Grail of Reporting - making the data 'owner' responsible

As storage admins, we cannot say what data should be removed.  I cannot think of any examples of people who have been fired for not removing 'junk' data from a system, but I do know people who have run into trouble when 'junk' data was no longer available when required.  The data owner must be made to take responsibility and identify which data can be safely be removed.  So easy to write...


This is how I cracked the nut:


  • Selected volumes to report for each ‘owner’ and group by:
    • Primary
    • Standby (Snapmirror)
    • Backup (Snapvault)
  • Forecasted based on the last 12 months and on upcoming changes - new offices opening, system upgrades, etc.
  • Divided the forecast growth by four, to review each quarter
  • If the review was under or equal to the forecast, fine.  But I'd warn them if over.  On the second ‘over forecast’, I'd quote for extra storage costs and send them into the FD to ask for more…


The application manager went off into the money man's office to try and explain, that they where simply 'too busy' to manage the data growth and that the storage team would need $$$ (unbudgeted) to make the problem go away.  Like a person trying to feed a hungry lion a steak by hand, this manager had not considered the full consequences of their actions.  Anyhow, the manager was now very keen to take ownership of the data, and this was the primary factor in the solution going forward.


My role at Proact is internal, and I look after our cloud systems - but they do let me out of the data centre to attend the Virtual Machine User Group events in the UK.  If you fancy a chat about storage or virtualisation, come along and say hi.  The next event is in Leeds on the 30th April or London, Edinburgh or Manchester later in the year. - follow this link for more information



Back to index


Filter Blog

By date:
By tag: