Data Protection

Posted by chriskranz Sep 4, 2009

We are currently going through a fairly large project internally, and part of this is a “risk register” against the business. Now this includes a lot more information than just simply data on disk, but also people, reputation and so on. For me, now, this is what data protection is all about.

 

 

It’s an interesting topic, and something that I’d like to share with you at this early stage in my own project as it makes you look at the storage aspects in a different light.

 

 

What affects a piece of data’s risk class?

 

      1. Who has access to it?
      2. How confidential is it?
      3. Does it have a tangible value?
      4. How portable is it?
      5. Could it potentially damage the business reputation?
      6. Is it protected?
      7.   … probably a lot more!

 

 

Some of these are all questions we already have asked about the data sets as we need to define snapshot, replication and tape policies, but data protection goes a lot further than just this. Interestingly the plugin for my blog has linked “data protection” with “Information Privacy”, which is a key point!

 

 

Who has access to it?

 

 

Not just from a front-end authorised point of view, although you do need to know this. Payroll for instance, generally it would just be HR and Accounting that have access to this, but is there a mechanism for anyone else to gain access to it? If so, is there any audit control to check who has been granted access, or who has gained access? The audit control is almost more important than the security in the first place. Security can and will always be broken, but if you can prove it was broken, then you can fix it!

 

 

How confidential is it?

 

 

Most of us have a fair grasp on what is confidential and what is not. Employee data, Customer data, Payroll, Accounting, are all obvious candidates for highly confidential data. But other things are still confidential, even if they are not classified as highly confidential. External IP schema’s, low level system passwords, although they may be freely accessible by the technical teams, they are not available to the secretaries for instance, so this makes them confidential in some way. Are they marked as confidential? Other than applying common sense, have you ever told anyone that they shouldn’t email a domain administrator password around for instance?

 

 

Does it have a tangible value?

 

 

Very difficult one to play and the analysts will love this! But some things have a real immediate tangible value, purchase order or a signed contract maybe. There is scope for defining a cost scoring system against data, but it is very difficult to calculate. Something will cost money in very indirect ways, for instance if something damages the company’s reputation, it could cause loss of revenue. This should really be assessed in other areas and not necessarily spend time putting a tangible value on every piece of data (I want to finish my project this year!!!).

 

 

How portable is it?

 

 

With the age of Virtual Machines, portability is very important, and very dangerous. Someone can literally just walk off with an entire database system now on a portable hard drive! How do you protect against this? Is there any way to bind key systems, or police raw access to them? As much as technology and WAN speeds have come along, it’s still fairly unreasonable to assume you could email an entire system. However it is very easy to email spreadsheets and documents around. Preventing this from happening can be restrictive on day-to-day running of a business, so we fall back to auditing and monitoring. There are a lot of bases to cover, portable media, email, file shares, etc.

 

 

Could it damage the business reputation?

 

 

This is a good one, and not something you might immediately think of. Not just necessarily “dirt” on the business, but perhaps the business has a key technology or system that means they are unique in the market. If this is leaked to another company, it could damage the reputation as others could then start doing the same. Could the business reputation be damaged if the data were absent? If a key system was offline for a period of time, how would the business reputation be damaged (take a look at some start-up Cloud companies!). A damaged reputation could sink a company. Naturally company ethics and business practices are a good way of destroying a reputation. I have many friends that still won’t buy Nestle!

 

 

Is it protected?

 

 

And this is an amalgamation of all of the above really. The questions above help you to define the business value on a particular data asset. So if it has a high value, how protected is it? How protected should it be? How long can the business survive while it is being recovered? How much would data loss actually cost the company?

 

 

Depending on the risk class and business value will greatly affect the protection and auditing you deploy around it.

 

 

Putting this into action

 

 

I’d love to hear from people about how you put the above into action. We use NetApp ourselves (definitely practice what we preach), and this gives me a great level of control over my data sets and the protection we employ. Protection goes further than just snapshots and tape backup however; we need to protect it from more than just data loss. While NetApp have some great tools for protecting against data loss, there is a requirement to help with the other areas of data protection, and I’d love to see NetApp build on this space.

 

 

I have some experience using tools like Varonis, Acopia, Northern Storage Suite, NTP QFS, TekTools, to name a few, and these have all helped us in the past in deploying a complete solution. I have said on several occasions, and it’s something I really believe in so I’ll say it again; a complete solution is a combination of many different technologies that complement each other.

 

 

I’d like to revisit this topic again in a few months when I have progressed my project further, but I’d like to hear from the field to see what other people are doing to gain complete data protection.

 

 

 

86 Views 0 Comments Permalink Tags: audit, risk_register, data_protection, audit, risk_register, data_protection

Visual Cheat Sheet

Posted by chriskranz Jul 9, 2009

I am actually very proud of these. I did them awhile ago and I still refer to them quite a lot. The idea is simple, if you only configure a filer once or twice a month, the process might not stick in your head exactly and you may easily miss out a step. Rather than reams and reams of documentation, the idea is to have a couple of these pinned around your monitor and you have quite a quick visual guide on how you should configure things and it's easy to keep to a certain standard.

 

I'm going to try produce more of these for various guides I do, they are much easier to understand quickly when you're in a hurry.

The SnapDrive one here is a little dated, so please don't hold that against the content.

 

I should also add, they've not scaled well on here, but I think you get the idea. Not wanting to plug myself externally, but the I think the images came out better over here - http://www.wafl.co.uk/visual-cheat-sheetvisual-cheat-sheet/

 

 

Presenting Storage to VMware.png

 

Presenting Storage via SnapDrive.png

 

SnapMirror Relationships.png

 

Storage Setup Requirement.png

366 Views 0 Comments Permalink Tags: visual_config, flow_chart, storage_provisioning, flow_chart, storage_provisioning, visual_config

I have met a lot of people that have a fear of Operations Manager. I've had a fair play with this now, and once you get to grips with the interface and the thinking behind it all, it is actually quite straight forward. I did a quick guide for one of my customers who wanted to be able to schedule reports and also make some custom ones. This was based on 3.7, so I'm not sure how much this has changed recently, but I will try to update through later versions.

 

 

Management Group

Chargeback

Generate Pre-Built Report

Volume Growth

User usage

Build a Custom Report

Scheduled Reports

 

Management Group

Creating custom reports from within Ops Mgr is relatively simple. The first step is to create some Management Groups so that we can run reports on certain areas of storage rather than across the entire system.

 

Next to Groups, click "Edit Groups"

 

 

Now give the new group a name. If some groups already exist, you can place the new group inside another if needed. Here we are going to add a new group at the global level. Then click "Add".

 

 

The new group appears at the bottom, but you'll notice the "Group Membership" reads "Empty". Click on "Empty" so we can add some filesystem members to this group.

 

 

From the drop down on the next page, choose the volumes (or other areas if you wish) that you want to add to this reporting group (use shift or ctrl to select multiples) and use the ">>" button to add them. The page will refresh each time you click this, so it's easier to select and add multiples. Here we have added all the VMware volumes.

 

 

There is no commit button or save button. When you click ">>" these are then added to the group. Over the left hand side we can see this new group, and any related warnings are passed down to it.

 

 

Chargeback

 

When you have management groups setup, you can then setup charging mechanisms per group. This is incredibly useful to help work out the TCO of a NetApp system and also obviously how to charge departments back for their usage (or even just show them how much their usage is costing).

How to calculate your cost per GB should be is outside of the scope of this doc, but should take into account several aspects of the filer management, user admin costs and also the basic costs of storage. To get a full costing all aspects need to be reviewed and adjusted for (power, cooling, space, admin costs, management, and so on). For this example we are simply using 5 per GB (although 50 may be closer to reality on a large system). 5 can be $, £, or whatever you want. Because the £ symbol is not a standard ASCII symbol, we have to enter the HTML character code for it. This is "£". The currency can be set in "Setup", "Options", "Chargeback". You can also set the defaults here.

 

Got back into "Edit Groups"

 

And choose the group we have just setup (here we setup a new group for the User Home Directories).

 

 

At the bottom of the page we have an "Annual Charge Rate (per GB)" and you'll see the default we've now set. You can alter this for different areas, so VMware may have a different cost to users home directories. Or FC disk may have a different cost to SATA disk.

 

Chargeback works on averages, so if this is a fresh system or you have just initialised the quota's, then this may not give any useful information just yet.

 

Generate Pre-Built Report

Volume Growth

 

Now we can create some reports based on this. Click on "Reports" and "All" and we can generate a pre-built report based on our new management group.

 

 

We want to create a growth report here, so on the left hand side, under "Logical Objects" choose "Volumes".

 

 

Then choose "Volume Growth"

 

 

At the bottom under "Using Resource", click "Browse".

 

 

Then under "Resources" and "Groups" drill down to our newly created group. Just select the top level group and not the individual members. OK this.

 

 

The selection a the bottom changes and click on "Show" and we'll get our volume growth for this management group.

 

 

The below example had only recently been setup, the growth rates have not had chance to be registered.

 

 

User usage

 

For this to work successfully, you need to have some quota's setup on the users home directories. These don't have to be hard quota's, in this example we will be setting up some soft quota's on the filer, and then reporting against this.

 

From FilerView, goto "Volumes", "Quotas", "Add".

 

 

Here we have a single home directory volume and we are going to add a default User quota based on this.

 

 

In this particular volume we have a qtree setup for "normal" users. In this example this is actually just one qtree for all users, but this can be setup to report on multiple qtrees if needed.

 

 

For this example here we don't actually want to use quota's, we simply want to report against our users. So we are going to use a soft quota of 5 GB. I have also put a soft quota on 50,000 (50k) files. All other entries are left blank to prevent any hard limits or other limits being enforced.

 

 

 

Remember to enable this quota we need to go back to the manage page, check it, and click "on" at the bottom. If it has already been enabled, click "resize" to reload the quota details.

 

 

 

The filer will now scan the volume, if it is large it may take some time…

 

 

Once this has complete, verify that it is working as expected by looking at "Quotas" and "Report". We can see the soft limits working in place for our example here.

 

 

I also have setup a "Tree" quota, again with soft limits, on a departmental share with separate Qtrees for each department. We can see the results of that quota below.

 

 

Back in Operations Manager. By default the monitoring period for Qtree's and Quota's is 8 hours and 1 day respectively. If we are demo'ing this, or setting this up for a customer, we may want to reduce this to allow us to show the immediate effects.

 

Goto "Setup", "Discovery".

 

 

Then goto "Monitoring"

 

 

And update the Qtree Monitoring and User Quota Monitoring intervals. Here I have changed them to 5 minutes and had a quick coffee break.

 

 

We will need a new management group this, so I have created a new group called "User Areas" and I added in the Qtree "/home/normal" into the list of members.

 

 

We can check that this has successfully picked up our new quotas and is recognising the users by looking at the "Group Status" and then "Quotas" tab.

 

 

We now see a list of all the users and the status of their level. Users with green status will be under quota, those with yellow will be over quota.

 

 

Now we can finally report on this usage! Under "Reports", "All", then under "Monitoring" choose "User Quotas".

 

 

Choose "User Quotas, All" and make sure you are "Using Resource" for the management group we just setup.

 

 

This report is almost identical to the one shown within FilerView, but it is a little easier to understand and is sortable.

 

 

Users that no longer exist on the domain will appear with their SID rather than their username.

 

The limitation here is that Quota's are entirely driven by ownership. So if a user does not own their home directory, then the above process will not work, you will find that "Administrator" owns a large proportion of user files. This should be addressed as it is best practice to make sure all users own their own files anyway.

 

Build a Custom Report

 

If we want to build a custom report, maybe comparing 2 data fields that don't normally exist on the same report, we can do so. Under "Reports" click "Custom".

 

 

You'll be taken through to the "Create a Report" page by default. There is a veritable banquet of options for custom reports, so have a careful think before simply diving in to create or show this feature off. It may be best to base what you require on an existing report. Most things are available in the pre-built reports, so it may be that you just want to customise this slightly.

 

For this example, I want a single report to display the username, disk space used, files used, chargeback, daily growth rate % and days to full.

Whatever report you create must have at least one field taken from the "Base Catalog", so choose the "Base Catalog" that best suits what you are trying to report on. For this example it is obviously going to be "UserQuota".

 

 

The display tab is quite important as this shows where in Operations Manager this report will be visible from a drop down menu. When you are different pages, Operations Manager gives you different lists that you can choose from. Again, on this occasion it makes sense to include this report in the "Quotas" tab.

 

 

Next we need to start defining what information we want to include in this report. So from what I defined at the start, I will start filling out my report. Remember you need at least 1 field from the top level.

 

Daily Growth Rate %
Days to Full
Disk Space Used
Files Used
Username (note I have to drill down a level to get this info)
Chargeback

Then I need to order this in the way I defined for my report.

 

And finally click "Create" and we see our new Custom report at the bottom.

 

 

To see this report in action, go back "Home", then click on "User Home Directories" (my custom group), "Quotas" and chose this from the drop down "Report" at the top right.

 

 

As mentioned earlier, there are some blanks in this as the report hasn't been collecting data for long, so there is no data for the growth rate or the chargeback amounts.

 

 

If you want to tweak this report, you can go back into "Reports", "Custom" and select "Edit" from the list at the bottom of the page.

 

Scheduled Reports

 

Once you know which reports are useful, or perhaps pulled certain aspects out of different reports to create your own custom one, then you may want to schedule these reports and get them emailed to you. Goto "Reports" and "Schedule"

 

 

First we want to create a schedule, so click on the "Schedule" tab and click "Add New Schedule"

 

 

Create a schedule based on what your requirements are. Here I want a report generated every Sunday at 8pm.

 

 

Now back to "Report Schedules" and we can "Add New Report Schedule".

 

 

From here I can now create my Scheduled Report and the email recipients. I am going to get my new custom report (User Custom Usage) to run against my new management group (User Home Directories) to get generated per my new schedule (Weekly Sunday Evening) and emailed to myself. I'm going to keep the standard HTML formatting and all the other standard settings.

 

 

Make sure that a valid mail server has been configured before generating any reports. Goto "Setup", "Options", "Events and Alerts". Here I have also updated the purge interval as I don't want alerts older than 1 week.

 

 

To test this schedule, I want to run it now. Check the checkbox of my new schedule and click "Run Selected".

 

 

The report will be attached in a zip file.

519 Views 1 Comments Permalink Tags: dfm, reporting, operations_manager, reporting, dfm, operations_manager

snapmirror.conf file

Posted by chriskranz May 1, 2009

This question seems to come up quite a lot, so I thought I’d cover it quickly. I’m going to steer away from covering SnapMirror as a whole, and just look at the format of the snapmirror.conf file. I will also steer away from Synchronous SnapMirror as I’m not a huge fan, I prefer SyncMirror!

 

First, you can find it in /etc/snapmirror.conf. Edit it using either rdfile / wrfile, or map to /vol/etc and edit them with your favourite text editor (but not Windows Notepad please!). Once you get used to the formatting, you’ll be writing these with your eyes closed! Having said that, I usually need to refer to something for reference!

 

The basic layout is…

source_filer:volume_name destination_filer:volume_name options min hour dom dow

This drops it down into very simple terms, a good reference to start with. The “volume_name” can of course be a QTree if you are doing qsm, but I will concentrate on VSM for now.

 

The options section is often left blank. Any blank entry from options through the schedule will be filled in with “-”. So if you are setting up SnapManager for Exchange or SQL, you would create a relationship with this setting, do a baseline, then get SMx to manage the replication. Leaving the option as a single “-” assumes you accept the default for all settings. If you define one setting, the others are assumed to stay the defaults.

 

The options you can choose from are…

  • “kbs=” to limit the transfer speeds to whatever number you define here. This is in kilobytes, so remember to convert it for WAN speeds.
  • “restart=” to set how a transfer is transferred and how it is resumed when it suffers a communication failure. “never” defines to always start the SnapMirror from the beginning of the transfer, and never set a checkpoint. “always” will try to resume from a checkpoint during the transfer. “default” means that they are updated and checkpointed providing it doesn’t conflict with a schedule.

 

If you need to set both these options, then separate them with a comma, but not a space. If you set one, but not the other, the other is assumed to be the default.

 

Schedules can be defined in a variety of ways…

  • Simple numbers, or comma delimited numbers. On a hourly schedule, “0″ for midnight or “0,12″ for midnight and noon.
  • Timeframes. You can set to take snapshots during a period of time. “8-18″ may be used to set on an hour schedule to set during 8am and 6pm for office hours.
  • A mix of timeframs and numbers. “4,8-18,22″ to do a replica at 4am, then during office hours, and then again at 10pm.
  • Frequency. So if you want to set every three hours during the day, “0-23/3″. Perhaps more commonly used, every 5 minutes “0-59/5″

 

One thing to watch out for is if you are looking for vague scheduling. So perhaps every midnight, or every Monday, you might do one of the following…

filer:vol filer:vol - * * * 1
filer:vol filer:vol - * 0 * *

The first will actually run every single minute of every hour of every Monday (but no other time). The second will not run an update every hour, but infact run an update every minute between midnight and 00:59. The * is not a default, so if you want midnight, don’t forget to include 0 minutes past the hour! However if you want to run a schedule every Monday, you need to make sure the dom is left as a *, or it’ll only ever schedule if the Monday is on x day of the month! So always try review your schedules.

 

A nice little trick is setting up multi-pathing for SnapMirror. Out of the box SnapMirror will only really be able to use one path. By setting up SnapMirror with multiple paths you can make full use of what bandwidth you have. This is really useful if you are doing a large baseline and the two systems and in the same datacentre.

 

At the top of the snapmirror.conf file, declare the following…

name = mode (src_filer1, dest_filer1) (src_filer2, dest_filer2)

The “name” is the connection you want to refer to, which would then be used everytime you define a replica schedule instead of the source filer name. The mode would be defined as “multi” or “failover”.

 

For as an example, I want to multi-path from filer1 to filer2. Both filers have 2 VIFs defined that use separate paths between the systems (vif1, vif2). To define this, I would declare this at the top of my snapmirror.conf with something like…

filer1 = multi (filer1_vif1, filer2_vif1) (filer1_vif2, filer2_vif2)

One last thing that is very important. When you are setting up SnapMirrors for the first time, make sure you have a console open to both to the primary and the secondary systems. A lot of errors get displayed to the console, and so it makes it very easy to troubleshoot. Some errors are also only displayed obviously on one end of the relationship, so it is useful to have both open.

 

For reference I would strongly recommend reading through the “Data Protection: Online Backup and Recovery Guide” available on the NOW site  (Section 4, page 77 with the OnTap 7.2.6.1 version). There should be a Red Book version of this also. This is a great guide and covers almost every area of SnapMirror that you should need to know.

858 Views 2 Comments 0 References Permalink Tags: snapmirror.conf, snapmirror.conf

OnTap Configs

Posted by chriskranz Apr 27, 2009

I think it's very important to save a config of a good setup. Firstly it's a great reference if you ever need to go back and refer to things, secondly it's a great way to show what you did was actually correct and that you did configure things correctly from the start (especially for me as a consultant!).

 

There is a handy tool provided within OnTap (and you get this functionality from Operations Manager also) to do entire config dumps, compares and restores. This is limited to the filers base configuration and doesn't necessarily include areas like volume setup or various config files within "/etc".

b2net-filer01> config
Usage:
        config clone <filer> <remote_user>
        config diff [-o <output_file>] <config_file1> [ <config_file2> ]
        config dump [-f] [-v] <config_file>
        config restore [-v] <config_file>

The command is very simple and straight forward. You start by dumping out the configuration from the filer. This automatically goes into /etc/configs. From here you can then clone the config if needed, or compare (diff) the config. Running diff is a very good way of comparing a config between 2 points in time, if you aren't sure what has changed, or if you are comparing a filer upgrade and you copy the config files between the 2 systems or if you want to have primary and DR as identical as possible. And finally you can also use the restore feature, although I have personally never used it, and I'm not sure what other changes would need to be made within the /etc folder to do a system restore (potentially I guess you could also do a snap restore on vol0 or copy across a backup vol0).

 

Overall a very useful command. I use this most for taking backups of filer configs and comparing them between similar systems (for instance primary and DR), or even comparing configs over time.

366 Views 0 Comments 0 References Permalink Tags: ontap, config_dump, config_diff, config, config, config_diff, config_dump, ontap

      

PerfStat is a great way to get some quite detailed performance information out of the filer when you have a performance or other issue that you can't quite put your finger on. You need to have access to the PerfStat Viewer, or get someone to process this output for you, and then you need to trawl through it.

 

Operations Manager, and more specifically Performance Advisor is brilliant and 99% of the time gives you the counters you need to diagnose the problem. Once you've found your way round it, it is completely indispensible!

 

But what if you don't have Operations Manager, or you just want to quickly pull out information on one area of the system?

 

First things you want to look at sysstat. Everyone's best friend and great way of seeing "Is my system busy?". Whenever you run sysstat, make sure to through it the "-s" modifier so that you get a summary at the end of the output. If you don't define a number of iterations (-c <num>), then ctrl+c to break the output. "-x" is great for giving all areas of output, but it can be a little wide sometimes. "-u" is my favourite as it gives you utilisation readings and these the usually the most useful when troubleshooting.

 

Most of the columns are fairly self explanatory. CPU is % busy, NFS, CIFS, HTTP, FCP and iSCSI are all protocol operations counters. Net kB/s in and out are obvious (for reference a single gigabit interface will happily sustain around 80MB/s, but can stretch to 110/120MB/s). Disk and Tape in&out. Watch the cache age when it gets really low, but there's better counters for that. Cache hit is a counter you want as close to 100% as possible. The more data is getting read from cache the better! CP Type is Consistency Points, I won't go into detail as to what these are, there is a very good KB article on this already (KB23471). And finally Disk Utilisation which seems to cause some confusion. This is the reading from the single busiest disk in the system, and not an average. This reading can interestingly go about 100% (much like CPU can too), and this simply means the disks are doing more than they should!

 

So sysstat is a great way to get a high level view of "Is my system busy" and also gives you a rough idea of where the bottleneck is. If the CPU is really high, but nothing else, then this is what is holding back the system. If the disk utilisation is very high, then again, here is the problem. But these aren't conclusive figures, and don't point directly at a culprit. For instance if disk utilisation is very high, you may need to run a wafl reallocate as you have added some new disks and these aren't holding any data yet. If your CPU is very high, it may be that you are doing a lot of other processing like A-SIS and SnapVault, or it could be very random IO so the CPU is working harder at trying to make calculations around this.

 

The next step may be to look at statit. A "priv set advanced" command, and not for the feint hearted, a great command to get a snapshot of details over a period. Simply run "statit -b" at the start of the monitoring period, and then "statit -e" at the end. Make sure to log your output window as you'll get a lot from statit (more than the standard Windows and Putty buffer will show). There is a lot of statit output, and I won't go into too much detail in it all here (perhaps another day).

 

This brings me onto the real reason for this article in the first place. One of my favourite commands, and certainly a largely overlooked one, “stats”. This has a lot of information at it’s fingertips, pretty much anything you can see from in Performance Advisor and anything you can report on in PerfStats is available in the stats command. And possibly a lot more! “stats” works very similar to sysstat in that it reports counters based on the iterations. If you simply run it, it’ll report what the system is doing at that exact time. If you tell it to run every 5 seconds, it’ll report what happened over those 5 seconds.

 

So first up, don’t just in and run “stats show” without having a few minutes to spare. The output is very complete! First you want to see what counters are available. Stats is split into “Objects”, “Instances” and “Counters”. To show each, we can use “stats list …”

 

b2net-filer01> stats list objects

Objects:

      dump

      logical_replication_source

      logical_replication_destination

      vfiler

      qtree

      aggregate

      iscsi

      fcp

      cifs

      volume

      lun

      target

      nfsv3

      ifnet

      processor

      disk

      system

b2net-filer01> stats list instances ifnet

Instances for object name: ifnet

      B2net

      Storage-101

b2net-filer01> stats list counters ifnet

Counters for object name: ifnet

      recv_packets

      recv_errors

      send_packets

      send_errors

      collisions

      recv_data

      send_data

      recv_mcasts

      send_mcasts

      recv_drop_packets

As an example above, I can show all the objects available to be, I can query all the networking instances I have setup (2 VIFs, 1 with a VLAN), and I can see what counters I can report on. So putting this together…

 

b2net-filer01> stats show ifnet:Storage-101:collisions

ifnet:Storage-101:collisions:0/s

Great, my storage interface doesn’t have any network collisions for the period this has run! That’s good news for me!

If I want to run this over several iterations, I can feed it some more options. Note: The options must go before the counter information!

 

b2net-filer01> stats show -n 5 -i 1 ifnet:Storage-101:collisions

Instance collisions

               /s

Storage-101        0

Storage-101        0

Storage-101     0

Storage-101        0

Storage-101        0

Great, so over a period of 5 seconds I’m still not getting collisions!

 

You’ll notice from above that there are a lot of performance counters available, and not all of them have the most verbose names. You can query any of these by running “stats explain counters”.

 

b2net-filer01> stats explain counters ifnet collisions

Counters for object name: ifnet

Name: collisions

Description: Collisions per second on CSMA interfaces

Properties: rate

Unit: per_sec

So lets take another example, I want to look at latency readings on my Exchange system…

 

b2net-filer01> stats show -n 5 -i 1 volume:exch01_db:read_latency volume:exch01_db:write_latency volume:exch01_logs:read_latency volume:exch01_logs:write_latency

Instance read_latency write_latenc

               ms         ms

exch01_db          0          0

exch01_logs          0          0

exch01_db          0          0

exch01_logs          0          0

exch01_db          0          0

exch01_logs          0          0

exch01_db          0          0

exch01_logs          0          0

exch01_db          0          0

exch01_logs          0          0

 

It’s 8 in the morning, none of the sales team is awake yet! The column headings get a bit skewed, but we can see read latency in the first column, and write latency in the second.

 

One of my biggest complaints about sysstat is what happens if I want to keep this running over a period of time and log the output? Well, I can change “options autologout” and leave my laptop plugged in, but that’s never a good idea. “stats” gives you the ability to pipe all stats output direct to a file. Brilliant news!

 

b2net-filer01> stats show -n 5 -i 1 -o /etc/stats.txt volume:exch01_db:read_latency volume:exch01_db:write_latency volume:exch01_logs:read_latency volume:exch01_logs:write_latency

b2net-filer01> rdfile /etc/stats.txt

Instance read_latency write_latenc

                 ms         ms

exch01_db          0      16.00

exch01_logs          0          0

exch01_db        0          0

exch01_logs          0          0

exch01_db          0       8.00

exch01_logs          0          0

exch01_db          0          0

exch01_logs          0          0

exch01_db          0       1.00

exch01_logs          0          0

 

Unfortunately this doesn’t free up the console, so scripting this from RSH or SSH may be the best bet, but be careful how long you run the iterations for!

 

Another nice feature is that you can have some presets. So if you have 4 Exchange servers each with 3 databases, then you can load all the volume:<vol_name>:read/write_latency commands into a file and issue this direct from the stats command. The presets files are XML files, so they take a little thought in the writing, but if you have seen XML before, then it’s not that tricky.

 

My XML file looks like this…

 

<?xml VERSION = "1.0" ?>

<preset>

      <object name="volume">

         <instance name="exch01_db">

                 <counter name="read_latency">

                 </counter>

                 <counter name="write_latency">

                </counter>

         </instance>

         <instance name="exch01_logs">

                 <counter name="read_latency">

                 </counter>

                 <counter name="write_latency">

                 </counter>

         </instance>

   </object>

</preset>

 

Once saved within /etc/stats/presets as an “.xml” file (my file is exchange.xml), I can call it directly from the stats command.

 

b2net-filer01> stats show -p exchange -i 1 -n 5

Instance read_latency write_latenc

                 ms         ms

exch01_db          0          0

exch01_logs          0          0

exch01_db          0          0

exch01_logs          0         0

exch01_db          0       0.13

exch01_logs          0       0.12

exch01_db          0       0.00

exch01_logs          0       0.00

exch01_db          0          0

exch01_logs          0          0

 

The possibilities are huge for this, but this opens up something even better. We can now use “stats start” and “stats stop” to trigger this reporting and I get my console back!

 

b2net-filer01> stats start -p exchange

Stats identifier name is 'Ind0x6920b2f0'

b2net-filer01> stats show -I Ind0x6920b2f0

StatisticsID: Ind0x6920b2f0

volume:exch01_db:read_latency:0ms

volume:exch01_db:write_latency:5.14ms

volume:exch01_logs:read_latency:0ms

volume:exch01_logs:write_latency:0.00ms

b2net-filer01> stats stop -I Ind0x6920b2f0

StatisticsID: Ind0x6920b2f0

volume:exch01_db:read_latency:0ms

volume:exch01_db:write_latency:5.36ms

volume:exch01_logs:read_latency:0ms

volume:exch01_logs:write_latency:0.00ms

 

Hopefully you are starting to realise why I like this command, and why the possibilities for using this are huge, and that it is very powerful indeed!

 

Another great powerful feature is that you can use wildcards in the “stats show” command, so to pull out all counters for my exchange database…

 

b2net-filer01> stats show volume:exch01_db:*

volume:exch01_db:avg_latency:0.00ms

volume:exch01_db:total_ops:3/s

volume:exch01_db:read_data:0b/s

volume:exch01_db:read_latency:0ms

volume:exch01_db:read_ops:0/s

volume:exch01_db:write_data:12288b/s

volume:exch01_db:write_latency:0.00ms

volume:exch01_db:write_ops:3/s

volume:exch01_db:other_latency:0ms

volume:exch01_db:other_ops:0/s

 

Or to show all the read_latency for all my volumes…

 

b2net-filer01> stats show volume:*:read_latency

volume:vol0:read_latency:0ms

volume:exch01_db:read_latency:0ms

volume:home:read_latency:0ms

volume:backup:read_latency:0ms

volume:share:read_latency:0ms

 

 

One final thing to add, there are a lot of counters available by default in normal privileged mode, but try switched to advanced, or even diag, and see how many counters are available then! This is overwhelming, but with a bit of digging, very powerful.

 

If you have any specific questions, or you want to query how to get specific counter information from the system, feel free to send me over a question. Hope this is useful for everyone!

2,461 Views 3 Comments Permalink Tags: latency, stats, statit, perfstat, performance, performance, perfstat, statit, stats, latency

I seem to get questioned about Fractional Reservation at least once a week, and find myself explaining it over and over. I have found quite a simple way of explaining this now, unfortunately much of the documentation doesn't make it quite so simple to understand. I’ve got a much better understanding of what it actually is now. It makes more sense as NetApp have changed it’s description in some places, in Operations Manager 3.7 it’s now referred to as “Overwrite Reserved Space”.

 


This is easiest to explain with pictures. We should have all seen a standard snapshot graphic. When we snapshot the filesystem, we take a copy of the file allocation tables and this locks the data blocks in place. Any new or changed data is written to a new location, and the changed blocks are preserved on disk.

 

snap00285.bmp


So basically a snapshot locks the data blocks of the data referenced by it in place. This means that any new or changed blocks (D1 in the above graphic) in the active file-system are written to a different location. This concept is fundamentally the same as what Fractional Reservation is.

 

As the LUN gets filled up with data, we take a snapshot and that data is locked in place. Potentially all this data could change, and we need to guarantee not only this existing data, but also the potential that we need to write totally new data blocks. Any changed data gets written into the Fractional Reservation area rather than into the area that the existing LUN data is in. (I know that in reality this is spread across all the disks and these areas don’t actually exist, but it makes it easier to visual and understand explaining it this way). As changed data blocks are written, old data blocks get preserved in the snapshot reservation area. Fractional Reservation is preserving the maximum rate of change we could potentially get.

 

snap00286.bmp

Don't confuse this with the snapshot reservation area. The snapshot reservation includes saved data blocks from previous snapshots, where-as the Fractional Reservation is protecting your Active File System (AFS in the above graphic) from it's own potential rate of change.

 

 

So the reason a LUN may be switched offline if the fractional reservation area is set to 0, is that the filer needs to protect the existing data that is locked between the active file system and the most recent snapshot, plus any additional changes that happen to the active file system. If the volume / LUN / frac-res and snap reserve are full, then this space is not available and the filer needs to take action to prevent these writes from failing. The filer guarantees no data loss, but with no space free and nowhere to write the new data, it has to offline the LUN to prevent the writes from failing.

 

So fractional reservation is in constant use by the filer as an over-write area for the LUN. Without it, you need to make sure that sufficient space is free to allow for the maximum rate of change you would expect. Defaults are good, but trimming down on these you need to monitor the rate of change and make sure the worst case scenario is within a buffer of free space that you allow. If you reduce the Fractional Reservation to 0, you need to make sure the rate of change is within the volume size, or you need to make sure the volume can auto-grow when required or even snap auto-delete to reduce the reserved blocks and free up space (although I am not a huge fan of snap auto-delete for various reasons).

 

And that is Fractional Reservation!

 

Quick last thoughts... A-SIS won’t make any difference to the Fractional Reservation area as such, but it can help as the data blocks within the LUN will get de-duped, but the Fractional Reservation area per-se would always be required as you need this LUN over-write area for changing data. If you reduce the footprint of the non-changing data with A-SIS, you reduce the potential reservation area required. Space savings aren't apparenty when you have things thick provisioned. Reducing Fractional Reservation and thin provisioning can be a dangerous game.

 

The most important rule is to monitor and understand your data. If you understand your rate of change, you can tweak a lot of areas of the storage system.

 

 

 

3,851 Views 32 Comments Permalink Tags: fractional_reservation, lun_overwrite, lun_overwrite, fractional_reservation

Storage vs. Rock'n'Roll

Posted by chriskranz Feb 18, 2009

http://www.wafl.co.uk/wp-content/uploads/2009/02/1485194801_l1-150x150.jpg I have had a bit of a rock’n’roll past, I spent several years driving round with a motley bunch of pirates/musicians (The Klopeks) in my much loved LDV minibus playing in bars and basements up and down the UK. One thing that I’ve noticed with music is that people are easily led, if there is one person in the crowd that openly cheers and gets involved, everyone wants a part of it. Everyone knows that music has a scene, but I think that storage does too.

I find storage vendors and storage technology is very much like live gigs and like a music scene. If they already see someone else using something, they’ll be more likely to jump on that band wagon. If someone is already having a good time with it, surely I will too? Even if it doesn’t actually fit! There is definitely a ‘scene’ to the storage industry.

This can work absolutely fantastically, and this is one reason above most that I try to work above and beyond for my existing customers. If they have an issue, question or problem, doesn’t matter if it’s not my problem, I need to help them out. All too often I hear from other companies “that support expired 2 days ago I’m afraid”, or “technically that would be vendor x and not us I’m afraid”, and so the customer goes unloved. When we built our support infrastructure I said to our call centre “One rule above all others, never decline to take a support call”. Take the bull by the horns and get to the bottom of the issue and resolve them for the customer or they will not purchase again!

This is a very hard working way of keeping my customers happy though, and I find myself with less and less spare time, and more and more time spent answering emails and problem solving. To be honest that’s one of the reasons I’ve started blogging some of the issues and questions I encounter. If I can resolve this for one customer, why not share it with everyone so that everyone can benefit from my hard work?

What if you don’t or can’t manage to keep this customer cheering and singing? On the converse, you can end up with a front row heckler . They’ve had a few Stella’s , they want to get lairy, and they think it’ll be funny to start shouting (naturally it couldn’t be because I’ve snapped my E-string again and my guitar is totally out of tune!). Well we can shut-up the heckler by carefully timing a stage dive, but I like my guitar too much and I’m a bit squeamish of blood. Best face up to the heckler and get them involved. Even if my guitar is horribly out of tune and I can’t hear the drums because the levels are all wrong. A live show is about a stage performance, about getting the most out of it, not necessarily out of it being technically brilliant. I’ve seen technically brilliant bands play live and it can be the most boring thing in the world.

And so with the customer. If I have a customer that is having a particularly bad experience and they want to take it out on me, all good, but I need to work with the customer, I need to get them involved. Forget the shouting, lets cut this back to the basics. I’m not here to sell you as many products as possible, I’m here to make what you’ve got today work, and potentially offer options to improve this. I’m not a sales guy, I have no interest in you buying anything as I am not on commission!

This works pretty well most of the time, and people have to just recognise my T K Maxx suit and thrashed out Alfa to realise I really am not lying! I need to make this work for the customer, that is my job. Even if the solution is not a technically brilliant one, or maybe there are gaping holes in the entire solution, there are always ways to make it work and make it fit. I am often tasked with resolving customer issues when we weren’t involved with the original design, and this is often the case. A solution can always work, it just depends how you are willing to work with it.

Sure, this isn’t ideal or clandestine, but then if everything was, there would only be 1 storage vendor, and only 1 solution. There is choice because there is a million different vendors offering a million different architectures that can be built. Just because I have a personal preference does not make it always right or always fit.

And so I find myself relating back to my rock’n’roll past life with the storage industry. Customers want to have a good time, they want to enjoy it, but they might need someone to get them started, they almost always need a little direction to know what to do. Hecklers are often turned into the driving force of a wild stage show if they are handled correctly, you can’t just use a rough hand. My favourite live shows have always been the ones with a rowdy audience, you don’t get better without a little abuse, it just depend how you use that abuse!

This is why I enjoy working with NetApp technologies. The systems give the flexibility for the solution to change almost full circle and still provide flexibility and functionality. This is why I like working with complimentary technologies like VMware, Riverbed and Acopia to name but a few. These give you the flexibility to do things anyway you want. So long as it works, it is not an incorrect solution.

I have found that solutions are like opinions, none of them are technically incorrect or wrong, it just depends on your point of view and what you are trying to achieve. One way or another, everything can be done, and you can achieve anything, it just depends how flexible you can be.

http://www.wafl.co.uk/wp-content/uploads/2009/02/557505574_l-150x150.jpg

468 Views 0 Comments Permalink Tags: flexible_storage, rock'n'roll, rock'n'roll, flexible_storage

OnTap Upgrades

Posted by chriskranz Feb 13, 2009

There are several different ways to upgrade OnTap, but to be honest I ended up discovering my own, and found it to be the most reliable! The key is to get the OnTap software onto the filer. Oddly, the filer doesn’t recognise the Unix format of the download, so you need to download and copy across the Windows version of OnTap. You need to copy the software across into /vol/vol0/etc/software.

To copy it onto the filer you have several options.

CIFS

Simply copy it over from your Windows desktop or server. If you don’t have CIFS licensed, just connect to c$ anyway as you are allowed admin access using CIFS. (\\filer_name\c$\etc\software). Double check the qtree permissions on vol0 and make sure it’s NTFS.

NFS

If you have NFS licensed, then you can use your *nix system to mount vol0. You might need to modify the exports (rw=workstation_ip,root=workstation_ip) so you have permission to copy the files across. Double check the qtree permissions on vol0 and make sure it’s UNIX.

Web Server

If you can put the OnTap executable on a webserver that the filer can access, then the filer can download the files itself. This is actually a really nice way of deploying an upgrade to multiple filers. I usually run a web server on my laptop, so this works quite well, I just keep my apache install with the latest OnTap versions in the root, and point a filer at my laptop.

ndmpcopy

If you have already placed the OnTap executable on another filer, then you can copy it across.

ndmpcopy -sa root:password -da root:password filer01:/vol/vol0/etc/software/ontap.exe filer02:/vol/vol0/etc/software/ontap.exe

ftp

You can enable ftp on the filer, copy the OnTap executable onto the filer, and then copy this into vol0. Obviously you shouldn’t allow ftp to access vol0 directly (for so many reasons), but you can copy it into another volume, then use mv (within priv set advanced), or ndmpcopy to copy it across.

Once the OnTap executable has been placed into /vol/vol0/etc/software you can get the filer to install it. This is a 3 stage process, and 2 of the stages can be done with relatively low disruption.

software install ontap_version_x.exe

This unpacks the Windows self extracting file and places it within vol0. This can be done with no service disruption.

download

This takes the newly extracted files and commits them to the filer, overwriting the boot kernel and other files. Although this can be done with no disruption to any services, you may find the Filer View could break afterwards if this has been upgraded in the new OnTap version.

reboot / cf takeover

Depending on what you are trying to achieve, if this is a stand-alone system, or you have scheduled downtime, you can do a reboot. If you are looking to do a NDU (Non-Disruptive Upgrade), then you would get the partner filer to issue a “cf takeover”. This causes the upgraded filer to reboot, but the services are failed over, so you shouldn’t lose any access. Once the filer has rebooted and is “waiting for giveback”, you can issue “cf giveback” to finish booting into the new OnTap version.

If you do have scheduled downtime, or this is a stand-alone system, then you can issue “software update ontap_version_x.exe” to perform all 3 above steps in one move. This can be really nice if you don’t mind the outage, and you can even point this command at a web-server (“software update http://server_name/ontap_version_x.exe”) and the whole thing becomes quite smooth!

Gotchas

There are, as always, a few things to look out for.

NDU is always a bit of a challenge and you need to check all the pre-reqs and documentation very carefully. If you want all your systems to stay up, you need to do some research. NDU can work great if planned, I’ve done it many times, and the only times I’ve seen it cause problems is when it hasn’t been planned, or there’s an extra system that wasn’t planned for.

Upgrade all dependent systems first. If you have SnapDrive, SMAI, etc, upgrade these first. These are almost always backward compatible, but often not forward compatible. This usually requires more planning than just the OnTap upgrade itself!

Upgrade firmware first! Very important if you want a smooth upgrade. When the filer boots it always checks the firmware folders (“/vol/vol0/etc/disk_fw” and “/vol/vol0/etc/shelf_fw”). If these contain newer versions that what it finds already applied to the system, it will stop the boot and apply these first. Shelf firmware can take 20-30 minutes, so your smooth upgrade has just been delayed by 20 minutes! In NDU, this is catastrophic!!!

If you upgrade disk and shelf firmware, it can generally be done totally NDU (with the exception of SATA, but that is a lot less disruptive than a boot upgrade). So it is well worth taking the time to upgrade this first.

1,174 Views 0 Comments Permalink Tags: upgrade, ndu, ontap, ontap, ndu, upgrade

I like to know how things work, and I like to know how things break. I spend a lot of time with different customers who are trying to achieve different things in different ways. I'm going to try commit myself to regularly update this blog with my experiences and findings.

 

I have been working with NetApp storage for over 4 years now, but I come from a web-development background. This has given me a definite can-do attitude. I like to make things work, and anything can work, it just depends how much effort you are willing to put into the solution.

 

I am very proud to be the first non-NetApp employee to be given a blog, and hopefully I will do this honour justice. Hopefully this will give my input here an independent view on the solutions, and less of an evangelical view.

 

Hopefully this will be a useful resource for people and I always welcome honest criticism and comments.

360 Views 1 Comments Permalink