NetApp for Microsoft Environments in NetApp BlogsWhere is place located?

Log in to follow, share, and participate in this community. Not a member? Join Now!

Recent Blog Posts

Refresh this widget

Alex Jauch -- Architect, Microsoft Private Cloud

     

While we have had many discussions in this blog about Dynamic vs. Fixed VHD’s, we still get questions about this topic from time to time.  There is a misconception by some that there isn’t a significant performance difference between dynamic and fixed VHD file formats.  We’ve recently put together a demonstration that shows this difference more clearly. 

    

When talking about dynamic VHD’s, it is very important to note that Microsoft does not recommend them for production:

    

We recommend that you use fixed size virtual hard disks for virtual machines that are running on a production environment.”

http://technet.microsoft.com/en-us/library/dd183729(WS.10).aspx

 

This is a very clear statement and there should be no further debate.  However, some folks would really like to see the difference for themselves.  In the demonstration below we show both the impact of on-demand file allocation and the impact to the underlying SAN of steady state random I/O:

    

http://youtu.be/qfidu_1DYyQ

 

 

Reena Gupta - Reference Architect, Microsoft Business Unit

 

Many times in the SharePoint 2010 deployments, there is a struggle going between the Infrastructure teams and the end users. Users might complain about errors, timeouts, and slow performance while reading or writing their large files. Infrastructure teams may not realize that their SharePoint deployment which is completely using the SQL databases for storing all the documents and files is actually being the bottleneck. There had been guidance on using the Remote BLOB Storage with SharePoint to improve the performance, but the Infrastructure team needs the proof points.

Recently we did the performance testing for RBS (Remote Blob Storage) on SharePoint 2010 SP1. The objective for this testing was to compare SharePoint throughput and the file I/O performance while all the documents are in SQL content databases vs. all BLOBs are stored in the Remote BLOB Storage and only metadata in the SQL content databases. The test results clearly show the value of using RBS with the SharePoint deployments and support Microsoft recommendations on the file size.

Microsoft recommends to use RBS, when the documents stored in SharePoint are >1MB and any documents <256KB are best stored in SQL databases only. To prove the RBS value, I created the test environment for SharePoint 2010 with 2 WebApps  - one as ‘SQLNative’ to keep all the documents completely in the content DBs and other webapp as ‘SQLRBS’ to keep all the BLOBs for any document >1KB in the Remote BLOB Storage in a SMB share, while the metadata was stored in content DBs. I built the SharePoint corpus of 1TB data for each webapp consisting of different size of files as 100KB, 1MB, 10MB and 100MB. Each webapp had 5 site collections with one content DB each.

Test Environment

Servers:

1.       4-5 Web Servers (Web Front End servers) running SharePoint 2010SP1,

2.       1 Visual Studio 2010 Load Test Controller,

3.       6 Visual Studio 2010 Load Agents,

4.       1 SQL server running SQL 2008R2

5.       2 Media Servers running Windows 2008R2 SP1.

Software:

1.       NetApp SnapDrive 6.3R1 to manage the database and logs LUNs,

2.       NetApp Snap Manager for SQL 5.1 to manage the SQL databases,

3.       NetApp Snap Manager for SharePoint 6.0 to manage SharePoint and provide the RBS provider.

Note: SnapManager 6.0 for SharePoint Architecture has been described in my previous blog.

All the servers were virtualized using Hyper-V and distributed evenly on 4 physical servers with SQL server hosted on a dedicated physical server. The backend storage was NetApp FAS3170 Controller with 600GB SAS disk drives used for hosting SQL content DBs and logs for both ‘SQLNative’ and ‘SQLRBS’ webapps and 1TB SATA disk drives used for hosting the SMB shares for Remote BLOB Storage. The network bandwidth between the servers and the storage controllers was 10GBE for all the data communication and the management network bandwidth was 1GBE.

The baseline test cases included a mix of read-write operations with a ratio of 77% reads and 23% writes. Some percentage of the operations was metadata intensive.

Test Mix:

RBS_perf1.png

RBS Performance Results

 

  1. RBS Reduces SQL Server CPU Usage: As RBS removes some I/O operations from SQL Server, the overall CPU load that SQL Server generates with BLOBs going directly to RBS is significantly less than the same workload running against a native SQL Server BLOB store.

RBS_perf2.png

2. RBS Relieves SQL Lock Pressure: Since SharePoint writes all documents for a given content database to a single table, we see significant SQL Server Wait time due to locking of the tables in workloads that include concurrent document uploads. Our base case assumes a 77%/23% read/write ratio. Customers with lower write ratios will see less locking than customers with higher write profiles.

RBS_perf3.png

 

3. RBS Reduces the Read-Write Test Time for a Mixed Workload: As Microsoft recommended the use of RBS, mostly for the files above 1MB in size, where RBS is really beneficial for better performance. The observation on read and write test duration measurement for specific file sizes was very much in line with Microsoft’s recommendation. The read test duration for 100k files was very close in both cases, but starting with 1MB file size, the read tests with RBS sites were much faster than the ones on SQL Native. Similarly the write test for 1MB files or higher, showed a signification performance improvement with RBS enabled site.  Following charts show the test results for the read/write behavior in the mixed workload use case for 500 and 400 user thread counts.  Please note the chart indicates the total read and write test time measurement for the time for reading/writing of all the files in the corpus with a specific file size; not to get confused, it’s not for a single file test.

 

RBS_perf4.png

 

RBS_perf5.png

 

4. RBS Improves the File Download and Upload Time: We also performed the read only and write only tests as per the test mixes shown in the table earlier in this blog. RBS also proved the average download or upload time to be lower or close to the SQLNative sites for all the smaller  sizes, but significantly lower at 10MB or 100MB file operations. Following charts show the read only test for a 500 user thread count and write only tests for a 75 user thread count. Please note that the write only workload is quite intensive with range of such file sizes.

 

RBS_perf6.png

RBS_perf7.png

 

5. RBS Increases SharePoint Scalability: In the baseline tests, the aggregate average throughput (requests per second) from all the SharePoint WFEs was measured thru Perfmon counters. SharePoint throughput with RBS clearly outperformed the throughput from SQL Native databases, which means SharePoint could handle more # of concurrent user requests with RBS than with SQL Native.  Following chart shows the throughput with 4 WFEs:

RBS_perf8.png 

 

Please note if there are more web servers added to the configuration, you’d see a better throughput, unless the web servers start to saturate or backend storage infrastructure is not able to serve more requests.

In addition to the higher throughput, RBS sites were able to perform the baseline tests with lower latency (avg. test time reported from Visual Studio), see the chart below.

 

RBS_perf9.png

Overall RBS provided higher scalability for SharePoint 2010 by increasing the SharePoint throughput while taking less time for all the tests. There will be more tests performed for RBS in different scenarios and different kinds of use cases; we’ll keep you posted as we get new information.

Alex Jauch -- Architect, Microsoft Private Cloud

     

As we have previously discussed on this blog, the PowerShell Toolkit provides the ability to create “Thin-Fixed” VHD’s.  To refresh your memory, these are fixed size VHD’s which have had their zeros unmapped on the storage controller.  This means that we have flexible provisioning like a dynamic VHD but the performance of a fixed VHD.  There are a couple of interesting benefits to this technique.  One, these VHD’s are inherently space efficient.  No more “captive” white space in your .VHD files.  The storage controller will only allocate storage for them as they are written to.  This is exactly what we do for a  thin LUN, just at a file granular level.  Second, these files are much faster to create  than traditional fixed VHD’s.  This is because the zero unmap operation is much more efficient than the write zero operation that is traditionally done to create them.

    

In the table below, we see a comparison of file creation times between Fixed VHD’s and Thin-Fixed (in seconds):

   

 

FixedThin-Fixed
1Gb3.40.05
10Gb33.60.7
100Gb3232.4
1000Gb3354 (1hr)16

 

 

 

 

As you can see, the difference can be as high as 100x faster for Thin-Fixed VHD creation.  What is probably more important is that it brings the creation of fixed VHD’s down to a time interval short enough that there is really no reason to use Dynamic VHD’s any longer.  This will improve the overall performance of your storage system and the Hyper-V systems it supports.

 

To highlight this tool and the performance advantage we offer, we created a demo on YouTube.  Please let us know what you think!

  

http://www.youtube.com/watch?v=wOz3yV97dCo

 

We had previously talked about the next release being 2.0.  Well we’re not quite done with our Cluster-Mode cmdlets.   The 2.0 release of the Toolkit will be a complete Cluster-Mode implementation.  That’s not to say we didn’t make any progress.  1.7 brings our Cluster-Mode content to a very respectable place with 375 cmdlets, but we’re still only 80% done with the Cluster-Mode ZAPI coverage.  Therefore the question remains why the interim release?  Well, we’ve been busy bees in the bowels of our RTP facility, and have developed a couple cmdlets that we just couldn’t rightfully keep to ourselves any longer.  We’ve quietly been showing off some of these capabilities to select customers for a couple months now, and if their excitement is any indication I think you’ll be pleased we decided to ship!

 

First and foremost, we cracked particularly un-crackable nut. 1.7 introduced a new and improved Invoke-NaHostVolumeSpaceReclaim cmdlet that supports CSVs!  We had to take a totally unique approach with CSVs.  The NTFS API’s that we use with traditional LUNs aren’t supported on CSVs, but our engineering team just went back to the drawing board.  The result is the ability to perform ONLINE space-reclamation of any CSV in direct I/O mode!  No need to take the CSV offline or even enter redirected mode.

 

Honestly that ability alone would have been sufficient for us to ship, but it wasn’t the only one.  As the industry gets its collective head around Windows 8, we’ve noticed an increased interest in Hyper-V.  This got us thinking, and with a little experimentation we developed a pair of V2V cmdlets. There are already various tools available to convert disk images between the VHD and VMDK formats.  But performance can be vastly improved if the data blocks are cloned in-place on the storage controller rather than copied via a Windows host.  Toolkit 1.7 adds two cmdlets, ConvertTo-NaVhd and ConvertTo-NaVmdk, that can convert between these formats in a matter of seconds.  These measurements were taken on the same Windows 2008 R2 server using the same source VMDK file under identical conditions:

 

Convert a 32 GB flat VMDK file to fixed VHD using a popular commercial tool46 Minutes
Convert a 32 GB flat VMDK file to fixed VHD using ConvertTo-NaVhd63 Seconds

 

Now these two cmdlets in particular have a couple caveats we’ll handle in a follow on post.  While we were cracking open VHDs we took a stab at solving alignment. Virtual Disk Alignment has been the bane of all storage since the inception of Virtualization.  In fact it had gotten so bad that NetApp developed MBRScan and MBRAlign to address the problem.  We’ve received countess requests for a Hyper-V/PowerShell version of those popular tools.  In true NetApp fashion we didn’t just convert the existing tools, we improved them!

 

Get-NaVirtualDiskAlignment reads the first sector of a fixed VHD file to determine if any of the partitions on the VHD are misaligned.

 

PS C:\Toolkit\1.7.0> Get-NaVirtualDiskAlignment M:\linux.vhd


   VirtualDisk: M:\linux.vhd

IsBootable    AbsoluteStartingLba             Size       IsExtendedBootRecord  IsAligned
----------    -------------------             ----       --------------------  ---------
True                           64           102 MB                      False       True
False                      208848            15 GB                      False       True

 

Repair-NaVirtualDiskAlignment will correct misalignments in fixed VHD files that are formatted as MBR-style disks.  Unlike some other Toolkit cmdlets such as Copy-NaHostFile or ConvertTo-NaVhd, correcting alignment issues is not possible using WAFL block cloning (because the cloned blocks would also be misaligned), so a data copy is required.  For data in LUNs, Data ONTAP 7.3.5+ performs this operation rapidly by offloading the copy operation to the storage controller.  Data ONTAP 8 does not yet provide this capability, so this cmdlet falls back to a slower host-based copy as needed.  For CIFS shares, copy offload is used by this cmdlet for all versions of Data ONTAP.

 

In our judgment these new enhancements more than warranted a release, as did the other 232 cmdlets!

 

New cmdlets

Data ONTAP PowerShell Toolkit 1.7, not including the Cluster-Mode set:

  • ConvertTo-NaVhd
  • ConvertTo-NaVmdk
  • Get-NaVirtualDiskAlignment
  • Repair-NaVirtualDiskAlignment
  • Enable-NaStorageAdapter
  • Get-NaStorageAdapter
  • Get-NaStorageAdapterInfo
  • Get-NaControllerError
  • ConvertTo-SerializedString
  • ConvertFrom-SerializedString

Ccategories of new cmdlets in the Cluster-Mode set:

  • Cifs (32 cmdlets)
  • Clone (1 cmdlet)
  • Cluster peer (6 cmdlets)
  • Disk (10 cmdlets)
  • Exports (9 cmdlets)
  • Fc (4 cmdlets)
  • Fcp (20 cmdlets)
  • File (9 cmdlets)
  • Igroup (10 cmdlets)
  • Iscsi (30 cmdlets)
  • Net (27 cmdlets)
  • Nfs (13 cmdlets)
  • Portset (5 cmdlets)
  • Quota (9 cmdlets)
  • Security (13 cmdlets)
  • Sis (10 cmdlets)
  • Snapmirror (16 cmdlets)
  • Storage adapter (3 cmdlets)

Issues fixed in Toolkit 1.7

  • Fixed null handling in 7-Mode cmdlet wildcard patterns.
  • Fixed error handling in Get-NaVol.
  • Fixed Invoke-NaHostVolumeSpaceReclaim to support LUNs larger than 4 TB.

 

If you haven’t already, jump on over and check out the new release, over the coming weeks we’ll do our best to familiarize you with some of the exciting new capabilities.  As always if you have any feedback  on the toolkit head on over to the communities!

 

Alex Jauch -- Architect, Microsoft Private Cloud

 

As an architect of our private cloud solution, I spend quite a bit of time working on making the components of that solution “Cloud Ready.”  What’s funny about this work is that there really isn’t a definition of that term.  What makes infrastructure “Cloud Ready” anyway?  What types of infrastructure are better or worse for private cloud?

    

One way to think about this is to go back to core principals.  What is our definition of cloud?  As we have discussed before in this blog, we use the NIST model of cloud computing for our cloud definition framework.  Those characteristics are on-demand self-service, broad network access, resource pooling, rapid elasticity and measured service.  This means that our infrastructure must have features that support those five characteristics.  Taking a look one layer deeper into actual implementations, the infrastructure that supports these characteristics has to have all the normal infrastructure characteristics.  At Microsoft we used to call these the “abilities.”  Supportability, Scalability, Stability, etc.

 

However, there are implications to aspects like self-service, rapid elasticity and measured service that require additional feature sets that not all infrastructure has.  For example, you cannot perform self-service without some degree of automation.  There is an assumption in the architectural model that the infrastructure provisioning process can to some extent be automated.   In addition, the rapid elasticity and resource pooling characteristics assume an ability to operate at scale.  That is to say, you are not going to run a cloud on one server.  You’re going to have dozens or hundreds.  This also implies remote operations.  Anything that requires you to login to a local console is going to be just too cumbersome.

  

So, this implies that there are some critical features that you need to build a cloud.  Here are five tests to see how well your infrastructure supports these key characteristics:

 

  1. Automation.  One of the foundations of cloud is self-service.  This implies automation.  Any infrastructureyou deploy in support of your cloud efforts will benefit from a high degree of automation.
  2. Measured Service.  Because cloud solutions must be carefully managed to ensure you don’t over-subscribe, you need to be sure that your infrastructure components support a robust performance management interface.  Ideally, all the components of your solution use the same management infrastructure so you can establish a single view of your cloud.
  3. Rapid Provisioning.  Because clouds need to appear infinite, you really need to be able to perform provisioning tasks quickly.  Some of this is achieved via the automation bullet above, but in some cases expensive operations like copying a large file can slow things down enough to make the solution a poor candidate for cloud.
  4. Scale.  Because of the complexity of cloud style architectures, there is a lower sizing boundary below which you really don’t want to go. At that point, the cost per VM just gets too high and you can’t really justify all the complex systems needed to support things like self-service provisioning.  That proof point varies depending on your business requirements, but it’s safe to say that a solution that supports less than 100 VM’s is going to have a tough time producing a positive ROI.
  5. Availability.  I only put this one last because it’s really nothing new.  Yes, this is hugely important but in most cases it’s already a consideration for your infrastructure.  Most DC’s already operate on a “no single point of failure” rule but I will repeat it here because it is so vital.

 

As an example, let’s take this set of rules and apply them to our own Private Cloud offering.  How well do we eat our own dog food?

 

 

  1. Automation.  Our solution has two automation options.  In our case, 100% of core server and storage operations can be performed via PowerShell.  This is the automation tool most commonly used by Windows shops so we focus a great deal of our solution development on PowerShell.  However, many of our customers are also moving to System Center Orchestrator which is Microsoft’s orchestration platform.  For this reason, we also support a native OIP for System Center Orchestrator which allows common operations like mounting a lun or sub-LUN clone to be done without writing code.
  2. Measured Service.  As part of our solution, we include OnCommand Plug-In for Microsoft (OCPM) 3.0.  One of the core components we get from that product is tight integration with Microsoft’s System Center Operations Manager (SCOM).  This tight SCOM integration allows us to present a “single pane of glass” view to administrators that includes information about Windows Servers, NetApp storage controllers and our Cisco UCS blade centers.  This complete view gives admins a much richer picture of their total service offering than separate management tools.
  3. Rapid Provisioning.  Our solution includes several examples of this.  For VM provisioning, we use Sub-LUN cloning to allow administrators to create a new VM clone very quickly in support of a user request.  We also support FlexClone based boot LUNs for the UCS chassis which means we can provision a new blade into the environment very quickly and via a fully scripted process.  These features along with the rest of the solution offering allows us to support very fast SLA’s for provisioning tasks that aren’t possible with traditional methods such as WDS or other PXE based on demand
    deployment solutions.
  4. Scale. Our target configuration is for about 1000 VMs in the default configuration.  We have a minimum size of four blades and a maximum size of sixteen blades in our core infrastructure.  This along with the storage gives us enough capacity to support the requirements of most mid-sized
    companies and is of sufficient scale to provide a full cloud experience.
  5. Availability.  As per NetApp best practice, our solution has no single point of failure.  The storage has fully redundant controllers, redundant storage network paths, redundant network switches, redundant blade chassis, etc.  For NetApp, this is a normal requirement for all solutions, but it applies to private cloud very nicely as well.

 

As you can see, we are very focused on ensuring that our solutions are a full, complete platform for private cloud.  We will continue to improve and there is some great stuff in store for us this year in this space, but we feel that we’re proceeding from a very strong technical foundation that is enabling our future offerings and technologies.  I would encourage you to examine your private cloud plans and measure your planned or existing infrastructure against this set of requirements.

More

Categories

There are no categories.