We are still rolling our NetApp SAN. My question relates to best practices when over-committing in a virtual environment.
We currently have 2 aggregates with matching volumes and LUNs that match the aggregate GB for GB (1TB aggregate w/ 1TB volume w/ 1TB LUN). We are seeing only about 25% utilization of this space despite the fact that the virtual host servers using the LUNs think they have nearly fully used them. We have a stable environment (no expected storage growth) and I would lke to overcommit.
Question: Better to add LUNs and overcommit the volume or better to add volumes and overcommitt the aggregate?
Thanks for your feedback and any thoughts you may have.
NetApp Storage Best Practices for VMware VSphere says the following.
When you enable NetApp thin-provisioned LUNs, NetApp recommends deploying these LUNs in FlexVol® volumes that are also thin provisioned with a capacity that is 2x the size of the LUN. By deploying a LUN in this manner, the FlexVol volume acts merely as a quota. The storage consumed by the LUN is reported in FlexVol and its containing aggregate.
Information is on Page 16. TR-3749 found here...
Peter, I'm trying to do the same thing you are, and I've read TR-3749 too, but I'm confused as to why this is the recommendation. When my engineer was here installing my FAS, he told me that I should put thin LUNs into thin volumes to get max savings, and make the LUNs 2x the size of current data. So, if I have 500GB worth of VMs, I would make a thin volume with a thin LUN @ 1TB. What this TR article is saying is that the thin volume should then be 2TB for 500GB worth of data. Why?
I should mention that I'm going to be really tight on space in the short term until I can get more disks, so the emphasis on everything that I've planned has been to save space w/o shooting myself in the foot.
The formula I'm thinking of employing is:
thin LUN = current data x2 rounded to nearest 100GB.
thin volume w/snapshots = thin LUN x 120%
thin volume w/o snapshots = thin LUN
Doesn't that make sense?
Using http://media.netapp.com/documents/tr-3483.pdf as a reference (see page 21), I was thinking of having it look like this:
Guarantee = none
Lun reservation = disabled (Thin prov.)
fractional reserve = 0%
Snap reserve = 0%
autodelete = volume/oldest first
autosize = off
try_first (if applicable) = delete snap
I'd appreciate it if one of the pros could weigh in on this. I'm not a storage admin, so my thinking could be way off, but if so I have no way of knowing. Please help!
"What this TR article is saying is that the thin volume should then be 2TB for 500GB worth of data. Why?"
The TR isn't actually saying anything about the "data", just the size of the LUN.
TR 3348 - Page 6 in section "Volume Sizing and LUN Creation" walks through each of the values used to calculate the total the size of a volume. You can pipe in your entries above and see what the results are. It takes the guess work out of it.
Although this subject is outside the scope of technical support, I did manage to find someone that was willing to chat with me about this.
This is what I learned. I would welcome thoughts and contradictions on this.
I was stated that it is best to have 1 LUN per Volume (1-to-1 match).
We also discussed the behavior of the LUN in an overcommitted environment
The NetApp Filer sets a "high water mark" in an overcommitted environment.
That is to say, it identifies how much space must always be available in the LUN.
Once a high water mark is hit, I was told that the actual usage from the LUN will not drop below that despite what happens in the LUN.
A 1 TB Aggregate is created and a 1TB Volume is created with a 1TB LUN.
After the Server using the LUN thinks it has used nearly all of the 1TB LUN, actual usage may be far less. Let’s say 25% of that (250GB). Even if storage is freed up in the Server environment and the Server using the LUN thinks it is only using 150GB of the LUN, the “high water mark” in the SAN has already been set and will not change. It will remain at 25% (250GB)
Could someone pipe in on this? This true?
When overcommitting, care must be taken to anticipate growth.
If an additional Volume and LUN is created on the aggregate to take advantage of thin-provisioning, and for whatever reason space is quickly used up, there is really no way of decreasing the amount of space used in the LUN (remember the "high water mark" will not go down) unless a LUN is deleted/recreated.
Question I still have:
It’s my understanding that de-duplication WOULD actually reduce the amount of space used in the volume. This true?
"Could someone pipe in on this? This true?"
Yes and no. You are on the right track. But space can be reclaimed and you can decrease the amount of space used in the LUN by using space reclamation built into Snap Drive.
Technical report 3483 details step by step the items mentioned in your example and implications. It starts on page 4 and is under "LUN Space Usage". It goes on to talk about space reclamation on page 7.
"Question I still have: It's my understanding that de-duplication WOULD actually reduce the amount of space used in the volume. This true?"
No, that statement is not true.
Props to Bob Charlier for the explanation.
The short answer is that deduplication will still think that data is written to those blocks, just as if the data has never been deleted. Because of that, deduplication won't give the best results, and won't free up any additional space.
The reason is, NTFS is controlling the list of files that are on the system, and the list of where the free data blocks are. The NetApp doesn't have any insight in to this. Space Reclaimer syncs the Windows/NTFS "free block list" with the NetApp free block list, and that is where you end up with unused space on the storage controller.
This is an example:
A file is written and takes x data blocks, and gets one entry in the NTFS "master file table" table. === The NetApp gets a request to write x blocks of data anywhere it can, and the few blocks of data extra to update the NTFS file system.
The file is deleted from Windows.
For speed, the only thing that is changed from the NTFS/Windows side is the master file table entry. The entry in the MFT gets marked as deleted, and the list of blocks that it was using get added to a list that NTFS keeps about free space on the file system. === On the NetApp, couple of blocks are updated to note the master file table changes.
The only thing that knows these data blocks are free is NTFS. The old data is still really present in the same block locations, it hasn't been zeroed out or released, until it is overwritten by other data.
If you run deduplication at this point, it is going to treat those blocks as used.
If you run Space Reclaimer, the list of blocks that NTFS thinks is free is sent to the NetApp, and those corresponding blocks are released in the volumes free space pool.
I thought Deduplication was on the Volume, not on the LUN... Am I reading something wrong?
I had professional services out and I was told:
Aggregate = all disk of one type and size
Volume = The largest amount you can present, thin provisioned, dedup on, auto grow on
LUN = same size as the volume
When a past co-worker set this up originally he said the best way to take advantage of the dedupe was to create a large volume and over provision the luns.
Since VMware will not know the volume is deduped you need to over provision it for maxim utilization. the theory is that you can have more system drives on the same volume with a loarger dedup saving.
Volume = 5 TB, auto grow on, dedup on
LUNs = 4 at 2 TB (rough estimate)
This way all the data on all the LUNs are deduped