Now that version 1.2 is out with support for vSphere 5.1 and VMotion, I'm preparing to deploy it, but before I do, there's one thing that I haven't been able to find an answer to in documentation - how does it handle cache device failures? That is, if I give it just one SSD (rather than RAID1 or RAID10), and that SSD fails, will I simply get performance degradation, or will my cached VMs (or worse, entire host, including non-cached VMs) crash? Same goes for PCIe devices, which can't be RAIDed in the first place.
I know the reads will come from spindles, my question is, how gracefully is the failure itself handled, when the entire cache device disappears from host, or starts throwing weird errors? Will the VMs bluescreen? Will the host crash? Will either the VMs or the host require a reboot? Can Flash Accel trigger an automatic VMotion to hosts where the cache is still alive? When I replace the faulty device, will I need a reboot to get the cache back online?
Thats a good question..
I just kicked off a SQLIO on a test VM with flash accel cache enabled and after a few minutes removed the datastore that holds our RDMp's for Flash Accel on the host where this VM lives to simulate the loss of a cache device...
VM stayed alive running the SQLIO test and flashcache start to kick in... VM was fine
Flash Accel hit Ratio %
When i look at the Flash Accel homepage it now also says:
To repair this, i migrated the VM to another host that had access to the existing RDMp datastore, disabled the cache for the VM, enabled the cache, and now its working again... no restart required!
If you ask me... thats pretty cool!
hope that helps you
I presented the datastore to each ESXi host via iscsi and I did my test against an iSCSI LUN presented within a windows host configured with clustered file services (this was a test HA SQL environment)
I'm not in a position to do any re-testing against the operating system drive hosted on a VMDK - which may be the difference here ?
Message was edited by: Chris Anders Added LSI Card to B200M3 spec.
Wait one - I was under the impression that current version of Flash Accel doesn't support MSCS. I have a few environments similar to what you tested (Windows Server 2008 R2 on top of vSphere with in-guest iSCSI LUNs used for SQL Server 2008 R2 on MSCS) which could benefit from Flash Accel (the filers are FAS2040/2220/2240, so no option of FlashCache), but when I asked whether or not MSCS is supported in a recent NetApp/LSI webcast about Flash Accel, I was told that it's not supported in 1.2, and may be added in 1.3. Was that incorrect?
so from the flash accel gui i was able to see on both hosts the mapped lun's however only one of the hosts had the luns mounted and was writing to it.
10G of cache was given to both hosts and migration was enabled, which meant i burnt 20G of cache on both blades. - i had each SQL host on separate blades.
I did some simple testing whereby i ran some IO and watch the cache do its job, i then failed over to the other node, re ran some tests and watch the second cache do its job.
(the screenshots above dont represent that test - just pulled them now and the server has since been restarted)
Cache was cold as i migrated between SQL hosts, but that was to be expected and to be honest i didnt even check if this configuration was supported, i just tested it since 1.2 supported iscsi within the host and to my surprise it did the job!
Im not saying its "supported" but it certainly passed the - wow this is cool... lets try this in UAT!
My environment looks like this:
Only one ESXi host is involved in the test but is part of a cluster configured with HA and DRS.
To make FlashAccel work i presented a iscsi LUN to the ESXi to store the pRDM file. All other vm disks are on NFS datastore.
All working ok until I offline the lun presented with iscsi. At that moment the ESXi throws an error that it cannot find the raw disk, and shuts down the vm to restart it on another host. No other host has (for the moment) the iscsi datastore so it remains powered off.
I think it's expected behavior from ESX HA to try and restart the VM to another host in the cluster when it looses connection to the LUN but that is not the way I'd wish it should react.
Anyway loosing the iscsi datastore is not a viable scenario as the netapp is AA so no problems here to make iscsi redundant. I will do some more test but this time will actually fail the fusionio card to see the result.
So the main difference I see apart from ESXi version is that your making use of the FusionIO card and im making use of the LSI card which i believe is presented to the ESXi host differently.
Not sure how else you can simulate the card failure without physically pulling it out, but interested to see how you go
In response to your comment:
"I think it's expected behavior from ESX HA to try and restart the VM to another host in the cluster when it looses connection to the LUN but that is not the way I'd wish it should react."
That seems a tad odd to me, considering i have seen datastores go missing on me numerous times, especially when demonstrating NFS failover on 7-mode installs to customers and instead of the VM dying, it will pause while trying to resolve the missing datastore.
The vm will pause while esx tries to restore the NFS datastore for a given amount of time - if you use Netapp VSC those setting are done by VSC - NFS timeout - but eventualy it will HA to a different host.
For the Fusionio i've used a dedicated driver so i think i'm ok in that part.
Did you enable cache on the OS disk also or only on the disk presented as iSCSI LUN directly in the VM (windows iscsi software initiator) ?