I've been working with NetApp products for several years but have just inherited a stretch MetroCluster installation and have never had experience with them or SyncMirror before.
I've read TR-3548 and have a pretty good insight now but I do have a couple of questions which, if answered, would help me understand what I'm dealing with!
I have a DR test soon and would just like to test standard cluster failover of the controllers. Having read section 3 on failure handling, my questions are:
1. I know I can do a normal cf takeover at the DR site and have the primary controller failover to the DR site. As I understand it, a stretch MetroCluster is just like a normal cluster but geographically dispersed. If the cluster failover has happened, what would happen if I then turned off the disk shelves at the original "failed" site? My understanding is that nothing would happen in terms of data service, the aggregate would continue to be available from the failed over controller and the plex at the DR site, but the aggregate would show as broken because the plexes at the original site have failed. Is my thinking correct here? It appears to be addressed in section 3.6 of TR-3548 (Rolling Failures) but I want to be sure that I'm getting it right.
2. What would happen if I then powered off the controller at the original "failed" site? My thinking is - nothing else, aside from what has already happened as this would be part of a rolling failure scenario. However, I'm not sure about that.
3. If 2. above is correct, how does that differ from doing (or having to do) a cf forcetakeover - d?
4. (Last one) I have the raid.mirror_read_plex_pref set to alternate which has improved read speed in normal operation. If it was set to local, and the controller then failed over to the DR site, would it still continue to read from the same plex at the original site? i.e. Would it still consider the plex at the primary site to be local?
Any advice gratefully received!
P.S. I'm in the UK so may not reply straight away as it's bedtime here now!
1. Yes, you correct.
2. Again, you are correct. Nothing happens, partner controller is not up anyway.
3. In your case state of each filer is known at every step. In case of disaster it is unknown whether partner or interconnect is down.
4. Good question. I'd like to get answer too
Thanks for your reply, it's helped me get clear in my mind what will happen in a failover so it's much appreciated.
As for the last point, I've also asked NetApp, so if they tell me, I'll let you know! In my case it's FAS3160 so it is software disk ownership. I suppose that the original primary site plex will still be local because of this, but I'd like to find out for sure, out of curiosity more than anything else.
I said I would update with the answer about the raid.mirror_read_plex_pref setting when a failover occurs!
The answer, from asking NetApp support, is that this this option is set at filer level not aggregate, so it will use the setting in the global options of the filer that took over.
The local and remote plexes are determined by whether the plex is in pool0 or pool1 and pool0 and pool1 will not change will not change in a takeover so the plexes will keep the same local/remote designation
Therefore If the option is set to remote on both filers, then it will continue to read from the original remote plex, which will now be on the same site as the controller filer.
Hope that makes sense! Also, I hope I have reduced by one the number of those really annoying threads where someone asks the question you want to know the answer to but never actually posts the solution