OK so I've downloaded the ESX VSA demo and setup two nodes.
I have a single site, and a single cluster containing both nodes.
The cluster has a virtual IP.
I have a couple of volumes, each set to 2-way replication.
I can continuously ping the virtual IP so long as the node that is the virtual manager is up and running, if I power that node off (either via the CMC or to simulate the failure of a link or node) I lose the ability to ping the virtual IP, and of course my test server loses access to iSCSI volumes.
I believe I need to move the virtual manager, but it doesn't seem to give any option to do this in the CMC if it thinks the virtual manager is running on the failed/down node, it just says the manager is offline and I seem to go around in a circle where I can't make the other node the virtual manager until I stop the existing virtual manager - which of course I can't do as that node could be down/on fire which is the whole point :-)
I guess I'm doing something wrong, and as much as I read the manual I don't know what?
Note: If you are the author of this question and wish to assign points to any of the answers, please login first.For more information on assigning points ,click
here
The concept with the Virtual Manager is a manual failover process. You just add a node to the Virtual Manager in case of a failure.
I recommend to use the FOM (Failover Manager) running on another host (could be Virtual Server or VMware Workstation or VMware Player). The FOM runs always instead of the Virtual Manager. FOM is the decision maker in case of a failure and works automatically.
Under normal operating circumstances you should not have a virtual manager running - it should be started manually on the surviving node when a failure occurs thus restoring quorum and access to the volumes.
Using a Failover Manager is definitely the best solution for your environment as it will run all of the time and will maintain quorum and access to the volumes during the periods when one of your nodes is offline for whatever reason.
Thanks guys, a little after posting I worked out that a FOM is what I needed, so I installed one and it works seamlessly (well, a few seconds delay whilst it figures out what's happening but near as dammit seamless).
I guess this question applies to any sort of redundant storage that is seen as a single addressable "cluster" by ESX:
Suppose you have two sites, A and B. Each contains some nodes making up a storage cluster. Each contains some servers, ESX most likely. A and B are linked by a fast LAN link.
Let's say you lose the link between A and B, but the kit in each location is up and running.
You now have "highly available" storage that is still available in both locations.
You have ESX servers that can each still see the storage local to them, but can't see the other servers in the ESX cluster as the link has gone.
So don't you end up with the same VM's now running in both locations as each ESX box's HA would kick in, and each ESX box can still access its shared storage as your SAN is resilient?
I'm sure I'm overlooking something obvious here as I can only think about it right now vs. actually do it.
Split-brain scenario is prevented by requiring a majority of storage managers to maintain quorum.
In a perfect world you would have a 3rd site (that is connected to the first 2 sites) that hosts your failover manager. If that configuration is not possible, you then run your failover manager at your primary site that you want to stay up if your site link goes down.
If you have 4 nodes (2 at each site), only the site that has access to 3 managers (the 2 local nodes, plus failover manager) will have quorum (and thus access to storage)in the event of a link failure.
Thanks, I'd actually tested failover using the FOM in this exact scenario and I clearly wasn't thinking when I posted the question - only the site that has Quorum will have the cluster IP so you can't actually have two sites in "split brain".