We ran across the same problem that some have experienced – the Microsoft NLB cluster wouldn’t converge on the VMWare virtual guests (which happen to be Microsoft Dynamics 4 hosts). The respective product versions were Windows 2008 NLB and vSphere 4.1.
The hardware solution was based on HP Blades being VMWare hosts. The Blades were housed in a few new chassis’, the reason behind this was the add additional capacity to the existing VMWare based hosting environment. The design had called for 2 VM’s using NLB for load balanced capacity, no use of VMWare HA/DRS as there are lots of warnings to not mix the two. The design had the VM’s hosted on separate blade’s, each in a different chassis in a different rack. This would give a degree of resilience if you ignore Datacentre fail or power-fail of a whole rack row. The high level design did not document whether the It NLB cluster would use multicast (allowing the VM’s could talk to each other as each would retain its Real IP while also hosted the Cluster VIP) or unicast (single VIP shared by all hosts thus not preventing intra-cluser comms).
Clients were encountering random connectivity issues, it appeared to be something to do with the clustering but the initial efforts to correct hadn’t helped so while most clients used the NLB VIP for access we temporarily had clients configured to run directly against the Real IP of the Dyamics hosts rather than using the VIP if they encountered any problems.
The support team had moved the VM’s so that they were on the same host i.e. removing all hardware variables from the problem. However, the problem re-occurred so no luck there. On initial investigation it became apparent that the cluster was running in unicast mode which is the default installation setting. Multicast had been tried in the pre-prod environment but the convergence problem manifested itself, unicast appeared to work so it went live in that mode.
We came across a Vmware article that explained that while unicast will work it can generate too much load on the switch. Typically switches will silently drop packets when out of resource. This could explain the symptons we were seeing i.e. as we increased the number users the problem appeared – this implied a load problem. We believed that port flooding on the switch was happening as the load increased.
Thus we made the decision to reconfigure the cluster to multicast mode and add the recommended static ARP mapping for the cluster VIP address. The latter because the switch would not learn the multicast address – the MAC address returned to a normal ARP request is multicast yet the IP address is unicast which switches don’t like. The static ARP would force the switch to learn which ports were associated with the NLB host VIP address. This was confirmed by SME’s for another account/client which had a similar setup.
The question was where to put the static ARP mapping for the multicast address? Each chassis connected to top of rack access switches which in turn connected directly into the Datacentre core Cisco 6500 switches. The normal wisdom is to put the ARP mapping on the ‘closest’ switches to the guests but these were running in an unmanged configuration i.e. no management IP to connect to in order to configure the switches.
We made the decision to put the static ARP mapping on the core 6500 switches as we could manage them. From there we figured out which VLAN the ports were on, in the specific VLAN we had a number of ports connecting to physical servers, some to the top of rack chassis and a couple for switch trunking purposes. Some VMWare articles and forums say that static MAC resolution may also be required in addition to the static ARP but in our case it wasn’t as the core switches were runing etherchannel (presumably configured in the correct channel load balancing configuration).
We applied the static ARP to the VLAN and it all worked out for us. The browser clients no longer experienced any drop-outs.