After seeing a mention on Scott Lowe’s blog (blog.scottlowe.org) and on Storage Monkeys Blog (blogs.storagemonkeys.com) I’ve decided to discuss the issue(s) that I’ve came across in regards to disabling NFS Locking with the NFS.LockDisable=1 function.
As the problem can arise from many different circumstances, the majority of feedback I’m receiving appears to be caused by a VMware HA failover (either intentional or unintentional). Thus, I would like to discuss VMware HA and how it works (based on my experience and knowledge).
But before that, let me mention that the end result of having NFS.LockDisable set to 1 is that Virtual Machines can become corrupt (Windows VMs blue screen and give NTFS errors / Linux guests are more resilient and could potentially be fixed by a fsck, but you should always have a good backup regardless). This is caused by the fact that multiple ESX hosts can start the same VMX at the same time. Ok, lets continue…
From what I can see when you configure VMware HA the first (4) nodes configured are marked as primary, every host after the fourth is considered a backup node. In the event of a HA fail-over the primary nodes will all attempt to start the VMs that were running on the failed node. It appears they rely on the VM locking to determine if the VM is actually down or not. So what this means is regardless of Isolation Response the VM can actually be powered on multiple times. In fact, in the couple times this has happened to me I had the same running VM on up to (3) hosts at once.
You can also see some strange behaviors in VirtualCenter, such as the number of Virtual Machines registered in each host will jump up and down within seconds. I would look at the summary of one of my hosts and see the Virtual Machine count go from 20 to 35 to 28 to 40 and so on.
The only true way to clear this up is from the service console to do the following;
- Run a vmware-cmd -l on each of your hosts within the cluster.
- Output this data to a file so you can sort it later (ie: vmware-cmd -l > host1).
- Sort those host files together into one master file (ie: sort host1 host2 host3 > masterVMlist).
- View the master VM list and determine which VMX files are registered multiple times.
- Now this is the tricky part, if you have tons of hosts within a cluster it will take some time to actually find where they are really located, but you do know which ones are registered multiple times. Knowing the list of multi-registered VMX files, you could potentially create a script that ssh’s to each of your ESX hosts and runs a vmware-cmd -l grepping for the VMX file, then returning a code notifying you if its there or not. Since I only had (4) nodes on the cluster that failed this wasn’t necessary for me.
- You can run a ps aux | grep VMX-FILE on the hosts where they are registered to determine the PID.
- Use kill -9 PID to remove the running VM. Magically it will become unregistered on the invalid hosts.
Ok, so in closing I do not want to put all the blame on VMware HA, it is actually a combination of NFS.LockDisable=1 and what happens because of that that causes the potential corruption. The same result can occur by manually registering and starting the same VMX on multiple hosts (as with disabling locking it removes the that added layer of security).
It is extremely important that you enable NFS Locking by changing NFS.LockDisable back to the default setting of 0. You should also install VMware Patch ESX350-200808401-BG. I discuss the fix of this issue in another posting, which can be found here.