HA fails to restart a virtual machine error...

One of my customers recently had a host throw a PSOD.  It's a large environment with appropriate spare capacity, so that wasn't a major issue.  I mean, we never want a host to go down, but we designed the environment to accommodate that situation and it largely responded well... except for 2 VMs.  They errored out, with a message saying that "vSphere HA unsuccessfully failed over this virtual machine... ...Reason: An error occured during host configuration".  We found a related event that said, "Operation failed, diagnostics report: Failed to open file /vmfs/volumes/<Datastore UUID>/.dvsData/<DVS UUID>/<Port Number> Status (bad0003)= Not found".  True to the error message, that file was not there.

VMware has a KB Article about this issue, but it's related to vSphere 5.0, when Storage vMotion fails to move a file that HA requires for tracking the VM on the Distributed vSwitch.  This customer is on vSphere 5.5 (and has been for a significant time), but the error messages lined up exactly and were very specific.  We went ahead and ran the script in that article, and it identified both of the VMs that HA failed to migrate, as well as a handful of other VMs in the environment.  We decided to operate on the assumption that these files had been lost somehow, but that the problem was resolved in the current version, and so we initiated a storage vMotion on one of the affected VMs.  It regenerated the file, and so we were able to easily resolve the problem... but that left us wondering about the other sites and if they might be affected as well.

So, we decided to go ahead and run that check script on our other sites.  It's a great script, but it is designed to output everything to the console.  For a small set of VMs, this isn't an issue (and it even helpfully color codes the output)... but when you have dozens of vCenters with a few thousand VMs at each site, it doesn't scale too well.  This is especially true because the script uses write-host (presumably, to allow for that nice color coding) rather than write-output, meaning that you can't easily redirect the output to a file.

I went ahead and made a small modification to it.  I added an -outFile parameter to the Test-VDSVMIssue script.  If that parameter is populated with a valid file path, the script will log each VM/Network Adapter that has the problem, and will store it in that file as a CSV.  This change allowed me to simply fire the script off in each vCenter and then go back and look at those log files to get the list of affected VMs (rather than parsing through the CLI log).

Quick note - we chose the identify the problem VMs and then remediate them individually rather than using the -fix switch that the original author created.  We chose to do this because the customer uses Ephemeral port groups, and the -Fix option appears to assume that Static port groups are in use.  Rather than attempting to re-engineer that solution and apply it to production machines, we decided to go with the (more resource intensive) known safe solution of initiating a storage vMotion for the affected VMs.

Anyway, like with all scripts that you find on the internet, please read through it and make sure that you understand what this is doing before you try it out.  Also, make sure that you test it thoroughly before trying it in production; just because it worked in this situation, that doesn't mean that it will work in yours! Here's my modified version, I execute it like this: get-vm | sort | test-vdsvmissue -outfile C:\temp\problems.csv:


Popular posts from this blog

Deleting Orphaned (AKA Zombie) VMDK Files

Clone a Standard vSwitch from one ESXi Host to Another

vCenter Server Appliance Crash due to Full /Storage/SEAT Partition