Invalid VDS PortID Preventing vMotion

One of my customers had an issue where a bunch of VMs were not able to vMotion, despite the hosts being configured correctly in all regards (other VMs using the same VDS Port Groups, for example, could vMotion onto and off of the host where these VMs were running).  When DRS (or an administrator) attempted a vMotion, a generic "A general system error occurred: vim.fault.NotFound" error message would be displayed.

When I took a look at these VMs, I noticed something interesting (besides the fact that they were all on the same host); their VDS Port numbers were universally high, like in the 5000s.  This was particularly interesting because when I looked at the VDS itself, the highest numbered port on it was 4378.  I supposed that these ephemeral ports had somehow been assigned invalid port numbers, which was causing vMotion to fail when the new destination was unable to reserve that invalid number on the VDS.  Interestingly, all of these VMs were communicating just fine on the network, despite this odd configuration.

I decided that I needed to figure out how widespread this issue was.  I found that if I went to PowerShell and did a $all = get-vdswitch switchName | get-vm I would get a list of all VMs in the environment that were using that switch, including these VMs with invalid port assignments.  My next step was to try and get a list of all VMs with port assignments (since these VMs were not showing as being assigned to specific ports), which I did with this ugly command: $valid = get-view ((get-vdswitch switchName).extensiondata.fetchdvports($null) | ? {$_.connectee.ConnectedEntity.Type -eq "VirtualMachine"}).connectee.connectedentity | sort name -unique

Once I had my list of all VMs on the VDS and my list of all VMs with valid port assignments, I compared them to find which VMs were associated with the VDS but were not assigned a port: $ProblemVMNames = (compare-object $ $ | ? {$_.sideindicator -eq "<="}).inputObject and then used that list of VM names to actually get a list of specific VM objects (and to filter based on PowerState, which causes false positives): $problemVMs = $all | ? {$_.powerstate -eq "PoweredOn" -and $problemVMNames -contains $} | select name,@{N="VMHost";E={$}},@{N="PortGroup";E={(get-view $}}.  I then saved my $problemVMs as a CSV and moved on.

Now that I had a list of all of the VMs that were experiencing the issue, I could get to work cleaning it up (we still don't know what had caused it, but they had experienced some other problems at that same time and so suspect that this is fallout).  Since the customer was using ephemeral ports on their VDS, I supposed that moving the VMs onto a different port group and back would result in them being assigned new PortIDs.  If that other port group was created on the same VLAN that the VM was already using, that transfer should be non-interruptive.  

So, I wrote a script to do exactly that.  It takes a list of VMs, figures out which port groups they're using, creates temporary copies of those port groups, then bounces each NIC on each VM over to the appropriate temporary port group and back again.  I fired off that script and it worked exactly as intended.  Each of the VMs bounced between the appropriate port groups, experiencing 0-1 response losses on a continuous ping per transfer.  After the script completed, we found that all of the VMs were able to vMotion once again!

Like with all scripts found on the internet, this is posted as is for educational purposes with no implied guarantee.  Just because it worked for me in my situation is no guarantee that it will work for you in yours, so test thoroughly and make sure that you understand what it's doing before you ever execute a script.


Popular posts from this blog

Orphaned VMDK Files

Migrating from one vCenter to Another, Improved

Copying VM Folders and Permissions from One vCenter to Another