Unidesk 2.6 Upgrade Problem

One of my customers was upgrading their Unidesk environment to 2.6 and we ran into a problem.  It proved to be an incredibly specific problem and not at all tied to Unidesk (just tied to the way they push this particular upgrade), but the troubleshooting process was very interesting and so I think it's worth putting this knowledge out there.

As part of the upgrade, the Unidesk Management Appliance needs to install an OVF of itself (which is an uncommon but not unprecedented behavior for virtual appliances).  That is the step that was failing.  The most common source of that failure is a firewall; if port 443 is blocked between the Management Appliance and the ESXi host, that deployment will fail.  We went round and round with the network team, checking firewalls and couldn’t find any records of ports being blocked, but we couldn’t communicate with the ESXi hosts (although we could communicate with the vCenter server just fine).

Eventually, we did a TCPDump on the ESXi host’s management port to try and figure out what was happening, and we got some weird results.  We saw the SYN come in from the client, followed almost immediately (less than 1 ms later) by a RST.  Interestingly, we didn’t see the host sending out any reply whatsoever, and so were quite confused (looking back, this was the first major clue to what was actually happening).  We found that no device on the Management Appliance network could communicate with the ESXi hosts, although they could all communicate with vCenter (which is on the same network as the ESXi hosts), and so we used a Windows server on that network to do some Wireshark work.

We found that that client was sending out the SYN, was receiving a SYN,ACK, and was then sending the RST that we’d seen on the host.  We were baffled at why our client would immediately send a RST after getting the expected SYN,ACK, so we dug deeper.

I noticed that the ACK sequence number was completely different from our SYN’s sequence number.  That explained why our client was sending the RST, as it never sent the SYN that the ESXi host was acknowledging.  In fact, going back to that ESXi host’s TCPDump, it didn’t look like that host was sending the SYN,ACK at all.  So, we dug deeper. 

The source MAC Address in the weird SYN,ACK was a VMware MAC address (00:50:56:x:x:x), so I decided to try and track down what device was sending that SYN,ACK (I was expecting it to be a vShield component or something).  I wrote a quick PowerCLI script to search a vCenter inventoryfor a specific VM by MAC Address and it turned up no VMs with that MAC (despite successful test runs with other VMs' MAC addresses).  That’s when I remembered that VMK interfaces are technically virtual interfaces as well, and so I checked the VMK interfaces on the host.

It turned out that the host had a secondary VMK interface that was directly on the network that had our Management Appliance.  So, the Management Appliance communicated to that host by its DNS name, which resolved to its primary VMK interface.  Since the host had an interface that was directly on that client’s network though, it replied through that interface.  

We had found a situation where the we had asymmetric routes, which is a bad thing.  When we moved that secondary VMK interface onto the storage network where it belonged, everything started working as intended.


Popular posts from this blog

Orphaned VMDK Files

Migrating from one vCenter to Another, Improved

Deleting Orphaned (AKA Zombie) VMDK Files