Unidesk 2.6 Upgrade Problem
One of my customers was upgrading their Unidesk environment to 2.6 and we ran into a problem. It proved to be an incredibly specific problem and not at all tied to Unidesk (just tied to the way they push this particular upgrade), but the troubleshooting process was very interesting and so I think it's worth putting this knowledge out there.
As part of the upgrade, the Unidesk Management Appliance
needs to install an OVF of itself (which is an uncommon but not unprecedented
behavior for virtual appliances). That is the step that was
failing. The most common source of that failure is a firewall; if port
443 is blocked between the Management Appliance and the ESXi host, that
deployment will fail. We went round and round with the network team,
checking firewalls and couldn’t find any records of ports being blocked, but we
couldn’t communicate with the ESXi hosts (although we could communicate with
the vCenter server just fine).
Eventually, we did a TCPDump on the ESXi host’s management
port to try and figure out what was happening, and we got some weird results.
We saw the SYN come in from the client, followed almost immediately (less than
1 ms later) by a RST. Interestingly, we didn’t see the host sending out
any reply whatsoever, and so were quite confused (looking back, this was the
first major clue to what was actually happening). We found that no device
on the Management Appliance network could communicate with the ESXi hosts,
although they could all communicate with vCenter (which is on the same network as
the ESXi hosts), and so we used a Windows server on that network to do some
Wireshark work.
We found that that client was sending out the SYN, was
receiving a SYN,ACK, and was then sending the RST that we’d seen on the
host. We were baffled at why our client would immediately send a RST
after getting the expected SYN,ACK, so we dug deeper.
I noticed that the ACK sequence number was completely
different from our SYN’s sequence number. That explained why our client
was sending the RST, as it never sent the SYN that the ESXi host was
acknowledging. In fact, going back to that ESXi host’s TCPDump, it didn’t
look like that host was sending the SYN,ACK at all. So, we dug
deeper.
The source MAC Address in the weird SYN,ACK was a VMware MAC
address (00:50:56:x:x:x), so I decided to try and track down what device was
sending that SYN,ACK (I was expecting it to be a vShield component or
something). I wrote a quick PowerCLI script to search a vCenter inventoryfor a specific VM by MAC Address and it turned up no VMs with that MAC (despite successful test runs with other
VMs' MAC addresses). That’s when I remembered that VMK interfaces are
technically virtual interfaces as well, and so I checked the VMK interfaces on
the host.
It turned out that the host had a secondary VMK interface
that was directly on the network that had our Management Appliance. So,
the Management Appliance communicated to that host by its DNS name, which
resolved to its primary VMK interface. Since the host had an interface
that was directly on that client’s network though, it replied through that interface.
We had found a situation where the we had asymmetric routes, which is a bad thing. When we moved that secondary VMK interface onto
the storage network where it belonged, everything started working as intended.
Comments
Post a Comment
Sorry guys, I've been getting a lot of spam recently, so I've had to turn on comment moderation. I'll do my best to moderate them swiftly after they're submitted,