Recovering Unidesk Desktops from a Catastrophic Host Failure
I've recently had the opportunity to deploy Unidesk at one of my customer sites for their VDI solution. I was preparing to write up a post about the install process, but it was actually very simple and is extremely well documented on their site (which is sadly behind an authentication wall so is not google indexed). If I did end up writing such an article, it would end up just being a big love letter and wouldn’t be particularly interesting to anyone (well, with the possible exception of the Unidesk sales team). Instead, I’m going to write this article about some of the things that I've broken and fixed, and some of the interesting things that you can do from a troubleshooting perspective.
First, a brief overview of Unidesk. It’s a Layering technology. A given desktop is basically a collection of read-only vmdk files that are all stacked up together with their specialized driver in Windows to make it look like a single hard drive. On one of those vmdk files you’ll have an OS installed and on all of the others you’ll have applications (sometimes a single application on a layer, sometimes a collection… whatever makes sense for your organization). Sitting on top of that stack is a “User Experience Package” which is actually a pair of writable vmdk files where any user customization goes. When you apply updates, you simply adjust the desired layer and tell the system to rebuild the desktops with the updated layer. All other layers (including the UEP layer) are unaffected and remain in place, so the user has a consistent desktop experience. Pretty cool, when it comes right down to it. It’s like having the flexibility of application virtualization without the pain of creating the packages.
So, the other night we had a catastrophic host failure. Because we’re in the middle of transitioning from a pure View VDI solution to a View + Unidesk solution, we have a lot of overhead in our cluster (we’re basically running 2 instances of several of our desktop pools). Instead of being at our end goal N +2 state, we’re actually a lot closer to something like N - 1 right now. It’s a bad time for a catastrophic host failure. Our specific failure was a bad NIC in an ESXi host, but as fate would have it, that NIC was only being used for management and not for VM traffic, so the desktops kept running without the users being aware of the issue. Since our cluster was so oversubscribed, HA didn't respond to the isolation event and we finished the day without issue. Because our VM traffic is going through a Distributed vSwitch, we couldn’t simply add a VMK port to it to get the server back online (although now, I’m considering standing up emergency Port Groups on the ESXi Management network on all of my teams, just in case). After hours, we decided to try and fix this.
Once our outage window rolled around, we took all of our NICs from the Distributed vSwitch and put them on a standard vSwitch with a VMK interface. Unfortunately, our management VLAN didn't seem to be passed along that trunk (despite earlier requests) and our network guy had a family emergency and so was unavailable. Well, all of these VMs were on Shared Storage, so we could shut them down from the CLI and then just re-register them on another host (which we hastily stood up to replace the failed host). We did just that, but with mixed initial results and long term broad failure.
When you remove a Unidesk created VM from your vCenter inventory and then re-register it (even if you keep the ID), things break from Unidesk's perspective. There might be a way to fix it, but the association between Unidesk and that VM is basically severed; it can no longer interact with the VM from a layer creation/modification perspective (update: thanks to Illinois Guy for the tip about "Synchronize Infrastructure" - that's how you solve this without going through the restore steps). I mentioned that we had initial mixed results; some of our reregistered desktops had BSODs, whereas others came up just fine. Due to the inconsistent results and the long term issue, we decided to go another route.
Unidesk has a built in backup mechanism that just needs to be enabled. Since all of your OS and Application layers are read only, all that it has to back up is the UEP layer. From there, it can rebuild the desktop, complete with the user’s changes. After removing the original desktops, we began this restore process. It worked great. All of the restored machines powered right up and registered as available in View, with only a slight problem. Since Unidesk is technically rebuilding the VMs, it didn't remove the old entries from View. That meant that the users were still assigned to their original desktop instances in View. After a few minutes of cleanup, we'd removed the stale versions (they were the ones that were not connected and were not associated with any datastore) and reassigned the users to the restored VMs. No problem.
That’s only half the story – that all went fine – but I like to poke and explore things. Rather than restoring my desktop, I decided to leave the failed box around so that I could take a look at the wreckage… and it was cool. Remember how the UEP layer is actually 2 VMDK files? Well, if you browse your datastore, you can mess around with those.
I stood up a clean OS from a template (just a normal VM), then added those existing hard drives to my VM. At first, I was surprised to find that they weren’t in my desktop VM’s folder on the datastore, but in hindsight it makes sense. Those vmdk files are ultimately associated with the CachePoint – the Unidesk system that manages which layers get attached to which machine. The vmdk file in the VM's folder, if I am inferring things correctly, is simply used to boot that machine and attach the requisite layers. To find my UEP files, I had to browse my Cachepoint/UnideskLayers/User/<My Desktop> folder. In there, rested the pair of vmdk files that represent my UEP.
To see what I might get, I went ahead and mounted them to my throw-away VM, then assigned them drive letters. Browsing them through Windows explorer was awesome, as I was able to see exactly which files are stored in each vmdk! Unidesk describes the two vmdks of the UEP as basically a Configuration/Application vmdk and an “Everything Else” vmdk. By browsing the vmdks in this fashion, I could see exactly what data was going onto each one! I was able to infer which was which by simply looking for .exe files, which I know go to the Configuration/Application vmdk. This is super cool for a few reasons. Firstly, it can be used for sizing planning to help determine how large your UEP layers must be (since you can granularly determine where various types of data reside). Secondly, it gives me access to the UEP layers of the failed desktops, so if my backup was stale, I could retrieve updated versions of files!
In anyone is curious, my desktop's UEP utilization (which has some application installers in the profile’s “downloads” folder) comes to 2.11 GB of “Configuration/Application” data and 532 MB of “Everything Else” data after about 3 weeks of use. My Trend Micro virus definitions seem to have gone to the “Configuration/Application” drive as well. Those numbers are, of course, a from a single admin user, so they aren't indicative of what a normal user might have.