Recovering Unidesk Desktops from a Catastrophic Host Failure
I've recently had the opportunity to deploy Unidesk at one
of my customer sites for their VDI solution.
I was preparing to write up a post about the install process, but it was
actually very simple and is extremely well documented on their site (which is
sadly behind an authentication wall so is not google indexed). If I did end up writing such an article, it
would end up just being a big love letter and wouldn’t be particularly
interesting to anyone (well, with the possible exception of the Unidesk sales
team). Instead, I’m going to write this
article about some of the things that I've broken and fixed, and some of the
interesting things that you can do from a troubleshooting perspective.
First, a brief overview of Unidesk. It’s a Layering technology. A given desktop is basically a collection of
read-only vmdk files that are all stacked up together with their specialized
driver in Windows to make it look like a single hard drive. On one of those vmdk files you’ll have an OS
installed and on all of the others you’ll have applications (sometimes a single
application on a layer, sometimes a collection… whatever makes sense for your
organization). Sitting on top of that
stack is a “User Experience Package” which is actually a pair of writable vmdk
files where any user customization goes.
When you apply updates, you simply adjust the desired layer and tell the
system to rebuild the desktops with the updated layer. All other layers (including the UEP layer)
are unaffected and remain in place, so the user has a consistent desktop
experience. Pretty cool, when it comes
right down to it. It’s like having the
flexibility of application virtualization without the pain of creating the
packages.
So, the other night we had a catastrophic host failure. Because we’re in the middle of transitioning
from a pure View VDI solution to a View + Unidesk solution, we have a lot of
overhead in our cluster (we’re basically running 2 instances of several of our
desktop pools). Instead of being at our
end goal N +2 state, we’re actually a lot closer to something like N - 1 right
now. It’s a bad time for a catastrophic
host failure. Our specific failure was a
bad NIC in an ESXi host, but as fate would have it, that NIC was only being
used for management and not for VM traffic, so the desktops kept running
without the users being aware of the issue.
Since our cluster was so oversubscribed, HA didn't respond to the
isolation event and we finished the day without issue. Because our VM traffic is going through a Distributed
vSwitch, we couldn’t simply add a VMK port to it to get the server back online
(although now, I’m considering standing up emergency Port Groups on the ESXi
Management network on all of my teams, just in case). After hours, we decided to try and fix this.
Once our outage window rolled around, we took all of our
NICs from the Distributed vSwitch and put them on a standard vSwitch with a VMK
interface. Unfortunately, our management
VLAN didn't seem to be passed along that trunk (despite earlier requests) and
our network guy had a family emergency and so was unavailable. Well, all of these VMs were on Shared
Storage, so we could shut them down from the CLI and then just re-register them
on another host (which we hastily stood up to replace the failed host). We did just that, but with mixed initial
results and long term broad failure.
When you remove a Unidesk created VM from your vCenter
inventory and then re-register it (even if you keep the ID), things break from
Unidesk's perspective. There might be a
way to fix it, but the association between Unidesk and that VM is basically
severed; it can no longer interact with the VM from a layer
creation/modification perspective (update: thanks to Illinois Guy for the tip about "Synchronize Infrastructure" - that's how you solve this without going through the restore steps). I
mentioned that we had initial mixed results; some of our reregistered desktops
had BSODs, whereas others came up just fine.
Due to the inconsistent results and the long term issue, we decided to
go another route.
Unidesk has a built in backup mechanism that just needs to
be enabled. Since all of your OS and
Application layers are read only, all that it has to back up is the UEP
layer. From there, it can rebuild the
desktop, complete with the user’s changes.
After removing the original desktops, we began this restore
process. It worked great. All of the restored machines powered right up
and registered as available in View, with only a slight problem. Since Unidesk is technically rebuilding the
VMs, it didn't remove the old entries from View. That meant that the users were still assigned
to their original desktop instances in View.
After a few minutes of cleanup, we'd removed the stale versions (they
were the ones that were not connected and were not associated with any datastore)
and reassigned the users to the restored VMs.
No problem.
That’s only half the story – that all went fine – but I like
to poke and explore things. Rather than
restoring my desktop, I decided to leave the failed box around so that I could
take a look at the wreckage… and it was cool.
Remember how the UEP layer is actually 2 VMDK files? Well, if you browse your datastore, you can
mess around with those.
I stood up a clean OS from a template (just a normal VM),
then added those existing hard drives to my VM.
At first, I was surprised to find that they weren’t in my desktop VM’s
folder on the datastore, but in hindsight it makes sense. Those vmdk files are ultimately associated
with the CachePoint – the Unidesk system that manages which layers get attached
to which machine. The vmdk file in the
VM's folder, if I am inferring things correctly, is simply used to boot that
machine and attach the requisite layers.
To find my UEP files, I had to browse my Cachepoint/UnideskLayers/User/<My
Desktop> folder. In there, rested the
pair of vmdk files that represent my UEP.
To see what I might get, I went ahead and mounted them to my
throw-away VM, then assigned them drive letters. Browsing them through Windows explorer was
awesome, as I was able to see exactly which files are stored in each vmdk! Unidesk describes the two vmdks of the UEP as
basically a Configuration/Application vmdk and an “Everything Else” vmdk. By browsing the vmdks in this fashion, I
could see exactly what data was going onto each one! I was able to infer which was which by simply
looking for .exe files, which I know go to the Configuration/Application
vmdk. This is super cool for a few
reasons. Firstly, it can be used for
sizing planning to help determine how large your UEP layers must be (since you
can granularly determine where various types of data reside). Secondly, it gives me access to the UEP
layers of the failed desktops, so if my backup was stale, I could retrieve
updated versions of files!
In anyone is curious, my desktop's UEP utilization (which has
some application installers in the profile’s “downloads” folder) comes to 2.11
GB of “Configuration/Application” data and 532 MB of “Everything Else” data
after about 3 weeks of use. My Trend
Micro virus definitions seem to have gone to the “Configuration/Application”
drive as well. Those numbers are, of
course, a from a single admin user, so they aren't indicative of what a normal
user might have.
nice review Jason,
ReplyDeleteOne thing of note is that if you unregister and re-register a VM you are DEAD ON RIGHT that it breaks some of the logical links between the desktop and Unidesk.
You must have been on RC Code for 2.0 as the release code has our "Synchronize Infrastructure" button in the System tab that allows you to fix those links for the situation you ran into.
We are indeed on RC 2 for 2.0. I didn't realize that that Synchronize would also resolve those issues; I thought it just determined what user accounts were associated with each desktop. Thanks for the pointer!
DeleteGreat looking blog Jason! Thanks for the link!
ReplyDeleteJason -
ReplyDeleteThanks for posting your real-world experience with Unidesk ... and for really digging deep to understand what's going on "under the covers." It will be incredibly valuable for people evaluating Unidesk to get these insights. Thanks, also, for your great "What is Unidesk" overview.
We look forward to working with you at this client and others in the future.
Best regards,
Don Bulens, Unidesk CEO