Recovering Unidesk Desktops, Deleted from View/vSphere


Update: I just posted a followup to this post with an easier process for recovering a lost Unidesk Application Layer VMDK.  This post still has a lot of good information about what's going wrong with the system when one of those VMDK files are deleted, so it's still probably worth a read.

We had a bit of a PEBCAK issue recently where an administrator (read: me) was cleaning up some older View only desktop pools to clear up resources to bring in more Unidesk desktops.  Inadvertently, a Unidesk desktop pool was deleted from the View Administrator and the option to delete all VMs from disk was selected.  Oops.  By and large, you could probably do this and get away without any issues (aside from needing to restore those desktops), unless you deleted all of the desktops that are using a particular instance of an Application Layer.  Let’s look at what happens in that situation (based on what happened when we accidentally did just that).

When you open up the View Administrator and tell it to delete a VM from the infrastructure, it does just that.  If that VM is a Composer Linked Clone, it knows to take a special cleanup procedure, but otherwise it just looks at that VM and deletes all of the files that are associated with it.  That means that it’s going to remove the VMX configuration file, the swap file and all associated VMDK files.  In Unidesk, each CachePoint has its own clone of each Application Layer (and OS Layer) that its desktops are using.  Every desktop points back to that same VMDK file.  When View Administrator tries to delete the VM, it tries to delete all of the read only OS and Application Layer VMDKs that are managed by the CachePoint, not just the UEP and Boot Layers that belong to that individual VM!

Generally, this probably won’t hurt anything.  vSphere (should that “v” be upper case because it’s starting a sentence?) is smart enough to prevent you from deleting a VMDK that is in use by a powered on Virtual Machine and View has to play by those same rules.  Since those Layers are common between every desktop in the environment, they’re probably going to be locked by other, powered on, VMs.  That is, unless you delete every VM that is using a particular Application or OS Layer.  When you do that, by the time it comes around to deleting the final VM, there are no other VMs preserving that VMDK and so View happily removes it from the file system.  It even looks like it helpfully removes the (now empty) folder that contained that VMDK.

Of course, Unidesk isn’t aware of this change (the fact that one of its CachePoints has lost a VMDK file that it knows should be there), so things come a bit unglued.  When you try to deploy a VM with that Layer on that CachePoint (the instance on the Master CachePoint is still fine, of course), Unidesk thinks that that Layer already exists and so tries to attach it.  When it can’t read from it, it throws an error message “The CachePoint Appliance could not create the boot image.  Error is: A Layer appears to be corrupted.”

So, the fastest way to get affected desktops back online is to restore them to a different CachePoint.  Since the issue is related to that specific instance of the Application Layer, which is contained on a single CachePoint, you can get those desktops up and running without any issues on any other CachePoints in the environment (one that did not have that layer forcibly removed).  Of course, that’s just the step to take to swiftly get the users back to a state where they can work, but it doesn’t actually solve the problem.  The problem is that database entry that causes the CachePoint to think that it has a Layer that it doesn’t actually have.

I expect that a support call with Unidesk would have yielded some direct way of solving the issue, but we have enough access to solve it indirectly.  What I did was just create a new Version of the affected Layer.  No changes necessary, just a new Version number.  I then updated the affected desktops to use the new version.  Since Unidesk knows that it’s a new version that that CachePoint has never seen before, it knows that it has to copy it onto that device.  This means that the desktops power up fine with the new version.  To remove the references to the old, missing version, just get the whole environment updated to this new version (a low priority action, just required for cleanup) and delete the old version of the Layer from the Unidesk Manager.  This will signal all of the CachePoints to delete the VMDKs associated with that Version and to remove all references to it.

As I implied, I didn’t actually end up calling Support on this one, so there might be a more elegant method to solve it (I tried the Synchronize Infrastructure button this time, but as you might expect, it didn’t go deep enough to detect and resolve this).  That said, this process was pretty easy to apply and everything seems to be running smooth as clockwork since.  Plus, solving it myself gave me a great opportunity to poke around under the covers and learn some more about what ties this whole thing together.  It’s perfectly logical, so when you see a bit of the inner workings you can almost always figure out what’s going on!

Comments

  1. Jason, we are looking at Unidesk to manage our VDI environment. How do like the product and what are the problems or "gotchas" that you've experienced.

    ReplyDelete
  2. I like it quite a bit and recommend it to almost all of my VDI customers. If you're rolling out VDI to more than 10% of an organization, you need some sort of flexible application delivery tool. Of the ones that I've used, Unidesk is my favorite. It also has the side effect of giving you persistent desktops while not sacrificing centralized manageability; pretty cool.

    There aren't too many serious gotchas; proper design and operation will protect you from them. That said, I'll draw your attention to two things:

    1) Storage Density
    Each LUN will have a Unidesk CachePoint on it, which performs operations on the VDI desktops on that LUN. Each CachePoint is limited to 4 concurrent operations. On a VMFS volume, you shouldn't go past the 60-70 VM mark; a CachePoint takes about 2 hours (depending on the applications) to rebuild that many machines. Since you have 1 CachePoint per 60-70 machines, your environment takes about 2 hours to completely rebuild (for example, in the event of an OS patch). In an NFS environment, where you might have 250+ VMs per datastore, the rebuild times correspondingly go up. Then again, if you've got awesome storage, the rebuild times go down, so it's a factor that must be considered during the design phase.

    2) Update Procedures
    It's possible to put updates anywhere; please put them on the layer of the application that is being updated, unless you've got a really good reason to put it elsewhere. That rule is especially important for OS updates. While it is generally possible to put Windows updates onto an Application Layer (and I've heard of some people doing it successfully), it complicates things and can lead to some unexpected behaviors as Windows may need to shift kernels mid boot. By applying OS updates to your OS Layer, things just work, which makes everyone happy.

    Oh, and one bonus one - in any VDI solution, use KMS licenses rather than MAKs; it just makes life a lot easier.

    ReplyDelete
    Replies
    1. Thanks Jason for those tips. We've yet to actually play with the product and are still trying to evaluate other solutions including VMWare Mirage.

      I don't suppose I can contact you via email with other questions?

      Delete
    2. Sure - I don't want to post my email here (due to spam-bots), but I'm a consultant with http://www.ens-inc.com/ and we use the standard (first initial)(lastname)@(domain) naming convention for our email addresses. I suspect that blogger has some other, more elegant private messaging capability, but my goal with this blog is to invest as much of my time in content and as little time in blog maintenance as possible ;)

      Delete
  3. Jason - I've been using Unidesk lately and love it. Thanks for the great tip about the missing application layer. Very smart.

    ReplyDelete

Post a Comment

Sorry guys, I've been getting a lot of spam recently, so I've had to turn on comment moderation. I'll do my best to moderate them swiftly after they're submitted,

Popular posts from this blog

Clone a Standard vSwitch from one ESXi Host to Another

PowerShell Sorting by Multiple Columns

Deleting Orphaned (AKA Zombie) VMDK Files