Validating HA by Inducing a PSOD

After installing an ESXi cluster, of course you want to validate its functionality.  vMotion/DRS are easy to validate, but how about HA?  Just pull the power cable out of the back of one of your ESXi hosts, right?

Well, that will do it... but it's also a bit risky.  I've seen host configurations fail to come back from an unexpected power failure and I've heard stories of hardware damage stemming from such an event.  So, a power failure is probably not the best option... so what else can you try?

You could always induce a Host Isolation event.  Simply disconnect all of the network cables from the ESXi host, which will trigger the specified HA response.  That works, but it's a bit of a pain, as you're likely going to need to unplug several cables and keep track of exactly where they belong.  If you're remote, you'll need to involve the network team to get them to shut down specific ports (and you'd better hope that there's no miscommunication regarding which ports you're working with).

I'm not a big fan of that approach either.  Fortunately, there is an easy option that I've had good experience with: inducing a Purple Screen of Death.  From the command line of an ESXi host (either the local console or an SSH session), there is a command to cause a PSOD.  This is awesome, as it allows you to both verify that HA is working and validate that your scratch/core dumps are being written to persistent storage.  The process is pretty simple,  but before you do this, make sure that you've got some way to restart the server, as it is going to drop off the network ;)

I like to run two commands, after disabling DRS.  First, I use:

esxcli vm process list

This lists the running processes of every VM on the ESXi host.  This is where I make sure that I don't have any production workloads running on the ESXi host and that I have the expected test VMs on there.  Once I'm confident that this is the correct ESXi server and that I'm not going to impact production VMs, I use this command to cause the actual PSOD:

vsish -e set /reliability/crashMe/Panic 1

And that's that.  After issueing that command, you'll either see your PSOD or your SSH session will terminate.  In about 30 seconds, your VMs will start failing over to remaining hosts within your cluster and they should be back online within a few minutes.  Once you're sure that everything's good, reset the "failed" ESXi host and then it's back to business as usual.


Popular posts from this blog

Deleting Orphaned (AKA Zombie) VMDK Files

Clone a Standard vSwitch from one ESXi Host to Another

Orphaned VMDK Files