Restarting VMs after a Datacenter Down Event
One of my customers recently had a catastrophic thermal event in one of their datacenters and so had to shut down all of their infrastructure at that site. After the cooling issue was resolved, we were asked to help them to get their infrastructure back online. Fortunately, we have included several small details as best practices in our vSphere designs, and one of those really paid off for us. We always create a VM to Host affinity rule that keeps one Domain Controller, the vCenter server, its PSC and its Database (if external) on a known host in the management cluster.
So, after the SAN was powered back on and we restarted the physical ESXi servers, we knew exactly what to do. I fired up the vSphere client and logged into that ESXi server in the management cluster as root. From there, I was able to easily find those core infrastructure VMs and powered them all on. Once they were running, I logged into vCenter... and found that I had an interesting challenge.
We needed to turn back on the rest of the VMs, but we all knew that quite a few VMs in this environment were not running before the event and no one wanted them to be running now. How to solve this?
Well, it was actually pretty easy. I wrote up a quick PowerCLI script that checked the vCenter event database for VM Shutdown or Power Off events (some of the VMs did not have VMware tools and so were hard powered off). Once that list of VMs is collected, the script loops through it and attempts to power on any VM that isn't already running (since, in this situation, some admins had been powering on high impact VMs by hand as soon as vCenter was back online).
It's a pretty simple script, but it worked great for us! Please bear in mind that this was written in a single morning (under a bit of stress...), so while it worked for us there's no guarantee that it'll work for you. As always, test thoroughly, use this for educational purposes and, if you fix something, please let me know!
So, after the SAN was powered back on and we restarted the physical ESXi servers, we knew exactly what to do. I fired up the vSphere client and logged into that ESXi server in the management cluster as root. From there, I was able to easily find those core infrastructure VMs and powered them all on. Once they were running, I logged into vCenter... and found that I had an interesting challenge.
We needed to turn back on the rest of the VMs, but we all knew that quite a few VMs in this environment were not running before the event and no one wanted them to be running now. How to solve this?
Well, it was actually pretty easy. I wrote up a quick PowerCLI script that checked the vCenter event database for VM Shutdown or Power Off events (some of the VMs did not have VMware tools and so were hard powered off). Once that list of VMs is collected, the script loops through it and attempts to power on any VM that isn't already running (since, in this situation, some admins had been powering on high impact VMs by hand as soon as vCenter was back online).
It's a pretty simple script, but it worked great for us! Please bear in mind that this was written in a single morning (under a bit of stress...), so while it worked for us there's no guarantee that it'll work for you. As always, test thoroughly, use this for educational purposes and, if you fix something, please let me know!
Hi Jason nice script i really appreciate your efforts. I have a doubt if we have a huge infrastructure and try to boot all VMs at a same time we may get hit with boot storm on storage. May be pull information from DB and create batches may help. Please correct me if i am wrong.
ReplyDeleteYou are absolutely correct. In this case, the admins had already turned on all of the high priority systems (they got about 50% of their VMs) while I was putting this script together, so it only had to turn on the remainder. In order to avoid a boot storm, just put "Start-sleep 15" or something in the Power On VMs foreach loop; that'll add a 15 second delay between each VM power on operation.
DeletePerfect 👍
ReplyDeletePerfect 👍
ReplyDelete