Non Responsive ESXi Hosts from the HP ESXi ISO
One of my customers called me for some help troubleshooting a backup issue. For some inexplicable reason, their VM based backup solution was failing for a bunch of VMs on a specific ESXi host. When I got there, the first thing that I checked was the host tasks and events. It listed a whole bunch of failed vMotion attempts for one particular VM with no VM Tools installed, so I thought that I'd take a peak at the VM console to see what I could see. That failed, with a fairly generic message: "Unable to connect to the MKS: Connection terminated by server"
9 times out of 10, that error indicates that there's a firewall between the ESXi host and the client. It turns out that this was 1 time out of 10, because some subsequent network troubleshooting revealed that there was nothing odd going on in that space. My next troubleshooting step was pretty obvious; check the ESXi host logs to see if anything stood out. So, I logged into the local console of the ESXi server and tried to turn on the local shell. It looked like it worked, but it didn't. When I went back into the troubleshooting section, it was disabled once again.
At this point, things were looking a little bit odd. So, I decided to restart the management agents on the host... and then they quickly went from odd to ugly. When I tried restarting the management agents, the message "can't fork" started scrolling infinitely on the screen; my host's local console was stuck in an infinite loop and both SSH and the local command line were disabled. And vMotion wasn't working. Yikes.
One of my coworkers, Jeff Green, joined in on the troubleshooting effort and tracked down an interesting VMware KB article. It turns out that there's an issue with some versions of the HP AMS (Agentless Management Service) that's installed with the HP ESXi ISO. Long story short: it's got a memory leak. HP's got a fix out, but you have to know about the problem before you can know to apply the fix. There's a workaround (stop the HP-AMS service), but you need CLI access to the ESXi host to do it... which we couldn't get because this bug was preventing us from turning it on. Yuk.
Fortunately, we found an interesting quirk of this issue. Chris Chua posted his own experiences with this AMS memory leak issue. He noticed that you could still vMotion off of an afflicted ESXi host, just not to one. At this site, we lucked out; 1 ESXi host in the cluster was still functional. We logged into it very quickly and stopped the service (by following VMware's process). Fortunately, this cluster was less than 50% utilized, so we were able to evacuate all of the VMs from one of the other hosts onto this functioning one, so that we could bounce the broken system, then remediate it. This way, we were able to hit all of the ESXi hosts in the environment.
This customer decided that they'd rather just disable the AMS service in their environment. I'm stealing some of Jeff's scripting thunder, but since we have a lot of ESXi hosts to make this change on, he put together a script! He wrote this PowerShell script to generate a bunch of plink commands (one for each ESXi host in the environment). We then put the hosts into maintenance mode, ran their corresponding plink command that his script generated, then rebooted them (so that the service could be fully removed). Anyway, here's his script - thanks for your help, Jeff!
$x = get-vmhost | ? {$_.Manufacturer -match "HP"}
$x | % {
$fqdn = $_.name
$shortname = $fqdn.split(".")[0]
#Edit this
#Remember to escape special characters
$pass = ("`"Password`"")
write-output ("plink root@$fqdn -pw $pass `"/etc/init.d/hp-ams.sh stop && /etc/init.d/hp-ams.sh status && esxcli software vib remove -n hp-ams`"")
} | sort
9 times out of 10, that error indicates that there's a firewall between the ESXi host and the client. It turns out that this was 1 time out of 10, because some subsequent network troubleshooting revealed that there was nothing odd going on in that space. My next troubleshooting step was pretty obvious; check the ESXi host logs to see if anything stood out. So, I logged into the local console of the ESXi server and tried to turn on the local shell. It looked like it worked, but it didn't. When I went back into the troubleshooting section, it was disabled once again.
At this point, things were looking a little bit odd. So, I decided to restart the management agents on the host... and then they quickly went from odd to ugly. When I tried restarting the management agents, the message "can't fork" started scrolling infinitely on the screen; my host's local console was stuck in an infinite loop and both SSH and the local command line were disabled. And vMotion wasn't working. Yikes.
One of my coworkers, Jeff Green, joined in on the troubleshooting effort and tracked down an interesting VMware KB article. It turns out that there's an issue with some versions of the HP AMS (Agentless Management Service) that's installed with the HP ESXi ISO. Long story short: it's got a memory leak. HP's got a fix out, but you have to know about the problem before you can know to apply the fix. There's a workaround (stop the HP-AMS service), but you need CLI access to the ESXi host to do it... which we couldn't get because this bug was preventing us from turning it on. Yuk.
Fortunately, we found an interesting quirk of this issue. Chris Chua posted his own experiences with this AMS memory leak issue. He noticed that you could still vMotion off of an afflicted ESXi host, just not to one. At this site, we lucked out; 1 ESXi host in the cluster was still functional. We logged into it very quickly and stopped the service (by following VMware's process). Fortunately, this cluster was less than 50% utilized, so we were able to evacuate all of the VMs from one of the other hosts onto this functioning one, so that we could bounce the broken system, then remediate it. This way, we were able to hit all of the ESXi hosts in the environment.
This customer decided that they'd rather just disable the AMS service in their environment. I'm stealing some of Jeff's scripting thunder, but since we have a lot of ESXi hosts to make this change on, he put together a script! He wrote this PowerShell script to generate a bunch of plink commands (one for each ESXi host in the environment). We then put the hosts into maintenance mode, ran their corresponding plink command that his script generated, then rebooted them (so that the service could be fully removed). Anyway, here's his script - thanks for your help, Jeff!
$x = get-vmhost | ? {$_.Manufacturer -match "HP"}
$x | % {
$fqdn = $_.name
$shortname = $fqdn.split(".")[0]
#Edit this
#Remember to escape special characters
$pass = ("`"Password`"")
write-output ("plink root@$fqdn -pw $pass `"/etc/init.d/hp-ams.sh stop && /etc/init.d/hp-ams.sh status && esxcli software vib remove -n hp-ams`"")
} | sort
ReplyDeleteThanks for all your information.Website is very nice and informative content.
123.hp.com/setup 5520