VM NIC Hardware Failure Issue
A few customers have been hit by an intermittent issue where virtual machines seem to reject their network adapters. In Windows, this shows up as the guest OS reporting a hardware failure on the NIC (which, given that the NIC is virtual, is bit of a hard sell). So, while the VM has a network adapter attached to it and it is connected to the network, the VM doesn’t get any network access. If you open up device manager, the “general tab” for the network adapter will show an error (error 10 if memory serves). When you try to update the driver on the NIC, it will fail to install it. I’ve only ever seen this with the vmxnet3 adapter, but I’ve heard that it can affect e1000 as well.
Typically, the VM’s network connectivity can be restored by some combination of removing the vNIC from the VM and adding it back (well, it’s technically a new one as it will have a new MAC address), reinstalling the VMTools and/or removing nonpresent NICs from the guest OS. There doesn’t seem to be any consistency to that process though, as sometimes only one of those steps is required and other times none of them seem to alleviate the symptoms. We’ve tried all sorts of things to try and prevent this issue from occurring, but it’s an extremely difficult one to track down.
Finally, one of my customers tied it to their implementation of vShield Endpoint + Deep Security, and eventually working with VMware Support seems to have exposed the problem. Apparently, there’s a memory leak associated with vShield that will eventually fill up an ESXi host’s netGP Heap, resulting in all sorts of erratic behavior… including “hardware” failures on virtual NICs.
If you think that you’re suffering from this problem, you can log into the CLI on an ESXi server (either through SSH or a local console) to look at your heap usage. You’ll need to use vsish to do it and then just cat a file (remember, use “tab” to save yourself some typing on that path):
So, why are we interested in that “maximum heap size” line? Well, it turns out that we can grow that heap. It starts out with a maximum of either 64 MB or 80 MB (I don’t recall at the moment and am not in a position to check), but that can go up to 128 MB. The pain of this issue can be mitigated (somewhat) by increasing that heap size. Given the amounts of memory that are likely to be in an ESXi server, I don’t even bat an eyelash at increasing an allocation from 64 MB to 128 MB. Given the default heap size, I’ve seen a host with “uptime” of 9 days fill up its heap.
In order to make that change, just go into the Host’s advanced settings and set the VMkernel.Boot.netGPHeapMaxSize value to 128. Or, you can use this simple PowerCLI script (you’ll probably recognize the skeleton of it) to do it en-masse. As always, this script is for educational purposes only and you should test everything thoroughly before using it in a production environment. Please bear in mind that, in order for that change to take effect, the host must be rebooted after the setting is changed.
Typically, the VM’s network connectivity can be restored by some combination of removing the vNIC from the VM and adding it back (well, it’s technically a new one as it will have a new MAC address), reinstalling the VMTools and/or removing nonpresent NICs from the guest OS. There doesn’t seem to be any consistency to that process though, as sometimes only one of those steps is required and other times none of them seem to alleviate the symptoms. We’ve tried all sorts of things to try and prevent this issue from occurring, but it’s an extremely difficult one to track down.
Finally, one of my customers tied it to their implementation of vShield Endpoint + Deep Security, and eventually working with VMware Support seems to have exposed the problem. Apparently, there’s a memory leak associated with vShield that will eventually fill up an ESXi host’s netGP Heap, resulting in all sorts of erratic behavior… including “hardware” failures on virtual NICs.
If you think that you’re suffering from this problem, you can log into the CLI on an ESXi server (either through SSH or a local console) to look at your heap usage. You’ll need to use vsish to do it and then just cat a file (remember, use “tab” to save yourself some typing on that path):
vsish
cat /system/heaps/netGPHeap-0x4100013cc000/stats
This will spit out a whole bunch of information. There’s a few lines that you’re probably interested in:
Maximum heap size:#####
Percent free of max size:##
If your “percent free of max size” is 0, you’ve probably got a problem. So, how do you clear out the clutter from that heap… well, just like you would with any other memory leak. Yeah, sorry, but that’s going to be a host restart scenario. If you’re lucky (and most of the time, you will be), you can just throw that host into maintenance mode and restart it without causing an outage. I have seen hosts drop out of vCenter and off the network entirely when that heap fills up, so it can be bad (although that’s been very rare in my experience).So, why are we interested in that “maximum heap size” line? Well, it turns out that we can grow that heap. It starts out with a maximum of either 64 MB or 80 MB (I don’t recall at the moment and am not in a position to check), but that can go up to 128 MB. The pain of this issue can be mitigated (somewhat) by increasing that heap size. Given the amounts of memory that are likely to be in an ESXi server, I don’t even bat an eyelash at increasing an allocation from 64 MB to 128 MB. Given the default heap size, I’ve seen a host with “uptime” of 9 days fill up its heap.
In order to make that change, just go into the Host’s advanced settings and set the VMkernel.Boot.netGPHeapMaxSize value to 128. Or, you can use this simple PowerCLI script (you’ll probably recognize the skeleton of it) to do it en-masse. As always, this script is for educational purposes only and you should test everything thoroughly before using it in a production environment. Please bear in mind that, in order for that change to take effect, the host must be rebooted after the setting is changed.
#Edit this Get-VMHost command to limit the scope of the script. For example, "Get-VMHost *test*" will only target hosts with "test" in their name
$AllHosts = Get-VMHost
#No need to edit anything below this line
#====================================
foreach ($ThisHost in $AllHosts){
get-advancedsetting -entity (get-vmhost $ThisHost) -Name VMkernel.Boot.netGPHeapMaxSize | set-advancedsetting -value 128 -Confirm:$false
}
Comments
Post a Comment
Sorry guys, I've been getting a lot of spam recently, so I've had to turn on comment moderation. I'll do my best to moderate them swiftly after they're submitted,