PSODs and the iovDisableIR Setting

One of my customers recently came across an issue where their ESXi hosts were randomly crashing with a PSOD.  They had recently applied the latest SPP from HP and the latest ESXi 6.0 patches, and were now occasionally seeing these crashes with messages like "LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed.  This may be a hardware problem..."

As the PSOD implied, they had called HP support for help, but weren't making much progress.  I did some googling and found a really interesting blog post from Jason Whitelock about a recent ESXi update causing HP servers to PSOD.  He had come across the exact same issue and had tracked it down to the value of the iovDisableIR setting, which had changed in this latest ESXi update.  When he set it back to its original setting, the PSOD issue went away.

As VMware explains it, Interrupt Remapping (the technology that's affected by this setting) enables more efficient IRQ routing and thus improves performance.  Unfortunately, not all hardware supports it very well and can have issues, so VMware published a KB Article that describes how to identify those issues and how to disable it.  As Jason found, HP published a customer advisory stating that their hardware does not suffer from that issue, and in fact disabling that feature can lead to PSOD crashes.  That advisory includes details for how to turn that feature back on (by disabling the iovDisableIR setting via ESXCLI).  Since Interrupt Remapping defaults to being enabled (the disable setting is disabled), this has only ever been a minor issue and hasn't gathered much attention.

The problem is that recently, VMware changed that default setting.  Jason found that the November ESXi 6.0 Patch ESXi600-201611401-BG changed that default setting (and every ESXi host that was using the default... which is pretty much every ESXi host) to disable Interrupt Remapping.  This put my customer (with all of their HP hardware) into a bad position, as they were now in a configuration that was known to be unstable on HP hardware.

Fortunately, the fix is really easy; just use ESXCLI to change the iovDisableIR setting to FALSE on every affected ESXi host.  As you can probably guess, I wasn't about to SSH into every ESXi host in this environment, so this was a good opportunity to pull out PowerCLI and the get-esxcli cmdlet.

So, I wrote a script.  It's pretty simple (and very situational).  It pulls a list of all ESXi hosts that are in maintenance mode, then goes through that list setting the iovDisableIR setting to "false" and rebooting the host (so that the change can take affect).  I added a "reportOnly" switch to it, that will cause it to spit out a short report that shows the configured iovDisableIR setting and the current runtime setting per host, to validate that hosts are set correctly after the reboot.  Because there's that reboot built into the script, I hard coded the script to only target hosts that are in maintenance mode, making the reporting aspect less useful for people who are trying to figure out if their environment might be affected by this issue.  That said, you can pretty clearly see how that reporting bit is working, so put it in a more aggressive forEach loop if you want to check out the whole environment at once ;)

As always, this script is provided as is with no guarantees.  While it worked for me in my particular situation, that's no guarantee that it'll work for you in yours.  Test thoroughly and always make sure that you understand what a script is doing before executing it.

2/15 Update:  Ariel Sánchez, in the comments below, pointed out that VMware has published a KB article about this issue.

Comments

  1. Hi! found your post when trying to find the powercli. nice script! Here is the KB https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2149043

    ReplyDelete
    Replies
    1. Thanks for the pointer to that KB; I'll add a link to it in the blog post.

      Delete
  2. I must be missing something because I can't get CreateArgs() working for set... PowerCLI 6.3R1

    [PS]> $hostArgs = $esxcli_v2.system.settings.kernel.set.CreateArgs()
    Method invocation failed because [System.Management.Automation.PSMethod] does not contain a method named 'CreateArgs'.
    At line:1 char:1
    + $hostArgs = $esxcli_v2.system.settings.kernel.set.CreateArgs()
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : MethodNotFound

    [PS]> $esxcli_v2.system.settings.kernel.set

    OverloadDefinitions
    -------------------
    void Set(int , System.Object )

    ReplyDelete
    Replies
    1. I'd be happy to help you with your script, but you're gonna have to post more of it. From the error message, I'd say that it looks like you didn't specify the -v2 parameter on your get-esxcli command (although given your variable name, I'm gonna guess that you did).

      $esxcli = get-esxcli -VMHost $thisHost -v2

      When I just call .set on a v2 esxcli, I get back a list of methods, including CreateArgs()

      Delete
    2. Nevermind, figured it out :D

      Delete

Post a Comment

Sorry guys, I've been getting a lot of spam recently, so I've had to turn on comment moderation. I'll do my best to moderate them swiftly after they're submitted,

Popular posts from this blog

Orphaned VMDK Files

Deleting Orphaned (AKA Zombie) VMDK Files

Clone a Standard vSwitch from one ESXi Host to Another