Thursday, April 20, 2017

PowerShell String Manipulation of Formatted Text in Columns

Every now and then, I find myself needing to use a utility like plink in order to interface with a system, such as a switch or a chassis, during a script.  If I'm just sending configuration commands (and am taking it on faith that they worked...), then it's nice and easy, but if I actually want to extract information from the device, then I've got a bit of a challenge, because those devices (via plink) are not going to give me back an object that PowerShell understands.

For example, if I use get-vm in PowerShell, I will get back a vm object that has a bunch of properties, which I can easily access using dot notation.  If I use plink to pull a brocade switch configuration, all I'm going to get back (from PowerShell's perspective) is a great big long string with lots of New Line characters, tabs and spaces.  So, how do I extract data from a formatted text string, in order to more easily work with it in PowerShell?  Well, there's a lot of different tricks available, but here's some that I've used recently.

First, you need to understand the format of the string that you want to parse.  Does it use tabs to create the impression of columns?  Maybe each line is a separate piece of data using a colon to delineate what the property is from the actual value.  Maybe it's something else entirely!  The approach that you use will depend entirely on what that data looks like, but I generally like to create a new PowerShell object with appropriate properties that I can then work with.

Today, I'm going to talk about how I deal with false columns.  Well, this is the worst, as you also have to understand how it deals with data that is too wide for the column as well as how it spaces the columns.  Does it just misalign all further columns, or does it "wrap" to the next line instead?  My general approach is to first split the string by the New Line character, so that I can take it line by line (if it's not already neatly split) like this:

$a = $a -split("`n")

Then I look at the columns to figure out the spacing.  Is it something simple, like a tab delimited list?  If so, I can then split each line on the tab character in order to work with the data.  Check out this example:

$a = "Name`tDate`tMood`nJason`tToday`tHappy"

That will create some sample formatted text that looks like this:

Name    Date    Mood
Jason   Today   Happy

In this case, because we're dealing with tabs between the columns, we can easily convert this to a PowerShell object like this:

$a | convertFrom-csv -delimiter "`t"

Which will turn it into a nice object, complete with Name, Date and Mood parameters.  What do you do if it uses spaces to create the columns?  Well, then you've got a couple of options.  Let's look at this example formatted text:

$a = "Name   Date   Mood`nJason  Today  Happy"

which generates output like this (this example might get a little messed up because of the HTML conversion and the spaces):

Name   Date   Mood
Jason  Today  Happy

In this case, I think that the best approach is to insert delimiting characters to each line, then use that same convertFrom-csv cmdlet to turn it into an object.  Try this:

$b = $a -split("`n") | % {$_.insert(14,"`t").insert(7,"`t").replace(" ","")}
$b | convertfrom-csv -Delimiter "`t"

The first command there will split the input into an array based on the new line character, then will insert tabs into each line at the 14th and 7th positions.  Finally, it deletes any spaces that it found, so that your object parameters won't have troublesome spaces in their names.  The next command simply interprets that newly formatted text as a tab delimited csv and then creates a PowerShell object from it.

So, what do you do if there's line wrapping in the table?  Well, that's even more difficult.  I'd start by using one of the above techniques in order to generate a delimited csv, then I'd start parsing through that CSV to correct for the unneeded line breaks.  The first thing to determine is how to detect if there's been a wrap.  One way might be if one or more fields is blank in a given column, as that could indicate that this line should be considered a continuation of the line before.  Here's how I'd correct for that.

First, let's prepare an example:
$a = "Name`tDate`tMood`nJason`tToday`tHappy`nColeman`t`t`nJeff`tYesterday`tExuberant"
$b = $a | convertfrom-csv -delimiter "`t"

In this case, "Coleman" is really a continuation of data in the "name" property from the line above, although it has been presented to PowerShell as if it were a new item in the array.

Name    Date      Mood
----    ----      ----
Jason   Today     Happy
Coleman
Jeff    Yesterday Exuberant

How do we detect this and handle this situation?  Here's a solution that I put together that assumes that any line with an empty parameter must be a continuation of the previous line.  In some situations, that could be a big assumption, so make sure that is valid with your dataset before using this technique!  If that's the case, you'll need to figure out some other way of detecting a continuation, but can still use this framework for concatenating those lines once they're identified.

$properties = ($b[0] | Get-Member -membertype properties).name
for ($i=0;$i -lt $b.count;$i++){
    if (($properties | % {$b[$i].$_ -eq ""}) -contains $TRUE){
        $properties | % {if ($b[$i].$_){$b[$i - 1].$_ = $b[$i - 1].$_ + " " + $b[$i].$_}}
        $b[$i] = $b[$i - 1]
    }
}
$b = $b | get-unique -AsString

Obviously, this is a bit more complex.  What's it do?  The first line gathers a list of all of the properties from these PowerShell objects, which we'll use later to evaluate the contents of each line.

Next, it enters a for loop, going through each object in the array.  I used this technique instead of a ForEach because I'm going to be manipulating objects based on their index location within the loop.

Next, there's a complex If statement.  The evaluation that's happening there looks at all of the properties discovered in the first step and checks the current object to see if any of them are empty strings (depending on the source data, you may need to check for $NULL or something else entirely!).  If any of them are empty strings (that's what the -contains $TRUE is checking), it adds that object's property's contents to the previous object's property in the array.  After that, it overwrites the current object with duplicate data from that previous object.  The final command removes any duplicate lines from the array.

Is that a comprehensive list of techniques for parsing formatted text output?  Of course not!  But, hopefully these techniques will come in handy and save someone some headaches.

Monday, April 3, 2017

HP c7000 Chassis Administration Tips and Tricks

Several of my customers use HP C7000 Blade Chassis for their ESXi hosts.  I've picked up a few tips and tricks for working with that chassis over the years, so I figured that I'd post them here.

The Virtual Connect (the blade chassis's networking component) has a feature that can prevent pause frames from flooding a network by disconnecting a blade that is sending an excessive number of them.  Unfortunately, every now and then, it detects an ESXi host's uplink as sending such a number of pause frames and so disconnects that network adapter.  Fortunately, it's really easy to allow traffic to flow through that port once again.  Just SSH into the Virtual Connect (you can get the address by looking at the "Virtual Connect Manager" link in the Onboard Administrator interface.  Once you're connected, use the show port-protect command to see if there are any ports that are in a blocking state.  If so, you can use the reset port-protect command to reset the pause flood blocking (it's a global thing), and your link should come right back up.

Another issue that I see occasionally is when the server loses its access to its SD card.  Disconnecting power from the blade and then powering it back on will frequently restore access, but that's not easy to do if the blade isn't physically near you.  Fortunately, there is a way to remotely trigger the e-fuse on the blade and cause such a hard reset.  I shut down the blade first, then SSH into the Onboard Administrator.  Once I'm signed in, I used show server status X where X is the blade slot that I'm interested in.  I verify that the power is off for that blade, just to double check that I'm really working on the server that I think that I'm working on.  Once I'm sure that X is really the blade that I want, I use reset server X to trip the e-fuse.  Within a minute, I see the blade drop out of the OA and then return, and it will usually bring back the SD card with it.

I recently had to do some work on a blade while troubleshooting a suspected hardware failure.  One of the troubleshooting steps was to clear the NVRAM on the blade, and HP helpfully sent me a procedure.  There's a little switch on the motherboard and, for this model/generation combo (BL460 Gen 8), little is an understatement.  If you need to do this, you'd be well served by bringing a long needle with you, so that you can toggle only the switch that you want to toggle.  In a pinch, I ended up using one tine of the star allen wrench that HP embeds in the blades.  It wasn't easy, but it wasn't too bad either.  Once that switch was toggled, I had to turn on the server and let it do its thing, then turn it back off again.

In order to do that process, I had to look at the console of the blade to watch the boot process.  In the datacenter, I didn't have easy access to my laptop for iLo, so I learned how to use the KVM on the back of the C7000 chassis.  It's not too difficult, just plug in a VGA monitor and a USB mouse/keyboard to the active OA module.  If you plug into the standby module, it'll tell you, so just switch it over.  Then, use the front diagnostic panel and select the "KVM Menu" option, which will black out that panel.  Go back to the monitor and you should see a list of all blades installed in the system.  From there, you can do power operations or select the name of the blade to open its console directly.  The part that wasn't well documented is that you can get back to that menu at any time by pressing Print Screen twice in rapid succession.  So, when you are ready to switch over to another blade, just press Print Screen, Print Screen, and after a couple of seconds it'll bring you back to the menu.

Monday, March 27, 2017

Checking Distributed Switch PNICs for Invalid VLAN Traffic

4/26/17 Update: I changed this script so that it no longer uses the min/max VLAN numbers and instead discovers a list of valid VLANs based on the Port Groups that are defined on the VDS.  It then alerts if it sees any VLANs that are not in that list.

One of my customers has several physical uplinks going into their ESXi hosts, each carrying different sets of VLANs.  They recently had an issue where an uplink with one set of VLANs was accidentally attached to a VDS that was configured for the other set of VLANs.  This wasn't a catastrophic issue, as the VDS didn't have port groups defined for those invalid VLANs and so any traffic was dropped into the bit bucket, but it did mean that 1 of the links going into that switch was useless.

After we corrected the issue, we decided that we should audit the environment to see if this problem had occurred anywhere else but not been detected.  We decided that the best way to perform an initial scan of the environment would be to leverage the NIC traffic hints that VMware generates per PNIC and see if any PNICs either registered no traffic or registered traffic from VLANs that were not appropriate.  This process required examining every PNIC attached to every VDS and ensuring that it conformed to standards.

As you can imagine, I didn't want to do this by hand... so I wrote a script to do it for me!  This script takes 3 parameters: VDSwitch, minVLANID, and maxVLANID.  VDSwith is the name of the Distributed Switch that the script will examine.  minVLANID is the lowest numbered VLAN that is acceptable on the VDS and maxVLANID is the highest numbered VLAN that is acceptable on the VDS.

With that data provided, the script will loop through each PNIC on each ESXi Host that's attached to the specified VDS, examining the VLAN traffic hints.  If it finds and VLANs that are outside of that range, it will report that in red during execution.  If it finds any links with no observed traffic, it will report that in yellow.  After it's done, it spits out a full report that lists each ESXi host, the min/max VLAN numbers and the observed traffic on each PNIC.

As always, this script is posted as is with no guarantees.  The fact that it worked for me in my situation does not guarantee that it'll work for you in yours.  Make sure that you fully understand and test any script that you find on the internet before running it in your own environment.

Tuesday, March 14, 2017

Getting VM EVC Mode Requirements via PowerCLI

One of my customers was preparing to do some major ESXi host reconfiguration and so needed to shift VM workload from one cluster to another.  They had a challenge in that their clusters were running with different EVC modes, and they wanted to move VMs from the newer cluster to the older cluster.  "Impossible!" the strawman says, "it can't be done!"

Well, yes and no.  That's absolutely correct that you can't vMotion a VM that powered up on an Ivy Bridge CPU back onto an ESXi host with a Sandy Bridge processor.  The reason for this is that the VM, during its power on operation, scans the CPU of its host for a list of CPU features that are available and begins potentially using those features, which means that it can't be moved to a processor that doesn't have those features.  The VM, in effect, inherits its host's EVC Mode for the lifespan of this power cycle.  Until the VM goes through a complete new power cycle (not a reboot from within the guest OS), it will maintain that EVC Mode.  This means that any VMs that were originally powered on on older CPUs will maintain that older EVC Mode, even though they're running on an Ivy Bridge capable cluster!

You can actually see this through the vSphere client.  If you select your cluster, then go to the Virtual Machines tab, you can add the "EVC Mode" column (it's about 3/4 of the way down the list), which will list the runtime EVC Mode for each VM.  All that we had to do was select any VMs that were still running with an older mode (we literally had hundreds) and move them into the older cluster.

Well, it wasn't quite that easy.  The customer wanted a report of which VMs were candidates to move, so that they could be selective regarding which VMs were migrated to the older cluster.  Easy enough, right?  You can export a list as a CSV from the vSphere client, and you're good to go... usually.

I'm not sure why, but the CSV export blew up.  It had a bunch of lines that were just text; looking back, it was probably just due to carriage returns in the "notes" field of some VMs.  Regardless, it inspired me to pull that data via PowerCLI rather than using the retired vSphere client.

So, I did some quick googling and found that there's lots of information out there about how to pull the EVC Mode of a cluster (Get-Cluster 'MyCluster' | select Name,EVCMode).  There's lots of information about how to pull the EVC Mode of an ESXi host (Get-VMHost 'MyHost' | select Name,MaxEVCMode).  Not so much information about how to get the current EVC Mode requirement for a particular VM or set of VMs.  I figured that it must be in there somewhere, but I had no idea where.  So, I cheated.

I used get-vm to grab a running VM, then piped that to a file via Export-CLIXML -depth 8.  I chose to use Export-CLIXML because it serializes all of the properties and subproperties of the object, up to the specified depth.  I chose 8 because I know that VMware loves to embed references to objects within objects within references to the original object and I didn't want my file to get too crazy.  As it was, I ended up with a 3 MB XML file for this single VM.

Once I had that XML file, I searched it for EVC.  The first couple of instances weren't very promising (one was buried under the VMHost and the second didn't actually specify the mode), but the third one (around line 3400) looked good.  I started compacting the XML nodes to trace the path back towards the root, and eventually I found that the MinRequiredEVCModeKey parameter was under $_.ExtensionData.Runtime.

So, I put together the following command and generated a nice clean list of migration candidates:
get-cluster 'myCluster' | get-vm | ? {$_.powerstate -eq "poweredon"} | select name,@{Name="EVCMode";Expression={$_.ExtensionData.Runtime.MinRequiredEVCModeKey}} | ? {$_.EVCMode -ne "intel-ivybridge"}

For anyone who's curious, that command starts by getting the MyCluster cluster, then gets a list of all VMs within that cluster.  It then filters that list to only look at powered on VMs and grabs 2 attributes from each VM: the Name and its EVCMode.  Once that list is generated, it filters that list for all objects that aren't using the intel-ivybridge EVCMode.

Wednesday, February 8, 2017

2017 vExpert

I'm proud to announce that I've been selected as a 2017 vExpert!  Thanks for the recognition and congrats to all of the other vExperts, particularly my coworkers Jeff and Dennis!

Thursday, February 2, 2017

Invalid VDS PortID Preventing vMotion

One of my customers had an issue where a bunch of VMs were not able to vMotion, despite the hosts being configured correctly in all regards (other VMs using the same VDS Port Groups, for example, could vMotion onto and off of the host where these VMs were running).  When DRS (or an administrator) attempted a vMotion, a generic "A general system error occurred: vim.fault.NotFound" error message would be displayed.

When I took a look at these VMs, I noticed something interesting (besides the fact that they were all on the same host); their VDS Port numbers were universally high, like in the 5000s.  This was particularly interesting because when I looked at the VDS itself, the highest numbered port on it was 4378.  I supposed that these ephemeral ports had somehow been assigned invalid port numbers, which was causing vMotion to fail when the new destination was unable to reserve that invalid number on the VDS.  Interestingly, all of these VMs were communicating just fine on the network, despite this odd configuration.

I decided that I needed to figure out how widespread this issue was.  I found that if I went to PowerShell and did a $all = get-vdswitch switchName | get-vm I would get a list of all VMs in the environment that were using that switch, including these VMs with invalid port assignments.  My next step was to try and get a list of all VMs with port assignments (since these VMs were not showing as being assigned to specific ports), which I did with this ugly command: $valid = get-view ((get-vdswitch switchName).extensiondata.fetchdvports($null) | ? {$_.connectee.ConnectedEntity.Type -eq "VirtualMachine"}).connectee.connectedentity | sort name -unique

Once I had my list of all VMs on the VDS and my list of all VMs with valid port assignments, I compared them to find which VMs were associated with the VDS but were not assigned a port: $ProblemVMNames = (compare-object $all.name $valid.name) | ? {$_.sideindicator -eq "<="}).inputObject and then used that list of VM names to actually get a list of specific VM objects (and to filter based on PowerState, which causes false positives): $problemVMs = $all | ? {$_.powerstate -eq "PoweredOn" -and $problemVMNames -contains $_.name} | select name,@{N="VMHost";E={$_.vmhost.name}},@{N="PortGroup";E={(get-view $_.extensiondata.network).name}}.  I then saved my $problemVMs as a CSV and moved on.

Now that I had a list of all of the VMs that were experiencing the issue, I could get to work cleaning it up (we still don't know what had caused it, but they had experienced some other problems at that same time and so suspect that this is fallout).  Since the customer was using ephemeral ports on their VDS, I supposed that moving the VMs onto a different port group and back would result in them being assigned new PortIDs.  If that other port group was created on the same VLAN that the VM was already using, that transfer should be non-interruptive.  

So, I wrote a script to do exactly that.  It takes a list of VMs, figures out which port groups they're using, creates temporary copies of those port groups, then bounces each NIC on each VM over to the appropriate temporary port group and back again.  I fired off that script and it worked exactly as intended.  Each of the VMs bounced between the appropriate port groups, experiencing 0-1 response losses on a continuous ping per transfer.  After the script completed, we found that all of the VMs were able to vMotion once again!

Like with all scripts found on the internet, this is posted as is for educational purposes with no implied guarantee.  Just because it worked for me in my situation is no guarantee that it will work for you in yours, so test thoroughly and make sure that you understand what it's doing before you ever execute a script.

Wednesday, February 1, 2017

PSODs and the iovDisableIR Setting

One of my customers recently came across an issue where their ESXi hosts were randomly crashing with a PSOD.  They had recently applied the latest SPP from HP and the latest ESXi 6.0 patches, and were now occasionally seeing these crashes with messages like "LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed.  This may be a hardware problem..."

As the PSOD implied, they had called HP support for help, but weren't making much progress.  I did some googling and found a really interesting blog post from Jason Whitelock about a recent ESXi update causing HP servers to PSOD.  He had come across the exact same issue and had tracked it down to the value of the iovDisableIR setting, which had changed in this latest ESXi update.  When he set it back to its original setting, the PSOD issue went away.

As VMware explains it, Interrupt Remapping (the technology that's affected by this setting) enables more efficient IRQ routing and thus improves performance.  Unfortunately, not all hardware supports it very well and can have issues, so VMware published a KB Article that describes how to identify those issues and how to disable it.  As Jason found, HP published a customer advisory stating that their hardware does not suffer from that issue, and in fact disabling that feature can lead to PSOD crashes.  That advisory includes details for how to turn that feature back on (by disabling the iovDisableIR setting via ESXCLI).  Since Interrupt Remapping defaults to being enabled (the disable setting is disabled), this has only ever been a minor issue and hasn't gathered much attention.

The problem is that recently, VMware changed that default setting.  Jason found that the November ESXi 6.0 Patch ESXi600-201611401-BG changed that default setting (and every ESXi host that was using the default... which is pretty much every ESXi host) to disable Interrupt Remapping.  This put my customer (with all of their HP hardware) into a bad position, as they were now in a configuration that was known to be unstable on HP hardware.

Fortunately, the fix is really easy; just use ESXCLI to change the iovDisableIR setting to FALSE on every affected ESXi host.  As you can probably guess, I wasn't about to SSH into every ESXi host in this environment, so this was a good opportunity to pull out PowerCLI and the get-esxcli cmdlet.

So, I wrote a script.  It's pretty simple (and very situational).  It pulls a list of all ESXi hosts that are in maintenance mode, then goes through that list setting the iovDisableIR setting to "false" and rebooting the host (so that the change can take affect).  I added a "reportOnly" switch to it, that will cause it to spit out a short report that shows the configured iovDisableIR setting and the current runtime setting per host, to validate that hosts are set correctly after the reboot.  Because there's that reboot built into the script, I hard coded the script to only target hosts that are in maintenance mode, making the reporting aspect less useful for people who are trying to figure out if their environment might be affected by this issue.  That said, you can pretty clearly see how that reporting bit is working, so put it in a more aggressive forEach loop if you want to check out the whole environment at once ;)

As always, this script is provided as is with no guarantees.  While it worked for me in my particular situation, that's no guarantee that it'll work for you in yours.  Test thoroughly and always make sure that you understand what a script is doing before executing it.

2/15 Update:  Ariel S├ínchez, in the comments below, pointed out that VMware has published a KB article about this issue.

PowerShell String Manipulation of Formatted Text in Columns

Every now and then, I find myself needing to use a utility like plink in order to interface with a system, such as a switch or a chassis, du...