Friday, January 6, 2017

Using PowerCLI to get a Datastore from an NAA ID

This is just a quickie (mainly for my own notes in the future): if you ever need an easy way to figure out which datastore is being referenced by a given naa number (like if you're troubleshooting datastore access issues and the logs all reference that ID type), you can use this command to search it out:

get-datastore | ? {$_.ExtensionData.Info.Vmfs.Extent.Diskname -match "NAA NUMBER"}

Thursday, January 5, 2017

Creating VICredentialStore Items without Typing Your Password into the Command Line

I use PowerCLI a lot.  Like, when VMware said to stop using the C# client, I just started using PowerCLI instead of learning the Flash based web client.  As such, I log into many vCenter servers many times each day, and creating a VICredentialStore item for each vCenter that I use is one trick that saves me a lot of typing and therefore time.

The New-VICredentialStoreItem cmdlet is super easy to use, which creates these credential store items.  Once you have an item created, those credentials get used automatically when you connect to a vCenter server, making the logon faster and easier.  To use it, just follow this syntax:

New-VICredentialStoreItem -Host vCenterServer -User JColeman -Password SuperSecretPassword

And there you go, next time you use connect-viserver vCenterServer, it will automatically pass JColeman as the username and SuperSecretPassword as the password.

Of course, no one ever wants to do this.  Who in their right mind would want to type their password, in plain text, into the PowerCLI console?  Anyone shoulder surfing would be able to see it and, even worse, any time you print your PowerShell history, it'll pop up again!

Fortunately, there's a way to protect yourself against this issue.  It's an ugly command line, but use this instead:

New-VICredentialStoreItem -Host vCenterServer -UserJColeman -Password ((get-credential).GetNetworkCredential().password)

When you fire that off, it will prompt you for a username/password in a popup window, which will star out the password and won't record it in the command history.  The "username" field in that popup window doesn't matter, it just can't be blank; all that we're doing is grabbing the password that was typed into that window and passing that to the New-VICredentialStoreItem cmdlet.

Bear in mind, there are some security concerns with the VICredentialStore though.  It uses encryption to store your username/password so that only your user account can access them, but if you leave your desktop unlocked, someone could walk up and use (get-vicredentialstoreitem).password to get your password.  That'll only work if they can already open up a PowerCLI session with your credentials, so the risk is manageable, but it does exist.

Wednesday, January 4, 2017

Extreme IO Latency Caused by a Poorly Seated Fiber Cable

One of my customers was experiencing some extreme IO latency in their environment: in the hundreds of ms.  Obviously, there was some pain associated with this issue, and so they asked for help.  This environment had 3 HP c7000 chassis and 2 different SANs; the issue was affecting every host accessing every LUN, so we decided that the issue must be in the fiber channel fabric somewhere.

After poking around, we quickly realized that the firmware on the Brocade switches was from January 2013, so was quite old.  I pulled up top on each switch and saw no appreciable CPU or Memory usage.  Next, I looked at porterrshow to see if there were any problems on the switches; each one had a single port that had a ridiculous number of Enc Out errors.  I cleared the error stat counters by using portstatsclear -i 0-33 and then issued another porterrshow, and found that we had roughly 100,000 Enc Out errors on that port each second.  Coincidentally, each switch had 1 port that was displaying this issue.

Then we looked at SFPShow # for those ports, in order to see if there was any interesting physical information about the ports.  We noticed that the "RX Power" was very low, like 14 uW (the normal range bottoms out at 31.6 uW and goes up to 794 uW).  On a normal port on the switch, our "RX Power" was reading in the 500s.  Well, we figured, we've got a physical problem.  Probably a kinked cable or something like that.

Next, we used PortShow # to show the details for each of the impacted ports, which includes a list of any WWNs that are plugged into the port.  I pulled up PowerCLI and collected all of my vmhba WWNs so that I could figure out what those devices were by the following commands:

$allHBAs = get-vmhost | get-vmhosthba | select vmhost,device,@{N="WWN";E={"{0:X}"-f$_.PortWorldWideName}}
$allHBAs | ? {$_.WWN -match "WWN from Brocade CLI without colons"}

Those commands revealed that those ports were plugged into a new chassis that had been recently installed but not yet put into production.  Since that chassis was not yet live and the issue was systemic, I decided that those ports were unlikely to be the issue and wanted to move on to the firmware updates.  One of the network administrators here had seen weird issues from dirty SFPs in the past though, and insisted that we resolve the physical errors before moving on, because it would be nice and fast.

So, we grabbed a fiber cable cleaning kit, a couple of spare SFPs and a couple of spare fiber cables, then went to work.  When I got there, I immediately noticed that the cables in the two affected ports were not plugged in correctly.  The link light was lit, but the plastic "depressor" tab had gotten underneath the "locking pins" on the cable, which prevented them from being pushed down at all... which meant that they prevented the cable from being fully pushed into the SFP.

I pulled the cables (they were plugged into a not-in-use chassis, after all), fixed the plastic housing and cleaned them (since I had them out anyway), then plugged them back in.  Lo and behold, the Enc Out errors went away our our RX Power jumped up to the 600s.  Ok, that was a nice and easy fix.

So, we went back to the initial issue, the storage latency, and were amazed by what we saw.  Across the board, our storage latency had dropped to around 1 ms.  We think that the poor connections were basically causing a ridiculous number of retransmits to occur, which was bogging down the ASIC on the fiber channel switch.  Since the ASIC manages those things at the hardware level, the load never showed up on the Top command that I had started with.

That overwhelmed ASIC was basically propagating the problem to every other device that was trying to use it, which was both of the SANs in the environment.  Much to everyone's surprise (except for that one smug network admin), the issue was very easily resolved by taking care of those two errant ports.  So, lesson learned there: hardware issues on even a single port can propagate in unexpected ways across an entire switch.

Thursday, December 15, 2016

Validating LUN Path Consistency via PowerCLI

One of my customers needed some help with making some zoning changes on their fiber switches after standing up a batch of new ESXi servers.  I already had a script to create 1:1 fiber channel zones on Brocade switches, so that part was easy, but zoning changes to an existing environment are a little scary.  As in, if you really mess it up, the storage is going to disappear and every VM is going to crash, scary.  Fortunately, you've got to really mess it up to cause an issue, and so this customer was willing to allow changes during business hours as long as we promised not to cause an outage ;)

So, how can I enforce that promise?  Well, I've got my script to create accurate zones for the new hosts, but that's not really the dangerous part.  If that's messed up, it just means that the new hosts won't work... and since they're still being configured, they're obviously not in production yet.  The dangerous part is when you enable the new zones, in case you somehow manage to remove an existing zone from the active config.

As long as you're really careful, you're good, right?  Well, yes, but it sure is nice to be sure.  So, I always make my zoning changes to one switch first, then check to make sure that my storage availability is unaffected before moving on to the second switch.  Rather than just spot checking through the GUI, I decided to leverage an auditing script that I had written previously.  This script churns through an environment, examining each fiber channel VMHBA on each ESXi host, and records the total number of LUNs visible through each VMHBA.  If one VMHBA sees a different number of LUNs from another, it throws an error (and spits out all of its results at the end).

So, it's pretty easy to use: just invoke the script.  It will check each vmhost in each cluster in the environment, spitting out its results.  I like to capture those results in a variable, by doing something like $a = check-lunpaths.ps1, so that I can more easily sort through the results in case of a delta that needs investigation.

So, back to current use case, I just fire off this script before making any zoning changes (to establish a baseline), then launch it again after enabling the changes in order to ensure that I didn't break anything.  Of course, this will only give me information about the vSphere environment's zoning, but these days that seems to be almost every SAN use case that I come across.

Anyway, the script is below.  As always, it's published as is with no implied guarantee, etc.  While it worked for me in my situation, that's not guarantee that it'll work for you in yours; as with any script from the internet, test thoroughly before using it!

Monday, December 12, 2016

Memory Leak on the April HP ESXi Image

One of my customers had a whole collection of ESXi 6 hosts that were all installed from the April 2016 HP ESXi ISO image... and they hadn't been patched since then.  Well, one day they called me because their Splunk server had started sending out alerts from an alarm that we'd set up to monitor the ESXi hosts for memory leaks.

So, I logged into one of the affected hosts to try and figure out what was going on.  After poking around in a bunch of logs and a fair amount of google work, I came across this article about a memory leak over at CPU Ready.  It sure looked promising.  So, I followed their instructions to check the version of the broadcom driver in one of the affected ESXi hosts and, sure enough, it was an older version.  Fortunately, HP has a fix available, so I just needed to get it installed on all of the ESXi hosts (since full on ESXi patching wasn't necessarily available, unfortunately).

I needed some way to figure out exactly which hosts needed this new driver version installed, and since this was only an updated driver rather than an updated ESXi version, it wasn't necessarily easy to tell which hosts had the problem driver and which didn't.  This problem was resolved in the October 2016 HP ESXi ISO image, which installs the same build of ESXi, just with this newer driver.

The instructions in the article that I followed described using ESXCLI to discover the driver version, but I sure didn't want to enable SSH on every ESXi host, connect to each host, and check its driver version.  Fortunately, PowerCLI has an ESXCLI cmdlet, so I didn't have to.  I just used this command to output a list of all ESXi servers in the environment and which version of the driver each was using:

get-vmhost | foreach {echo "$($_.name): $(((get-esxcli -vmhost $_).network.nic.get('vmnic0')).driverinfo.version)"}

In our case, all of the hosts that were using version 2.712.70.v60.3 of the driver were using the version from the April image and so needed to be updated, whereas the hosts using version 2.713.10.v60.4 were from the October image and so were fine.

Edit:
In case you want this output in a more PowerShell friendly format, you can use these commands instead:

get-vmhost | foreach {
    $outObj = "" | select vmhost,driverVersion
    $outObj.vmhost = $_
    $outObj.driverVersion = ((get-esxcli -vmhost $_).network.nic.get('vmnic0')).driverinfo.version
    $outObj
}

Tuesday, November 22, 2016

Finding VMs with Duplicate MAC Addresses

At one of my customers' sites today, I saw an error message that I've not seen before: VM MAC Conflict.  "Well, that's certainly not good," I thought, as I poked around at the error message.  To my chagrin, I could only find that error message for a single VM in the environment, and that error message wouldn't tell me with which other VM it was conflicting.  So, I could only think of one way to figure out what was going on with this conflict: look at the MAC Address assigned to every NIC on every VM in the environment, and figure out what was causing the conflict.  Easy!

No, really, it was easy.  Had I done it by hand, I would certainly have driven myself crazy, but PowerCLI made it nice and easy.  I just used this command:

(get-vm | get-networkadapter | ? {$_.MacAddress -eq "<offending MAC Address>").parent

Lo-and-behold, it returned 2 VMs.  One was the known VM that had flagged the error and the other was a powered-off VM.  Maybe that's why it wasn't also flagging the error, but regardless, we easily identified the source of the problem and were able to resolve it.

P.S. if you ever need to find all duplicate MAC addresses in an environment, you can use these commands:

$allNICs = get-vm | get-networkAdapter
$dupeMACs = (compare-object $allNICs.macAddress ($allNICs.macAddress | select -unique)).inputObject

foreach ($thisMAC in $dupeMACs){
   "="*17 + "`n$thisMAC`n" + "="*17
   ($allNICs | ? {$_.macAddress -eq $thisMAC}).parent.name
}

Monday, November 21, 2016

Using Parallel Operations in PowerShell to Write a Port Scanner


Recently, I've written several scripts that need to perform relatively simple operations on a large set of objects (such as moving a bunch of VMs onto a given Port Group or reconfiguring NTP for a bunch of ESXi hosts).  In general, I approach these challenges by generating a list of all of the objects that I want to manipulate, and then I ForEach my way through that list until I've finished all of my work.

This approach obviously works just fine; it's the way that we'e written scripts for ages.  Just as you might expect from something that's been done the same way for a long time (particularly something IT related...), that's not really the best way to do it any more.  With PowerShell version 3, Microsoft introduced the concept of Parallel operations.  Starting with PowerCLI 6, VMware changed PowerCLI to make it much easier to use with PowerShell Parallel operations.

So, what is a parallel operation?  Well, a simple (and very practical!) example is that ForEach loop.  If I need to manipulate a bunch of VMs from a list, I can ForEach my way through that list and perform my manipulation on each VM, sequentially.  Of course, there's no inherent need for a particular sequence (at that level), but that's just the way that a ForEach works.  It does everything inside the loop on the first object in the list, then does it again on the second object, until it's done it for everything.

Well, PowerShell v3 gives us a new switch on the ForEach loop: -parallel.  When you do a ForEach -Parallel, that instructs PowerShell to execute all iterations of that ForEach loop simultaneously (as limited by resources).  So, instead of waiting for VM1 to be reconfigured, then moving on to VM2, a ForEach -parallel will reconfigure VM1 and VM2 at the same time.

Obviously, this can save an incredible amount of time, as many of those operations do not depend on the completion of previous iterations of those same commands on different objects.  Since this is such a useful technique, I decided that I'd go ahead and just rewrite all of my scripts to use this methodology!  Easy, right?

Yeah, right.  In order to leverage this parallelism, your scripts need to be written in a very specific way.  Firstly, you can't just use a ForEach -parallel loop in a normal script, it has to be in a Workflow.  I'd never heard of a Workflow before learning about this, but it's basically a specialized Function that has the limitations required to enable parallelism (as well as some other cool features).

So, to use parallelism, you need to define a Workflow.  Microsoft has a really good article about Workflows and Parallelism that I highly recommend reading.  It has a great description of how it all works together as well as some easy-to-use examples.

As I learnt more about Workflows and parallelism, I realized that I needed a nice, simple script to mess around with.  So, to that end, I decided to write a PowerShell based Port Scanner.  I figured that this would be an awesome way to demonstrate parallel operations, as pinging 1000+ ports on each of 20 IP addresses sequentially is a terrible situation to imagine.  Performing that giant scan in parallel, on the other hand, is actually feasible (although still not as fast as I'd like...).

workflow port-scan{
  param
  (
      [int[]]$ports,
      [string]$subnet,
      [int[]]$hosts
  )
  $subnet = $subnet.trim(".")
  foreach -parallel ($thisHost in $hosts){
   $remoteHost = "$subnet." + "$thisHost"
   foreach -parallel ($thisPort in $ports){
    test-netconnection $remoteHost -port $thisPort
    }
   }
 }

The port-scan workflow takes 3 parameters: -ports -subnet and -hosts.  -Ports must be an array of port numbers.  -Subnet must be the class C subnet that the host(s) are on.  -Hosts must be the final octets for the class C IP addresses of the hosts that you wish to ping.

So, if you want to test connectivity on ports 900-999 (maybe you can't remember exactly which port that ESXi hosts uses but need to test connectivity...) on a few ESXi hosts at 192.168.1.100-110, you could do it like this: $results = port-scan -ports (900..999) -subnet 192.168.1 -hosts (100..110)

At that point the Workflow fires off.  After a bit of string manipulation to ensure that the subnet is in the expected format, it launches into the parallel loops.  First, it generates a cloud of instances for all of the specified hosts (in this case, 192.168.1.100 - 110).  Within each of those instances, it generates a child cloud of instances, each running the test-netconnection cmdlet on a single port for that host.  When this is executing, you'll notice the yellow host output from the test-netconnection cmdlets coming back in a seemingly random order; that's the nature of parallelism.  That's also why it's handy to store the output in a variable ($results in this case), as you'll probably need to do some sorting to make the output easier for humans to consume.

So, how do you do this with PowerCLI?  Well, that's the next step on my todo list!  But, to that end, LucD has an excellent article about exactly how to use parallelism with PowerCLI that I will certainly be reading as I learn this!