Posts

Showing posts from 2017

vRealize Network Insight Overview

"It's the network!" seems like the battle-cry for some server teams.  I come from a system administration background; I get it.  When all of your services are running and the event logs look clear, it must be something external to the system... which is either the network, or something is being talked to through the network.  I've seen so many server guys chant that mantra, toss the problem over the fence and then wash their hands of the whole situation that I want to scream.  That said, I've also been in plenty of situations where I've brought a problem to the network team, asking them for help when I can't find anything wrong with a system.

One reason I like to go to the network team when I have a seemingly intractable issue is perspective.  Within a single server, I can drill down deep and get a good idea about what the applications on that server are doing, but it's much more difficult to get a picture of a whole solution.  The network team, with …

VMware Logon Monitor

When rolling out a VDI solution (or really, anything that touches on the user experience), it's crucial to understand how the change might impact the users and to ensure that they are left with a good impression of the solution.  They say that first impressions are most lasting, and the first impression that your users are going to see (for most solutions) is logon time.  That means that it's crucial that your solution does not negatively impact logon times, as that will color the entire experience.  So, how do you accurately measure it?

Well, VMware released a Fling called Logon Monitor (and, it's now baked into the Horizon 7.2 agent).  It's a tool that's sole purpose is to measure the logon process and to report on what's happening during a user logon.  After it's installed, it logs (with excruciating detail) everything that occurs during a logon, storing the file in a default location of C:\ProgramData\VMware\VMware Logon Monitor\Logs

It creates a file f…

vSAN RAID Levels and Fault Domains

One of my customers is considering implementing vSAN, so I've been researching it quite a bit lately.  The interactions of vSAN RAID levels (for all-flash configurations) and Fault Domains is fairly complex, so I figured that I should post some notes about what I've learned here.

First, the concept of RAID is a little different in vSAN than it is in a traditional array.  Traditionally, RAID specifies the algorithm used to spread data (or parity data) across a set of disks.  For example, RAID 5 specifies that data will be striped across all of the disks in a set, with a single disk's capacity used for parity.  This means that a 3 disk RAID 5 set will store data on 66% of its disks' capacity.  A 5 disk RAID 5 set will store data on 80% of its disks' capacity.

vSAN treats RAID differently.  There are 3 different RAID types that vSAN supports: RAID 1, 5 and 6.  Like in a traditional array, these RAID levels describe the data redundancy algorithm used, but the members a…

Group Policy Loopback Processing on Windows Server 2012

Every now and then (especially in a VDI situation), I need to enable Group Policy Loopback Processing.  This Group Policy setting can do a lot of things; I usually use it to allow me to create Group Policy Objects that contain User Configuration settings that only apply when the users log into a certain subset of computers (such as my VDI desktops).  When that setting is enabled, it basically instructs windows to process its computer GPOs again at user logon, so as to catch any User Configurations that are specified.

This is a setting that I configure once for each VDI deployment that I do, and I always need to look up where it is (who bothers to memorize where specific settings are amongst the thousands of options!?).  No problem, that's literally what google was made for.  So, a quick search for Group Policy Loopback Processing is in order, which brings me to a technet article about Windows Server 2003 that calls the setting simply Loopback processing.  Well, Loopback Processin…

Finding Stale Brocade Zone Configurations

I recently wrote about a situation where I was creating a zoning configuration and had to figure out which fiber channel devices were active.  After we finished that project, we decided that we should go through and actually remove the inactive aliases and zones.  We had a list of active devices, so we were all ready to move forward and say "delete everything that isn't on this list!".  That'll work great, right?

Of course not; we needed the opposite.  "Delete everything that is on this list" is a far better instruction that is way less likely to lead to painful mistakes.  Even better is "run these commands to delete all of these unnecessary objects" and I know one good way to generate such a list of commands: a script (I feel like I'm developing a battle-cry...).

I put together a script that does a few basic things.  First, it uses the nsshow command to get a list of all of the active WWNs on a given Brocade fiber switch.  Then, it compares a…

Decommissioning Specific SAN Datastores En Masse

One of my customers recently purchased a new SAN, with the goal of decommissioning the old one.  They used Storage vMotion to migrate all of their VMs over to the new SAN and adjusted all of the ESXi hosts to put their scratch space on a new LUN, and were ready to proceed.

Many people, at this point, would just turn off the old SAN... and they might be ok.  Maybe.  At that point, the ESXi hosts are going to seriously freak out, because they just encountered an unexpected SAN failure... and we've all seen that sometimes, ESXi doesn't respond well to losing datastores unexpectedly.

So, the more cautious people would right click on each datastore and unmount it, then turn off the old SAN.  While not as bad as just turning off the SAN, the ESXi hosts still expect those LUNs to be there (even if they're no longer mounted as datastores) and can still run into issues.

Miss Manners insists that people follow the procedure detailed in KB 2004605.  That article includes a lot of im…

Getting Active Brocade Fiber Switch Aliases via PowerShell

A while back, I posted a quick script to create commands for 1:1 Zoning on a Brocade Fiber Switch.  I was recently helping someone go through that exact process on a set of switches that had a lot of aliases already defined on them.  Their challenge was that they weren't sure which aliases were for their current SAN vs. a retired SAN.  Rather than just creating zones for both SANs, I decided to put together a quick script that would scrape their current Aliases and check which ones have active WWNs currently in the system.  They could then use this information to prune the Aliases that are no longer needed, in addition to only creating the required zones for our project.

In order to do this task, I had to do some quick string parsing into PowerShell objects.  Good thing I already know how to do that ;)  So, I put together this script which does two things:
1) It parses the Brocade Fiber Switch's configuration to look for any Aliases
2) It checks the WWNs for those Aliases agai…

Nested Progress Bars in PowerShell

I've been working on some scripts lately and just learned about nested progress bars, which are really cool!  In fact, progress bars are a tool that I'm going to use far more often in my scripts, for a few reasons.  First though, let's talk about script output.  In my opinion, there are three basic types of output that a script generates: information as the result of the script, run-time errors, and information about the progress of the script.  We're going to ignore the run-time errors and just talk about the output that is generated by a successful script execution: information about what the script is doing right now, and information that the script has retrieved as a result of its actions.

In general, it's super helpful to be able to store the results of a script in a variable for future manipulation/archival/whatever.  I do this by using syntax like $a = ./myScript.ps1.  That will take whatever output the script generates and store it in the variable $a, which…

Parsing GPOs for Drive Mappings

One thing that we always have to do (and people often overlook) when planning a VDI project is to understand the user environment and how to gracefully recreate their current desktop environment on the virtual desktop.  This is a big challenge, as you can tell from the fact that there are so manytoolsavailable to solve it.

In my experience the best solution is usually a combination of purpose built tools, of Group Policy Objects, and of the occasional login script.  Before you can even start figuring out which combination of tools and techniques might be most appropriate, you need to understand what currently exists in the environment... and you need a fairly accurate picture of that.  If the environment is already sophisticated with heavy use of GPOs for drive mappings, printer mappings, and critical registry settings, transitioning into VDI will be far easier than if new desktops are configured by an IT guy walking over and making all of those things by hand.  Of course, most organi…

Port Mirroring by SPAN or RSPAN on an HP C7000 Blade

This is just a heads-up to hopefully save someone else a bit of time and pain... but the HP Virtual Connect doesn't support SPAN or RSPAN to mirror traffic from a physical device into the chassis to, for example, a Virtual Machine.  Basically, Port Mirroring, such as through SPAN or RSPAN, uses unicast to duplicate network traffic from a source port or ports onto a destination port.  This technique is useful for troubleshooting, in case you can't get a packet capture running on either end of a network flow, or for monitoring (as was our intended use case).

IANANG (I Am Not a Networking Guy), but my understanding is that the problem is due to the nature of a SPAN port and how those packets look to the Virtual Connect.  When Port Mirroring is configured, all traffic is duplicated and sent out to the Virtual Connect.  These packets are not changed in this process, keeping their original source and destination MAC addresses; the SPAN port is forwarding these packets to the VC desp…

PowerShell String Manipulation of Formatted Text in Columns

Every now and then, I find myself needing to use a utility like plink in order to interface with a system, such as a switch or a chassis, during a script.  If I'm just sending configuration commands (and am taking it on faith that they worked...), then it's nice and easy, but if I actually want to extract information from the device, then I've got a bit of a challenge, because those devices (via plink) are not going to give me back an object that PowerShell understands.

For example, if I use get-vm in PowerShell, I will get back a vm object that has a bunch of properties, which I can easily access using dot notation.  If I use plink to pull a brocade switch configuration, all I'm going to get back (from PowerShell's perspective) is a great big long string with lots of New Line characters, tabs and spaces.  So, how do I extract data from a formatted text string, in order to more easily work with it in PowerShell?  Well, there's a lot of different tricks availabl…

HP c7000 Chassis Administration Tips and Tricks

Several of my customers use HP C7000 Blade Chassis for their ESXi hosts.  I've picked up a few tips and tricks for working with that chassis over the years, so I figured that I'd post them here.

The Virtual Connect (the blade chassis's networking component) has a feature that can prevent pause frames from flooding a network by disconnecting a blade that is sending an excessive number of them.  Unfortunately, every now and then, it detects an ESXi host's uplink as sending such a number of pause frames and so disconnects that network adapter.  Fortunately, it's really easy to allow traffic to flow through that port once again.  Just SSH into the Virtual Connect (you can get the address by looking at the "Virtual Connect Manager" link in the Onboard Administrator interface.  Once you're connected, use the show port-protect command to see if there are any ports that are in a blocking state.  If so, you can use the reset port-protect command to reset the p…

Checking Distributed Switch PNICs for Invalid VLAN Traffic

4/26/17 Update: I changed this script so that it no longer uses the min/max VLAN numbers and instead discovers a list of valid VLANs based on the Port Groups that are defined on the VDS.  It then alerts if it sees any VLANs that are not in that list.

One of my customers has several physical uplinks going into their ESXi hosts, each carrying different sets of VLANs.  They recently had an issue where an uplink with one set of VLANs was accidentally attached to a VDS that was configured for the other set of VLANs.  This wasn't a catastrophic issue, as the VDS didn't have port groups defined for those invalid VLANs and so any traffic was dropped into the bit bucket, but it did mean that 1 of the links going into that switch was useless.

After we corrected the issue, we decided that we should audit the environment to see if this problem had occurred anywhere else but not been detected.  We decided that the best way to perform an initial scan of the environment would be to leverage …

Getting VM EVC Mode Requirements via PowerCLI

One of my customers was preparing to do some major ESXi host reconfiguration and so needed to shift VM workload from one cluster to another.  They had a challenge in that their clusters were running with different EVC modes, and they wanted to move VMs from the newer cluster to the older cluster.  "Impossible!" the strawman says, "it can't be done!"

Well, yes and no.  That's absolutely correct that you can't vMotion a VM that powered up on an Ivy Bridge CPU back onto an ESXi host with a Sandy Bridge processor.  The reason for this is that the VM, during its power on operation, scans the CPU of its host for a list of CPU features that are available and begins potentially using those features, which means that it can't be moved to a processor that doesn't have those features.  The VM, in effect, inherits its host's EVC Mode for the lifespan of this power cycle.  Until the VM goes through a complete new power cycle (not a reboot from within the…

2017 vExpert

I'm proud to announce that I've been selected as a 2017 vExpert!  Thanks for the recognition and congrats to all of the other vExperts, particularly my coworkers Jeff and Dennis!

Invalid VDS PortID Preventing vMotion

One of my customers had an issue where a bunch of VMs were not able to vMotion, despite the hosts being configured correctly in all regards (other VMs using the same VDS Port Groups, for example, could vMotion onto and off of the host where these VMs were running).  When DRS (or an administrator) attempted a vMotion, a generic "A general system error occurred: vim.fault.NotFound" error message would be displayed.

When I took a look at these VMs, I noticed something interesting (besides the fact that they were all on the same host); their VDS Port numbers were universally high, like in the 5000s.  This was particularly interesting because when I looked at the VDS itself, the highest numbered port on it was 4378.  I supposed that these ephemeral ports had somehow been assigned invalid port numbers, which was causing vMotion to fail when the new destination was unable to reserve that invalid number on the VDS.  Interestingly, all of these VMs were communicating just fine on the…

PSODs and the iovDisableIR Setting

One of my customers recently came across an issue where their ESXi hosts were randomly crashing with a PSOD.  They had recently applied the latest SPP from HP and the latest ESXi 6.0 patches, and were now occasionally seeing these crashes with messages like "LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed.  This may be a hardware problem..."

As the PSOD implied, they had called HP support for help, but weren't making much progress.  I did some googling and found a really interesting blog post from Jason Whitelock about a recent ESXi update causing HP servers to PSOD.  He had come across the exact same issue and had tracked it down to the value of the iovDisableIR setting, which had changed in this latest ESXi update.  When he set it back to its original setting, the PSOD issue went away.
As VMware explains it, Interrupt Remapping (the technology that's affected by this setting) enables more efficient IRQ routing and thus improves performance.  Unfortunatel…

Using PowerCLI to get a Datastore from an NAA ID

This is just a quickie (mainly for my own notes in the future): if you ever need an easy way to figure out which datastore is being referenced by a given naa number (like if you're troubleshooting datastore access issues and the logs all reference that ID type), you can use this command to search it out:

get-datastore | ? {$_.ExtensionData.Info.Vmfs.Extent.Diskname -match "NAA NUMBER"}

Creating VICredentialStore Items without Typing Your Password into the Command Line

I use PowerCLI a lot.  Like, when VMware said to stop using the C# client, I just started using PowerCLI instead of learning the Flash based web client.  As such, I log into many vCenter servers many times each day, and creating a VICredentialStore item for each vCenter that I use is one trick that saves me a lot of typing and therefore time.

The New-VICredentialStoreItem cmdlet is super easy to use, which creates these credential store items.  Once you have an item created, those credentials get used automatically when you connect to a vCenter server, making the logon faster and easier.  To use it, just follow this syntax:

New-VICredentialStoreItem -Host vCenterServer -User JColeman -Password SuperSecretPassword

And there you go, next time you use connect-viserver vCenterServer, it will automatically pass JColeman as the username and SuperSecretPassword as the password.

Of course, no one ever wants to do this.  Who in their right mind would want to type their password, in plain text…

Extreme IO Latency Caused by a Poorly Seated Fiber Cable

One of my customers was experiencing some extreme IO latency in their environment: in the hundreds of ms.  Obviously, there was some pain associated with this issue, and so they asked for help.  This environment had 3 HP c7000 chassis and 2 different SANs; the issue was affecting every host accessing every LUN, so we decided that the issue must be in the fiber channel fabric somewhere.

After poking around, we quickly realized that the firmware on the Brocade switches was from January 2013, so was quite old.  I pulled up top on each switch and saw no appreciable CPU or Memory usage.  Next, I looked at porterrshow to see if there were any problems on the switches; each one had a single port that had a ridiculous number of Enc Out errors.  I cleared the error stat counters by using portstatsclear -i 0-33 and then issued another porterrshow, and found that we had roughly 100,000 Enc Out errors on that port each second.  Coincidentally, each switch had 1 port that was displaying this issue…