Technique for Parsing String Output into PowerShell Objects

I just read a cool article about parsing netstat output (aka unformatted string output) into PowerShell objects.  The solution was great (and their approach taught me that select-object has a -skip # parameter that can be used to chop off the first # of objects from an array, which is super useful), but since I would approach it in a totally different manner, I figured that I'd write out my technique!  I want to be clear: I'm not saying that one approach is better than the other.  There are often many perfectly valid solutions to a scripting problem.  In some situations, one might be preferable to another though, and so here's another option to add to the old toolbelt.

(netstat -an | ? {$_ -match '^  '}).trim() -replace '  +',';' | ConvertFrom-Csv -Delimiter ';'

That's a pretty dense line (and it's worth noting that there are two adjacent space characters before the plus in the replace string and in the match string), so let's break it down.

First, I'm getting my netstat output with "netstat -an" and then I'm starting to let PowerShell do its magic.  The first challenge that I noticed with the raw string output of the netstat command is that it has a bunch of formatting lines at the top that don't actually contain data that I care about.  In fact, all of the lines that I care about are "indented" with a double space... so I need to filter those results to only show me lines that begin with two space characters (and then remove them).  To do that, I pipe the output of my netstat command into ? and use the regex match "^  ".  That regex literally looks for two space characters at the start of a line.  Once I've filtered down to only the data lines, I do a quick trim() to remove those leading (and any trailing) spaces.  Here's that command, for anyone who wants to follow along:

(netstat -an | ? {$_ -match '^  '}).trim()

That's already a great improvement in parsability over the raw netstat output, but the fact that netstat uses space characters both to separate the columns and in the column headers is a bit of a pain.  If we were going through this line by line in a ForEach loop, we could split each line based on the '  +' regex (which means two or more adjacent space characters), but we can solve this without going into a loop.  I did that by replacing every sequence of two or more adjacent space characters with a single semi-colon character.  Here it is with that command:

(netstat -an | ? {$_ -match '^  '}).trim() -replace '  +',';'

While that makes the netstat output less readable to us as humans, it makes it way easier for PowerShell to parse.  Now, we've just got a CSV (although it uses a semicolon delimiter instead of a comma).  In all honesty, I could have probably gotten away with using a comma in my "replace" command in this situation, but people like to use commas in their data strings sometimes and I've been bit by that in the past.  Also, it's trivial to tell PowerShell to just use a different delimiter when interpreting a CSV: ConvertFrom-Csv -Delimiter ';'

And that brings us to the whole command:

(netstat -an | ? {$_ -match '^  '}).trim() -replace '  +',';' | ConvertFrom-Csv -Delimiter ';'

Where might you want to use this approach vs. Sean's approach?  Well, we do generate different outputs.  My command here generates a 1:1 array of PowerShell objects out of the Netstat command, with one property per column.  It doesn't care how many columns there are or what their headers are, as long as the data follows the formatting standard of being separated by two or more space characters it will generate an object with one property per column.  Sean's solution in tailor-made to that specific Netstat command (although you could obviously customize it for whatever string output you want), but that customization means that it can go deeper.  He extracts the port numbers from the two Address fields, for example, which might be a critical piece of information for a particular use case!

So yeah, neither approach is inherently better than the other, and in fact elements of each can be combined to do even more cool stuff if you want!

Update: Ok, because I'm a glutton for punishment (well really, it just seemed like an interesting problem), I decided to go ahead and put together an example that uses this technique to split out the IP and Port values into their own properties:

$lines = (netstat -an | ? {$_ -match '^  '}).trim() -replace '  +',';' | ConvertFrom-Csv -Delimiter ';'

foreach ($line in $lines){

foreach ($loc in @("foreign", "local")){

$ip,$port = if ($line."$loc address" -match "(.*):(.*)"){$matches[1],$matches[2]}

$line | add-member -type NoteProperty -name "$loc-IP" -value $ip

$line | add-member -type NoteProperty -name "$loc-Port" -value $port

}

}

$lines


There are two parts of this addition that I want to touch on.  The first is that I decided to be needlessly clever when parsing out the data for the "local address" and "foreign address" properties by making an array of the words "local" and "foreign" and then running the same commands on each of them so that I wouldn't need to duplicate code.

The actually interesting thing here is how we split out the IP from the Port, given that the : character is both used to separate them and within the IP for an IPv6 line.  I got the idea to use RegEx capture groups from jdgregson on StackOverflow (although I changed the suggested regex to better fit my purposes here and be easier to read).  A Regular Expression lets you use parenthesis to specify a capture group, which can be referred to later.  In PowerShell, the capture groups get thrown into the $matches variable as an array. 

The (.*):(.*) Regular Expression is pretty simple: it creates two capture groups, one from everything until the last : in the line, and one from everything after that.  It effectively splits on the last instance of a : in the line.  But, why does it work that way?  RegEx is greedy, but it needs to match if it can!  When you specify (.*) you are saying as many characters as possible, regardless of the character... but then we've got that : in there.  So, there has to be a : after that "as many characters as possible", so we've effectively identified the final : in the line.  The second (.*) lets us then capture all of the remaining characters.  So, our RegEx very easily grabs the IP (be it IPv4 or IPv6) and separates it from the port, into $matches[1] and $matches[2].  I assign those to $ip and $port, then add those properties to my lines.  Voila!

Comments

Popular posts from this blog

Clone a Standard vSwitch from one ESXi Host to Another

PowerShell Sorting by Multiple Columns

Deleting Orphaned (AKA Zombie) VMDK Files