Speeding Up Scripts: Sorting and Selecting Unique

I often find myself working with large collections of objects, and one challenge that frequently comes up is to distill that collection to a set of unique items.  For example, I'm working on a project that involves analyzing a lot of network flow data that I receive with parameters for Source, Destination, Protocol, and Port.  For a part of this project, I need to create a bunch of computer objects, with parameters for InboundTCPPorts, InboundUDPPorts, InboundOtherPorts, OutboundTCPPorts, OutboundUDPPorts, and OutboundOtherPorts. 

To make these computer objects, I need to start by getting all of the unique computers from my input data's Source and Destination fields.  That's pretty simple.  I start by combining the source and destination fields into a single array:

$computers = $data.source + $data.destination

That gives me a single list of all computers that are involved in these network flows, but that list is going to have a ton of duplication in it (since each computer is going to have many network flows where it is the source and many where it is the destination).  Fortunately, it's really easy to remove those duplicates:

$computers = $computers | select -unique

That eliminates all of the duplicate computer names from my list.  Since this list is eventually going to be used to generate a report, I want it to be alphabetically sorted (also, if I'm being honest, I just prefer to work with sorted data rather than random data because I'm a bit neurotic).  That's also really easy:

$computers = $computers | sort

That gives me a sorted list of all of the unique computers in that network flow data.  In fact, those steps could all be combined into a single line like this:

$computers = ($data.source + $data.destination) | select -unique | sort

And that's what got me thinking about script performance.  Frankly speaking, I have no idea what the "select -unique" algorithm that PowerShell uses looks like.  I also don't know what the "sort" algorithm looks like.  Is it faster to give sort a smaller list of objects to work with by pruning the list via "select -unique" first, or would that unique selection process be faster on a sorted list?  It was time for some testing!

So, I started by collecting a batch of data to play around with:

$events = Get-EventLog -LogName system -Newest 1000

I figured that there were two extreme cases that I should test: one in which the data was super highly redundant, and one in which it was super unique.  Event log data is a great source for both of those!  I started by testing the super redundant data:

1..10 | % {(measure-command {$events.entrytype | select -unique | sort}).totalMilliseconds}
1..10 | % {(measure-command {$events.entrytype | sort | select -unique}).totalMilliseconds}

I ran each of those a bunch of times because there's some variability to the results and I wanted to get an idea about the impact there.  When the data was highly redundant (my test set of 1000 entries only had 3 unique event types: warning, error, and information), it was significantly faster to get the unique entries before trying to sort them.

I then did the same tests, but using data that's wholly unique

1..10 | % {(measure-command {$events.index | select -unique | sort}).totalMilliseconds}
1..10 | % {(measure-command {$events.index | sort | select -unique}).totalMilliseconds}

In this case, the two were effectively identical.  Ok, in hindsight, that makes perfect sense, because either way both "sort" and "select -unique" are going to need to churn through the entire list.  Let's use some slightly less unique data (in my sample set of 1000 lines, I have 754 unique "timewritten" values):

1..10 | % {(measure-command {$events.timewritten | select -unique | sort}).totalMilliseconds}
1..10 | % {(measure-command {$events.timewritten | sort | select -unique}).totalMilliseconds}

In this case, there is a slight performance benefit to sorting before selecting -unique. 

So, it looks like there is no absolute rule about what order you should use for "sort" vs. "select -unique" as the redundancy of the data with which you're working can swing it one way or the other.  That said, it appears that we can draw a rule of thumb though: removing redundancy before sorting often provides more benefit than sorting before removing redundancy.  So, because we have to do one before the other, I'm going to try and remember to get my unique data before sorting from now on.

Comments

Popular posts from this blog

PowerShell Sorting by Multiple Columns

Clone a Standard vSwitch from one ESXi Host to Another

Deleting Orphaned (AKA Zombie) VMDK Files