Troubleshooting with vRealize Network Insight

I've had the opportunity to use vRealize Network Insight (vRNI) lately during a network migration project and it has proven invaluable.  We've used it to collect data about the subnets before they're migrated and we use it to help troubleshoot issues after the migration is completed.  It's given us great visibility into the traffic on the network and into where that traffic is being blocked.  So, how do we use it?

Before the migration, we use it to scrape a ton of data from the source subnet, as we need to know what's going on with the servers that are running there.  At the start of the project, we attempted to learn those details by asking the application owners about their applications' requirements, however we found that the vendor documentation was universally poor, especially when compared against the needs of micro-segmentation.

To get that information, I execute a very simple query in vRNI: flows where subnet = <subnet>.  This returns a list of all network flows for that subnet.  This is not a packet capture dump, it's a summary of all of the observed flows that involved that subnet.  For example, if there are 50,000 TCP 445 sessions from System A to System B, it's simply going to report that System A talked to System B on TCP 445.  You don't need to sift through millions of lines of network traffic.

That said, it still generates a lot of data.  Our subnets ranged from hundreds to tens of thousands of unique network flows.  This data can be exported as a CSV and, from there, it's basically up to you and your own ingenuity to figure out how to make sense of it.

After the migration is where I found the tool especially useful.  We occasionally got complaints from the application owners along the lines of "ApplicationServerX isn't able to connect to DatabaseServerY!  Now the building is on fire and we're all going to die!"  Well, not quite that extreme, but no one is happy when a production system is not behaving as expected.  vRNI absolutely shines in this scenario.

I would simply execute a query like this: VM ApplicationServerX to VM DatabaseServerY and check out the results.  vRNI will draw a diagram showing exactly how traffic flows from ApplicationServerX to DatabaseServerY, including all of the ESXi hosts/interfaces involved, the routing hops and the firewalls (assuming that you've hooked those devices into vRNI, which you absolutely should do).  Once you've got that diagram in front of you, you can click on each firewall's icon and it will display the rules that affect this flow.  In my experience, this has been extremely helpful for figuring out which (if any) firewall rule is blocking the desired traffic, and has even exposed a few routing problems!

When looking at that interface, there's a small "circling arrows" icon that you can press to reverse the direction of the flow.  I initially thought that this would just flip the source and the destination, but that's not quite what it does.  When you press that button, it shows you flow information about the replies for the traffic in the query.  So, that VM ApplicationServerX to VM DatabaseServerY query, when flipped, would show the path (and applicable firewalls) for DatabseServerY's replies to ApplicationServerX's traffic.  This detail can be important because firewall rules are usually directional but stateful, meaning that the rule only applies to a single direction but then allows the response to that allowed traffic.  If you actually want to see the reversed traffic flow (rather than the response to the traffic flow), you need to run the query with the servers reversed: VM DatabaseServerY to VM ApplicationServerX.

P.S. By the way, when exporting vRNI data to CSV, there is one really annoying "quirk": the system always exports an "entity name" column (at least, I can't find a way to turn it off).  That column gives you a summary of the network flow in which it looks up IPs and returns either a hostname or at least a geographic location.  The problem is that the geographic location is in the form of Sacramento, US.  Do you see the problem?  There's potentially a comma in that field of the CSV, and the CSV has no quotes or anything to protect that column.  So, unless you go in and edit the CSV (I just replace all ", " with " "), some seemingly random number of rows will be completely unusable due to the columns being shifted.


Popular posts from this blog

Deleting Orphaned (AKA Zombie) VMDK Files

Clone a Standard vSwitch from one ESXi Host to Another

vCenter Server Appliance Crash due to Full /Storage/SEAT Partition