Extreme IO Latency Caused by a Poorly Seated Fiber Cable
One of my customers was experiencing some extreme IO latency in their environment: in the hundreds of ms. Obviously, there was some pain associated with this issue, and so they asked for help. This environment had 3 HP c7000 chassis and 2 different SANs; the issue was affecting every host accessing every LUN, so we decided that the issue must be in the fiber channel fabric somewhere.
After poking around, we quickly realized that the firmware on the Brocade switches was from January 2013, so was quite old. I pulled up top on each switch and saw no appreciable CPU or Memory usage. Next, I looked at porterrshow to see if there were any problems on the switches; each one had a single port that had a ridiculous number of Enc Out errors. I cleared the error stat counters by using portstatsclear -i 0-33 and then issued another porterrshow, and found that we had roughly 100,000 Enc Out errors on that port each second. Coincidentally, each switch had 1 port that was displaying this issue.
Then we looked at SFPShow # for those ports, in order to see if there was any interesting physical information about the ports. We noticed that the "RX Power" was very low, like 14 uW (the normal range bottoms out at 31.6 uW and goes up to 794 uW). On a normal port on the switch, our "RX Power" was reading in the 500s. Well, we figured, we've got a physical problem. Probably a kinked cable or something like that.
Next, we used PortShow # to show the details for each of the impacted ports, which includes a list of any WWNs that are plugged into the port. I pulled up PowerCLI and collected all of my vmhba WWNs so that I could figure out what those devices were by the following commands:
$allHBAs = get-vmhost | get-vmhosthba | select vmhost,device,@{N="WWN";E={"{0:X}"-f$_.PortWorldWideName}}
$allHBAs | ? {$_.WWN -match "WWN from Brocade CLI without colons"}
Those commands revealed that those ports were plugged into a new chassis that had been recently installed but not yet put into production. Since that chassis was not yet live and the issue was systemic, I decided that those ports were unlikely to be the issue and wanted to move on to the firmware updates. One of the network administrators here had seen weird issues from dirty SFPs in the past though, and insisted that we resolve the physical errors before moving on, because it would be nice and fast.
So, we grabbed a fiber cable cleaning kit, a couple of spare SFPs and a couple of spare fiber cables, then went to work. When I got there, I immediately noticed that the cables in the two affected ports were not plugged in correctly. The link light was lit, but the plastic "depressor" tab had gotten underneath the "locking pins" on the cable, which prevented them from being pushed down at all... which meant that they prevented the cable from being fully pushed into the SFP.
I pulled the cables (they were plugged into a not-in-use chassis, after all), fixed the plastic housing and cleaned them (since I had them out anyway), then plugged them back in. Lo and behold, the Enc Out errors went away our our RX Power jumped up to the 600s. Ok, that was a nice and easy fix.
So, we went back to the initial issue, the storage latency, and were amazed by what we saw. Across the board, our storage latency had dropped to around 1 ms. We think that the poor connections were basically causing a ridiculous number of retransmits to occur, which was bogging down the ASIC on the fiber channel switch. Since the ASIC manages those things at the hardware level, the load never showed up on the Top command that I had started with.
That overwhelmed ASIC was basically propagating the problem to every other device that was trying to use it, which was both of the SANs in the environment. Much to everyone's surprise (except for that one smug network admin), the issue was very easily resolved by taking care of those two errant ports. So, lesson learned there: hardware issues on even a single port can propagate in unexpected ways across an entire switch.
After poking around, we quickly realized that the firmware on the Brocade switches was from January 2013, so was quite old. I pulled up top on each switch and saw no appreciable CPU or Memory usage. Next, I looked at porterrshow to see if there were any problems on the switches; each one had a single port that had a ridiculous number of Enc Out errors. I cleared the error stat counters by using portstatsclear -i 0-33 and then issued another porterrshow, and found that we had roughly 100,000 Enc Out errors on that port each second. Coincidentally, each switch had 1 port that was displaying this issue.
Then we looked at SFPShow # for those ports, in order to see if there was any interesting physical information about the ports. We noticed that the "RX Power" was very low, like 14 uW (the normal range bottoms out at 31.6 uW and goes up to 794 uW). On a normal port on the switch, our "RX Power" was reading in the 500s. Well, we figured, we've got a physical problem. Probably a kinked cable or something like that.
Next, we used PortShow # to show the details for each of the impacted ports, which includes a list of any WWNs that are plugged into the port. I pulled up PowerCLI and collected all of my vmhba WWNs so that I could figure out what those devices were by the following commands:
$allHBAs = get-vmhost | get-vmhosthba | select vmhost,device,@{N="WWN";E={"{0:X}"-f$_.PortWorldWideName}}
$allHBAs | ? {$_.WWN -match "WWN from Brocade CLI without colons"}
Those commands revealed that those ports were plugged into a new chassis that had been recently installed but not yet put into production. Since that chassis was not yet live and the issue was systemic, I decided that those ports were unlikely to be the issue and wanted to move on to the firmware updates. One of the network administrators here had seen weird issues from dirty SFPs in the past though, and insisted that we resolve the physical errors before moving on, because it would be nice and fast.
So, we grabbed a fiber cable cleaning kit, a couple of spare SFPs and a couple of spare fiber cables, then went to work. When I got there, I immediately noticed that the cables in the two affected ports were not plugged in correctly. The link light was lit, but the plastic "depressor" tab had gotten underneath the "locking pins" on the cable, which prevented them from being pushed down at all... which meant that they prevented the cable from being fully pushed into the SFP.
I pulled the cables (they were plugged into a not-in-use chassis, after all), fixed the plastic housing and cleaned them (since I had them out anyway), then plugged them back in. Lo and behold, the Enc Out errors went away our our RX Power jumped up to the 600s. Ok, that was a nice and easy fix.
So, we went back to the initial issue, the storage latency, and were amazed by what we saw. Across the board, our storage latency had dropped to around 1 ms. We think that the poor connections were basically causing a ridiculous number of retransmits to occur, which was bogging down the ASIC on the fiber channel switch. Since the ASIC manages those things at the hardware level, the load never showed up on the Top command that I had started with.
That overwhelmed ASIC was basically propagating the problem to every other device that was trying to use it, which was both of the SANs in the environment. Much to everyone's surprise (except for that one smug network admin), the issue was very easily resolved by taking care of those two errant ports. So, lesson learned there: hardware issues on even a single port can propagate in unexpected ways across an entire switch.
Comments
Post a Comment
Sorry guys, I've been getting a lot of spam recently, so I've had to turn on comment moderation. I'll do my best to moderate them swiftly after they're submitted,