As you may know, Pure Storage has been shipping iSCSI arrays for a few months. Recently we got a bug report from a customer: when running heavy IO from a VMware guest, iSCSI IO from the VMware system to a Pure array would stall for minutes at a time. Fortunately, we were able to get a similar set up in our lab and reproduce the issue, so we could dig in and see what was going on.
The IO was going from the guest to a guest disk image, which is really just a file on the VMware filesystem, so the actual iSCSI initiator was the VMware hypervisor. We started grabbing packet dumps of the traffic while we reproduced the problem, and noticed a strange pattern of slow retransmissions from the Pure side when the stall occurred.
The test setup had one 10G Ethernet link from the VMware system going through a switch to two 10G ports on the Pure array, so packet loss due to congestion was immediately something we thought of, but we couldn’t understand why we couldn’t recover. It seemed that when the stall occurred, the Pure side of the TCP connection got stuck with a full send buffer.
The first clue was when we noticed that selective ACK (SACK) was not enabled for our iSCSI TCP connections — and indeed, when we captured the connection establishment, the VMware iSCSI initiator was not advertising SACK in its TCP options. This was kind of a mystery, because when we did other things such as ssh-ing into the the VMware command line, it was perfectly happy to set up TCP connections with SACK enabled.
So not having SACK enabled partially explained why we were not able to recover from packet loss very well: we were running with a large TCP window (hundreds of KB) on a high-speed link, and if some packets got dropped, we might send a few hundred more afterwards; without SACK, the VMware system had no way to tell us which packets to retransmit.
However, TCP was behaving even worse than “lack of SACK” could explain. What we saw in the packet trace was that after a lost packet, our retransmission timer would expire and we would send the packet after the last one that the initiator ACKed (which would be a few hundred packets before the last one that we sent). The initiator would ACK that packet, which would advance our send window, and so we would send one new packet beyond the last one we sent — right at the end but definitely within our window.
And then the initiator would just ignore that packet! The way that TCP is supposed to work is that the receiver should either ACK all the way up to that new packet (if our retransmitted packet was the only lost packet, and it now had all the data up to and including our new packet), or it should send a duplicate ACK (“dup ACK”) that re-ACKed the data we already knew it had. Just ignoring those packets that are within its receive window is completely inexplicable.
The dup ACK is important because enough of them will trigger “fast retransmission” — without that, we’re stuck waiting for a roughly 1/4 second timer to expire between retransmissions, which means it will take way too long to resend a full send window of several hundred packets. In fact, so long that the inititor just gives up and establishes a new TCP connection after a timeout of a minute or two.
Finally, we realized why the VMware initiator’s TCP behavior was so crazy and primitive. Fortunately, we had duplicated the customer config and we were using a Broadcom 10G Ethernet adapter with the Broadcom iSCSI offload driver (roughly equivalent to the bnx2i driver in Linux). This crazy TCP stack wasn’t in the VMware hypervisor — it was running on the network adapter.
(In fact, looking at the Linux kernel sources for bnx2i and cnic, one can see that the Broadcom TCP offload engine apparently has an option “L4_KWQ_CONNECT_REQ1_SACK” for connections, but because the iSCSI initiator driver doesn’t set the “SK_TCP_SACK” flag, it doesn’t get enabled. One can guess that the Broadcom driver for VMware is probably from a similar codebase, and that kind of explains why we didn’t see SACK enabled)
Once we realized where the problem was coming from, the fix was simple: switch from the Broadcom offload driver to the normal VMware software iSCSI initiator. Once we did that, performance became pretty stable, just about saturating the 10G Ethernet link, with occasional hiccups of a few seconds when a congestion drop occurred. (As a side note, it’s kind of nuts that these days we take it for granted that a storage array can do enough IOPS to get above 1 GB/sec with small random IOs).
In the past I’ve defended TOEs, but in this case the Broadcom NIC and driver aren’t even fully implementing the most primitive form of TCP, so I have to agree that it’s completely unusable. But it’s worth noting that we tried Chelsio and Emulex adapters with their iSCSI offload drivers, and they worked fine. I still think TCP and iSCSI offload make sense because they have a fundamental 3x advantage in memory bandwidth (the NIC puts the data where it’s supposed to go, rather than putting it in some random receive buffer and then having the CPU read it and write it to copy it to where it’s supposed to go)
So I don’t think there’s any broad conclusion that can be drawn beyond the fact that one should really never use Broadcom’s iSCSI offload.