He who divides and shares is left with the best share

I’ve been talking to a lot of people about the “iWARP port sharing problem” lately, so I thought it might be a good idea to write a quick summary to point at and bring new people up to speed without constantly repeating myself.

To start with, iWARP is an RDMA (remote direct memory access) protocol that runs over TCP (or conceivably SCTP or any other stream protocol). It was defined by the IETF rddp working group, and the standard is in RFC 5040 and later RFCs. So what’s so great about RDMA?

The rationale for RDMA is laid out in great detail in RFC 4297, but the basic idea is that allowing network messages to carry information about where they should be received and allowing the NIC to place the data directly in that buffer allows fundamentally better performance.

To take a concrete example, think of iSCSI: an initiator sends a bunch of SCSI commands to a target (probably queuing up multiple commands), and the target processes the commands, possibly out of order, and returns the responses to the initiator. Without RDMA (or at least, without “direct data placement,” which is pretty equivalent to RDMA), for each read that the initiator does, it has to receive the data from the target, look at which command the data corresponds to, and copy it into the buffer that the SCSI midlayer wants it in. With RDMA and the “iSCSI Extensions for RDMA” (iSER, which is RFC 5046), the target can send the data in response to a read command and have it placed directly in the receive buffer on the initiator, which saves the copy and uses 3x less memory bandwidth (which is huge if the data is running at 10Gb/sec). In the SCSI world, this is nothing particularly exciting: pretty much every Fibre Channel HBA in the world already does the equivalent thing. What’s cool about iWARP is that it allows similar optimizations for NFS (the IETF nfsv4 working group is defining a standard for NFS/RDMA, and kernel 2.6.24-rc1 already has the client side of this draft protocol merged) as well as other applications that we haven’t thought of yet.

The way that iWARP is implemented is that RDMA NICs handle the full iWARP protocol including TCP in hardware — yes, the dreaded “TCP offload engine.” This is crucial to the performance: if the network data isn’t processed to the point of knowing where to put it on the NIC’s side of the PCI bus, then the memory bandwidth savings of copy avoidance is lost. So while one can imagine an iWARP implementation with stateless NIC hardware using some super-fancy header splitting and chipset DMA engine tricks, it’s not clear that it will perform as well as current iWARP NICs do.

Now, in addition to handling TCP connections, iWARP NICs also have to act like normal NICs so that they can handle normal network traffic such as ARPs, pings or ssh logins. What this means is that some packets are received normally and passed up the standard network stack, while other packets that belong to iWARP connections are consumed by the NIC.

This is what leads to the “port sharing problem.” One application might do a normal bind() to accept TCP connections on port X. It might even let the kernel choose a port number for it. Then another application (possibly even the same application) does an iWARP bind and tells the iWARP NIC to accept TCP connections on the same port X. This might happen because two different applications do the bind and have no way of coordinating with each other, or it might happen because one application just passes 0 in the sin_port field of its bind requests, and the kernel chooses the same port for both the normal and iWARP bind(). Whatever the reason, the end result is not good: the NIC and the network stack are left fighting for the same packets, and someone has to lose.

The reason this is an issue is because the kernel’s network stack and iWARP stack have completely separate port allocators, so there is no way for applications to prevent port collisions from happening. The obvious solution is to have normal TCP and iWARP port numbers allocated from the same space.

Unfortunately, the Linux networking developers are not too interested in cooperating on this. It seems that some people have just decided that anyone who wants to use iWARP is wrong to want that (no matter how much better than the alternatives it is for that user’s app) and will just reflexively reject anything iWARP-related without trying to engage in constructive discussion. (Given that attitude, it’s rather ironic when the same people preach about open-mindedness and “thinking outside the box,” but let’s not get sidetracked…)

Given the current deadlock, the advice I’ve been giving to the various iWARP NIC companies is just to sell a lot of iWARP NICs and make the problem so big that we’re forced to find a solution. I don’t see any other way to force people to work together.

3 Responses to “He who divides and shares is left with the best share”

  1. Oh, think for a minute. says:

    I suppose no other protocols involve a control-plane / data-plane split. And there’s never been direct-access hardware before. Nope. No graphics adapters. No “old” high-speed networks. Nope nope.

    You’ve decided on *one* solution. It’s not widely acceptable. Come up with another. Geez. Maybe the hardware that’s out there isn’t as perfect as you would like to believe.

    One absolutely radical idea would be call-backs to the kernel from within the NIC’s state machine. Yeah, this exposes some other problems with scheduling. You aren’t the only one seeing those, so maybe you could work *with* other people rather than whining.

    But no, this is high-performance. All rules must go out the window for a high-performance adapter. After all, it’ll be high-performance for at least another 6 months before something else comes along.

  2. roland says:

    I’m not really sure who the “Oh, think for a minute” reply is ranting at. Who has decided on one solution? Who is refusing to work with people rather than whining? It sure sounds targeted at the people who stick to the “TOE is evil” mantra and refuse to look for common ground, but the last paragraph seems to be aimed at the people trying to work with the whole Linux networking community and make iWARP work a little better.

    More specifically, how do callbacks into the kernel make it possible to avoid having both a normal TCP socket and an iWARP socket bound to the same 4-tuple? There’s no call-back needed, since the kernel is fully aware of all the state at the time an application does a bind(), and I don’t see how a call-back helps anyway.

  3. !foo says:

    I think the institutional and traditional outlook evinced by the native ip stack devs (in the absence of non-conflicting suggestions from the (RDMA/iWARP) devs) is not out of line. The comment linked to, while negative, had the merit of not suggesting a further division of the standard methgod of BSD socket port allocation.
    Granted the benefit to traditional network filesystems,etc.. seems potentially considerable.
    The lack of hardware/kernel communication seems to be an issue that should be dealt with by an iWARP nic driver with hooks deep enough to ensure no port allocation conflict occurs. What does that take?