RDMA on Converged Ethernet

I recently read Andy Grover’s post about converged fabrics, and since I particupated in the OpenFabrics panel in Sonoma that he alluded to, I thought it might be worth sharing my (somewhat different) thoughts.

The question that Andy is dealing with is how to run RDMA on “Converged Ethernet.” I’ve already explained what RDMA is, so I won’t go into that here, but it’s probably worth talking about Ethernet, since I think the latest developments are not that familiar to many people.  The IEEE has been developing a few standards they collectively refer to as “Data Center Bridging” (DCB) and that are also sometimes referred to as “Converged Enhanced Ethernet” (CEE).  This refers to high speed Ethernet (currently 10 Gb/sec, with a clear path to 40 Gb/sec and 100 Gb/sec), plus new features.  The main new features are:

  • Priority-Based Flow Control (802.1Qbb), sometimes called “per-priority pause”
  • Enhanced Transmission Selection (802.1Qaz)
  • Congestion Notification (802.1Qau)

The first two features let an Ethernet link be split into multiple “virtual links” that operate pretty independently — bandwidth can be reserved for a given virtual link so that it can’t be starved, and by having per-virtual-link flow control, we can make sure certain traffic classes don’t overrun their buffers and avoid dropping packets.  Then congestion notification means that we can tell senders to slow down to avoid congestion spreading caused by that flow control.

The main use case that DCB was developed for was Fibre Channel over Ethernet (FCoE).  FC requires a very reliable network — it simply doesn’t work if packets are dropped because of congestion — and so DCB provides the ability to segregate FCoE traffic onto a “no drop” virtual link.  However, I think Andy misjudges the real motivation for FCoE; the TCP/IP overhead of iSCSI was not really an issue (and indeed there are many people running iSCSI with very high performance on 10 Gb/sec Ethernet).

The real motivation for FCoE is to give a way for users to continue using all the FC storage they already have, while not requiring every server that wants to talk to the storage to have both a NIC and an FC HBA.  With a gateway that’s easy to build an scale, legacy FC storage can be connected to an FCoE fabric, and now servers with a “converged network adapter” that functions as both an Ethernet NIC and an FCoE HBA can talk to network and storage over one (Ethernet) wire.

Now, of course for servers that want to do RDMA, it makes sense that they want a triple-threat converged adapter that does Ethernet NIC, FCoE HBA, and RDMA.  The way that people are running RDMA over Ethernet today is via iWARP, which runs an RDMA protocol layered on top of TCP.  The idea that Andy and several other people in Sonoma are pushing is to do something analogous to FCoE instead, that is, take the InfiniBand transport layer and stick it into Ethernet somehow.  I see a number of problems with this idea.

First, one of the big reasons given for wanting to use InfiniBand on Ethernet instead of iWARP is that it’s the fastest path forward.  The argument is, “we just scribble down a spec, and everyone can ship it easily.”  That ignores the fact that iWARP adapters are already shipping from multiple vendors (although, to be fair, none with support for the proposed IEEE DCB standards yet; but DCB support should soon be ubiquitous in all 10 gigE NICs, iWARP and non-iWARP alike).  And the idea that an IBoE spec is going to be quick or easy to write flies in the face of the experience with FCoE; FCoE sounded dead simple in theory (just stick an Ethernet header on FC frames, what more could there be?) it turns out that the standards work has taken at least 3 years, and a final spec is still not done.  I believe that IBoE would be more complicated to specify, and fewer resources are available for the job, so a realistic view is that a true standard is very far away.

Andy points at a TOE page to say why running TCP on an iWARP NIC sucks.  But when I look at that page, pretty much all the issues are still there with running the IB transport on a NIC.  Just to take the first few on that page (without quibbling about the fact that many of the issues are just wrong even about TCP offload):

  • Security updates: yup, still there for IB
  • Point-in-time solution: yup, same for IB
  • Different network behavior: a hundred times worse if you’re running IB instead of TCP
  • Performance: yup
  • Hardware-specific limits: yup

And so on…

Certainly, given infinite resources, one could design an RDMA protocol that was cleaner than iWARP and took advantage of all the spiffy DCB features.  But worse is better and iWARP mostly works well right now; fixing the worst warts of iWARP has a much better chance of success than trying to shoehorn IB onto Ethernet and ending up with a whole bunch of unforseen problems to solve.

5 Responses to “RDMA on Converged Ethernet”

  1. Pete says:

    Doing some investigation and I have to agree with Andy. A) I do not look at it as IBoE. I think I could make a proprietary
    RDMA protocol between two intelligent NICs pretty quick. And B) the effort and overhead of running the TCP/IP on the Nic I would like to avoid. This is the research that brought me here. Looking for a non-proprietary standard non-TCP method.
    These writeups may have motivated me to think more about proprietary, though still researching.

  2. roland says:

    @Pete: A) not sure why you wouldn’t look at it as IBoE — taking IB transport protocol and changing the L2 encapsulation to ethernet sure looks like exactly what IBoE means. You might be able to make a proprietary protocol quickly but are you going to be able to do enough research/simulation to handle congestion stably, etc? And you have to figure out how to tie into ethernet multicast management (most 10 gig ethernet switches are going to do IGMP snooping), do address discovery to know how to talk to other endpoints, etc.

    B) Yes, TCP introduces some overhead on the NIC, but you need some reliability protocol (and if anything, the IB RC protocol is more complex in some ways than TCP, with multiple kinds of ACK/NAKs etc). And TCP offload is a solved problem at this point anyway — cf Chelsio, NetEffect/Intel, etc.

  3. [...] announced the “RDMA over Converged Ethernet (RoCE)” specification today.  I’ve already discussed my thoughts on the underlying technology (although I have a bit more to say), so for now I just [...]

  4. James says:

    @roland, I loved your article and pretty much agree with it entirely. I do believe that RoCE will only pick up if you can use fairly standard ethernet NICs, is this the case? Roland, do you know if it’s so? Some datacenters, and some circumstances don’t make TCP the ideal protocol, and although iWARP does make a huge difference, if you want to avoid TCP, RoCE does seem like a fairly decent alternative…. the fact that infiniband uses a credit/token based protocol sometimes does yield it’s benefits with regard to flow control, although it’s true that it’s implementation of reliability may be more complex than TCP altogether. Love your blog Roland!! :-)

  5. James says:

    Apparently you can use SoftRoCE by systemfabricworks to use standard ethernet NICs to try out RoCE without any specific hardware as such. Obviously the performance will not be nearly as good as it would with some of the Mellanox cards or whatever, but it could be a good starting point!