I want to mention two things about IBoE. (I’m using the term InfiniBand-over-Ethernet, or IBoE for short, for what the IBTA calls RoCE for reasons already discussed)
First, we merged IBoE support on mlx4 devices into the upstream kernel in 2.6.37-rc1, so IBoE will be in upstream kernel for the 2.6.37 release — one fewer reason to use OFED. (And by the way, we used the term IBoE in the kernel) The requisite libibverbs and libmlx4 patches are not merged yet, but I hope to get to that soon and release new versions of the userspace libraries with IBoE support.
Second, a while ago I promised to detail some of my specific critiques of the IBoE spec (more formally, “Annex A16: RDMA over Converged Ethernet (RoCE)” to the “InfiniBand Architecture Specification Volume 1 Release 1.2.1”; if you want to follow along at home, you can download a copy from the IBTA). So here are two places where I think it’s really obvious that the spec is a half-assed rush job, to the detriment of trying to create interoperable implementations. (Fortunately everyone will just copy what the Linux stack does if they don’t actually just reuse the code, but still it would have been nice if the people writing the standards had thought things through instead of letting us just make something up and hope it there are no corner cases that will bite us later)
- The annex has this to say about address resolution in A16.5.1, “ADDRESS ASSIGNMENT AND RESOLUTION”:
The means for resolving a GID to a local port address (i.e. SMAC or DMAC) are outside the scope of this annex. It is assumed that standard Ethernet mechanisms, such as ARP or Neighbor Discovery are used to maintain an appropriate address cache for RoCE ports.
It’s easy to say that something is “outside the scope” but, uh, who else is going to specify how to turn an IB GID into an Ethernet address, if not the spec about how to run IB over Ethernet packets? And how could ARP conceivably be used, given that GIDs are 128-bit IPv6 addresses? If we’re supposed to use neighbor discovery, a little more guidance about how to coordinate the IPv6 stack and the IB stack might be helpful. In the current Linux code, we finesse all this by assuming that (unicast) GIDs are always local-scope IPv6 addresses with the Ethernet address encoded in them, so converting a GID to a MAC is trivial (cf
- This leads to the second glaring omission from the spec: nowhere are we told how to send multicast packets. The spec explicitly says that multicast should work in IBoE, but nowhere does it say how to map a multicast GID to the Ethernet address to use when sending to that MGID. In Linux we just used the standard mapping from multicast IPv6 addresses to multicast Ethernet addresses, but this is a completely arbitrary choice not supported by the spec at all.
You may hear people defending these omissions from the IBoE spec by saying that these things should be specified elsewhere or are out of scope for the IBTA. This is nonsense: who else is going to specify these things? In my opinion, what happened is simply that (for non-technical reasons) some members of the IBTA wanted to get a spec out very quickly, and this led to a process that was too short to produce a complete spec.