Archive for the ‘rdma’ Category

Rocky roads

Monday, April 19th, 2010

I saw that the InfiniBand Trade Association announced the “RDMA over Converged Ethernet (RoCE)” specification today.  I’ve already discussed my thoughts on the underlying technology (although I have a bit more to say), so for now I just want to say that I really, truly hate the name they chose.  There are at least two things that suck about the name:

  1. Calling the technology “RDMA over” instead of “InfiniBand over” is overly vague and intentionally deceptive.  We already have “RDMA over Ethernet” — except we’ve been calling it iWARP.  Choosing “RoCE” is somewhat like talking about “Storage over Ethernet” instead of “Fibre Channel over Ethernet.”  Sure, FCoE is storage over ethernet, but so is iSCSI.  As for the intentionally deceptive part: I’ve been told that “InfiniBand” was left out of the name because the InfiniBand Trade Association felt that InfiniBand is viewed negatively in some of the markets they’re going after.  What does that say about your marketing when you are running away from your own main trademark?
  2. The term “Converged Ethernet” is also pretty meaningless.  The actual technology has nothing to do with “converged” ethernet (whatever that is, exactly); the annex that was just release simply describes how to stick InfiniBand packets inside a MAC header and Ethernet FCS, so simply “Ethernet” would be more accurate.  At least the “CE” part is an improvement over the previous try, “Converged Enhanced Ethernet” or “CEE”; not only does the technology have nothing to do with CEE either, “CEE” was an IBM-specific marketing term for what eventually became Data Center Bridging or “DCB.”  (At Cisco we used to use the term “Data Center Ethernet” or “DCE”)

So both the “R” and the “CE” of “RoCE” aren’t very good choices.  It would be a lot clearer and more intellectually honest if we could just call InfiniBand over Ethernet by its proper name: IBoE.  And explaining the technology would be a bit simpler too, since the analogy with FCoE becomes a lot more explicit.

First they laugh at you…

Friday, November 20th, 2009

I found this article in “Network Computing” pretty interesting, although not exactly for the content.   Just the framing of the whole article, with Microsoft is touting the fact that they’ve managed to achieve performance parity with Linux on some HPC benchmarks as an achievement (and putting up a graph that shows they are still at least a few percent behind), shows how dominant Linux is in HPC.  Also, the article says:

The beta also reportedly includes optimizations for new processors and can deploy and manage up to 1,000 nodes.

So in other words Microsoft is stuck at the low end of the HPC market, only usable on small clusters.

RDMA on Converged Ethernet

Wednesday, March 25th, 2009

I recently read Andy Grover’s post about converged fabrics, and since I particupated in the OpenFabrics panel in Sonoma that he alluded to, I thought it might be worth sharing my (somewhat different) thoughts.

The question that Andy is dealing with is how to run RDMA on “Converged Ethernet.” I’ve already explained what RDMA is, so I won’t go into that here, but it’s probably worth talking about Ethernet, since I think the latest developments are not that familiar to many people.  The IEEE has been developing a few standards they collectively refer to as “Data Center Bridging” (DCB) and that are also sometimes referred to as “Converged Enhanced Ethernet” (CEE).  This refers to high speed Ethernet (currently 10 Gb/sec, with a clear path to 40 Gb/sec and 100 Gb/sec), plus new features.  The main new features are:

  • Priority-Based Flow Control (802.1Qbb), sometimes called “per-priority pause”
  • Enhanced Transmission Selection (802.1Qaz)
  • Congestion Notification (802.1Qau)

The first two features let an Ethernet link be split into multiple “virtual links” that operate pretty independently — bandwidth can be reserved for a given virtual link so that it can’t be starved, and by having per-virtual-link flow control, we can make sure certain traffic classes don’t overrun their buffers and avoid dropping packets.  Then congestion notification means that we can tell senders to slow down to avoid congestion spreading caused by that flow control.

The main use case that DCB was developed for was Fibre Channel over Ethernet (FCoE).  FC requires a very reliable network — it simply doesn’t work if packets are dropped because of congestion — and so DCB provides the ability to segregate FCoE traffic onto a “no drop” virtual link.  However, I think Andy misjudges the real motivation for FCoE; the TCP/IP overhead of iSCSI was not really an issue (and indeed there are many people running iSCSI with very high performance on 10 Gb/sec Ethernet).

The real motivation for FCoE is to give a way for users to continue using all the FC storage they already have, while not requiring every server that wants to talk to the storage to have both a NIC and an FC HBA.  With a gateway that’s easy to build an scale, legacy FC storage can be connected to an FCoE fabric, and now servers with a “converged network adapter” that functions as both an Ethernet NIC and an FCoE HBA can talk to network and storage over one (Ethernet) wire.

Now, of course for servers that want to do RDMA, it makes sense that they want a triple-threat converged adapter that does Ethernet NIC, FCoE HBA, and RDMA.  The way that people are running RDMA over Ethernet today is via iWARP, which runs an RDMA protocol layered on top of TCP.  The idea that Andy and several other people in Sonoma are pushing is to do something analogous to FCoE instead, that is, take the InfiniBand transport layer and stick it into Ethernet somehow.  I see a number of problems with this idea.

First, one of the big reasons given for wanting to use InfiniBand on Ethernet instead of iWARP is that it’s the fastest path forward.  The argument is, “we just scribble down a spec, and everyone can ship it easily.”  That ignores the fact that iWARP adapters are already shipping from multiple vendors (although, to be fair, none with support for the proposed IEEE DCB standards yet; but DCB support should soon be ubiquitous in all 10 gigE NICs, iWARP and non-iWARP alike).  And the idea that an IBoE spec is going to be quick or easy to write flies in the face of the experience with FCoE; FCoE sounded dead simple in theory (just stick an Ethernet header on FC frames, what more could there be?) it turns out that the standards work has taken at least 3 years, and a final spec is still not done.  I believe that IBoE would be more complicated to specify, and fewer resources are available for the job, so a realistic view is that a true standard is very far away.

Andy points at a TOE page to say why running TCP on an iWARP NIC sucks.  But when I look at that page, pretty much all the issues are still there with running the IB transport on a NIC.  Just to take the first few on that page (without quibbling about the fact that many of the issues are just wrong even about TCP offload):

  • Security updates: yup, still there for IB
  • Point-in-time solution: yup, same for IB
  • Different network behavior: a hundred times worse if you’re running IB instead of TCP
  • Performance: yup
  • Hardware-specific limits: yup

And so on…

Certainly, given infinite resources, one could design an RDMA protocol that was cleaner than iWARP and took advantage of all the spiffy DCB features.  But worse is better and iWARP mostly works well right now; fixing the worst warts of iWARP has a much better chance of success than trying to shoehorn IB onto Ethernet and ending up with a whole bunch of unforseen problems to solve.

On over-engineering

Wednesday, November 19th, 2008

I’ve been trying to get a udev rule added to Ubuntu so that /dev/infiniband/rdma_cm is owned by group “rdma” instead of by root, so that unprivileged user applications can be given permission to use it by adding the user to the group rdma.  This matches the practice in the Debian udev rules and is a simple way to allow unprivileged use of RDMA while still giving the administrator some control over who exactly uses it.

I created a patch to the Ubuntu librdmacm package containing the appropriate rule and opened a Launchpad bug report requesting that it be applied.  After two months of waiting, I got a response that basically said, “no, we don’t want to do that.”  After another month of asking, I finally found out what solution Ubuntu would rather have:

Access to system devices is provided through the HAL or DeviceKit interface. Permission to access is managed through the PolicyKit layer, where the D-Bus system bus service providing the device access negotiates privilege with the application requesting it.

Because of course, rather than having an application simply open a special device node, mediated by standard Unix permissions, we’d rather have to run a daemon (bonus points for using DBus activation, I guess) and have applications ask that daemon to open the node for them.  More work to implement, harder to administer, less reliable for users — everyone wins!

Sigh….

He who divides and shares is left with the best share

Tuesday, November 6th, 2007

I’ve been talking to a lot of people about the “iWARP port sharing problem” lately, so I thought it might be a good idea to write a quick summary to point at and bring new people up to speed without constantly repeating myself.

To start with, iWARP is an RDMA (remote direct memory access) protocol that runs over TCP (or conceivably SCTP or any other stream protocol). It was defined by the IETF rddp working group, and the standard is in RFC 5040 and later RFCs. So what’s so great about RDMA?

The rationale for RDMA is laid out in great detail in RFC 4297, but the basic idea is that allowing network messages to carry information about where they should be received and allowing the NIC to place the data directly in that buffer allows fundamentally better performance.

To take a concrete example, think of iSCSI: an initiator sends a bunch of SCSI commands to a target (probably queuing up multiple commands), and the target processes the commands, possibly out of order, and returns the responses to the initiator. Without RDMA (or at least, without “direct data placement,” which is pretty equivalent to RDMA), for each read that the initiator does, it has to receive the data from the target, look at which command the data corresponds to, and copy it into the buffer that the SCSI midlayer wants it in. With RDMA and the “iSCSI Extensions for RDMA” (iSER, which is RFC 5046), the target can send the data in response to a read command and have it placed directly in the receive buffer on the initiator, which saves the copy and uses 3x less memory bandwidth (which is huge if the data is running at 10Gb/sec). In the SCSI world, this is nothing particularly exciting: pretty much every Fibre Channel HBA in the world already does the equivalent thing. What’s cool about iWARP is that it allows similar optimizations for NFS (the IETF nfsv4 working group is defining a standard for NFS/RDMA, and kernel 2.6.24-rc1 already has the client side of this draft protocol merged) as well as other applications that we haven’t thought of yet.

The way that iWARP is implemented is that RDMA NICs handle the full iWARP protocol including TCP in hardware — yes, the dreaded “TCP offload engine.” This is crucial to the performance: if the network data isn’t processed to the point of knowing where to put it on the NIC’s side of the PCI bus, then the memory bandwidth savings of copy avoidance is lost. So while one can imagine an iWARP implementation with stateless NIC hardware using some super-fancy header splitting and chipset DMA engine tricks, it’s not clear that it will perform as well as current iWARP NICs do.

Now, in addition to handling TCP connections, iWARP NICs also have to act like normal NICs so that they can handle normal network traffic such as ARPs, pings or ssh logins. What this means is that some packets are received normally and passed up the standard network stack, while other packets that belong to iWARP connections are consumed by the NIC.

This is what leads to the “port sharing problem.” One application might do a normal bind() to accept TCP connections on port X. It might even let the kernel choose a port number for it. Then another application (possibly even the same application) does an iWARP bind and tells the iWARP NIC to accept TCP connections on the same port X. This might happen because two different applications do the bind and have no way of coordinating with each other, or it might happen because one application just passes 0 in the sin_port field of its bind requests, and the kernel chooses the same port for both the normal and iWARP bind(). Whatever the reason, the end result is not good: the NIC and the network stack are left fighting for the same packets, and someone has to lose.

The reason this is an issue is because the kernel’s network stack and iWARP stack have completely separate port allocators, so there is no way for applications to prevent port collisions from happening. The obvious solution is to have normal TCP and iWARP port numbers allocated from the same space.

Unfortunately, the Linux networking developers are not too interested in cooperating on this. It seems that some people have just decided that anyone who wants to use iWARP is wrong to want that (no matter how much better than the alternatives it is for that user’s app) and will just reflexively reject anything iWARP-related without trying to engage in constructive discussion. (Given that attitude, it’s rather ironic when the same people preach about open-mindedness and “thinking outside the box,” but let’s not get sidetracked…)

Given the current deadlock, the advice I’ve been giving to the various iWARP NIC companies is just to sell a lot of iWARP NICs and make the problem so big that we’re forced to find a solution. I don’t see any other way to force people to work together.

Materials from LinuxConf.eu RDMA tutorial (at last)

Wednesday, October 10th, 2007

At long last, after several requests, I’ve posted the slides, notes, and client and server examples from the tutorial I gave at LinuxConf.eu 2007 in Cambridge back in September.  Hyper-observant readers will notice that the client program I posted does not match the listing in the notes I handed out; this is because I fixed a race condition in how completions are collected.

I’m not sure how useful all this is without me talking about it, but I guess every little bit helps.  And of course, if you have questions about RDMA or InfiniBand programming, come on over to the mailing list and fire away.

InfiniBand/RDMA in 2.6.23 and 2.6.24

Wednesday, October 10th, 2007

With yesterday’s release of  kernel 2.6.23, I thought it might be a good time to look back at what significant changes are in 2.6.23, and what we have queued up for 2.6.24..

So first I looked at the kernel git log from the v2.6.22 tag to the v2.6.23 tag, and I was surprised to find that nothing really stood out.  We merged something like 158 patches that touched 123 files, but I couldn’t really find any headline-worthy new features in there.  There were just tons of fixes and cleanups all over, although mostly in the low-level hardware drivers.  For some reason, 2.6.23 was a pretty calm development cycle for InfiniBand and RDMA, which means that at least that part of 2.6.23 should be rock solid.

2.6.24 promises to be a somewhat more exciting release for us.  In my for-2.6.24 branch, in addition to the usual pile of fixes and cleanups, I have a couple of interesting changes queued up to merge as soon as Linus starts pulling things in:

  • Sean Hefty’s quality-of-service support.  These changes allow administrators to configure the network to give different QoS parameters to different types of traffic (eg IPoIB, SRP, and so on).
  • A patch from me (based on Sean Hefty’s work) to handle multiple P_Keys for userspace management applications.  This is one of the last pieces to make the InfiniBand stack support IB partitions fully.

Also, bonding support for IP-over-InfiniBand looks set to go in through Jeff Garzik’s tree.  This is something that I’ve been wanting to see for years now; the patches allow the standard bonding module to enslave IPoIB interfaces, which means that multiple IB ports can finally be used for IPoIB high-availability failover.  Moni Shoua and others did a lot of work and stuck with this for a long time, and the final set of patches turned out to be very clean and nice, so I’m really pleased to see this get merged.

Cambridge… England, that is

Monday, August 6th, 2007

My tutorial Writing RDMA applications on Linux has been accepted at LinuxConf Europe 2007. I’ll try to give practical introduction to writing native RDMA applications on Linux — “native” meaning directly to RDMA verbs as opposed to using an additional library layer such as MPI or uDAPL.  I’m aiming to make it accessible to people who know nothing about RDMA, so if you read my blog you’re certainly qualified.  Start planning your trip now!

My presentation is on the morning of Monday, September 3, and I’m flying to England across 7 time zones on Sunday, September 2, so I hope I’m able to remain upright and somewhat coherent for the whole three hours I’m supposed to be speaking….

2.6.19 merge plans for InfiniBand/RDMA

Thursday, August 17th, 2006

Here’s a short summary of what I plan to merge for 2.6.19. I sent this out via email to all the relevant lists, but I figured it can’t hurt to blog it too. Some of this is already in infiniband.git (git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git), while some still needs to be merged up. Highlights:

  • iWARP core support. This updates drivers/infiniband to work with devices that do RDMA over IP/ethernet in addition to InfiniBand devices. As a first user of this support, I also plan to merge the amso1100 driver for Ammasso RNICs.I will post this for review one more time after I pull it into my git tree for last minute cleanups. But if you feel this iWARP support should not be merged, please let me know why now.
  • IBM eHCA driver, which supports IBM pSeries-specific InfiniBand hardware. This is in the ehca branch of infiniband.git, and I will post it for review one more time. My feeling is that more cleanups are certainly possible, but this driver is “good enough to merge” now and has languished out of tree for long enough. I’m certainly happy to merge cleanup patches, though.
  • mmap()ed userspace work queues for ipath. This is a performance enhancement for QLogic/PathScale HCAs but it does touch core stuff in minor ways. Should not be controversial.
  • I also have the following minor changes queued in the for-2.6.19 branch of infiniband.git:
       Ishai Rabinovitz:
             IB/srp: Add port/device attributes

       James Lentini:
             IB/mthca: Include the header we really want

       Michael S. Tsirkin:
             IB/mthca: Don't use privileged UAR for kernel access
             IB/ipoib: Fix flush/start xmit race (from code review)

       Roland Dreier:
             IB/uverbs: Use idr_read_cq() where appropriate
             IB/uverbs: Fix lockdep warning when QP is created with 2 CQs

What is this thing called RDMA?

Thursday, July 13th, 2006

A good way to kick this blog off is probably to explain what this RDMA stuff that I work on really is.
RDMA stands for Remote Direct Memory Access, but the term “RDMA” is usually used to refer to networking technologies that have a software interface with three features:

  • Remote direct memory access (Remote DMA)
  • Asynchronous work queues
  • Kernel bypass

InfiniBand host channel adapters (HCAs) are an example of network adapters that offer such an interface, but RDMA over IP (iWARP) adapters are starting to appear as well.

Anyway, let’s take a look at what these three features really mean.

Remote DMA

Remote DMA is pretty much what it sounds like: DMA on a remote system. The adapter on system 1 can send a message to the adapter on system 2 that causes the adapter on system 2 to DMA data to or from system 2′s memory. The messages come in two main types:

  • RDMA Write: includes an address and data to put at that address, and causes the adapter that receives it to put the supplied data at the specified address
  • RDMA Read: includes an address and a length, and causes the adapter that receives it to generate a reply that sends back the data at the address requested.

These messages are “one-sided” in the sense that they will be processed by the adapter that receive them without involving the CPU on the system that receives the messages.

Letting a remote system DMA into your memory sounds pretty scary, but RDMA adapters give fine-grained control over what remote systems are allowed to do. Going into the details now will make this entry way too long, so for now just trust me that things like protection domains and memory keys allow you to control connection-by-connection and byte-by-byte with separate read and write permissions.

To see why RDMA is useful, you can think of RDMA operations as “direct placement” operations: data comes along with information about where it’s supposed to go. For example, there is a spec for NFS/RDMA, and it’s pretty easy to see why RDMA is nice for NFS. The NFS/RDMA server can service requests in whatever order it wants and return responses via RDMA as they become available; by using direct placement, the responses can go right into the buffers where the client wants them, without requiring the NFS client to do any copying of data.

(There are actually some more complicated RDMA operations that are supported on InfiniBand, namely atomic fetch & add, and atomic compare & swap, but those aren’t quite as common so you can ignore them for now)

Asynchronous work queues

Software talks to RDMA adapters via an aynchronous interface. This doesn’t really have all that much to do with remote DMA, but when we talk about RDMA adapters, we expect this type of interface (which is called a “verbs” interface for some obscure historical reason).

Basically, to use an RDMA adapter, you create objects called queue pairs (or QPs), which as the name suggests are a pair of work queues: a send queue and a receive queue, and completion queues (or CQs). When you want to do something, you tell the adapter to post an operation to one of your work queues. The operation executes asynchronously, and when it’s done, the adapter adds work completion information onto the end of your CQ. When you’re ready, you can go retrieve completion information from the CQ to see which requests have completed.

Operating asynchronously like this makes it easier to overlap computation and communication.

(Incidentally, as the existence of receive queues might make you think, RDMA adapters support plain old “two-sided’ send/receive operations, in addition to one-sided RDMA operations. You can post a receive request to your local receive queue, and then next send message that comes in will be received into the buffer you provided. RDMA operations and send operations can be mixed on the same send queue, too)

Kernel bypass

The last feature which is common to RDMA adapters also has nothing to do with remote DMA per se. But RDMA adapters allow userspace processes to do fast-path operations (posting work requests and retreiving work completions) directly with the hardware without involving the kernel at all. This is nice because these adapters are typically used for high-performance, latency-sensitive applications, and saving the system call overhead is a big win when you’re counting nanoseconds.

OK, that’s it for today’s edition of “RDMA 101.” Now we have some background to talk about some of the more interesting features of RDMA networking.