Archive for the ‘infiniband’ Category

Cambridge… England, that is

Monday, August 6th, 2007

My tutorial Writing RDMA applications on Linux has been accepted at LinuxConf Europe 2007. I’ll try to give practical introduction to writing native RDMA applications on Linux — “native” meaning directly to RDMA verbs as opposed to using an additional library layer such as MPI or uDAPL.  I’m aiming to make it accessible to people who know nothing about RDMA, so if you read my blog you’re certainly qualified.  Start planning your trip now!

My presentation is on the morning of Monday, September 3, and I’m flying to England across 7 time zones on Sunday, September 2, so I hope I’m able to remain upright and somewhat coherent for the whole three hours I’m supposed to be speaking….

Who knew…

Monday, July 2nd, 2007

…computing icon Andy Bechtolsheim is a top-poster?

Do you feel lucky, punk?

Thursday, June 28th, 2007

Sun just introduced their Constellation supercomputer at ISC Dresden. They’ve managed to get a lot of hype out of this, including mentions in places like the New York Times. But the most interesting part to me is the 3,456-port “Magnum” InfiniBand switch. I haven’t seen many details about it and I couldn’t find anything about it on Sun’s Constellation web site.

However I’ve managed to piece together some info about the switch from the new stories as well as the pictures in this blog entry. Physically, this thing is huge–it looks like it’s about half a rack high and two racks wide. The number 3,456 gives a big clue as to the internal architecture: 3456 = 288 * 12. Current InfiniBand switch chips have 24 ports, and the biggest non-blocking switch one can build with two levels (spine and leaf) is 24 * 12 = 288 ports: 24 leaf switches each of which have 12 ports to the outside and 12 ports to the spines (one port to each of the 12 spine switches).

Then, using 12 288-port switches as spines, one can take 288 24-port leaf switches that each have 12 ports to the outside and end up with 288 * 12 = 3456 ports, just like Sun’s Magnum switch. From the pictures of the chassis, it looks like Magnum has the spine switches on cards on one side of the midplane and the leaf switches on the other side, using the cute trick of having one set of cards be vertical and one set horizontal to get all-to-all connections between spines and leaves without having too-long midplane traces.

All of this sounds quite reasonable until you start to consider putting all of this in one box. Each 288 port switch (which is on one card in this design!) has 36 switch chips on it. At about 30 Watts per switch chip, each of this cards is over 1 kilowatt, and there are 12 of these in a system. In fact, with 720 switch chips in the box, the total system is well over 20 kW!

It also seems that the switch is using proprietary high-density connectors that bring three IB ports out of each connector, which reduces the number of external connectors on the switch down to a mere 1152.

One other thing I noticed is that the Sun press stuff is billing the Constellation as running Solaris, while the actual TACC page about the Ranger system says the cluster will be running Linux. I’m inclined to believe TACC, since running Solaris for an InfiniBand cluster seems a little silly, given how far behind Solaris’s InfiniBand support is when compared to Linux, whose InfiniBand stack is lovingly maintained by yours truly.

2.6.19 merge plans for InfiniBand/RDMA

Thursday, August 17th, 2006

Here’s a short summary of what I plan to merge for 2.6.19. I sent this out via email to all the relevant lists, but I figured it can’t hurt to blog it too. Some of this is already in infiniband.git (git://, while some still needs to be merged up. Highlights:

  • iWARP core support. This updates drivers/infiniband to work with devices that do RDMA over IP/ethernet in addition to InfiniBand devices. As a first user of this support, I also plan to merge the amso1100 driver for Ammasso RNICs.I will post this for review one more time after I pull it into my git tree for last minute cleanups. But if you feel this iWARP support should not be merged, please let me know why now.
  • IBM eHCA driver, which supports IBM pSeries-specific InfiniBand hardware. This is in the ehca branch of infiniband.git, and I will post it for review one more time. My feeling is that more cleanups are certainly possible, but this driver is “good enough to merge” now and has languished out of tree for long enough. I’m certainly happy to merge cleanup patches, though.
  • mmap()ed userspace work queues for ipath. This is a performance enhancement for QLogic/PathScale HCAs but it does touch core stuff in minor ways. Should not be controversial.
  • I also have the following minor changes queued in the for-2.6.19 branch of infiniband.git:
       Ishai Rabinovitz:
             IB/srp: Add port/device attributes

       James Lentini:
             IB/mthca: Include the header we really want

       Michael S. Tsirkin:
             IB/mthca: Don't use privileged UAR for kernel access
             IB/ipoib: Fix flush/start xmit race (from code review)

       Roland Dreier:
             IB/uverbs: Use idr_read_cq() where appropriate
             IB/uverbs: Fix lockdep warning when QP is created with 2 CQs

What is this thing called RDMA?

Thursday, July 13th, 2006

A good way to kick this blog off is probably to explain what this RDMA stuff that I work on really is.
RDMA stands for Remote Direct Memory Access, but the term “RDMA” is usually used to refer to networking technologies that have a software interface with three features:

  • Remote direct memory access (Remote DMA)
  • Asynchronous work queues
  • Kernel bypass

InfiniBand host channel adapters (HCAs) are an example of network adapters that offer such an interface, but RDMA over IP (iWARP) adapters are starting to appear as well.

Anyway, let’s take a look at what these three features really mean.

Remote DMA

Remote DMA is pretty much what it sounds like: DMA on a remote system. The adapter on system 1 can send a message to the adapter on system 2 that causes the adapter on system 2 to DMA data to or from system 2’s memory. The messages come in two main types:

  • RDMA Write: includes an address and data to put at that address, and causes the adapter that receives it to put the supplied data at the specified address
  • RDMA Read: includes an address and a length, and causes the adapter that receives it to generate a reply that sends back the data at the address requested.

These messages are “one-sided” in the sense that they will be processed by the adapter that receive them without involving the CPU on the system that receives the messages.

Letting a remote system DMA into your memory sounds pretty scary, but RDMA adapters give fine-grained control over what remote systems are allowed to do. Going into the details now will make this entry way too long, so for now just trust me that things like protection domains and memory keys allow you to control connection-by-connection and byte-by-byte with separate read and write permissions.

To see why RDMA is useful, you can think of RDMA operations as “direct placement” operations: data comes along with information about where it’s supposed to go. For example, there is a spec for NFS/RDMA, and it’s pretty easy to see why RDMA is nice for NFS. The NFS/RDMA server can service requests in whatever order it wants and return responses via RDMA as they become available; by using direct placement, the responses can go right into the buffers where the client wants them, without requiring the NFS client to do any copying of data.

(There are actually some more complicated RDMA operations that are supported on InfiniBand, namely atomic fetch & add, and atomic compare & swap, but those aren’t quite as common so you can ignore them for now)

Asynchronous work queues

Software talks to RDMA adapters via an aynchronous interface. This doesn’t really have all that much to do with remote DMA, but when we talk about RDMA adapters, we expect this type of interface (which is called a “verbs” interface for some obscure historical reason).

Basically, to use an RDMA adapter, you create objects called queue pairs (or QPs), which as the name suggests are a pair of work queues: a send queue and a receive queue, and completion queues (or CQs). When you want to do something, you tell the adapter to post an operation to one of your work queues. The operation executes asynchronously, and when it’s done, the adapter adds work completion information onto the end of your CQ. When you’re ready, you can go retrieve completion information from the CQ to see which requests have completed.

Operating asynchronously like this makes it easier to overlap computation and communication.

(Incidentally, as the existence of receive queues might make you think, RDMA adapters support plain old “two-sided’ send/receive operations, in addition to one-sided RDMA operations. You can post a receive request to your local receive queue, and then next send message that comes in will be received into the buffer you provided. RDMA operations and send operations can be mixed on the same send queue, too)

Kernel bypass

The last feature which is common to RDMA adapters also has nothing to do with remote DMA per se. But RDMA adapters allow userspace processes to do fast-path operations (posting work requests and retreiving work completions) directly with the hardware without involving the kernel at all. This is nice because these adapters are typically used for high-performance, latency-sensitive applications, and saving the system call overhead is a big win when you’re counting nanoseconds.

OK, that’s it for today’s edition of “RDMA 101.” Now we have some background to talk about some of the more interesting features of RDMA networking.