As I mentioned on Twitter (by the way, are you following @rolanddreier?), I’ll be speaking at the Linux Foundation Collaboration Summit in San Francisco on April 7. My general mandate is to give an introduction to RDMA and InfiniBand on Linux, and to talk about recent developments and what might be coming next in the area. However, I’d like to make my talk a little less boring than my usual talks, so I’d be curious to hear about specific topics you’d like me to cover. And if you’re at the summit, stop by and say hello.
Archive for the ‘infiniband’ Category
Since I changed jobs, I left behind a lot of my test systems, but I now have a couple of test systems set up. Here is the rather crazy set of non-chipset devices I now have in one box:
$ lspci -nn|grep -v 8086: 03:00.0 InfiniBand [0c06]: Mellanox Technologies MT25208 [InfiniHost III Ex] [15b3:6282] (rev 20) 04:00.0 Ethernet controller : Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] [15b3:6750] (rev b0) 05:00.0 InfiniBand [0c06]: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] [15b3:673c] (rev b0) 84:00.0 Ethernet controller : NetEffect NE020 10Gb Accelerated Ethernet Adapter (iWARP RNIC) [1678:0100] (rev 05) 85:00.0 Ethernet controller : Chelsio Communications Inc T310 10GbE Single Port Adapter [1425:0030] 86:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] [15b3:6274] (rev 20)
(I do have a couple of open slots if you have some RDMA cards that I’m missing to complete my collection )
Today is my last day at Cisco.
A little more than 10 years ago, in January 2001, I joined a small startup called Topspin Communications. We weren’t saying much publicly about what we were doing, but the idea when I joined was to build a super-high-performance box for dynamic web serving, with web app blades, TCP offload blades, storage blades and SSL blades. I was in charge of the SSL blade. However, early 2001 was when it became clear that the bubble was well and truly bursting, and it started to become clear the we weren’t going to have enough customers if we actually built our box, so we abandoned that product. Shortly after this decision, I got a call from the salesman from the company whose encryption chip we had selected for the SSL blade, telling me that they had decided not to build the encryption chip after all. I remember thinking how upset I would have been if he had called a week earlier, when we were still planning a product around the chip.
After a few months of flailing around searching for a product direction (not the most fun time in Topspin’s history), we decided to focus on InfiniBand networking gear. Initially, we focused on connections from servers on an InfiniBand fabric to existing Ethernet and Fibre Channel networks, and thus was born the IGR — InfiniBand Gateway Router — aka Buzz (“To InfiniBand and Beyond”):
This first chassis was pretty far from being a real product: it had only 1X IB ports (2 Gbps!) and was built using Mellanox MT21108 “Gamla” chips — pretty far from a shippable product. Heroic hardware reworks and software hacks were done just the get the system booting; for example, somehow I added enough IB support to PPCBoot for the line cards to load a kernel from the controller over InfiniBand directed route MADs.
Still, it was enough to get companies like Dell and Microsoft to take us seriously (which helped us raise another $30 million in the summer of 2002). Keep in mind that this was during the time that everyone thought InfiniBand was going to be huge, and Microsoft was planning on having IB drivers in Windows Server 2003. In fact we lugged some prototypes and emulators built on PCs up to Washington State to do interoperability testing and debugging with the Windows driver developers, and even watch Windows kernel developers at work.
When we were designing the next version of this box, one big decision was what 4X IB adapter chip to use inside. The choices were to play it safe with IBM Microelectronics, or to gamble on a startup, Mellanox, who was making bold performance promises. Luckily, we chose Mellanox, since the “safe” choice, IBM, canceled their IB products after struggling to make them work at all. Mellanox’s first spin of their chip worked — it was an amazing experience to have a real 4X adapter that “just worked” in our lab after all the screwing around with half-baked 1X products that we had gone though (although we did spend plenty of time debugging the driver and firmware for that Tavor adapter).
We worked hard on getting to a real product, and in November 2002, we were able to introduce the Topspin 360, which had 24 4X IB ports, 12 standard IB module slots (each could hold either a 4-port 1G Ethernet gateway or a 2-port Fibre Channel gateway) as well as one very cool bezel design:
In engineering, we followed the 360 with the “90 in 90″ challenge and built the Topspin 90 in only 90 days. I was able to get IPoIB working on the 90′s controller, in spite of having only a primitive IB switch and no host adapter available. The Topspin 90 was introduced in January 2003:
The engineering team spent the rest of 2003 building the Topspin 120 24-port switch (another switch chip to get IPoIB working on), and a new 6-port Ethernet gateway. The Ethernet gateway was pretty cool — for the first 4-port Ethernet gateway, we used a PowerPC 440GP along with a Mellanox HCA and some Intel NICs and did all the forwarding between Ethernet and IPoIB in software. Between PCI-X and CPU bottlenecks, we were a bit performance limited. The 6-port gateway used a Xilinx Virtex 2 FPGA with our own InfiniBand logic, and did all the forwarding in hardware, so we were able to handle full line rate of minimum-sized packets in both directions on all 6 Ethernet ports–and in 2003, 12 Gbps of traffic was an awful lot!
Somewhere along the way, it became clear that operating systems (aside from borderline-irrelevant proprietary Unixes like Solaris and HP-UX) would not include InfiniBand drivers out of the box; Microsoft dropped their plans for IB drivers, and the open source Linux project stalled. It became clear that if we wanted anyone to buy InfiniBand networking gear, we would have to take care of the server side of things too, and so we started working on a host driver stack. Luckily, at the very beginning of our InfiniBand development in 2001, we made the decision to use Linux on PowerPC rather than VxWorks as our embedded OS. That meant we had a lot of Linux InfiniBand driver code from our switch systems that we could adapt into host drivers.
At first, we distributed our drivers as proprietary binary blobs, which meant a lot of pain for us building our drivers for every different kernel flavor on every distribution our customers used, and which also meant a lot of pain for our customers who wanted to mix and match IB gear from different vendors. Clearly, for IB to work everyone had to agree on an open source stack, and after a lot of arguing and political wrangling that I’ll skip over here, the OpenIB Alliance was formed, and we started working on InfiniBand drivers for upstream inclusion in Linux.
The starting point of all the different vendor stacks that got released as open source was not particularly good, and although a lot of the community was in denial about it, it was clear to me that we would have to start from scratch to get something clean enough to go upstream. Around February 2004, I was trying to optimize IPoIB performance, and I got so frustrated trying to wade through all the abstraction layers of the Mellanox HCA driver that I decided I would try to write my own drastically simpler driver, and I started working on something I called “mthca”.
By May 2004, I had mthca working enough to run IPoIB and I decided to announce it publicly. This led to another series of flamewars but also enough encouragement from people I considered sane that I continued working on a stack built around mthca, and by December 2004 we had something good enough to go upstream. That was really the start of a lot of great things, and I’m really proud of my role helping to maintain the Linux stack; today we have iWARP support, eight different hardware drivers, IPoIB, storage protocols, network file protocols, RDS; InfiniBand is used in more than half of the Top 500 supercomputers, etc. And I don’t think any of that happens without IB support being upstream.
On the hardware side of things, we continued building things like the Topspin 270 96-port switch (1.5 Tbps of switch capacity!), switches for IBM BladeCenter, and so on. In April 2005, Cisco bought Topspin, and when the deal closed in May 2005, I officially became a Cisco employee. The Topspin IB products became the Cisco SFS product line, and for a brief glorious time, Cisco sold IB gear.
Unfortunately (for the SFS product line, at least), the IB market didn’t grow fast enough to become the billion-dollar market that Cisco looks for, and so Cisco decided to stop selling IB gear. We went from announcing new products to announcing that we wouldn’t sell those products (and I don’t think an SFS 3504 ever actually shipped to a customer). In fact, I personally gummed up the works a bit by putting in an internal order for an SFS 3504 as soon as it was orderable; a year later, the guy responsible for winding down the SFS product line had to track me down and have me cancel the order, which was the last one still on the books.
After we stopped working on InfiniBand stuff, we were bounced around between a few Cisco business units until we ended up working on x86 servers for the Cisco UCS product line. For the past few years, I’ve been helping Cisco build rack servers while continuing to be the InfiniBand/RDMA maintainer for Linux. I’ve helped build cool products such as the Cisco C460 server (some amusing things about the C460 project were debugging UEFI/BIOS that made memtest86+ insta-reboot at a certain memory location, and figuring out why Linux wouldn’t boot on an x86 system with 1TB of RAM). Cisco is a fun, rewarding place to work, and it’s amazing to still work every day with so many people from the old-school Topspin team, who have taught me so much over the years and become good friends along the way.
But since the Cisco acquisition, I’ve always missed the rush of working at a startup (hence my cri de coeur defending startups), and starting on Monday I’ll finally get back to that. My new company is using InfiniBand, and continuing to maintain the upstream stack is part of my official job description, so nothing should be changing about my free software activities. If my next job is half as good as Topspin, it should be an awesome ride.
My new company is still trying to keep things on the down-low, so I’m not going to put a link on my blog. I can say that we still want to hire more great Linux developers, so if you’re interested, please get in touch with me! We’re looking for people to work in-person in downtown Mountain View, CA (really downtown–not off in the Shoreline wilderness near the Googleplex, but actually in the same building as the Mozilla Foundation, near the train station, restaurants, etc). As I said, working remotely isn’t an option, but if you aren’t currently in the area and want to move to Silicon Valley, we can help with relocation and visas (if you’re good enough, of course ).
I want to mention two things about IBoE. (I’m using the term InfiniBand-over-Ethernet, or IBoE for short, for what the IBTA calls RoCE for reasons already discussed)
First, we merged IBoE support on mlx4 devices into the upstream kernel in 2.6.37-rc1, so IBoE will be in upstream kernel for the 2.6.37 release — one fewer reason to use OFED. (And by the way, we used the term IBoE in the kernel) The requisite libibverbs and libmlx4 patches are not merged yet, but I hope to get to that soon and release new versions of the userspace libraries with IBoE support.
Second, a while ago I promised to detail some of my specific critiques of the IBoE spec (more formally, “Annex A16: RDMA over Converged Ethernet (RoCE)” to the “InfiniBand Architecture Specification Volume 1 Release 1.2.1″; if you want to follow along at home, you can download a copy from the IBTA). So here are two places where I think it’s really obvious that the spec is a half-assed rush job, to the detriment of trying to create interoperable implementations. (Fortunately everyone will just copy what the Linux stack does if they don’t actually just reuse the code, but still it would have been nice if the people writing the standards had thought things through instead of letting us just make something up and hope it there are no corner cases that will bite us later)
- The annex has this to say about address resolution in A16.5.1, “ADDRESS ASSIGNMENT AND RESOLUTION”:
The means for resolving a GID to a local port address (i.e. SMAC or DMAC) are outside the scope of this annex. It is assumed that standard Ethernet mechanisms, such as ARP or Neighbor Discovery are used to maintain an appropriate address cache for RoCE ports.
It’s easy to say that something is “outside the scope” but, uh, who else is going to specify how to turn an IB GID into an Ethernet address, if not the spec about how to run IB over Ethernet packets? And how could ARP conceivably be used, given that GIDs are 128-bit IPv6 addresses? If we’re supposed to use neighbor discovery, a little more guidance about how to coordinate the IPv6 stack and the IB stack might be helpful. In the current Linux code, we finesse all this by assuming that (unicast) GIDs are always local-scope IPv6 addresses with the Ethernet address encoded in them, so converting a GID to a MAC is trivial (cf
- This leads to the second glaring omission from the spec: nowhere are we told how to send multicast packets. The spec explicitly says that multicast should work in IBoE, but nowhere does it say how to map a multicast GID to the Ethernet address to use when sending to that MGID. In Linux we just used the standard mapping from multicast IPv6 addresses to multicast Ethernet addresses, but this is a completely arbitrary choice not supported by the spec at all.
You may hear people defending these omissions from the IBoE spec by saying that these things should be specified elsewhere or are out of scope for the IBTA. This is nonsense: who else is going to specify these things? In my opinion, what happened is simply that (for non-technical reasons) some members of the IBTA wanted to get a spec out very quickly, and this led to a process that was too short to produce a complete spec.
I saw that the InfiniBand Trade Association announced the “RDMA over Converged Ethernet (RoCE)” specification today. I’ve already discussed my thoughts on the underlying technology (although I have a bit more to say), so for now I just want to say that I really, truly hate the name they chose. There are at least two things that suck about the name:
- Calling the technology “RDMA over” instead of “InfiniBand over” is overly vague and intentionally deceptive. We already have “RDMA over Ethernet” — except we’ve been calling it iWARP. Choosing “RoCE” is somewhat like talking about “Storage over Ethernet” instead of “Fibre Channel over Ethernet.” Sure, FCoE is storage over ethernet, but so is iSCSI. As for the intentionally deceptive part: I’ve been told that “InfiniBand” was left out of the name because the InfiniBand Trade Association felt that InfiniBand is viewed negatively in some of the markets they’re going after. What does that say about your marketing when you are running away from your own main trademark?
- The term “Converged Ethernet” is also pretty meaningless. The actual technology has nothing to do with “converged” ethernet (whatever that is, exactly); the annex that was just release simply describes how to stick InfiniBand packets inside a MAC header and Ethernet FCS, so simply “Ethernet” would be more accurate. At least the “CE” part is an improvement over the previous try, “Converged Enhanced Ethernet” or “CEE”; not only does the technology have nothing to do with CEE either, “CEE” was an IBM-specific marketing term for what eventually became Data Center Bridging or “DCB.” (At Cisco we used to use the term “Data Center Ethernet” or “DCE”)
So both the “R” and the “CE” of “RoCE” aren’t very good choices. It would be a lot clearer and more intellectually honest if we could just call InfiniBand over Ethernet by its proper name: IBoE. And explaining the technology would be a bit simpler too, since the analogy with FCoE becomes a lot more explicit.
I found this article in “Network Computing” pretty interesting, although not exactly for the content. Just the framing of the whole article, with Microsoft is touting the fact that they’ve managed to achieve performance parity with Linux on some HPC benchmarks as an achievement (and putting up a graph that shows they are still at least a few percent behind), shows how dominant Linux is in HPC. Also, the article says:
The beta also reportedly includes optimizations for new processors and can deploy and manage up to 1,000 nodes.
So in other words Microsoft is stuck at the low end of the HPC market, only usable on small clusters.
I recently read Andy Grover’s post about converged fabrics, and since I particupated in the OpenFabrics panel in Sonoma that he alluded to, I thought it might be worth sharing my (somewhat different) thoughts.
The question that Andy is dealing with is how to run RDMA on “Converged Ethernet.” I’ve already explained what RDMA is, so I won’t go into that here, but it’s probably worth talking about Ethernet, since I think the latest developments are not that familiar to many people. The IEEE has been developing a few standards they collectively refer to as “Data Center Bridging” (DCB) and that are also sometimes referred to as “Converged Enhanced Ethernet” (CEE). This refers to high speed Ethernet (currently 10 Gb/sec, with a clear path to 40 Gb/sec and 100 Gb/sec), plus new features. The main new features are:
- Priority-Based Flow Control (802.1Qbb), sometimes called “per-priority pause”
- Enhanced Transmission Selection (802.1Qaz)
- Congestion Notification (802.1Qau)
The first two features let an Ethernet link be split into multiple “virtual links” that operate pretty independently — bandwidth can be reserved for a given virtual link so that it can’t be starved, and by having per-virtual-link flow control, we can make sure certain traffic classes don’t overrun their buffers and avoid dropping packets. Then congestion notification means that we can tell senders to slow down to avoid congestion spreading caused by that flow control.
The main use case that DCB was developed for was Fibre Channel over Ethernet (FCoE). FC requires a very reliable network — it simply doesn’t work if packets are dropped because of congestion — and so DCB provides the ability to segregate FCoE traffic onto a “no drop” virtual link. However, I think Andy misjudges the real motivation for FCoE; the TCP/IP overhead of iSCSI was not really an issue (and indeed there are many people running iSCSI with very high performance on 10 Gb/sec Ethernet).
The real motivation for FCoE is to give a way for users to continue using all the FC storage they already have, while not requiring every server that wants to talk to the storage to have both a NIC and an FC HBA. With a gateway that’s easy to build an scale, legacy FC storage can be connected to an FCoE fabric, and now servers with a “converged network adapter” that functions as both an Ethernet NIC and an FCoE HBA can talk to network and storage over one (Ethernet) wire.
Now, of course for servers that want to do RDMA, it makes sense that they want a triple-threat converged adapter that does Ethernet NIC, FCoE HBA, and RDMA. The way that people are running RDMA over Ethernet today is via iWARP, which runs an RDMA protocol layered on top of TCP. The idea that Andy and several other people in Sonoma are pushing is to do something analogous to FCoE instead, that is, take the InfiniBand transport layer and stick it into Ethernet somehow. I see a number of problems with this idea.
First, one of the big reasons given for wanting to use InfiniBand on Ethernet instead of iWARP is that it’s the fastest path forward. The argument is, “we just scribble down a spec, and everyone can ship it easily.” That ignores the fact that iWARP adapters are already shipping from multiple vendors (although, to be fair, none with support for the proposed IEEE DCB standards yet; but DCB support should soon be ubiquitous in all 10 gigE NICs, iWARP and non-iWARP alike). And the idea that an IBoE spec is going to be quick or easy to write flies in the face of the experience with FCoE; FCoE sounded dead simple in theory (just stick an Ethernet header on FC frames, what more could there be?) it turns out that the standards work has taken at least 3 years, and a final spec is still not done. I believe that IBoE would be more complicated to specify, and fewer resources are available for the job, so a realistic view is that a true standard is very far away.
Andy points at a TOE page to say why running TCP on an iWARP NIC sucks. But when I look at that page, pretty much all the issues are still there with running the IB transport on a NIC. Just to take the first few on that page (without quibbling about the fact that many of the issues are just wrong even about TCP offload):
- Security updates: yup, still there for IB
- Point-in-time solution: yup, same for IB
- Different network behavior: a hundred times worse if you’re running IB instead of TCP
- Performance: yup
- Hardware-specific limits: yup
And so on…
Certainly, given infinite resources, one could design an RDMA protocol that was cleaner than iWARP and took advantage of all the spiffy DCB features. But worse is better and iWARP mostly works well right now; fixing the worst warts of iWARP has a much better chance of success than trying to shoehorn IB onto Ethernet and ending up with a whole bunch of unforseen problems to solve.
I’ve been trying to get a udev rule added to Ubuntu so that /dev/infiniband/rdma_cm is owned by group “rdma” instead of by root, so that unprivileged user applications can be given permission to use it by adding the user to the group rdma. This matches the practice in the Debian udev rules and is a simple way to allow unprivileged use of RDMA while still giving the administrator some control over who exactly uses it.
I created a patch to the Ubuntu librdmacm package containing the appropriate rule and opened a Launchpad bug report requesting that it be applied. After two months of waiting, I got a response that basically said, “no, we don’t want to do that.” After another month of asking, I finally found out what solution Ubuntu would rather have:
Access to system devices is provided through the HAL or DeviceKit interface. Permission to access is managed through the PolicyKit layer, where the D-Bus system bus service providing the device access negotiates privilege with the application requesting it.
Because of course, rather than having an application simply open a special device node, mediated by standard Unix permissions, we’d rather have to run a daemon (bonus points for using DBus activation, I guess) and have applications ask that daemon to open the node for them. More work to implement, harder to administer, less reliable for users — everyone wins!
At long last, after several requests, I’ve posted the slides, notes, and client and server examples from the tutorial I gave at LinuxConf.eu 2007 in Cambridge back in September. Hyper-observant readers will notice that the client program I posted does not match the listing in the notes I handed out; this is because I fixed a race condition in how completions are collected.
I’m not sure how useful all this is without me talking about it, but I guess every little bit helps. And of course, if you have questions about RDMA or InfiniBand programming, come on over to the mailing list and fire away.
With yesterday’s release of kernel 2.6.23, I thought it might be a good time to look back at what significant changes are in 2.6.23, and what we have queued up for 2.6.24..
So first I looked at the kernel git log from the v2.6.22 tag to the v2.6.23 tag, and I was surprised to find that nothing really stood out. We merged something like 158 patches that touched 123 files, but I couldn’t really find any headline-worthy new features in there. There were just tons of fixes and cleanups all over, although mostly in the low-level hardware drivers. For some reason, 2.6.23 was a pretty calm development cycle for InfiniBand and RDMA, which means that at least that part of 2.6.23 should be rock solid.
2.6.24 promises to be a somewhat more exciting release for us. In my for-2.6.24 branch, in addition to the usual pile of fixes and cleanups, I have a couple of interesting changes queued up to merge as soon as Linus starts pulling things in:
- Sean Hefty’s quality-of-service support. These changes allow administrators to configure the network to give different QoS parameters to different types of traffic (eg IPoIB, SRP, and so on).
- A patch from me (based on Sean Hefty’s work) to handle multiple P_Keys for userspace management applications. This is one of the last pieces to make the InfiniBand stack support IB partitions fully.
Also, bonding support for IP-over-InfiniBand looks set to go in through Jeff Garzik’s tree. This is something that I’ve been wanting to see for years now; the patches allow the standard bonding module to enslave IPoIB interfaces, which means that multiple IB ports can finally be used for IPoIB high-availability failover. Moni Shoua and others did a lot of work and stuck with this for a long time, and the final set of patches turned out to be very clean and nice, so I’m really pleased to see this get merged.