Rocky roads

April 19th, 2010

I saw that the InfiniBand Trade Association announced the “RDMA over Converged Ethernet (RoCE)” specification today.  I’ve already discussed my thoughts on the underlying technology (although I have a bit more to say), so for now I just want to say that I really, truly hate the name they chose.  There are at least two things that suck about the name:

  1. Calling the technology “RDMA over” instead of “InfiniBand over” is overly vague and intentionally deceptive.  We already have “RDMA over Ethernet” — except we’ve been calling it iWARP.  Choosing “RoCE” is somewhat like talking about “Storage over Ethernet” instead of “Fibre Channel over Ethernet.”  Sure, FCoE is storage over ethernet, but so is iSCSI.  As for the intentionally deceptive part: I’ve been told that “InfiniBand” was left out of the name because the InfiniBand Trade Association felt that InfiniBand is viewed negatively in some of the markets they’re going after.  What does that say about your marketing when you are running away from your own main trademark?
  2. The term “Converged Ethernet” is also pretty meaningless.  The actual technology has nothing to do with “converged” ethernet (whatever that is, exactly); the annex that was just release simply describes how to stick InfiniBand packets inside a MAC header and Ethernet FCS, so simply “Ethernet” would be more accurate.  At least the “CE” part is an improvement over the previous try, “Converged Enhanced Ethernet” or “CEE”; not only does the technology have nothing to do with CEE either, “CEE” was an IBM-specific marketing term for what eventually became Data Center Bridging or “DCB.”  (At Cisco we used to use the term “Data Center Ethernet” or “DCE”)

So both the “R” and the “CE” of “RoCE” aren’t very good choices.  It would be a lot clearer and more intellectually honest if we could just call InfiniBand over Ethernet by its proper name: IBoE.  And explaining the technology would be a bit simpler too, since the analogy with FCoE becomes a lot more explicit.

First they laugh at you…

November 20th, 2009

I found this article in “Network Computing” pretty interesting, although not exactly for the content.   Just the framing of the whole article, with Microsoft is touting the fact that they’ve managed to achieve performance parity with Linux on some HPC benchmarks as an achievement (and putting up a graph that shows they are still at least a few percent behind), shows how dominant Linux is in HPC.  Also, the article says:

The beta also reportedly includes optimizations for new processors and can deploy and manage up to 1,000 nodes.

So in other words Microsoft is stuck at the low end of the HPC market, only us

Poor choice of words

August 11th, 2009

I just got a marketing email from with the subject “Diaper blowout: save on Pampers, Huggies, Seventh Generation and more.”  If you’re not a parent and you don’t know why that’s unintentionally hilarious, just do a web search for “diaper blowout.”  Suffice it to say that when placed next to the word “diaper,” the word “blowout” does not usually connote slashed prices.

Lazyweb: best Verizon data card?

June 9th, 2009

I currently have Verizon mobile data service with a Kyocera PC card, and it works well with recent distros using NetworkManager.  However, my venerable laptop is being replaced with a Lenovo X200, which has no PC card slot, so I’ll have to replace my Verizon data card as well.  According to the Verizon Wireless web site, my choices seem to be the NovaTel V740 for ExpressCard, or for USB the UTStarcom UM175 or the Novatel USB760.

My question for the lazyweb is: which data card/EV-DO modem should I get (assume that I’ll be running Linux 99.9% of the time when I use it)?  The ExpressCard is substantially more expensive and less flexible (since I may want to use this card on a system without an ExpressCard slot someday), so I’d probably go with one of the USB cards if it’s left up to me.  The USB760 doubles as a micro SD reader, which is not useful to me, and confounds things with a mass storage interface that probably just causes confusion, so my first choice would be the UM175 probably.  However if someone with first-hand knowledge knows why that’s a bad decision, I’d love to hear about it in the comments.

(And I put a very high value in not having to boot into Windows periodically to update cell tower locations or anything like that, for what it’s worth)

RDMA on Converged Ethernet

March 25th, 2009

I recently read Andy Grover’s post about converged fabrics, and since I particupated in the OpenFabrics panel in Sonoma that he alluded to, I thought it might be worth sharing my (somewhat different) thoughts.

The question that Andy is dealing with is how to run RDMA on “Converged Ethernet.” I’ve already explained what RDMA is, so I won’t go into that here, but it’s probably worth talking about Ethernet, since I think the latest developments are not that familiar to many people.  The IEEE has been developing a few standards they collectively refer to as “Data Center Bridging” (DCB) and that are also sometimes referred to as “Converged Enhanced Ethernet” (CEE).  This refers to high speed Ethernet (currently 10 Gb/sec, with a clear path to 40 Gb/sec and 100 Gb/sec), plus new features.  The main new features are:

  • Priority-Based Flow Control (802.1Qbb), sometimes called “per-priority pause”
  • Enhanced Transmission Selection (802.1Qaz)
  • Congestion Notification (802.1Qau)

The first two features let an Ethernet link be split into multiple “virtual links” that operate pretty independently — bandwidth can be reserved for a given virtual link so that it can’t be starved, and by having per-virtual-link flow control, we can make sure certain traffic classes don’t overrun their buffers and avoid dropping packets.  Then congestion notification means that we can tell senders to slow down to avoid congestion spreading caused by that flow control.

The main use case that DCB was developed for was Fibre Channel over Ethernet (FCoE).  FC requires a very reliable network — it simply doesn’t work if packets are dropped because of congestion — and so DCB provides the ability to segregate FCoE traffic onto a “no drop” virtual link.  However, I think Andy misjudges the real motivation for FCoE; the TCP/IP overhead of iSCSI was not really an issue (and indeed there are many people running iSCSI with very high performance on 10 Gb/sec Ethernet).

The real motivation for FCoE is to give a way for users to continue using all the FC storage they already have, while not requiring every server that wants to talk to the storage to have both a NIC and an FC HBA.  With a gateway that’s easy to build an scale, legacy FC storage can be connected to an FCoE fabric, and now servers with a “converged network adapter” that functions as both an Ethernet NIC and an FCoE HBA can talk to network and storage over one (Ethernet) wire.

Now, of course for servers that want to do RDMA, it makes sense that they want a triple-threat converged adapter that does Ethernet NIC, FCoE HBA, and RDMA.  The way that people are running RDMA over Ethernet today is via iWARP, which runs an RDMA protocol layered on top of TCP.  The idea that Andy and several other people in Sonoma are pushing is to do something analogous to FCoE instead, that is, take the InfiniBand transport layer and stick it into Ethernet somehow.  I see a number of problems with this idea.

First, one of the big reasons given for wanting to use InfiniBand on Ethernet instead of iWARP is that it’s the fastest path forward.  The argument is, “we just scribble down a spec, and everyone can ship it easily.”  That ignores the fact that iWARP adapters are already shipping from multiple vendors (although, to be fair, none with support for the proposed IEEE DCB standards yet; but DCB support should soon be ubiquitous in all 10 gigE NICs, iWARP and non-iWARP alike).  And the idea that an IBoE spec is going to be quick or easy to write flies in the face of the experience with FCoE; FCoE sounded dead simple in theory (just stick an Ethernet header on FC frames, what more could there be?) it turns out that the standards work has taken at least 3 years, and a final spec is still not done.  I believe that IBoE would be more complicated to specify, and fewer resources are available for the job, so a realistic view is that a true standard is very far away.

Andy points at a TOE page to say why running TCP on an iWARP NIC sucks.  But when I look at that page, pretty much all the issues are still there with running the IB transport on a NIC.  Just to take the first few on that page (without quibbling about the fact that many of the issues are just wrong even about TCP offload):

  • Security updates: yup, still there for IB
  • Point-in-time solution: yup, same for IB
  • Different network behavior: a hundred times worse if you’re running IB instead of TCP
  • Performance: yup
  • Hardware-specific limits: yup

And so on…

Certainly, given infinite resources, one could design an RDMA protocol that was cleaner than iWARP and took advantage of all the spiffy DCB features.  But worse is better and iWARP mostly works well right now; fixing the worst warts of iWARP has a much better chance of success than trying to shoehorn IB onto Ethernet and ending up with a whole bunch of unforseen problems to solve.

Know anyone at Coverity?

February 19th, 2009

The recent mention of at reminded me that the Coverity results for the kernel (what they call “linux-2.6”) have become pretty useless lately.  The number of “results” that their checker produce jumped by a factor of 10 a month or so ago, with all of the new results apparently warning about nonsensical things.  For example, CID 8429 is a warning about a resource leak, where the code is:

      req = kzalloc(sizeof *req, GFP_KERNEL);
      if (!req)
              return -ENOMEM;

and the checker thinks that req can be leaked here if we hit the return statement.

The reason for this seems to be that the checker is run with all config options enabled (which is sensible to get maximum code coverage), and in particular it seems to be because the config variable CONFIG_PROFILE_ALL_BRANCHES is enabled, which leads to a complex C macro redefininition of if() that fatally confuses the scanner.

I’ve sent email to scan-admin about this but not gotten any reply (or had any effect on the scan). So I’m appealing to the lazyweb to find someone at Coverity who can fix this and make the scanner useful for the kernel again; having nine-tenths or more of the results be false positives makes it really hard to use the current scans. What needs to be done to fix this is simple to make sure CONFIG_PROFILE_ALL_BRANCHES is not set; in fact it may be a good idea to set CONFIG_TRACE_BRANCH_PROFILING to n as well, since enabling that option causes all if statements annotated with likely() or unlikely to be obfuscated by a complex macro, which will probably lead to a similar level of false positives.

Update: Dave Jones got me in touch with David Maxwell at Coverity, and he updated the kernel config so that we don’t get all the spurious results any more.  Thanks guys!

Signs you may not be dealing with a straight shooter

November 30th, 2008

A play in two scenes.


  • X – a senior manager
  • X’s admin – keeper of X’s schedule
  • Y – a hard-working developer


setting: X’s admin’s desk

X’s admin: Hi, Y.

Y: Hi.  I’d like to get some time with X when there’s an opening.

X’s admin: How about next Tuesday at 1?

Y: Perfect.  Please put me on the schedule then.


setting: X’s office, next Tuesday at 1

X: Hi, Y.

Y: Hi, good to see you.

X: Thanks for coming by.  We haven’t talked for a while and I just wanted to touch base.

Y: …

On over-engineering

November 19th, 2008

I’ve been trying to get a udev rule added to Ubuntu so that /dev/infiniband/rdma_cm is owned by group “rdma” instead of by root, so that unprivileged user applications can be given permission to use it by adding the user to the group rdma.  This matches the practice in the Debian udev rules and is a simple way to allow unprivileged use of RDMA while still giving the administrator some control over who exactly uses it.

I created a patch to the Ubuntu librdmacm package containing the appropriate rule and opened a Launchpad bug report requesting that it be applied.  After two months of waiting, I got a response that basically said, “no, we don’t want to do that.”  After another month of asking, I finally found out what solution Ubuntu would rather have:

Access to system devices is provided through the HAL or DeviceKit interface. Permission to access is managed through the PolicyKit layer, where the D-Bus system bus service providing the device access negotiates privilege with the application requesting it.

Because of course, rather than having an application simply open a special device node, mediated by standard Unix permissions, we’d rather have to run a daemon (bonus points for using DBus activation, I guess) and have applications ask that daemon to open the node for them.  More work to implement, harder to administer, less reliable for users — everyone wins!


Free Software Syndrome

November 5th, 2008

While thinking about the discussion that my recent series of posts about the career benefits of contributing to the open source community, I started wondering about whether this could actually have some effect on how software gets designed.  We’ve all heard of “Not Invented Here Syndrome,” where developers decide to reimplement something, even when the technically better way to go would be to reuse some existing code.

I think there is also a small but growing tendency towards a “Free Software Syndrome,” where developers push management to release something as open source, not necessarily to get the benefits of community testing and review, more developers, or anything like that, but simply so that the developers can be open source developers.

The enemy within

November 5th, 2008

I saw several interesting responses to my last post.  I liked Val‘s succint way of putting the career impact of free software work:

If you are even moderately successful in open source, you will probably end up “employed for life.”

I’m not sure I would go that far, since someone early in their career now might be working for 30, 40 or 50 more years, and it’s hard to say how much the artilects will care about open source after they’ve converted the entire mass of the solar system into computronium.  But certainly as long as the software industry looks the way it does now, developers who are able, both technically and socially, to work in the open source community are going to be in high demand.  (By the way, Val: two ‘e’s in “Dreier” 😛 — the native English-speaking brain seems hard-wired to misspell my last name)

There were also a few comments to my blog post from people who would like to do more in the open source community, but who are prevented by their current employer.  I understand that no job is ideal, and that people often can’t or won’t change jobs for many reasons — having a great commute, a gazillion dollars in stock still to vest, gambling debts to pay or any of a million other things might be reasons to stay in a particular position.

However, I think it’s important for people in those less than ideal positions to view this for what it is — these employers are taking money out of your pockets and limiting your future earnings just as surely as if they forbid you from other professional development such as learning a new programming language.  I’ve also found that many (although surely not all) software developers are too passive about accepting issues like this.

You need to keep in mind that you are responsible for getting the best for yourself — your interests are not your employers interests, and your employer is surely not going to put your interests first, so you better put your own interests first!  Of course this means that when negotiating things (and you are negotiating from your first job interview to the end of your exit interview), you need to explain why what you want is really in your employer’s interest.

There are many ways you can explain why contribution to open source is in an employer’s interest too, from the direct benefits: “This code has no real relevance to our business, but if we release it and get it upstream, other people will help us maintain it, which saves us money and time in the long term,” to slightly less concrete: “Contributing to free software is good marketing, and it’s going to help us sell as well as recruit and retain good developers,” to the personal: “I know we were told bonuses will be down this year, but something that you can do for me that doesn’t cost anything in your budget is let me contribute this code, which we already agreed the company has no interest in.”