Archive for the ‘hacking’ Category

You can never be too rich or too thin

Friday, March 16th, 2012

Thin Mints by by Jesse Michael Nix

One of the cool things about having a storage box that virtualizes everything at sector granularity is that there’s pretty much zero overhead to creating as big a volume as you want.  So I can do

    pureuser@pure-virt> purevol create --size 100p hugevol
    pureuser@pure-virt> purevol connect --host myinit hugevol

and immediately see

    scsi 0:0:0:2: Direct-Access     PURE     FlashArray       100  PQ: 0 ANSI: 6
    sd 0:0:0:2: [sdb] 219902325555200 512-byte logical blocks: (112 PB/100 PiB)
    sd 0:0:0:2: [sdb] Write Protect is off
    sd 0:0:0:2: [sdb] Mode Sense: 2f 00 00 00
    sd 0:0:0:2: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
    sdb: unknown partition table
    sd 0:0:0:2: [sdb] Attached SCSI disk

on the initiator side.  Being able to create gigantic LUNs makes using and managing storage a lot simpler — I don’t have to plan ahead for how much space I’m going to need or anything like that.  But going up to the utterly ridiculous size of 100 petabytes is fun on the initiator side…

First, I tried

    # mkfs.ext4 -V
    mke2fs 1.42.1 (17-Feb-2012)
            Using EXT2FS Library version 1.42.1
    # mkfs.ext4 /dev/sdb

but that seems to get stuck in an infinite loop in ext2fs_initialize() trying to figure out how many inodes it should have per block group. Since block groups are 32768 blocks (128 MB), there are a lot (something like 800 million) of block groups on a 100 PB block device, but ext4 is (I believe) limited to 32-bit inode numbers, so the number of inodes per block group calculated ends up being about 6, which the code then rounds it up to a multiple of 8 — that is, up to 8. It double checks that 8 * number of block groups doesn’t overflow 32 bits, but unfortunately it does, so it reduces the inodes/group count it tries, and goes around the loop again, which doesn’t work out any better.  (Yes, I’ll report this upstream in a better forum too..)

Then I tried

    # mkfs.btrfs -V
    mkfs.btrfs, part of Btrfs Btrfs v0.19
    # mkfs.btrfs /dev/sdb

but that gets stuck doing a BLKDISCARD ioctl to clear out the whole device. It turns out my array reports that it can do SCSI UNMAP operations 2048 sectors (1 MB) at a time, so we need to do 100 billion UNMAPs to discard the 100 PB volume. My poor kernel is sitting in the unkillable loop in blkdev_issue_discard() issuing 1 MB UNMAPs as fast as it can, but since the array does about 75,000 UNMAPs per second, it’s going to be a few weeks until that ioctl returns.  (Yes, I’ll send a patch to btrfs-progs to optionally disable the discard)

[Aside: I'm actually running the storage inside a VM (with the FC target adapter PCI device passed in directly) that's quite a bit wimpier than real Pure hardware, so that 75K IOPS doing UNMAPs shouldn't be taken as a benchmark of what the real box would do.]

Finally I tried

    # mkfs.xfs -V
    mkfs.xfs version 3.1.7
    # mkfs.xfs -K /dev/sdb

(where the “-K” is stops it from issuing the fatal discard) and that actually finished in less than 10 minutes. So I’m able to see

    # mkfs.xfs -K /dev/sdb
    meta-data=/dev/sda               isize=256    agcount=102401, agsize=268435455 blks
             =                       sectsz=512   attr=2, projid32bit=0
    data     =                       bsize=4096   blocks=27487790694400, imaxpct=1
             =                       sunit=0      swidth=0 blks
    naming   =version 2              bsize=4096   ascii-ci=0
    log      =internal log           bsize=4096   blocks=521728, version=2
             =                       sectsz=512   sunit=0 blks, lazy-count=1
    realtime =none                   extsz=4096   blocks=0, rtextents=0
    # mount /dev/sdb /mnt
    # df -h /mnt/
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/sdb        100P  3.2G  100P   1% /mnt

New testbed installed

Friday, February 11th, 2011

Since I changed jobs, I left behind a lot of my test systems, but I now have a couple of test systems set up. Here is the rather crazy set of non-chipset devices I now have in one box:

$ lspci -nn|grep -v 8086:
03:00.0 InfiniBand [0c06]: Mellanox Technologies MT25208 [InfiniHost III Ex] [15b3:6282] (rev 20)
04:00.0 Ethernet controller [0200]: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] [15b3:6750] (rev b0)
05:00.0 InfiniBand [0c06]: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] [15b3:673c] (rev b0)
84:00.0 Ethernet controller [0200]: NetEffect NE020 10Gb Accelerated Ethernet Adapter (iWARP RNIC) [1678:0100] (rev 05)
85:00.0 Ethernet controller [0200]: Chelsio Communications Inc T310 10GbE Single Port Adapter [1425:0030]
86:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] [15b3:6274] (rev 20)

(I do have a couple of open slots if you have some RDMA cards that I’m missing to complete my collection :) )

Missing the point on startups

Thursday, December 23rd, 2010

I’ve been thinking about Ted Ts’o's recent posts about whether it’s possible to do engineering or work on technology at startups. I’m not going to argue that you can’t work on technology at Google or another big company (although articles like these do point out the difficulties). It would be easy to pick on Google’s failures and point out how many of their successes were actually acquired by buying a startup, but what I really wanted to talk about is how (IMHO) Ted is misunderstanding startups.

Ted’s central point seems to be:

But if your primary interest is to doing great engineering work, then you want go to company that has a proven business model.

Phrased so broadly, that’s bad advice. The reasoning that leads Ted to that bad advice starts with two contradictory misunderstandings of startups:

These days, the founder or founders will have a core idea, which they will hopefully patent, to prevent competitors from replicating their work, just as before. [...] most of the technology developed in a typical startup will tend to be focused on supporting the core idea that was developed by the founder.

and

Because if you talk to any venture capitalist, a startup has one and only one reason to exist: to prove that it has a scalable, viable business model.

In my experience, startups typically start with the founders deciding they’ve found a problem they can solve better, cheaper or faster — but it’s rare for founders to have an idea that’s developed enough to patent the whole thing. Ted I think implies that at a startup, the founders have figured everything out and everyone else is just filling in the details of the idea. To me, that seems completely backwards: if you go to a big company with an established business model, then almost certainly you’ll be working within the outline of that model (Innovator’s Dilemma and all that); at a startup, you’ll have to help the founders figure out just what the hell your company is supposed to be doing. And that gets to the second quote: a startup is an exercise in adapting the technology you’re building until you find the right business model. In other words, nearly every startup will get it wrong to start with and have to change plans repeatedly; the hope is that the technology you build along the way is valuable enough that you can survive until you find the right way to make money.

To give one example from personal experience, when I was at Topspin working on InfiniBand products, early in the InfiniBand hype cycle (around 2001 or so), we thought that every OS would soon ship with InfiniBand drivers, so we focused on building switches and other networking gear, without worrying about the hosts that would be connected to the network. It turned out that the first open source project for a Linux InfiniBand stack fizzled, and Windows also gave up on InfiniBand, so we ended up having to build an InfiniBand host stack — fortunately the embedded software from our switches already had most of the ingredients, and so we were able to pull it off by reusing our embedded work. (That Topspin host stack ended up getting released as free software, and it became one of the ingredients that went into the current Linux InfiniBand stack — and I ended up as the InfiniBand maintainer for the Linux kernel, while working for a startup)

So as I said before, I think it’s bad advice to suggest to someone that “real” engineering can only be done at a large company. Certainly there are huge differences between working at a big company and a small company, and I do believe that there are “big company people” and “small company people.” If your goal is to spend nearly all your time making incremental improvements in ext4, sure, it’s probably easier to do that at a company that is a big enough ext4 user for that work to pay off; on the other hand if you’d rather work on something that you’re making up as you go along and where your decisions shape the whole future of the company, then a startup is probably a better place for you. Similarly, Ted’s assertion

For most startups, though, open source software is something that they will use, but not necessarily develop except in fairly small ways.

misses the real distinction. There are plenty of startups where open source is the main focus (Cloudera, Riptano and Strobe are just a few that spring to mind; and I don’t mean to dis all of the others that I’m not namechecking here), and there are gazillions of big technology companies that are actively hostile to open source. So really, if you want to get paid to work on open source, make sure you go to an open source company; the size of the company is a completely orthogonal issue.

To summarize my advice: if you think you might be a small company person, don’t let Ted scare you away from startups. Oh, and happy holidays!

Was it something I said?

Thursday, June 3rd, 2010

I saw that OpenBSD 4.7 was released a couple of weeks ago.  I tried to help, I really did.

I used to have a fanless 600MHz VIA system with a cheapie Airlink 101 Wi-Fi card that I used as a home wireless router.  I ran OpenBSD on it for a few reasons — at the time I started, the OpenBSD wireless stack was ahead of Linux; their security obsession appealed to me; and not using Linux everywhere seemed like a fun thing to do.  It all worked pretty well, except that the wireless interface sometimes got stuck while forwarding heavy traffic.  For quite a while, I survived with hacks similar to this nutty crontab entry.

Eventually, though, I said to myself, “Self, you’re a kernel hacker.  You should be able to fix this driver.”  And indeed, after a couple of evenings of hacking, I figured out what was wrong and came up with a patch that improved things immensely for me.  The problem was that the driver was not written with a system as slow as mine in mind, and it got confused if more than one interrupt happened before it got a chance to service the first interrupt — you can read the patch description for full details.  Of course, being a good free software citizen, I sent my patch to the OpenBSD mailing lists so that it could be applied upstream.

Here’s where things went wrong.  I never heard from the author of this driver — I got no reply when I reported the original bug, and no replies to any mail I sent about my patch.  I did get several reports from other users who had the same problem and found that my patch fixed things for them as well, and finally another OpenBSD committer wrote, “Then if no one objects I’ll commit it tomorrow.“  Unfortunately, at this point the original driver author did seem to get interested — he sent private email to this committer (not copying the mailing list or me) objecting, and so we ended up with, “Objections were made. Apparently this patch only works for AP and does funky stuff to the hardware. So back to the drawing board on this one.“  As I said, all of my attempts to work directly with the driver author to find out what those objections were or how to improve the patch were ignored.

At this point I gave up on getting my patch upstream (and when I upgraded my wireless network to 802.11n, I chose a MIPS box running OpenWrt).

Know anyone at Coverity?

Thursday, February 19th, 2009

The recent mention of scan.coverity.com at lwn.net reminded me that the Coverity results for the kernel (what they call “linux-2.6″) have become pretty useless lately.  The number of “results” that their checker produce jumped by a factor of 10 a month or so ago, with all of the new results apparently warning about nonsensical things.  For example, CID 8429 is a warning about a resource leak, where the code is:

      req = kzalloc(sizeof *req, GFP_KERNEL);
      if (!req)
              return -ENOMEM;

and the checker thinks that req can be leaked here if we hit the return statement.

The reason for this seems to be that the checker is run with all config options enabled (which is sensible to get maximum code coverage), and in particular it seems to be because the config variable CONFIG_PROFILE_ALL_BRANCHES is enabled, which leads to a complex C macro redefininition of if() that fatally confuses the scanner.

I’ve sent email to scan-admin about this but not gotten any reply (or had any effect on the scan). So I’m appealing to the lazyweb to find someone at Coverity who can fix this and make the scanner useful for the kernel again; having nine-tenths or more of the results be false positives makes it really hard to use the current scans. What needs to be done to fix this is simple to make sure CONFIG_PROFILE_ALL_BRANCHES is not set; in fact it may be a good idea to set CONFIG_TRACE_BRANCH_PROFILING to n as well, since enabling that option causes all if statements annotated with likely() or unlikely to be obfuscated by a complex macro, which will probably lead to a similar level of false positives.

Update: Dave Jones got me in touch with David Maxwell at Coverity, and he updated the kernel config so that we don’t get all the spurious results any more.  Thanks guys!

On over-engineering

Wednesday, November 19th, 2008

I’ve been trying to get a udev rule added to Ubuntu so that /dev/infiniband/rdma_cm is owned by group “rdma” instead of by root, so that unprivileged user applications can be given permission to use it by adding the user to the group rdma.  This matches the practice in the Debian udev rules and is a simple way to allow unprivileged use of RDMA while still giving the administrator some control over who exactly uses it.

I created a patch to the Ubuntu librdmacm package containing the appropriate rule and opened a Launchpad bug report requesting that it be applied.  After two months of waiting, I got a response that basically said, “no, we don’t want to do that.”  After another month of asking, I finally found out what solution Ubuntu would rather have:

Access to system devices is provided through the HAL or DeviceKit interface. Permission to access is managed through the PolicyKit layer, where the D-Bus system bus service providing the device access negotiates privilege with the application requesting it.

Because of course, rather than having an application simply open a special device node, mediated by standard Unix permissions, we’d rather have to run a daemon (bonus points for using DBus activation, I guess) and have applications ask that daemon to open the node for them.  More work to implement, harder to administer, less reliable for users — everyone wins!

Sigh….

Free Software Syndrome

Wednesday, November 5th, 2008

While thinking about the discussion that my recent series of posts about the career benefits of contributing to the open source community, I started wondering about whether this could actually have some effect on how software gets designed.  We’ve all heard of “Not Invented Here Syndrome,” where developers decide to reimplement something, even when the technically better way to go would be to reuse some existing code.

I think there is also a small but growing tendency towards a “Free Software Syndrome,” where developers push management to release something as open source, not necessarily to get the benefits of community testing and review, more developers, or anything like that, but simply so that the developers can be open source developers.

Quit Today

Wednesday, October 29th, 2008

I’m a slow blogger.  I’ve been meaning to post some thoughts about Greg’s [in]famous LPC keynote for a while now, but it’s taken me nearly two months to get to it.  I’ll start off by saying the same thing that I told Greg in person: I don’t think it was an appropriate setting for for Greg to single out Canonical for criticism.  It doesn’t matter who started it, it doesn’t matter what the merits of a particular argument are, and it doesn’t make sense for Greg to say he was not speaking as a Novell employee, since he is a Novell employee.  But I don’t really want to get drawn into that debate.

The slide that really stuck with me from Greg’s talk is the one from the conclusion that says, “Developers who are not allowed to contribute to Linux should change jobs.”  In the text for the talk, Greg writes, “The solution, quit and go work for one of the companies that allow you to do this!”  And I have to agree with this advice, because I think contributing to free software is in the rational self-interest of nearly any software developer.

When I started in this game about a decade ago, I was a typical Silicon Valley “senior software engineer” type: bright guy, knows how to code, decent resume.  I had a circle of people who I had worked with who knew me, but if I went for a job interview, the people interviewing me were usually meeting me for the first time.  But I was fortunate enough to end up in a job where I was working on open source InfiniBand drivers for Linux, and ended up becoming the Linux kernel InfiniBand/RDMA maintainer.  I should mention that it wasn’t a question of being “allowed” to contribute to Linux — I knew that InfiniBand needed open-source Linux support, and I didn’t listen to anyone who said “no.”

It has been great fun and very rewarding to build the Linux InfiniBand stack, but I just want to focus on the venal career side of this.  And the point is that tons of people know me now.  They know what I can do in terms of kernel coding, and maybe more importantly, they can see how I do it, how I respond to bug reports and how I handle the techno-diplomacy of collaborating on mailing ists.  And this has had a definite effect on my career.  I’m not just YASSE (yet another senior software engineer); I get calls from people I’ve never met offering me really interesting and very senior jobs.  And when hiring, I am certainly much more comfortable with candidates who have a visible history of contribution to open source, simply because I can see both their technical and social approach to development.

I do think that the contributions that have this value to individuals can be beyond Greg’s kernel/gcc/binutils/glibc/X view of the “Linux ecosystem.”  If people have done substantial work on, say, bzr or the Ubuntu installer, that’s still something I can go look at when I’m thinking about hiring them.  Of course, making contributions to a highly visible project carries more weight than contributing to a less visible project, but on the other hand, maybe there’s more room to shine in a project with a smaller community of developers.

Finally, Greg’s advice to “quit” made me think of a book from a few years ago, Die Broke that has as one of its main pieces of advice to “quit today.”  However, this advice is just a provocative way of saying that workers should be conscious of the fact that their employer probably has little to no real loyalty to employees, and so individual workers should focus on their own best interests, rather than what might be best for their employer.  And I think that metaphorical view applies just as well to Greg’s advice to quit: if you feel that you can’t contribute to open source in your present job, what’s stopping you?  Do you really need to change employers to start contributing, or can you just tweak your current job?  What will happen if you tell your manager, “Open source is good for our business for reasons X, Y and Z, and also it’s important to me for my career development, so can we come up with a way for me to start contributing?”

Lies, d… oh, forget it

Friday, April 11th, 2008

I noticed the recent blog post “Cisco Set to Dominate Linux Market?” (which lwn.net also linked), but the part that caught my eye was:

According to a recent Linux Foundation study, Cisco is already contributing to Linux and currently represents 0.5 percent of changes (which is a good number). I would expect that with the AXP in the market, Cisco’s contribution rate will go up.

Now, I don’t work on AXP or anything related to ISRs, so I have no idea what those groups plan to do with respect to Linux, but it was somewhat amusing to see the Linux Foundation report cited to show how much work CIsco does on the kernel.

This isn’t the first time I’ve seen this study cited to show what a kernel development powerhouse Cisco is.  In the report, Cisco is credited with 442 commits to the kernel; however, more than 400 of of those commits are mine, and about 30 are Don Fry maintaining the pcnet32 net driver.  So if you take away my work on InfiniBand/RDMA, Cisco’s contributions to the Linux kernel are pretty minimal.

I’m not sure if I have much of a point except that I wish we really did have more than one or two isolated developers at Cisco really engaged with the upstream kernel.

Enterprise distro kernels

Sunday, June 24th, 2007

Greg K-H wrote recently about kernels for “Enterprise Linux” distributions. I’m not sure I get the premise of the article; after all, the whole point of having more than one distro company is that they can compete on the basis of the differences in what they do. So it makes no sense to me to present this issue as something that Red Hat and Novell have to agree on (and it also leaves out Ubuntu’s “LTS” distribution, although I’m not sure if that has taken any of the “enterprise distro” market). Obviously Novell sees a reason for both openSUSE and SLES; why should SLES and RHEL have to be identical?

In fact (although Greg didn’t seem to realize it when he wrote his article), there are already significant differences between the SLES and RHEL kernel updates. SLES has relatively infrequent “SP” releases, where the kernel ABI is allowed to break, while RHEL has update releases roughly every quarter but aims to keep the kernel ABI stable through the life of a full major release.

Greg seems to favor the third proposal in his article, namely rebasing to the latest upstream kernel on every update. However, I don’t think that can work for enterprise distros, for a reason that DaveJ alluded to in his response:

W[ith] each upstream point revision, we fix x regressions, and introduce y new ones. This isn’t going to make enterprise customers paying lots of $ each year very happy.

For a lot of customers, the whole point of staying on an enterprise distro is to stick with something that works for them. No kernel is bug-free and every enterprise distro kernel surely has some awful bugs; what enterprise customers want to avoid are regressions. If SLES10 works for my app on my hardware, then SLES10SP1 better not keel over on the same app and the same hardware because of a broken SATA driver or something like that.

Of course customers often want crazy-sounding stuff, for example, “Give me the 2.6.9 kernel from RHEL4, except I want the InfiniBand drivers from 2.6.21.” (And yes, since I work on InfiniBand a lot, that is definitely a real example, and in fact a lot of effort goes into the “OpenFabrics Enterprise Distribution” (OFED) to make those customers happy) A kernel hacker’s first reaction to that request is most likely, “Then you should run just 2.6.21.” But if you think about what the customers are asking for some more, it starts to make sense. What they are really saying is that they need the latest and greatest IB features (maybe support for new hardware or a protocol that wasn’t implemented until long after the enterprise kernel was frozen), but they don’t want to risk some new glitch in a part of the kernel where RHEL4′s 2.6.9 is perfectly fine for them.

This is just a special case of Greg’s “support the latest toy” request, and if there were some technical solution for pulling just a subset of new features into an enterprise kernel then that would be great. But as I said before, without a major change in the upstream development process, rebasing enterprise kernels during the lifetime of a major release doesn’t seem to be what customers of enterprise distros want. And I agree with Linus when he says that you can’t slow down development without having people losing interest or going off onto a branch that’s too unstable for real users to test. So I don’t think we want to change our development process to be closer to an enterprise distro.

And given how new features often have dependencies on core kernel changes, I don’t see much hope of a technical solution for the “latest toy” problem. In fact the OFED solution of having the community that works on a particular class of new toys do the backporting seems to be about the best we can do for now.