Archive for the ‘hacking’ Category

If I’m a crusader, where’s my cape?

Tuesday, June 5th, 2007

Pete Zaitcev replied to my earlier post about misuse of atomic variables. I never really thought of myself as a crusader or as particularly quixotic, but I’ll respond to the technical content of Pete’s post. I think the disagreement was with my disparagement of the anti-pattern:

int x;

int foo(void)
{
        int y;

        spin_lock(&lock);
        y = x;
        spin_unlock(&lock);

        return y;
}

Pete is entirely correct that spin_lock() and spin_unlock() have full memory barriers; otherwise it would be impossible for anyone to use spinlocks correctly (and of course there’s still mmiowb() to trip you up when someone runs your driver on a big SGI machine).

However, I still think that using a spinlock around an assignment that’s atomic anyway is at best pretty silly. If you just need a memory barrier, then put an explicit mb() (or wmb() or rmb()) there, along with a fat comment about why you need to mess with memory barriers anyway.

Also, Pete is not entirely correct when he says that atomic operations lack memory barriers. All atomic operations that return a value (eg atomic_dec_and_test()) do have a full memory barrier, and if you want a barrier to go with an atomic operation that doesn’t return a value, such as atomic_inc(), the kernel does supply a full range of primitives such as smp_mb__after_atomic_inc(). The file Documentation/memory-barriers.txt in the kernel source tree explains all of this in excruciating detail.

In the end the cost of an atomic op is roughly the same as the cost of a spin_lock()/spin_unlock() pair (they both have to do one locked operation, and everything else is pretty much in the noise for all but the most performance criticial code). Spinlocks are usually easier to think about, so I recommend only using atomic_t when it fits perfectly, such as a reference count (and even then using struct kref is probably better if you can). I’ve found from doing code review that code using atomic_t almost always has bugs, and we don’t have any magic debugging tools to find them (the way we have CONFIG_PROVE_LOCKING, CONFIG_DEBUG_SPINLOCK_SLEEP and so on for locks).

By the way, what’s up with mark mazurek? His blog seems to be an exact copy taken from Pete’s blog feed, with the added bonus of adding a comment to the post in my blog that Pete linked to. There are no ads or really anything beyond an exact duplicate of Pete’s blog, so I can’t figure out what the angle is.

Atomic cargo cults

Sunday, May 13th, 2007

Cargo cult programming refers to “ritual inclusion of code or program structures that serve no real purpose.” One annoying example of this that I see a lot in kernel code that I review is the inappropriate use of atomic_t, in the belief that it magically wards off races.

This type of bogosity is usually marked by variables or structure members of type atomic_t, which are only ever accessed through atomic_read() and atomic_set() without ever using a real atomic operation such as atomic_inc() or atomic_dec_and_test(). Such programming reaches its apotheosis in code like:

        atomic_set(&head, (atomic_read(&head) + 1) % size);

(and yes, this is essentially real code, although I’ve paraphrased it to protect the guilty from embarrassment).

The only point of atomic_t is that arithmetic operations (like atomic_inc() et al) are atomic in the sense that there is no window where two racing operations can step on each other. This helps with situations where an int variable might have ++i and --i race with each other; since these operations are probably implemented as read-modify-write. So if i starts out as 0, the value of i after the two operations might be 0, -1 or 1 depending on when each operation reads the value of i and what order they write back the final value.

If you never use any of these atomic operations, then there’s no point in using atomic_set() to assign a variable and atomic_read() to read it. Your code is no more or less safe than if would be just using normal int variables and assignments.

One way to think about atomic operations is that they might be an optimization instead of using a lock to protect access to a variable. Unfortunately I’m not sure if this will help anyone get things right, since I also see plenty of code like:

int x;

int foo(void)
{

        int y;

        spin_lock(&lock);
        y = x;
        spin_unlock(&lock);

        return y;
}

which is the analogous cargo-cult anti-pattern for spinlocks.

Maybe the best rule would be: if the only functions using atomic_t in your code are atomic_set() and atomic_read(), then you need to write a comment explaining why you’re using atomic at all. Since the vast majority of the time, such a comment will be impossible to write, maybe this will cut down on cargo cult programming a bit. Or more likely it will just make code review more fun by generating nonsensical comments for me to chuckle at.

Oh what a relief it is

Thursday, November 16th, 2006

I just converted the main repositories for two libraries that I maintain, libibverbs and libmthca, from subversion to git. And even though I work with git a lot when working on the kernel, it’s still a shock to see how much easier git makes everything.

A few amusing examples:

$ du -sh libibverbs.*
980K    libibverbs.git
1.3M    libibverbs.svn

Yes, the git checkout with full development history takes up less disk space than a svn working copy with just the tip of the trunk (accessing history goes over the network for svn). And svn needing the network to do stuff has implications beyond just “work on my laptop on a plane”:

$ cd libibverbs.svn
$ time svn log > /dev/null

real    0m0.820s
user    0m0.032s
sys     0m0.004s

$ cd ../libibverbs.git
$ time git log > /dev/null

real    0m0.005s
user    0m0.004s
sys     0m0.000s

Yes, git is more than 100 times faster for showing the log!

And these performance differences make a real productivity difference. With git, I’m much more likely to look through the history, examine past changes, and so on, which means that I waste less time figuring out things that I used to know.

2.6.19 merge plans for InfiniBand/RDMA

Thursday, August 17th, 2006

Here’s a short summary of what I plan to merge for 2.6.19. I sent this out via email to all the relevant lists, but I figured it can’t hurt to blog it too. Some of this is already in infiniband.git (git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git), while some still needs to be merged up. Highlights:

  • iWARP core support. This updates drivers/infiniband to work with devices that do RDMA over IP/ethernet in addition to InfiniBand devices. As a first user of this support, I also plan to merge the amso1100 driver for Ammasso RNICs.I will post this for review one more time after I pull it into my git tree for last minute cleanups. But if you feel this iWARP support should not be merged, please let me know why now.
  • IBM eHCA driver, which supports IBM pSeries-specific InfiniBand hardware. This is in the ehca branch of infiniband.git, and I will post it for review one more time. My feeling is that more cleanups are certainly possible, but this driver is “good enough to merge” now and has languished out of tree for long enough. I’m certainly happy to merge cleanup patches, though.
  • mmap()ed userspace work queues for ipath. This is a performance enhancement for QLogic/PathScale HCAs but it does touch core stuff in minor ways. Should not be controversial.
  • I also have the following minor changes queued in the for-2.6.19 branch of infiniband.git:
       Ishai Rabinovitz:
             IB/srp: Add port/device attributes

       James Lentini:
             IB/mthca: Include the header we really want

       Michael S. Tsirkin:
             IB/mthca: Don't use privileged UAR for kernel access
             IB/ipoib: Fix flush/start xmit race (from code review)

       Roland Dreier:
             IB/uverbs: Use idr_read_cq() where appropriate
             IB/uverbs: Fix lockdep warning when QP is created with 2 CQs

Why you shouldn’t use __attribute__((packed))

Monday, July 31st, 2006

gcc supports an extension that allows structures or structure members to be marked with __attribute__((packed)), which tells gcc to leave out all padding between members. Sometimes you need this, when you want to make sure of the layout of a structure. For example, you might have something like

    struct my_struct {
            uint8_t  field1;
            uint16_t field2;
    } __attribute__((packed));

Without the packed attribute, the struct will have padding between field1 and field2, and that’s no good if this struct is something that has to match hardware or be sent on a wire.

However, it’s actively harmful to add the attribute to a structure that’s already going to be laid out with no padding. Sometimes, when a structure needs to be laid out without padding (because of hardware or wire protocol), people are tempted to add the attribute to a struct like the following “just to let the compiler know”

    struct my_struct {
            uint32_t  field1;
            uint32_t field2;
    };

But adding __attribute__((packed)) goes further than just telling gcc that it shouldn’t add padding – it also tells gcc that it can’t make any assumptions about the alignment of accesses to structure members. And this leads to disastrously bad code on some architectures.

To see this, consider the simple code

    struct foo { int a; };
    struct bar { int b; } __attribute__((packed));

    int c(struct foo *x) { return x->a; }
    int d(struct bar *x) { return x->b; }

On architectures like x86, x86-64 and powerpc, both functions generate the same code. But take a look at what happens on ia64:

   0000000000000000 <c>:
      0:       13 40 00 40 10 10       [MBB]       ld4 r8=[r32]
      6:       00 00 00 00 10 80                   nop.b 0x0
      c:       08 00 84 00                         br.ret.sptk.many b0;;

   0000000000000010 <d>:
     10:       09 70 00 40 00 21       [MMI]       mov r14=r32
     16:       f0 10 80 00 42 00                   adds r15=2,r32
     1c:       34 00 01 84                         adds r32=3,r32;;
     20:       19 80 04 1c 00 14       [MMB]       ld1 r16=[r14],1
     26:       f0 00 3c 00 20 00                   ld1 r15=[r15]
     2c:       00 00 00 20                         nop.b 0x0;;
     30:       09 70 00 1c 00 10       [MMI]       ld1 r14=[r14]
     36:       80 00 80 00 20 e0                   ld1 r8=[r32]
     3c:       f1 78 bd 53                         shl r15=r15,16;;
     40:       01 00 00 00 01 00       [MII]       nop.m 0x0
     46:       e0 70 dc ee 29 00                   shl r14=r14,8
     4c:       81 38 9d 53                         shl r8=r8,24;;
     50:       0b 70 40 1c 0e 20       [MMI]       or r14=r16,r14;;
     56:       f0 70 3c 1c 40 00                   or r15=r14,r15
     5c:       00 00 04 00                         nop.i 0x0;;
     60:       11 00 00 00 01 00       [MIB]       nop.m 0x0
     66:       80 78 20 1c 40 80                   or r8=r15,r8
     6c:       08 00 84 00                         br.ret.sptk.many b0;;

gcc gets scared about unaligned accesses and generates six times as much code (96 bytes vs. 16 bytes)! sparc64 goes similarly crazy, bloating from 12 bytes to 52 bytes:

   0000000000000000 <c>:
      0:       81 c3 e0 08     retl
      4:       d0 42 00 00     ldsw  [ %o0 ], %o0
      8:       30 68 00 06     b,a   %xcc, 20 <d>

   0000000000000020 <d>:
     20:       c6 0a 00 00     ldub  [ %o0 ], %g3
     24:       c2 0a 20 01     ldub  [ %o0 + 1 ], %g1
     28:       c4 0a 20 02     ldub  [ %o0 + 2 ], %g2
     2c:       87 28 f0 18     sllx  %g3, 0x18, %g3
     30:       d0 0a 20 03     ldub  [ %o0 + 3 ], %o0
     34:       83 28 70 10     sllx  %g1, 0x10, %g1
     38:       82 10 40 03     or  %g1, %g3, %g1
     3c:       85 28 b0 08     sllx  %g2, 8, %g2
     40:       84 10 80 01     or  %g2, %g1, %g2
     44:       90 12 00 02     or  %o0, %g2, %o0
     48:       81 c3 e0 08     retl
     4c:       91 3a 20 00     sra  %o0, 0, %o0
     50:       30 68 00 04     b,a   %xcc, 60 <d+0x40>

So the executive summary is: don’t add __attribute__((packed)) to your code unless you know you need it.