Are we having, like, a conversation?

Pete replied to my earlier reply and argued that mb() should never be used:

NOOOOOOOOOOO!

In theory, it’s possible to program in Linux kernel by using nothing but mb(). In practice, expirience teaches us that every use of mb() in drivers is a bug. I’m not kidding here. For some reason, even competent and experienced hackers cannot get it right.

I agree that memory barriers should be avoided, which is why I said that “a fat comment about why you need to mess with memory barriers anyway” should always go along with a barrier. However

  1. When a memory barrier is required, I don’t think that using a spinlock as an obfuscated stand-in is an improvement. If a spinlock is just serving as a memory barrier, then you probably don’t have a clear idea of what data the spinlock is protecting. And that’s going to lead to problems when someone does something like expanding the locked region by moving where the spinlock is taken.
  2. Spinlocks actually don’t save you from having your driver blow up on big machines. The number of people that understand mmiowb() is far smaller than the number of people that understand mb(), but no matter how many spinlocks you lock, you may still need mmiowb() for your driver to work on SGI boxes. (Actually I think this is a problem with the interface presented to drivers: someday when I really do feel like tilting at a windmill, I’ll try to kill off mmiowb())

It’s funny: when I read Pete’s entry, I did a quick grep of drivers/usb looking for mb() to show that even non-exotic device drivers need memory barriers. And the first example I pulled up at random, in drivers/usb/host/uhci-hcd.c looks like a clear bug:

        /* End Global Resume and wait for EOP to be sent */
        outw(USBCMD_CF, uhci->io_addr + USBCMD);
        mb();
        udelay(4);
        if (inw(uhci->io_addr + USBCMD) & USBCMD_FGR)
                dev_warn(uhci_dev(uhci), "FGR not stopped yet!n");

The mb() after the outw() does not make sure that the IO reaches the device, since it is probably a posted PCI write which might lurk in a chipset buffer somewhere long after the CPU has retired it. The only way to make sure that the write has actually reached the device is to do a read from the same device to flush the posted write. So the udelay() might expire before the previous write has even reached the device. I guess I’ll be a good citizen and report this on the appropriate mailing list.

An example of what I consider a proper use of a memory barrier in a device driver is in my masterpiece drivers/infiniband/hw/mthca/mthca_cq.c:

        cqe = next_cqe_sw(cq);
        if (!cqe)
                return -EAGAIN;

        /*
         * Make sure we read CQ entry contents after we've checked the
         * ownership bit.
         */
         rmb();

The device DMAs a status block into our memory and then sets a bit in that block to mark it as valid. If we check the valid bit, then we need an rmb() to make sure that we don’t operate on bogus data because the CPU has speculatively executed the reads of the rest of the status block before it was written and then checked the valid bit after it was set (and yes, this was actually seen happening on the PowerPC 970). I can’t think of any reasonable way to write this with a spinlock. Would I just create a dummy spinlock to take and drop for no reason other than memory ordering?

Of course this is driving exactly the sort of high-speed exotic hardware that Pete talked about in his blog, but I think the principle applies to any hardware that DMAs structures to or from host memory. For example, tg3_rx() in drivers/net/tg3.c seems to have a barrier for similar reasons. If you’re writing a driver for DMA-capable hardware, you better have some understanding of memory ordering.

2 Responses to “Are we having, like, a conversation?”

  1. kyle mcmartin says:

    ioports (using in/out) are defined to have non-postable semantics, so the mb() is really just a compiler barrier in your usb case, afaict.

  2. roland says:

    oh, my bad. IO ports are so old-school I just read outw/inw as writew/readw without thinking about it — I never looked at what the UHCI interface looked like before, but sure enough all the PCI device has is 32 bytes of IO ports.

    But surely outw() and inw() are also defined to be ordered with respect to each other? An mb() between outw() and inw() must be totally redundant, or else code that poked at ports in a specific order would be totally insane. So I still think the code is at least somewhat ugly.