Some TOEs stink

February 9th, 2013
stinky toes

“Stinky Toes” by ijonas

As you may know, Pure Storage has been shipping iSCSI arrays for a few months.  Recently we got a bug report from a customer: when running heavy IO from a VMware guest, iSCSI IO from the VMware system to a Pure array would stall for minutes at a time.  Fortunately, we were able to get a similar set up in our lab and reproduce the issue, so we could dig in and see what was going on.

The IO was going from the guest to a guest disk image, which is really just a file on the VMware filesystem, so the actual iSCSI initiator was the VMware hypervisor.  We started grabbing packet dumps of the traffic while we reproduced the problem, and noticed a strange pattern of slow retransmissions from the Pure side when the stall occurred.

The test setup had one 10G Ethernet link from the VMware system going through a switch to two 10G ports on the Pure array, so packet loss due to congestion was immediately something we thought of, but we couldn’t understand why we couldn’t recover.  It seemed that when the stall occurred, the Pure side of the TCP connection got stuck with a full send buffer.

The first clue was when we noticed that selective ACK (SACK) was not enabled for our iSCSI TCP connections — and indeed, when we captured the connection establishment, the VMware iSCSI initiator was not advertising SACK in its TCP options.  This was kind of a mystery, because when we did other things such as ssh-ing into the the VMware  command line, it was perfectly happy to set up TCP connections with SACK enabled.

So not having SACK enabled partially explained why we were not able to recover from packet loss very well: we were running with a large TCP window (hundreds of KB) on a high-speed link, and if some packets got dropped, we might send a few hundred more afterwards; without SACK, the VMware system had no way to tell us which packets to retransmit.

However, TCP was behaving even worse than “lack of SACK” could explain.  What we saw in the packet trace was that after a lost packet, our retransmission timer would expire and we would send the packet after the last one that the initiator ACKed (which would be a few hundred packets before the last one that we sent).  The initiator would ACK that packet, which would advance our send window, and so we would send one new packet beyond the last one we sent — right at the end but definitely within our window.

And then the initiator would just ignore that packet!  The way that TCP is supposed to work is that the receiver should either ACK all the way up to that new packet (if our retransmitted packet was the only lost packet, and it now had all the data up to and including our new packet), or it should send a duplicate ACK (“dup ACK”) that re-ACKed the data we already knew it had.  Just ignoring those packets that are within its receive window is completely inexplicable.

The dup ACK is important because enough of them will trigger “fast retransmission” — without that, we’re stuck waiting for a roughly 1/4 second timer to expire between retransmissions, which means it will take way too long to resend a full send window of several hundred packets.  In fact, so long that the inititor just gives up and establishes a new TCP connection after a timeout of a minute or two.

Finally, we realized why the VMware initiator’s TCP behavior was so crazy and primitive.  Fortunately, we had duplicated the customer config and we were using a Broadcom 10G Ethernet adapter with the Broadcom iSCSI offload driver (roughly equivalent to the bnx2i driver in Linux).  This crazy TCP stack wasn’t in the VMware hypervisor — it was running on the network adapter.

(In fact, looking at the Linux kernel sources for bnx2i and cnic, one can see that the Broadcom TCP offload engine apparently has an option “L4_KWQ_CONNECT_REQ1_SACK” for connections, but because the iSCSI initiator driver doesn’t set the “SK_TCP_SACK” flag, it doesn’t get enabled.  One can guess that the Broadcom driver for VMware is probably from a similar codebase, and that kind of explains why we didn’t see SACK enabled)

Once we realized where the problem was coming from, the fix was simple: switch from the Broadcom offload driver to the normal VMware software iSCSI initiator.  Once we did that, performance became pretty stable, just about saturating the 10G Ethernet link, with occasional hiccups of a few seconds when a congestion drop occurred.  (As a side note, it’s kind of nuts that these days we take it for granted that a storage array can do enough IOPS to get above 1 GB/sec with small random IOs).

In the past I’ve defended TOEs, but in this case the Broadcom NIC and driver aren’t even fully implementing the most primitive form of TCP, so I have to agree that it’s completely unusable.  But it’s worth noting that we tried Chelsio and Emulex adapters with their iSCSI offload drivers, and they worked fine.  I still think TCP and iSCSI offload make  sense because they have a fundamental 3x advantage in memory bandwidth (the NIC puts the data where it’s supposed to go, rather than putting it in some random receive buffer and then having the CPU read it and write it to copy it to where it’s supposed to go)

So I don’t think there’s any broad conclusion that can be drawn beyond the fact that one should really never use Broadcom’s iSCSI offload.

You can never be too rich or too thin

March 16th, 2012

Thin Mints by by Jesse Michael Nix

One of the cool things about having a storage box that virtualizes everything at sector granularity is that there’s pretty much zero overhead to creating as big a volume as you want.  So I can do

    pureuser@pure-virt> purevol create --size 100p hugevol
    pureuser@pure-virt> purevol connect --host myinit hugevol

and immediately see

    scsi 0:0:0:2: Direct-Access     PURE     FlashArray       100  PQ: 0 ANSI: 6
    sd 0:0:0:2: [sdb] 219902325555200 512-byte logical blocks: (112 PB/100 PiB)
    sd 0:0:0:2: [sdb] Write Protect is off
    sd 0:0:0:2: [sdb] Mode Sense: 2f 00 00 00
    sd 0:0:0:2: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
    sdb: unknown partition table
    sd 0:0:0:2: [sdb] Attached SCSI disk

on the initiator side.  Being able to create gigantic LUNs makes using and managing storage a lot simpler — I don’t have to plan ahead for how much space I’m going to need or anything like that.  But going up to the utterly ridiculous size of 100 petabytes is fun on the initiator side…

First, I tried

    # mkfs.ext4 -V
    mke2fs 1.42.1 (17-Feb-2012)
            Using EXT2FS Library version 1.42.1
    # mkfs.ext4 /dev/sdb

but that seems to get stuck in an infinite loop in ext2fs_initialize() trying to figure out how many inodes it should have per block group. Since block groups are 32768 blocks (128 MB), there are a lot (something like 800 million) of block groups on a 100 PB block device, but ext4 is (I believe) limited to 32-bit inode numbers, so the number of inodes per block group calculated ends up being about 6, which the code then rounds it up to a multiple of 8 — that is, up to 8. It double checks that 8 * number of block groups doesn’t overflow 32 bits, but unfortunately it does, so it reduces the inodes/group count it tries, and goes around the loop again, which doesn’t work out any better.  (Yes, I’ll report this upstream in a better forum too..)

Then I tried

    # mkfs.btrfs -V
    mkfs.btrfs, part of Btrfs Btrfs v0.19
    # mkfs.btrfs /dev/sdb

but that gets stuck doing a BLKDISCARD ioctl to clear out the whole device. It turns out my array reports that it can do SCSI UNMAP operations 2048 sectors (1 MB) at a time, so we need to do 100 billion UNMAPs to discard the 100 PB volume. My poor kernel is sitting in the unkillable loop in blkdev_issue_discard() issuing 1 MB UNMAPs as fast as it can, but since the array does about 75,000 UNMAPs per second, it’s going to be a few weeks until that ioctl returns.  (Yes, I’ll send a patch to btrfs-progs to optionally disable the discard)

[Aside: I’m actually running the storage inside a VM (with the FC target adapter PCI device passed in directly) that’s quite a bit wimpier than real Pure hardware, so that 75K IOPS doing UNMAPs shouldn’t be taken as a benchmark of what the real box would do.]

Finally I tried

    # mkfs.xfs -V
    mkfs.xfs version 3.1.7
    # mkfs.xfs -K /dev/sdb

(where the “-K” is stops it from issuing the fatal discard) and that actually finished in less than 10 minutes. So I’m able to see

    # mkfs.xfs -K /dev/sdb
    meta-data=/dev/sda               isize=256    agcount=102401, agsize=268435455 blks
             =                       sectsz=512   attr=2, projid32bit=0
    data     =                       bsize=4096   blocks=27487790694400, imaxpct=1
             =                       sunit=0      swidth=0 blks
    naming   =version 2              bsize=4096   ascii-ci=0
    log      =internal log           bsize=4096   blocks=521728, version=2
             =                       sectsz=512   sunit=0 blks, lazy-count=1
    realtime =none                   extsz=4096   blocks=0, rtextents=0
    # mount /dev/sdb /mnt
    # df -h /mnt/
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/sdb        100P  3.2G  100P   1% /mnt

Do you know everything about RDMA?

March 7th, 2011

As I mentioned on Twitter (by the way, are you following @rolanddreier?), I’ll be speaking at the Linux Foundation Collaboration Summit in San Francisco on April 7.  My general mandate is to give an introduction to RDMA and InfiniBand on Linux, and to talk about recent developments and what might be coming next in the area.  However, I’d like to make my talk a little less boring than my usual talks, so I’d be curious to hear about specific topics you’d like me to cover.  And if you’re at the summit, stop by and say hello.

Want to work with me?

February 16th, 2011


More seriously, in the past few weeks, my new employer (Pure Storage) has said a little more, and I can now link to a real jobs page.  As you can see from the listings, we’re looking for both kick-ass developers as well people with more QA/tools/scripting skills.  And we definitely are willing to help people fresh out of school learn, as long as they have some experience with Linux.

If you’re interested, you can let me know or just apply directly from the jobs page.  Good luck!

New testbed installed

February 11th, 2011

Since I changed jobs, I left behind a lot of my test systems, but I now have a couple of test systems set up. Here is the rather crazy set of non-chipset devices I now have in one box:

$ lspci -nn|grep -v 8086:
03:00.0 InfiniBand [0c06]: Mellanox Technologies MT25208 [InfiniHost III Ex] [15b3:6282] (rev 20)
04:00.0 Ethernet controller [0200]: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] [15b3:6750] (rev b0)
05:00.0 InfiniBand [0c06]: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] [15b3:673c] (rev b0)
84:00.0 Ethernet controller [0200]: NetEffect NE020 10Gb Accelerated Ethernet Adapter (iWARP RNIC) [1678:0100] (rev 05)
85:00.0 Ethernet controller [0200]: Chelsio Communications Inc T310 10GbE Single Port Adapter [1425:0030]
86:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] [15b3:6274] (rev 20)

(I do have a couple of open slots if you have some RDMA cards that I’m missing to complete my collection :))

Watch Movie Online Fences (2016) subtitle english

January 21st, 2011
Poster Movie Fences 2016

Fences (2016) HD

Director : Denzel Washington.
Producer : Scott Rudin, Denzel Washington.
Release : December 16, 2016
Country : United States of America.
Production Company : Paramount Pictures, Scott Rudin Productions, Bron Creative, MACRO.
Language : English, Italiano.
Runtime : 139 min.
Genre : Drama.

Buy Now on Amazon Fences (2016) Full Movie

Movie ‘Fences’ was released in December 16, 2016 in genre Drama. Denzel Washington was directed this movie and starring by Denzel Washington. This movie tell story about In 1950s Pittsburgh, a frustrated African-American father struggles with the constraints of poverty, racism, and his own inner demons as he tries to raise a family.

Do not miss to Watch movie Fences (2016) Online for free with your family. only 2 step you can Watch or download this movie with high quality video. Come and join us! because very much movie can you watch free streaming.

Watch movie online Fences (2016)

Incoming search term :

download full film Fences 2016
Fences 2016 Episodes Online
Fences 2016 Full Episodes Online
Fences 2016 English Full Episodes Free Download
Fences 2016 Online Free Megashare
live streaming movie Fences online
Fences 2016 HD Full Episodes Online
watch full film Fences 2016
watch full Fences 2016 film online
movie Fences
Fences 2016 English Full Episodes Download
live streaming film Fences 2016 online
download film Fences
Watch Fences 2016 Online Free megashare
Fences 2016 English Episodes
Fences 2016 HD English Full Episodes Download
Watch Fences 2016 Online Free Viooz
watch full Fences 2016 movie
streaming film Fences 2016
Watch Fences 2016 Online Free
Watch Fences 2016 Online Putlocker
Fences 2016 For Free Online
Fences 2016 For Free online
Watch Fences 2016 Online Viooz
Fences 2016 Full Episodes Watch Online
trailer movie Fences
Fences 2016 Full Episode
download movie Fences now
Fences 2016 English Episodes Free Watch Online
watch full movie Fences 2016
Watch Fences 2016 Online Free Putlocker
Watch Fences 2016 Online Free putlocker
download film Fences 2016 now
watch movie Fences 2016 now
Fences 2016 Watch Online
Watch Fences 2016 Online Megashare
Fences 2016 English Full Episodes Online Free Download
Fences 2016 Episodes Watch Online
Fences 2016 English Full Episodes Watch Online
Fences 2016 English Episode
watch full movie Fences online
film Fences 2016

Missing the point on startups

December 23rd, 2010

I’ve been thinking about Ted Ts’o’s recent posts about whether it’s possible to do engineering or work on technology at startups. I’m not going to argue that you can’t work on technology at Google or another big company (although articles like these do point out the difficulties). It would be easy to pick on Google’s failures and point out how many of their successes were actually acquired by buying a startup, but what I really wanted to talk about is how (IMHO) Ted is misunderstanding startups.

Ted’s central point seems to be:

But if your primary interest is to doing great engineering work, then you want go to company that has a proven business model.

Phrased so broadly, that’s bad advice. The reasoning that leads Ted to that bad advice starts with two contradictory misunderstandings of startups:

These days, the founder or founders will have a core idea, which they will hopefully patent, to prevent competitors from replicating their work, just as before. […] most of the technology developed in a typical startup will tend to be focused on supporting the core idea that was developed by the founder.


Because if you talk to any venture capitalist, a startup has one and only one reason to exist: to prove that it has a scalable, viable business model.

In my experience, startups typically start with the founders deciding they’ve found a problem they can solve better, cheaper or faster — but it’s rare for founders to have an idea that’s developed enough to patent the whole thing. Ted I think implies that at a startup, the founders have figured everything out and everyone else is just filling in the details of the idea. To me, that seems completely backwards: if you go to a big company with an established business model, then almost certainly you’ll be working within the outline of that model (Innovator’s Dilemma and all that); at a startup, you’ll have to help the founders figure out just what the hell your company is supposed to be doing. And that gets to the second quote: a startup is an exercise in adapting the technology you’re building until you find the right business model. In other words, nearly every startup will get it wrong to start with and have to change plans repeatedly; the hope is that the technology you build along the way is valuable enough that you can survive until you find the right way to make money.

To give one example from personal experience, when I was at Topspin working on InfiniBand products, early in the InfiniBand hype cycle (around 2001 or so), we thought that every OS would soon ship with InfiniBand drivers, so we focused on building switches and other networking gear, without worrying about the hosts that would be connected to the network. It turned out that the first open source project for a Linux InfiniBand stack fizzled, and Windows also gave up on InfiniBand, so we ended up having to build an InfiniBand host stack — fortunately the embedded software from our switches already had most of the ingredients, and so we were able to pull it off by reusing our embedded work. (That Topspin host stack ended up getting released as free software, and it became one of the ingredients that went into the current Linux InfiniBand stack — and I ended up as the InfiniBand maintainer for the Linux kernel, while working for a startup)

So as I said before, I think it’s bad advice to suggest to someone that “real” engineering can only be done at a large company. Certainly there are huge differences between working at a big company and a small company, and I do believe that there are “big company people” and “small company people.” If your goal is to spend nearly all your time making incremental improvements in ext4, sure, it’s probably easier to do that at a company that is a big enough ext4 user for that work to pay off; on the other hand if you’d rather work on something that you’re making up as you go along and where your decisions shape the whole future of the company, then a startup is probably a better place for you. Similarly, Ted’s assertion

For most startups, though, open source software is something that they will use, but not necessarily develop except in fairly small ways.

misses the real distinction. There are plenty of startups where open source is the main focus (Cloudera, Riptano and Strobe are just a few that spring to mind; and I don’t mean to dis all of the others that I’m not namechecking here), and there are gazillions of big technology companies that are actively hostile to open source. So really, if you want to get paid to work on open source, make sure you go to an open source company; the size of the company is a completely orthogonal issue.

To summarize my advice: if you think you might be a small company person, don’t let Ted scare you away from startups. Oh, and happy holidays!

Transition to Linode complete

December 7th, 2010

Hello world!

Two notes on IBoE

December 6th, 2010

I want to mention two things about IBoE.  (I’m using the term InfiniBand-over-Ethernet, or IBoE for short, for what the IBTA calls RoCE for reasons already discussed)

First, we merged IBoE support on mlx4 devices into the upstream kernel in 2.6.37-rc1, so IBoE will be in upstream kernel for the 2.6.37 release — one fewer reason to use OFED.  (And by the way, we used the term IBoE in the kernel)  The requisite libibverbs and libmlx4 patches are not merged yet, but I hope to get to that soon and release new versions of the userspace libraries with IBoE support.

Second, a while ago I promised to detail some of my specific critiques of the IBoE spec (more formally, “Annex A16: RDMA over Converged Ethernet (RoCE)” to the “InfiniBand Architecture Specification Volume 1 Release 1.2.1”; if you want to follow along at home, you can download a copy from the IBTA).  So here are two places where I think it’s really obvious that the spec is a half-assed rush job, to the detriment of trying to create interoperable implementations.  (Fortunately everyone will just copy what the Linux stack does if they don’t actually just reuse the code, but still it would have been nice if the people writing the standards had thought things through instead of letting us just make something up and hope it there are no corner cases that will bite us later)

  • The annex has this to say about address resolution in A16.5.1, “ADDRESS ASSIGNMENT AND RESOLUTION”:

    The means for resolving a GID to a local port address (i.e. SMAC or DMAC) are outside the scope of this annex. It is assumed that standard Ethernet mechanisms, such as ARP or Neighbor Discovery are used to maintain an appropriate address cache for RoCE ports.

    It’s easy to say that something is “outside the scope” but, uh, who else is going to specify how to turn an IB GID into an Ethernet address, if not the spec about how to run IB over Ethernet packets?  And how could ARP conceivably be used, given that GIDs are 128-bit IPv6 addresses?  If we’re supposed to use neighbor discovery, a little more guidance about how to coordinate the IPv6 stack and the IB stack might be helpful.  In the current Linux code, we finesse all this by assuming that (unicast) GIDs are always local-scope IPv6 addresses with the Ethernet address encoded in them, so converting a GID to a MAC is trivial (cf rdma_get_ll_mac()).

  • This leads to the second glaring omission from the spec: nowhere are we told how to send multicast packets.  The spec explicitly says that multicast should work in IBoE, but nowhere does it say how to map a multicast GID to the Ethernet address to use when sending to that MGID.  In Linux we just used the standard mapping from multicast IPv6 addresses to multicast Ethernet addresses, but this is a completely arbitrary choice not supported by the spec at all.

You may hear people defending these omissions from the IBoE spec by saying that these things should be specified elsewhere or are out of scope for the IBTA.  This is nonsense: who else is going to specify these things?  In my opinion, what happened is simply that (for non-technical reasons) some members of the IBTA wanted to get a spec out very quickly, and this led to a process that was too short to produce a complete spec.

Was it something I said?

June 3rd, 2010