Archive for the ‘linux’ Category

You can never be too rich or too thin

Friday, March 16th, 2012

Thin Mints by by Jesse Michael Nix

One of the cool things about having a storage box that virtualizes everything at sector granularity is that there’s pretty much zero overhead to creating as big a volume as you want.  So I can do

    pureuser@pure-virt> purevol create --size 100p hugevol
    pureuser@pure-virt> purevol connect --host myinit hugevol

and immediately see

    scsi 0:0:0:2: Direct-Access     PURE     FlashArray       100  PQ: 0 ANSI: 6
    sd 0:0:0:2: [sdb] 219902325555200 512-byte logical blocks: (112 PB/100 PiB)
    sd 0:0:0:2: [sdb] Write Protect is off
    sd 0:0:0:2: [sdb] Mode Sense: 2f 00 00 00
    sd 0:0:0:2: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
    sdb: unknown partition table
    sd 0:0:0:2: [sdb] Attached SCSI disk

on the initiator side.  Being able to create gigantic LUNs makes using and managing storage a lot simpler — I don’t have to plan ahead for how much space I’m going to need or anything like that.  But going up to the utterly ridiculous size of 100 petabytes is fun on the initiator side…

First, I tried

    # mkfs.ext4 -V
    mke2fs 1.42.1 (17-Feb-2012)
            Using EXT2FS Library version 1.42.1
    # mkfs.ext4 /dev/sdb

but that seems to get stuck in an infinite loop in ext2fs_initialize() trying to figure out how many inodes it should have per block group. Since block groups are 32768 blocks (128 MB), there are a lot (something like 800 million) of block groups on a 100 PB block device, but ext4 is (I believe) limited to 32-bit inode numbers, so the number of inodes per block group calculated ends up being about 6, which the code then rounds it up to a multiple of 8 — that is, up to 8. It double checks that 8 * number of block groups doesn’t overflow 32 bits, but unfortunately it does, so it reduces the inodes/group count it tries, and goes around the loop again, which doesn’t work out any better.  (Yes, I’ll report this upstream in a better forum too..)

Then I tried

    # mkfs.btrfs -V
    mkfs.btrfs, part of Btrfs Btrfs v0.19
    # mkfs.btrfs /dev/sdb

but that gets stuck doing a BLKDISCARD ioctl to clear out the whole device. It turns out my array reports that it can do SCSI UNMAP operations 2048 sectors (1 MB) at a time, so we need to do 100 billion UNMAPs to discard the 100 PB volume. My poor kernel is sitting in the unkillable loop in blkdev_issue_discard() issuing 1 MB UNMAPs as fast as it can, but since the array does about 75,000 UNMAPs per second, it’s going to be a few weeks until that ioctl returns.  (Yes, I’ll send a patch to btrfs-progs to optionally disable the discard)

[Aside: I’m actually running the storage inside a VM (with the FC target adapter PCI device passed in directly) that’s quite a bit wimpier than real Pure hardware, so that 75K IOPS doing UNMAPs shouldn’t be taken as a benchmark of what the real box would do.]

Finally I tried

    # mkfs.xfs -V
    mkfs.xfs version 3.1.7
    # mkfs.xfs -K /dev/sdb

(where the “-K” is stops it from issuing the fatal discard) and that actually finished in less than 10 minutes. So I’m able to see

    # mkfs.xfs -K /dev/sdb
    meta-data=/dev/sda               isize=256    agcount=102401, agsize=268435455 blks
             =                       sectsz=512   attr=2, projid32bit=0
    data     =                       bsize=4096   blocks=27487790694400, imaxpct=1
             =                       sunit=0      swidth=0 blks
    naming   =version 2              bsize=4096   ascii-ci=0
    log      =internal log           bsize=4096   blocks=521728, version=2
             =                       sectsz=512   sunit=0 blks, lazy-count=1
    realtime =none                   extsz=4096   blocks=0, rtextents=0
    # mount /dev/sdb /mnt
    # df -h /mnt/
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/sdb        100P  3.2G  100P   1% /mnt

Want to work with me?

Wednesday, February 16th, 2011

Why?

More seriously, in the past few weeks, my new employer (Pure Storage) has said a little more, and I can now link to a real jobs page.