You can never be too rich or too thin
Friday, March 16th, 2012One of the cool things about having a storage box that virtualizes everything at sector granularity is that there’s pretty much zero overhead to creating as big a volume as you want. So I can do
pureuser@pure-virt> purevol create --size 100p hugevol pureuser@pure-virt> purevol connect --host myinit hugevol
and immediately see
scsi 0:0:0:2: Direct-Access PURE FlashArray 100 PQ: 0 ANSI: 6 sd 0:0:0:2: [sdb] 219902325555200 512-byte logical blocks: (112 PB/100 PiB) sd 0:0:0:2: [sdb] Write Protect is off sd 0:0:0:2: [sdb] Mode Sense: 2f 00 00 00 sd 0:0:0:2: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA sdb: unknown partition table sd 0:0:0:2: [sdb] Attached SCSI disk
on the initiator side. Being able to create gigantic LUNs makes using and managing storage a lot simpler — I don’t have to plan ahead for how much space I’m going to need or anything like that. But going up to the utterly ridiculous size of 100 petabytes is fun on the initiator side…
First, I tried
# mkfs.ext4 -V mke2fs 1.42.1 (17-Feb-2012) Using EXT2FS Library version 1.42.1 # mkfs.ext4 /dev/sdb
but that seems to get stuck in an infinite loop in ext2fs_initialize() trying to figure out how many inodes it should have per block group. Since block groups are 32768 blocks (128 MB), there are a lot (something like 800 million) of block groups on a 100 PB block device, but ext4 is (I believe) limited to 32-bit inode numbers, so the number of inodes per block group calculated ends up being about 6, which the code then rounds it up to a multiple of 8 — that is, up to 8. It double checks that 8 * number of block groups doesn’t overflow 32 bits, but unfortunately it does, so it reduces the inodes/group count it tries, and goes around the loop again, which doesn’t work out any better. (Yes, I’ll report this upstream in a better forum too..)
Then I tried
# mkfs.btrfs -V mkfs.btrfs, part of Btrfs Btrfs v0.19 # mkfs.btrfs /dev/sdb
but that gets stuck doing a BLKDISCARD ioctl to clear out the whole device. It turns out my array reports that it can do SCSI UNMAP operations 2048 sectors (1 MB) at a time, so we need to do 100 billion UNMAPs to discard the 100 PB volume. My poor kernel is sitting in the unkillable loop in blkdev_issue_discard() issuing 1 MB UNMAPs as fast as it can, but since the array does about 75,000 UNMAPs per second, it’s going to be a few weeks until that ioctl returns. (Yes, I’ll send a patch to btrfs-progs to optionally disable the discard)
[Aside: I’m actually running the storage inside a VM (with the FC target adapter PCI device passed in directly) that’s quite a bit wimpier than real Pure hardware, so that 75K IOPS doing UNMAPs shouldn’t be taken as a benchmark of what the real box would do.]
Finally I tried
# mkfs.xfs -V mkfs.xfs version 3.1.7 # mkfs.xfs -K /dev/sdb
(where the “-K” is stops it from issuing the fatal discard) and that actually finished in less than 10 minutes. So I’m able to see
# mkfs.xfs -K /dev/sdb meta-data=/dev/sda isize=256 agcount=102401, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=27487790694400, imaxpct=1 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mount /dev/sdb /mnt # df -h /mnt/ Filesystem Size Used Avail Use% Mounted on /dev/sdb 100P 3.2G 100P 1% /mnt