Causing ZFS corruption for fun, profit, and quality assurance

Datto backs up data, a lot of it. At the time of writing Datto has over 500 PB of data stored on ZFS. This count includes both backup appliances that are sent to customer sites, as well as cloud storage servers that are used for secondary and tertiary backup of those appliances. At this scale drive swaps are a daily occurrence, and data corruption is inevitable. How we handle this corruption when it happens determines whether we truly lose data, or successfully restore from secondary backup. In this post we'll be showing you how at Datto we intentionally cause corruption in our testing environments, to ensure we're building software that can properly handle these scenarios.

Disclaimer: You should absolutely not attempt these instructions on any system containing any data you would like to keep. I provide no guarantees that the commands within this post will not completely destroy your zpool and all its contained data. But we'll try to only destroy it a little bit.

What is ZFS?

ZFS is a filesystem and volume manager with a variety of properties that make it ideal for storing large amounts of data. Amongst these are:

Automatic detection/repair of silent corruption (" bit rot ")
Constant time filesystem snapshotting
Constant time restore of those snapshots (i.e. no delta merges)
Transactional disk writes
The ability to send and receive snapshots for off-site backup

ZFS forms the foundation for both the data backup and disaster recovery mechanisms at Datto.

What is "Corruption"?

ZFS has mechanisms (referred to as "scrubbing") to detect and repair silent data errors. ZFS also gracefully handles drive failure and drive swaps.

In this case, by "corruption" we mean permanent data loss, where all of ZFS's internal backup replicas of the data are lost or destroyed. In a real life scenario this would be a trigger to recover the destroyed data from off-site replication (i.e. the Datto cloud), or from secondary cloud backup.

In practice these scenarios are rare, since ZFS is designed around preventing permanent corruption. But therein lies the problem. Since these corruption events are so rare, it's hard to write code to handle these scenarios. Unless, of course, you can cause corruption yourself!

Setup

For this example we'll be using a realistic example, a zpool with a single mirror vdev with 2 backing physical disks:

$ zpool status -L
  pool: tank
  state: ONLINE
  scan: scrub repaired 0B in 1h0m with 0 errors
...
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
...
errors: No known data errors
...

As you can see this pool has been recently scrubbed and has no known errors.

Causing Corruption

Since this is a mirror setup, a naive solution to cause corruption would be to randomly dd the same sectors of both /dev/sdb and /dev/sdc . This works, but is equally likely to just overwrite random unused space, or take down the zpool entirely. What we really want is to corrupt a specific snapshot, or even a specific file in that snapshot, to simulate a more realistic minor corruption event. Luckily we have a tool called zdb that lets us view some low level information about datasets.

First let's create a dataset:

$ zfs create tank/corrupt_me

Add some dummy data:

$ echo "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed" \
   "do eiusmod tempor incididunt ut labore et dolore magna aliqua." \
   > /tank/corrupt_me/test.txt

And create a snapshot:

$ zfs snap tank/corrupt_me@snap

We'll use this as our corruption target.

Lets see what zdb can tell us about this dataset:

$ zdb -ddd tank/corrupt_me@snap
Dataset tank/corrupt_me@snap [ZPL], ID 6100, cr_txg 7407961, 100K, 7 objects
...

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    6   128K    16K  56.0K     512    32K   10.94  DMU dnode
        -1    1   128K    512      0     512    512  100.00  ZFS user/group used
        -2    1   128K    512      0     512    512  100.00  ZFS user/group used
...
         2    1   128K    512     4K     512    512  100.00  ZFS plain file
...

Note: For those following along, if any of these commands fail try running zdb -eddd . This will bypass the zpool cache.

As you can see, this gives us a list of any ZFS objects associated with the specified dataset/snapshot. Since we only created one file in this dataset, this "ZFS plain file" must be what we're looking for. We can dive even deeper on the object with extra verbosity (more d's!):

$ zdb -ddddd tank/corrupt_me@snap
...
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    1   128K    512     4K     512    512  100.00  ZFS plain file
                                               168   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /test.txt
...
Indirect blocks:
               0 L0 0:4f110bf000:1000 200L/200P F=1 B=7407956/7407956

                segment [0000000000000000, 0000000000000200) size   512

Notice path /test.txt . That's the file we created earlier, so this confirms we're on the right track. This output also gives you the indirect block address ( 0:4f110bf000:1000 ). Without getting too deep into ZFS internals - the indirect block, in this case, stores the contents of our test.txt file. But you don't have to believe me, we can prove it with another zdb command:

$ zdb -R tank 0:4f110bf000:1000 | head
Found vdev type: mirror

0:4f110bf000:1000
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  4c6f72656d206970  73756d20646f6c6f  Lorem ipsum dolo
000010:  722073697420616d  65742c20636f6e73  r sit amet, cons
000020:  6563746574757220  6164697069736369  ectetur adipisci
000030:  6e6720656c69742c  2073656420646f20  ng elit, sed do
000040:  656975736d6f6420  74656d706f722069  eiusmod tempor i
000050:  6e6369646964756e  74207574206c6162  ncididunt ut lab
000060:  6f72652065742064  6f6c6f7265206d61  ore et dolore ma

As you can see the -R command takes a ZFS block address, and displays its content in several formats. Now we're really getting somewhere, we can see the lorem ipsum text we added to this file earlier. The indirect block address is in the format of [vdev]:[byte offset in hex]:[size] .

From the zpool status output earlier we know that our mirror vdev has two physical backing disks, sdb and sdc. Let's pick one of them (sdb) to see if we can use the indirect block offset to read our data directly off the disk (and eventually corrupt it).

Failed attempt:

$ dd if=/dev/sdb1 bs=512 skip=$((0x4f110bf000 / 512)) count=1 \
     | hexdump -C

1+0 records in
1+0 records out
512 bytes copied, 0.0123137 s, 41.6 kB/s
00000000  1f 80 26 b1 14 00 00 00  01 80 48 b1 14 00 00 00  |..&.......H.....|
00000010  08 80 4c b1 14 00 00 00  01 80 56 b1 14 00 00 00  |..L.......V.....|
00000020  00 80 5b b1 14 00 00 00  01 80 5e b1 14 00 00 00  |..[.......^.....|
00000030  00 80 61 b1 14 00 00 00  1a 80 63 b1 14 00 00 00  |..a.......c.....|
00000040  05 80 81 b1 14 00 00 00  8b 80 88 b1 14 00 00 00  |................|
00000050  4f 80 15 b2 14 00 00 00  c2 82 67 b2 14 00 00 00  |O.........g.....|
00000060  1a 80 2c b5 14 00 00 00  00 80 4b b5 14 00 00 00  |..,.......K.....|

It's just junk. Let's break down what we're attempting:

Read from /dev/sdb1 (our physical mirror member) with block size 512 bytes, skip ahead to our offset 0x4f110bf000 (divide by block size, because skip takes a number in blocks) and read a single block. So why didn't it work? To find the answer we need to dive into the zfs on disk specification . The relevant section is:

The value stored in offset is the offset in terms of sectors 
(512 byte blocks). To find the physical block byte offset from 
the beginning of a slice, the value inside offset must be 
shifted over (<<) by 9 (29 =512) and this value must be added 
to 0x400000 (size of two vdev_labels and boot block).

physical block address = (offset << 9) + 0x400000 (4MB)

To further add to the confusion, zdb automatically converts to bytes (rather than blocks), so we actually don't have to shift. But this gives us the information we need, we just need to skip the first 4MB of the physical disk.

$ dd if=/dev/sdb1 bs=512 \
   skip=$(((0x4f110bf000 / 512) + (0x400000 / 512))) count=1 \
   | hexdump -C

1+0 records in
1+0 records out
512 bytes copied, 0.0182629 s, 28.0 kB/s
00000000  4c 6f 72 65 6d 20 69 70  73 75 6d 20 64 6f 6c 6f  |Lorem ipsum dolo|
00000010  72 20 73 69 74 20 61 6d  65 74 2c 20 63 6f 6e 73  |r sit amet, cons|
00000020  65 63 74 65 74 75 72 20  61 64 69 70 69 73 63 69  |ectetur adipisci|
00000030  6e 67 20 65 6c 69 74 2c  20 73 65 64 20 64 6f 20  |ng elit, sed do |
00000040  65 69 75 73 6d 6f 64 20  74 65 6d 70 6f 72 20 69  |eiusmod tempor i|
00000050  6e 63 69 64 69 64 75 6e  74 20 75 74 20 6c 61 62  |ncididunt ut lab|
00000060  6f 72 65 20 65 74 20 64  6f 6c 6f 72 65 20 6d 61  |ore et dolore ma|
00000070  67 6e 61 20 61 6c 69 71  75 61 2e 0a 00 00 00 00  |gna aliqua......|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

There's our lorem ipsum! Now, instead of looking at the data, we need to overwrite it. Don't forget to overwrite both physical disks in the mirror:

$ dd if=/dev/urandom of=/dev/sdb1 bs=512 \
   seek=$(((0x4f110bf000 / 512) + (0x400000 / 512))) count=1

1+0 records in
1+0 records out
512 bytes copied, 0.0199633 s, 25.6 kB/s
$ dd if=/dev/urandom of=/dev/sdc1 bs=512 \
   seek=$(((0x4f110bf000 / 512) + (0x400000 / 512))) count=1
1+0 records in
1+0 records out
512 bytes copied, 0.000700771 s, 731 kB/s

At this point we just need to trigger a read of the data in our snapshot. You could do this by cloning the snapshot and reading the bad block, but you can also just trigger a scrub:

$ zpool scrub tank

After the scrub is complete:

$ zpool status -Lv
...
config:

        NAME        STATE     READ WRITE CKSUM
        tank    ONLINE       0     0     1
          mirror-0  ONLINE       0     0     2
            sdb     ONLINE       0     0     2
            sdc     ONLINE       0     0     2

...

errors: Permanent errors have been detected in the following files:

        tank/corrupt_me@snap:/test.txt

Congratulations, you've successfully destroyed your data!

Conclusion

At the 500 PB scale, it's not a matter of if data corruption will happen but when. Intentionally causing corruption is one of the strategies we use to ensure we're building software that can handle these rare (but inevitable) events.

To others out there using ZFS: I'm curious to hear how you've solved this problem. We did quite a bit of experimentation with zinject before going with this more brute force method. So I'd be especially interested if you've had luck simply simulating corruption with zinject.

What is ZFS?

What is "Corruption"?

Setup

Causing Corruption

Conclusion

Recommend

淘宝直播间流量布局大揭秘

Use internal packages to reduce your public API surface | Dave Cheney

【国民茶话会 · 有奖问答07】自由职业，真的自由吗？

webpack原理 - wyao - 博客园

Redis5.x哨兵搭建手记

How We Hold Our Gadgets – A List Apart

谈项目经理与产品经理的工作范围及差别？

GitHub - dragen1860/TensorFlow-2.x-Tutorials: TensorFlow 2.x version's...

GitHub - ardalis/CleanArchitecture: A starting point for Clean Architecture with...

GitHub - thomasms/glastoselenium: A bot for booking Glastonbury tickets using se...

About Joyk