6

raidz expansion feature by don-brady · Pull Request #15022 · openzfs/zfs · GitHu...

 1 year ago
source link: https://github.com/openzfs/zfs/pull/15022
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Contributor

Motivation and Context

This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks).

For additional context as well as a design overview, see Matt Ahrens' talk at the 2021 FreeBSD Developer Summit (video) (slides), and a news article from Ars Technica.

Description

Initiating expansion

A new device (disk) can be attached to an existing RAIDZ vdev, by running zpool attach POOL raidzP-N NEW_DEVICE, e.g. zpool attach tank raidz2-0 sda. The new device will become part of the RAIDZ group. A raidz expansion will
be initiated, and the new device will contribute additional space to the RAIDZ group once the expansion completes.

The feature@raidz_expansion on-disk feature flag must be enabled to initiate an expansion, and it remains active for the life of the pool. In other words, pools with expanded RAIDZ vdevs can not be imported by older releases of the ZFS software.

During expansion

The expansion entails reading all allocated space from existing disks in the RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including the newly added device).

The expansion progress can be monitored with zpool status.

Data redundancy is maintained during (and after) the expansion. If a disk fails while the expansion is in progress, the expansion pauses until the health of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting
for reconstruction to complete).

The pool remains accessible during expansion. Following a reboot or export/import, the expansion resumes where it left off.

After expansion

When the expansion completes, the additional space is available for use, and is reflected in the available zfs property (as seen in zfs list, df, etc).

Expansion does not change the number of failures that can be tolerated without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity
ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to zfs list, df, ls -s, and similar tools.

Manpage changes

zpool-attach.8:

NAME
     zpool-attach — attach new device to existing ZFS vdev

SYNOPSIS
     zpool attach [-fsw] [-o property=value] pool device new_device

DESCRIPTION
     Attaches new_device to the existing device.  The behavior differs depend‐
     ing on if the existing device is a RAIDZ device, or a mirror/plain
     device.

     If the existing device is a mirror or plain device ...

     If the existing device is a RAIDZ device (e.g. specified as "raidz2-0"),
     the new device will become part of that RAIDZ group.  A "raidz expansion"
     will be initiated, and the new device will contribute additional space to
     the RAIDZ group once the expansion completes.  The expansion entails
     reading all allocated space from existing disks in the RAIDZ group, and
     rewriting it to the new disks in the RAIDZ group (including the newly
     added device).  Its progress can be monitored with zpool status.

     Data redundancy is maintained during and after the expansion.  If a disk
     fails while the expansion is in progress, the expansion pauses until the
     health of the RAIDZ vdev is restored (e.g. by replacing the failed disk
     and waiting for reconstruction to complete).  Expansion does not change
     the number of failures that can be tolerated without data loss (e.g. a
     RAIDZ2 is still a RAIDZ2 even after expansion).  A RAIDZ vdev can be
     expanded multiple times.

     After the expansion completes, old blocks remain with their old data-to-
     parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distrib‐
     uted among the larger set of disks.  New blocks will be written with the
     new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded
     once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ vdev's
     "assumed parity ratio" does not change, so slightly less space than is
     expected may be reported for newly-written blocks, according to zfs list,
     df, ls -s, and similar tools.

Status

Matt Ahrens' original pull request (#12225) has been rebased here to current master branch and updated to incorporate recent code cleanups in the OpenZFS codebase. This feature is believed to be complete. However, like all PR's, it is subject to change as part of the code review process. Since this PR includes on-disk changes, it shouldn't be used on production systems before it is integrated to the OpenZFS codebase. Tasks that still need to be done before integration:

  • Additional code cleanup in ztest code
  • zloop changes to drive coverage of this feature
  • Address test failures in ztest runs
  • Document the high-level design in a "big theory statement" comment
  • Remove verbose logging
  • Detection of MBR partitions using reserved boot area (FreeBSD BTX boot loader)
  • Address any performance concerns

Acknowledgments

Thank you to the FreeBSD Foundation for commissioning this work in 2017 and continuing to sponsor it well past the original time estimates!
Thank you to iXsystems for sponsoring the final push to land this feature into OpenZFS.

Thanks also to contributors @FedorUporovVstack, @stuartmaybee, @thorsteneb, and @Fmstrat for portions
of the implementation.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Contributions-by: Stuart Maybee [email protected]
Contributions-by: Fedor Uporov [email protected]
Contributions-by: Thorsten Behrens [email protected]
Contributions-by: Fmstrat [email protected]
Contributions-by: Don Brady [email protected]

How Has This Been Tested?

Tests added to the ZFS Test Suite (functional/raidz) and ztest, in addition to manual testing.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

Pull Request Comments

Please limit comments here to code review/feedback and testing questions/results.

For generic discussions about RAID-Z, or discussions on future enhancements to RAIDZ expansion, please use the OpenZFS discussions.

congerh, jumbi77, Avamander, grimurd, lin72h, Ai-Himmel, abjugard, magma1447, f-andrey, PimvanderLoos, and 55 more reacted with thumbs up emojiabjugard, endigma, and ysaito8015 reacted with laugh emojiahrens, dalbani, davidchalifoux, drewthor, Evernow, eugenesvk, 0x2E, codykrieger, marvinvr, Skaronator, and 109 more reacted with hooray emojitimawesomeness, tinsukE, just1689, JaredF, lin72h, Gudahtt, IcyMidnight, abjugard, toast-gear, mufunyo, and 33 more reacted with heart emojiaskiiart, KoffeinKaio, abjugard, mikesplain, endigma, seqizz, reinismu, ysaito8015, and proxgs reacted with rocket emojiluispabon, ShadowJonathan, rbtr, D0han, dustinpianalto, venom85, Blacklands, IcyMidnight, abjugard, toast-gear, and 9 more reacted with eyes emoji

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK