ZFS

Use of the ZFS file system. ZFS has several interesting caracteristics such as: raid, self healing, snapshot, clone, rollback, cache, deduplication, copy-on-write.
Discover Alux-Studio and its wonderfull photographs:

Pool

Creation

The disk pool will be built according to the capacity and redundancy required, and of course the number of available disks. The different groups of disks that can be added to the pool are detailed below:

Disques
GroupData RedundancyRAID alternative
disk 1 → n0RAID0 / JBOD
mirror1 1RAID1 / Mirror
raidz11 → n1RAID5
raidz21 → n2RAID6
raidz31 → n3

Having a redundancy of type raidz reduced performance due to parity calculations required for distribution of data accros the disks. But it helps protect a group against a loss of 1 disk (raidz1), 2 disks (raidz2) or 3 disks (raidz3) without the need to dedicate 50% of the initial storage capacity for redundancy as it is the case in a mirror (mirror).

The advantage of having at least 2 disks for redundancy (type configuration raidz2 , raidz3) is to guard against cascading failures:
  • if the disks are from the same batch, they may have the same defect, and thus fail at nearly the same time;
  • the time required to rebuild the group render it vulnerable to the loss of additional disks, especially since this period tends to be longer with the current disk capacity (and can strech to a few days).

Examples of pool creation (pool is called tank) consisting of a single group:

zpool create tank        da0                      # RAID0 / JBOD     (1+0)
zpool create tank mirror da0 da1                  # RAID1 / Mirror   (1+1)
zpool create tank raidz  da0 da1 da2              # RAID5            (2+1)
zpool create tank raidz2 da0 da1 da2 da3 da4      # RAID6            (3+2)
Available disk space on device replacement must be at least equal to the one it replaces. Therefore it may be desirable not to dedicate the entire disk to ZFS, but to protected ourself by creating a partition slightly smaller than the size of the disk. The announced capacity may indeed differ somehow depending on the manufacturers and references. This can even be the case on disks that should be identical: same manufacturer, same reference, same purchase date.

As it is hardly possible to be behind each array in case of problems, there is an option to mark certain disks as spare, the spare will immediately replace the failed disk and start the repair process.

zpool add tank spare da6

To improve ZIL performance, the ZIL (ZFS Intent Log) can be placed on a dedicated disk with better throughput and access time (typically on an SSD device). The ZIL is responsible for satisfing the POSIX requirement for read/write beging synchronous.

zpool add tank log        da7          # ZIL is on a single disk
zpool add tank log mirror da7 da8      # ZIL is protected by a mirror

Also for performance reasons, it is possible to add a cache consisting of one or more disks, here also, speed and low access time are paramount.

zpool add tank cache da8

Verification

Conduct an audit of the pool (thanks to the checksums present on the blocks), and repair it, if possible, from the redundancy available (mirror, raidz, and multiple copies).

zpool scrub tank
This command is also an indirect way to restart an aborted resilvering phase.

Displays the status of the pools and in case of problems the list the files impacted by the errors.

zpool status -x

Performances

Provides information on the performance of the pool.

zpool iostat

Disks replacement

Replacing disk is done by specifying the disk device to replace, /dev/old_device, and the replacing disk device /dev/new_device:

zpool replace pool /dev/old_device /dev/new_device

In the case where the disk to replace is part of pool with redundancy, it is then possible to directly replace the disk by another one in the same place. The command is simplified to:

zpool replace pool /dev/device

File systems

Logical volume

It is possible to use the ZFS pool not to create a file system, but a logical volume (ie: a raw block device ) that can be used later to create other zfs pool, iSCSI disks, ...

zfs create -V 50G tank/iscsi/web    # Available at: /dev/zvol/tank/iscsi/web

Deduplication

Deduplication allows saving disk space by keeping only one or a few copies (see dedupditto) of identical blocks.

zfs set dedup=sha256,verify tank/home
In the case where there is no more space available on the file system, clones, snapshots, and especially deduplication make it difficult to release disk space. Actually, deleting a file does no longer automatically mean making available the disk blocks associated to the file, as they can also be used by others.

For a quick comparison of blocks, they are compared by their hash values (fletcher4, sha256 ), it is still possible to require a full comparison (verify) in the event of a tie, especially when an algorithm with numerous collisions as fletcher is chosen.

To summarize among the possible combinations, the two interesting ones are:

ComparisonFeatures
fletcher4,verify quick comparison, but requires a costly verifcation in case of equality
sha256 Reasonable performance with acceptance of an error probability of about 10-77

To send a deduplicated stream, the option -D must passed to the zfs send command:

zfs send -D "${fs}@${tag}"

Backup and restoration

The zfs send and zfs recv commands respectively perform a backup and a restoration of the file system or of a set of file systems. They can be seen as the commands dump and restore for traditional file systems like UFS.

Backup
zfs send -D -R -I @last-month tank/web@today
OptionDescription
-Dcreates a deduplicated stream
-Rdescendants are included
-I @tagincrementally sending all the intermediate snapshots since tag
Restore
zfs recv -u -F -d tank/backup
OptionDescription
-ufile system is not mounted
-dsnapshots names used are the ones included in the stream
-Frollback of the file system is performed if changes were made to the destination

Commands can be put together using ssh to perform a backup on a remote server:

zfs send -D -R -I @last-month tank/web@today | ssh backup.example.com zfs recv -u -F -d tank/backup

Clone

For example, cloning a VM with the squeeze version of debian to create a new project: projectX

zfs clone tank/vm/debian@squeeze tank/vm/projectX

Snaphot and rollback

The snapshot mecanism allows for example to implement archiving, or coupled with the rollback mechanism it allows to guard against an unsuccessful update.

Snapshot
Creates on the mentioned file system, a snapshot with the tag just-in-case, or a recursive snapshot including all its descendants with the current date (date +% Y-% m-% d)
zfs snapshot    tank/system/pkg@just-in-case    # To protect from an update
zfs snapshot -r tank@`date +%Y-%m-%d`           # For archiving via a cron job
OptionDescription
-ratomically creates a snapshot on the file system and its descendants
Rollback
Performs a rollback to a previous snapshot:
zfs rollback tank/system/pkg@just-in-case
OptionDescription
-ralso destroys the snapshots newer than the specified one
-Ralso destroys the snapshots newer than the one specified and their clones
-fforces an unmount of any clone file systems that are to be destroyed
There are currently no options to apply a rollback recursively on the descendents (ie: no equivalence for zfs snapshot -r), it will be necessary in this case to manually perform a rollback on each file systems.

Debugging

Sometimes shitty things happen and you need to dig into the insides of ZFS. To this end, there is the zdb command, some of these options are presented below, and some example of its use will be detailed in practical examples.

Configuration
Lists the pool configuration with the attached devices (vdev):
zdb -C tank
Dataset
Lists information about the various dataset (clone, snapshot, ...), this information includes: name, id number, last transaction number, used space, and objects count:
zdb -d tank
Labels
Labels correspond to the configuration description of the pool but they are stored on each disk and include disk specific information. There are 4 labels per disk, two at the beginning and two at the end. They should normally be identical.
zdb -l /dev/ad10

Corrupted mirror

A crash of the system has prevented ZFS to complete its write operations on a mirror which has led to a difference in the metadata stored on the two disks. Consequently, the pool is no longer available, because corrupted, event if each disk is online.

The goal now is to identify which disk metadata has been partly written, in order to be able to find a healthy pool by removing the corrupted disk. This approach, somewhat abrupt, works well only in the mirror case. For this, the ZFS labels (metadata) will be read to compare the transaction numbers (fields: txg) and find the order in which records have been written.

zdb -l /dev/ad12
zdb -l /dev/ad10

Actually, during the labels display, the ad10 hard drive is immediately identified as corrupted as only two labels are legible on 4, which is explained by the fact that two labels are written at the beginning and two others at the end. The rebuild process chosen is to break the mirror and recreate it:

zpool detach tank ad10
zpool attach tank ad10

This is where a mistake was made. Indeed, at the time (ZFS v15) the pool's policy was to automatically increase its capacity if the drive capacity allows it. Unfortunately the two disks, although with the same reference, were not of the same capacity, the transition to a single disk caused an increase in capacity of the pool, preventing later the mirror reconstruction because the disk added was then too small. It was therefore necessary to make a backup to entirely rebuild the pool. Since then, the autoexpand attribut was added and is set to false by default.

Unable to destroy snapshot

# zfs destroy tank/data@foobar
cannot destroy 'tank/data@foobar': dataset already exists

There was a bug (CR-6860996) leading to this type of problem when receiving an incremantal stream (zfs recv), a temporary clone, with the % character in its name, is created but not deleted automatically.

So we will look for this clone and destroy it explicitly:

zdb -d tank | grep %                    # Looking for the clone
zfs destroy clone-with-%-in-the-name    # Detroying the clone