BigAdmin System Administration Portal
Feature Article
Print-friendly VersionPrint-friendly Version

ZFS, Sun's Cutting-Edge File System (Part 2: Ease of Administration and Future Enhancements)

Amy Rich, September 2006

ZFS Resources
Training Resources
 
 

Contents:

Part 1: Storage Integrity, Security, and Scalability


Introduction

ZFS, Sun's Cutting-Edge File System (Part 1: Storage Integrity, Security, and Scalability), the first article in this two-part series, presented a basic overview of Sun's ground-breaking new file system and talked about its data integrity and security model as well as its massive scalability. In this article, I'll focus on the ease of administration, including demonstrations of real usage, and some exciting new features planned for upcoming releases.

With ZFS, integrity and reliability are great, performance is great, and security is great, but what about making it easy to administer, too? As system administrators, we've been in the position where we need to fiddle with the file system to squeeze the last bit of performance out of it. Or maybe you've found out the hard way that you set up VxVM without making the second mirror bootable. Wouldn't it be nice to just have a file system/volume manager do the right thing instead of having to know the correct arcane incantation? You're in luck. The ZFS engineers were thinking of you, too.


Object-Based Storage

ZFS is all about object-based storage. The basic building block of the volume manager storage pool is the zpool. All file system components sit on top of a zpool. When you want to add storage to your pool, you just hand ZFS a file, slice, or entire disk. Gone are the days of needing to run format and newfs to get a usable file system. In fact ZFS can better optimize I/O when you hand it an entire disk. Let's take a look at the options from the zpool(1M) man page:

zpool - configures ZFS storage pools
zpool create [-fn] [-R root] [-M mountpoint] pool vdev...
zpool add [-fn] pool vdev...
zpool list [-H] [-o field[,fi] [pool] ...
zpool status [-xv] [pool] ...

zpool destroy [-f] pool
zpool iostat [-v] [pool] ... [interval [count]]

zpool attach [-f] pool device new_device
zpool detach pool device

zpool replace [-f] pool device [new_device]

zpool scrub [-s] pool ...

zpool offline pool device ...
zpool online pool device ...

zpool export [-f] pool
zpool import [-d dir]
zpool import [-d dir] [-f] [-o options] [-R root] pool |  id [newpool]
zpool import [-d dir] [-f] [-a]

Now let's get our hands dirty with some practical examples! First, we'll create a zpool called testpool by mirroring two slices:

# zpool create testpool mirror c0t0d0s5 c0t1d0s5

# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
testpool               7.94G    164K   7.94G     0%  ONLINE     -

# zpool status
   pool: testpool
  state: ONLINE
  scrub: none requested
 config:

        NAME          STATE     READ WRITE CKSUM
        testpool      ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c0t0d0s5  ONLINE       0     0     0
            c0t1d0s5  ONLINE       0     0     0

We didn't need to use something like metainit to set up each half of the mirror, and it was simple enough that we didn't really need the abstraction of a GUI like the ones that come with Solaris Volume Manager or VxVM. One command and you're done! If you love the GUI, though, ZFS administration can also be done via the web by running /usr/sbin/smcwebserver or with svcadm enable svc:/system/webconsole:console in Solaris Express.

Now that we've allocated storage, we can start making file systems. In ZFS, file systems are hierarchical and are therefore useful administrative control points. You can set the following on a per-file system basis: size, reservations (the minimum amount of space guaranteed to a dataset and its descendants), quotas, choice of compression, backups, and snapshots. For example, on a home directory server, everyone gets their own file system where you can limit the total space usage to one terabyte and turn on compression. You can then make sure that certain VIP users (or applications) always have at least one gigabyte available and are snapshotted hourly (via cron). The less-important users might be restricted by a quota and only be snapshotted once a day.

Here are the options from the zfs(1M) man page:

zfs create filesystem
zfs create [-s] [-b blocksize] -V size volume
zfs destroy [-rRf] filesystem|volume|snapshot
zfs clone snapshot filesystem|volume
zfs rename  filesystem|volume|snapshot filesystem|volume|snapshot
zfs snapshot  filesystem@name|volume@name
zfs rollback [-rRf] snapshot
zfs list [-rH] [-o property[,property]...] [ -t type[,type]...] 
	[filesystem|volume|snapshot] ...
zfs set property=value filesystem|volume...
zfs get [-rHp] [-o field[,field]...] [-s source[,source]...] all | 
	property[,property]... filesystem|volume|snapshot...
zfs inherit [-r] property filesystem|volume...
zfs mount
zfs mount [-o options] [-O] -a
zfs mount [-o options] [-O] filesystem
zfs unmount [-f] -a
zfs unmount [-f] [filesystem|mountpoint]
zfs share -a
zfs share filesystem
zfs unshare [-f] -a
zfs unshare [-f] [filesystem|mountpoint]
zfs backup [-i snapshot] snapshot
zfs restore [-vn ] filesystem|volume|snapshot
zfs restore [-vn ] -d filesystem

Let's create a couple of file systems called testfs and testfs2, inside our pool testpool:

# zfs create testpool/testfs

# zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
testpool               268K  7.88G    99K  /testpool
testpool/testfs       98.5K  7.88G  98.5K  /testpool/testfs

# zfs create testpool/testfs2

# zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
testpool               376K  7.88G  99.5K  /testpool
testpool/testfs       98.5K  7.88G  98.5K  /testpool/testfs
testpool/testfs2      98.5K  7.88G  98.5K  /testpool/testfs2

# df -k
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/md/dsk/d0         8262869    751774   7428467  10% /
swap                  11800800       784  11800016   1% /etc/svc/volatile
/dev/md/dsk/d30        8262869     23616   8156625   1% /var
swap                  11800016         0  11800016   0% /tmp
swap                  11800040        24  11800016   1% /var/run
testpool               8257771       100   8257672   1% /testpool
testpool/testfs        8257770        99   8257672   1% /testpool/testfs
testpool/testfs2       8257770        99   8257672   1% /testpool/testfs2

Once again, it just takes one simple command. We didn't have to run newfs or mount or edit the vfstab. ZFS does it all for us.

You'll note that each ZFS file system can access all resources of the pool. If we were to create a file system under /testpool/testfs, it would inherit all of the attributes of /testpool/testfs unless we specifically overrode them. If we want to mount our file system somewhere other than /, and create some children file systems, we can accomplish those goals as follows:

# zfs set mountpoint=/devel/testfs testpool/testfs
# zfs create testpool/testfs/dir1
# zfs create testpool/testfs/dir2
# zfs create testpool/testfs/dir3

# df -k
Filesystem           1k-blocks    Used Available Use% Mounted on
/dev/md/dsk/d0         8262869  751774   7428467  10% /
swap                  11800800     784  11800016   1% /etc/svc/volatile
/dev/md/dsk/d30        8262869   23616   8156625   1% /var
swap                  11800016       0  11800016   0% /tmp
swap                  11800040      24  11800016   1% /var/run
testpool               8257771     100   8257672   1% /testpool
testpool/testfs2       8257770      99   8257672   1% /testpool/testfs2
testpool/testfs       16515535      99  16515437   1% /devel/testfs
testpool/testfs/dir1  16515214      99  16515115   1% /devel/testfs/dir1
testpool/testfs/dir2  16515214      99  16515115   1% /devel/testfs/dir2
testpool/testfs/dir3  16515214      99  16515115   1% /devel/testfs/dir3

ZFS automatically relocates testpool/testfs/dir1 to the correct mount point, because it has inherited that from its parent, testpool/testfs. Again, creating new file systems is just as simple as creating a new directory or file.

Here's an example of setting some attributes on a file system:

# zfs set sharenfs=rw testpool/testfs
# zfs compression=on testpool/testfs
# zfs set quota=10g testpool/testfs/dir1
   (logically limits space)
# zfs set reservation=20g testpool/testfs/dir2
   (logically preallocates space)

Now, what if you're running out of space in your pool and you want to add more capacity? Just add some more devices with zpool add:

# zpool add testpool c0t0d0s6 c0t1d0s6
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool uses 2-way mirror and 
new vdev uses 1-way disk

ZFS is smart: It knew that my previous devices were mirrors, and it caught the fact that I was trying to add unmirrored storage. If that's actually what I had meant to do, I could force the operation with -f as stated above. It was really just a mistake on my part, though, so I'll go back and provide the correct information:

  # zpool add testpool mirror c0t0d0s6 c0t1d0s6

  # zfs list
  NAME                   USED  AVAIL  REFER  MOUNTPOINT
  testpool               396K  15.8G  99.5K  /testpool
  testpool/testfs       98.5K  15.8G  98.5K  /testpool/testfs
  testpool/testfs2      98.5K  15.8G  98.5K  /testpool/testfs2

Again, due to that dynamic metadata, there's no need to wait for the pool to adjust its size or for the mirror to resilver before using the new space. If I run a zpool status at this point, I'll see that the mirror is now built of two mirrors striped together, each mirror containing two devices:

  # zpool status
    pool: testpool
   state: ONLINE
   scrub: none requested
  config:

        NAME          STATE     READ WRITE CKSUM
        testpool      ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c0t0d0s5  ONLINE       0     0     0
            c0t1d0s5  ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c0t0d0s6  ONLINE       0     0     0
            c0t1d0s6  ONLINE       0     0     0

RAID-Z

Up until this point, we only looked at mirrored storage. ZFS implements another type of RAID, called RAID-Z, which is similar to RAID-5 but instead uses a variable stripe width to eliminate an issue with RAID-5 called the RAID-5 write hole. Under RAID-5, writes are performed to two or more independent devices, and the parity block is written as a component of each stripe. Since these writes are non-atomic, a power failure between the data and parity transactions results in the possibility of data corruption.

Some vendors have attempted to address this with parity region logging (like dirty region logging, only for the parity disk) or battery-backed NVRAM. With the NVRAM solution, data is written into NVRAM, then the data and parity writes are made to disk, and finally the NVRAM is released. NVRAM is expensive and can sometimes become a bottleneck. Also, any data stored in NVRAM will be lost after three days of downtime. In an extended disaster recovery situation, this loss of data might be unacceptable.

Unlike RAID-5, all RAID-Z writes are full-stripe writes, meaning they are made to all disks. There's no read-modify-write overhead, no RAID-5 write hole, and no need for NVRAM in hardware. If you're trying to decide when to use straight mirroring and when to use RAID-Z, take a look at Roch Bourbonnais' When to (and not to) use RAID-Z blog entry.


File System Snapshots and Clones

Now that we have some file systems full of data, what happens when a user (or a sys admin) accidentally modifies or deletes something and wants to restore the data? Going to backup tapes or disks is slow, and it is not very self-service friendly for the end user. Fortunately, ZFS includes capabilities for snapshots (like those offered by other storage vendors), and the even better news is that there's virtually no overhead at all due to the copy-on-write architecture. In fact, sometimes it is faster to take a snapshot rather than free the blocks containing the old data!

Under ZFS, as in Network Appliance Inc.'s implementation, only modified data is tracked. The number of snapshots you can keep is only limited by your storage capacity. Let's take a brief tour of snapshotting. There is a .zfs directory in every ZFS file system with a snapshot subdirectory:

# cd /testpool/testfs2
# ls -al .zfs/snapshot

total 0
dr-xr-xr-x    2 root     root            2 Jun 29 17:18 .
dr-xr-xr-x    3 root     root            3 Jun 29 17:18 ..

# cp /bin/sh /bin/tcsh /bin/ksh .
# zfs snapshot testpool/testfs2@snap1
# ls -al .zfs/snapshot/snap1/

total 786
drwxr-xr-x    2 root     sys             5 Jun 29 17:17 .
dr-xr-xr-x    3 root     root            3 Jun 29 17:19 ..
-r-xr-xr-x    1 root     other      238332 Jun 29 17:17 ksh
-r-xr-xr-x    1 root     other      113160 Jun 29 17:17 sh
-r-xr-xr-x    1 root     other      364176 Jun 29 17:17 tcsh

# zfs list testpool/testfs2@snap1

NAME                     USED  AVAIL  REFER  MOUNTPOINT
testpool/testfs2@snap1      0      -   882K  -

If I delete ksh and decide that I want it back, I can copy it out of the .zfs/snapshot/snap1 directory:

# rm ksh
# ls -al

total 514
drwxr-xr-x    2 root     sys             4 Jun 29 17:24 .
drwxr-xr-x    4 root     sys             4 May 22 10:48 ..
dr-xr-xr-x    3 root     root            3 Jun 29 17:25 .zfs
-r-xr-xr-x    1 root     other      113160 Jun 29 17:17 sh
-r-xr-xr-x    1 root     other      364176 Jun 29 17:17 tcsh

# cp .zfs/snapshot/snap1/ksh .
# ls -al
total 787
drwxr-xr-x    2 root     sys             5 Jun 29 17:26 .
drwxr-xr-x    4 root     sys             4 May 22 10:48 ..
dr-xr-xr-x    3 root     root            3 Jun 29 17:27 .zfs
-r-xr-xr-x    1 root     other      238332 Jun 29 17:26 ksh
-r-xr-xr-x    1 root     other      113160 Jun 29 17:17 sh
-r-xr-xr-x    1 root     other      364176 Jun 29 17:17 tcsh

My other option is to perform a rollback of the entire file system so that it matches a specific snapshot:

# rm ksh
# cp /bin/bash .
# ls -al
total 515
drwxr-xr-x    2 root     sys             5 Jun 29 17:28 .
drwxr-xr-x    4 root     sys             4 May 22 10:48 ..
dr-xr-xr-x    3 root     root            3 Jun 29 17:28 .zfs
-r-xr-xr-x    1 root     other      737148 Jun 29 17:28 bash
-r-xr-xr-x    1 root     other      113160 Jun 29 17:17 sh
-r-xr-xr-x    1 root     other      364176 Jun 29 17:17 tcsh

# ls -al .zfs/snapshot/snap1/
total 786
drwxr-xr-x    2 root     sys             5 Jun 29 17:19 .
dr-xr-xr-x    3 root     root            3 Jun 29 17:28 ..
-r-xr-xr-x    1 root     other      238332 Jun 29 17:19 ksh
-r-xr-xr-x    1 root     other      113160 Jun 29 17:17 sh
-r-xr-xr-x    1 root     other      364176 Jun 29 17:17 tcsh

# cd /
# zfs rollback testpool/testfs2@snap1
# ls -al /testpool/testfs2/
total 787
drwxr-xr-x    2 root     sys             5 Jun 29 17:19 .
drwxr-xr-x    4 root     sys             4 Jun 29 17:29 ..
dr-xr-xr-x    3 root     root            3 Jun 29 17:29 .zfs
-r-xr-xr-x    1 root     other      238332 Jun 29 17:19 ksh
-r-xr-xr-x    1 root     other      113160 Jun 29 17:17 sh
-r-xr-xr-x    1 root     other      364176 Jun 29 17:17 tcsh

When a snapshot isn't needed anymore, you can recover the space with zfs destroy:

# zfs destroy testpool/testfs2@snap1
# ls -al testpool/testfs2/.zfs/snapshot/
total 0
dr-xr-xr-x    2 root     root            2 Jun 29 17:31 .
dr-xr-xr-x    3 root     root            3 Jun 29 17:31 ..

As a tip, it's often useful to name snapshots by date for ease of tracking.

Another useful feature similar to snapshotting is cloning. Unlike snapshots, files in clones can be modified and written to. This is often very useful when you have a data set that is almost but not quite identical. One example might be a server that boots diskless clients. In this next example, I'll set up the file system for the first diskless client, take a snapshot, clone the snapshot, and then modify one of the files in the clone:

# zfs create testpool/diskless/nevada
   (create a bootable filesystem on testpool/diskless/nevada)
# zfs create testpool/diskless/sb2500
# zfs snapshot testpool/diskless/nevada@snap1
# zfs clone testpool/diskless/nevada@snap1 \
  testpool/diskless/sb2500/nevada
# vi /testpool/diskless/sb2500/nevada/etc/hosts
   (change hostname to sb2500 and save)

Automatic Endian Adaptiveness

Like UFS, ZFS is supported on both SPARC and x64 hardware. Unlike UFS, you can remove the data disks from one architecture and add them to the other without needing to worry about format or endian issues. Removing a data disk from a system running on SPARC processors and attaching it to an x64 box is now completely transparent. Sun refers to this as automatic endian adaptiveness.

When a block is read from disk, ZFS checks the endianness. If its endianness matches the current architecture, no changes are made. If it does not match, ZFS byte swaps the file before presenting the data and then writes the block back in its own native endianness. So while there's some overhead in accessing non-native blocks, this happens less often as these blocks are re-written in the new machine's native endianness. Note that endianness conversion only applies to ZFS meta data blocks, not user data.

If you wanted to migrate the data disk from one machine to another, you'd first export all of the pools on the physical media, detach the disks, attach them to the new machine, then import the pools:

# zpool export testpool
   (physically move the disks)
# zpool import

During the export, all file system metadata like size, quotas, reservations, NFS exportability, and so on, are saved, and the disks and pool are logically removed from the system. When you import the data on the new machine, you can specify a specific pool or you can just run the import command as above and it will scan for any pools that are not currently part of the system and import them all.


Future ZFS Enhancements

As you can see, a lot of features to save time and money are already built into ZFS. The project isn't finished by any means, though. Still more cool features are on the horizon.

One of the hottest ZFS topics is bootable ZFS, and this sub-project is currently underway in OpenSolaris. Currently, adventurous souls can use a UFS shim to boot into a ZFS root file system on x64 using directions from Tabriz Leman's blog entry on ZFS Mountroot. Sun has talked about adding ZFS bootability to machines running on SPARC technology soon by teaching the SPARC OBP to utilize GRUB as a boot loader.

ZFS boot is also discussed in Jeff Bonwick's message on the zfs-discuss list: "[zfs-discuss] Re: Re: Bootable ZFS and Live upgrade".

Various people have been using ZFS snapshots to perform quick and dirty backups of their file system, but it's likely that we'll see tighter integration between snapshots and actual commercial backup software. For code-level discussion of how snapshots work, take a look at Matthew Ahrens's blog entry on snapshots.

According to Eric Schrock in his zfs-discuss post regarding Bootable ZFS and Live upgrade there are plans to support LiveUpgrade using ZFS clones. In order for this to happen, the installation and upgrade tools must first be modified and clone swap support must be added to ZFS. This feature has already been added to Solaris Express.

ZFS is also addressing its security on the wire and on the physical media. The ZFS on disk encryption support project at OpenSolaris is an ongoing effort to provide on-disk encryption and decryption and key management support for ZFS. Darren Moffat wrote a draft document of the zfs-crypto-project for opensolaris.org, detailing various aspects of the planned implementation. The goals are to protect data provided from a SAN over an untrusted path, protect data from theft of physical storage, and provide a mechanism for secure deletion.

The destruction of ZFS encrypted file system keys acts as a form of secure deletion, since once the keys are gone, there's no way to retrieve the data. One of the other security features on the horizon is a DOD-compliant secure deletion. In this scenario, as soon as a block is freed, it's overwritten multiple times.

To keep on top of what else is in the works, be sure to visit the opensolaris.org web site and subscribe to the zfs-discuss or zfs-code mailing lists.


Resources

Comments (latest comments first)

Discuss and comment on this resource in the BigAdmin Wiki

Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.


BigAdmin
  
 
BigAdmin Upgrade Hub