ZFS, Sun's Cutting-Edge File System (Part 2: Ease of Administration and Future Enhancements)Amy Rich, September 2006
Contents:
Part 1: Storage Integrity, Security, and Scalability IntroductionZFS, Sun's Cutting-Edge File System (Part 1: Storage Integrity, Security, and Scalability), the first article in this two-part series, presented a basic overview of Sun's ground-breaking new file system and talked about its data integrity and security model as well as its massive scalability. In this article, I'll focus on the ease of administration, including demonstrations of real usage, and some exciting new features planned for upcoming releases. With ZFS, integrity and reliability are great, performance is great, and security is great, but what about making it easy to administer, too? As system administrators, we've been in the position where we need to fiddle with the file system to squeeze the last bit of performance out of it. Or maybe you've found out the hard way that you set up VxVM without making the second mirror bootable. Wouldn't it be nice to just have a file system/volume manager do the right thing instead of having to know the correct arcane incantation? You're in luck. The ZFS engineers were thinking of you, too. Object-Based StorageZFS is all about object-based storage. The basic building block of the
volume manager storage pool is the zpool - configures ZFS storage pools zpool create [-fn] [-R root] [-M mountpoint] pool vdev... zpool add [-fn] pool vdev... zpool list [-H] [-o field[,fi] [pool] ... zpool status [-xv] [pool] ... zpool destroy [-f] pool zpool iostat [-v] [pool] ... [interval [count]] zpool attach [-f] pool device new_device zpool detach pool device zpool replace [-f] pool device [new_device] zpool scrub [-s] pool ... zpool offline pool device ... zpool online pool device ... zpool export [-f] pool zpool import [-d dir] zpool import [-d dir] [-f] [-o options] [-R root] pool | id [newpool] zpool import [-d dir] [-f] [-a] Now let's get our hands dirty with some practical examples! First, we'll
create a
# zpool create testpool mirror c0t0d0s5 c0t1d0s5
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
testpool 7.94G 164K 7.94G 0% ONLINE -
# zpool status
pool: testpool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
testpool ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s5 ONLINE 0 0 0
c0t1d0s5 ONLINE 0 0 0
We didn't need to use something like Now that we've allocated storage, we can start making file systems. In ZFS, file systems are hierarchical and are therefore useful administrative control points. You can set the following on a per-file system basis: size, reservations (the minimum amount of space guaranteed to a dataset and its descendants), quotas, choice of compression, backups, and snapshots. For example, on a home directory server, everyone gets their own file system where you can limit the total space usage to one terabyte and turn on compression. You can then make sure that certain VIP users (or applications) always have at least one gigabyte available and are snapshotted hourly (via cron). The less-important users might be restricted by a quota and only be snapshotted once a day. Here are the options from the zfs create filesystem zfs create [-s] [-b blocksize] -V size volume zfs destroy [-rRf] filesystem|volume|snapshot zfs clone snapshot filesystem|volume zfs rename filesystem|volume|snapshot filesystem|volume|snapshot zfs snapshot filesystem@name|volume@name zfs rollback [-rRf] snapshot zfs list [-rH] [-o property[,property]...] [ -t type[,type]...] [filesystem|volume|snapshot] ... zfs set property=value filesystem|volume... zfs get [-rHp] [-o field[,field]...] [-s source[,source]...] all | property[,property]... filesystem|volume|snapshot... zfs inherit [-r] property filesystem|volume... zfs mount zfs mount [-o options] [-O] -a zfs mount [-o options] [-O] filesystem zfs unmount [-f] -a zfs unmount [-f] [filesystem|mountpoint] zfs share -a zfs share filesystem zfs unshare [-f] -a zfs unshare [-f] [filesystem|mountpoint] zfs backup [-i snapshot] snapshot zfs restore [-vn ] filesystem|volume|snapshot zfs restore [-vn ] -d filesystem Let's create a couple of file systems called # zfs create testpool/testfs # zfs list NAME USED AVAIL REFER MOUNTPOINT testpool 268K 7.88G 99K /testpool testpool/testfs 98.5K 7.88G 98.5K /testpool/testfs # zfs create testpool/testfs2 # zfs list NAME USED AVAIL REFER MOUNTPOINT testpool 376K 7.88G 99.5K /testpool testpool/testfs 98.5K 7.88G 98.5K /testpool/testfs testpool/testfs2 98.5K 7.88G 98.5K /testpool/testfs2 # df -k Filesystem 1k-blocks Used Available Use% Mounted on /dev/md/dsk/d0 8262869 751774 7428467 10% / swap 11800800 784 11800016 1% /etc/svc/volatile /dev/md/dsk/d30 8262869 23616 8156625 1% /var swap 11800016 0 11800016 0% /tmp swap 11800040 24 11800016 1% /var/run testpool 8257771 100 8257672 1% /testpool testpool/testfs 8257770 99 8257672 1% /testpool/testfs testpool/testfs2 8257770 99 8257672 1% /testpool/testfs2 Once again, it just takes one simple command. We didn't have to run
You'll note that each ZFS file system can access all resources of the
pool. If we were to create a file system under # zfs set mountpoint=/devel/testfs testpool/testfs # zfs create testpool/testfs/dir1 # zfs create testpool/testfs/dir2 # zfs create testpool/testfs/dir3 # df -k Filesystem 1k-blocks Used Available Use% Mounted on /dev/md/dsk/d0 8262869 751774 7428467 10% / swap 11800800 784 11800016 1% /etc/svc/volatile /dev/md/dsk/d30 8262869 23616 8156625 1% /var swap 11800016 0 11800016 0% /tmp swap 11800040 24 11800016 1% /var/run testpool 8257771 100 8257672 1% /testpool testpool/testfs2 8257770 99 8257672 1% /testpool/testfs2 testpool/testfs 16515535 99 16515437 1% /devel/testfs testpool/testfs/dir1 16515214 99 16515115 1% /devel/testfs/dir1 testpool/testfs/dir2 16515214 99 16515115 1% /devel/testfs/dir2 testpool/testfs/dir3 16515214 99 16515115 1% /devel/testfs/dir3 ZFS automatically relocates Here's an example of setting some attributes on a file system: # zfs set sharenfs=rw testpool/testfs# zfs compression=on testpool/testfs# zfs set quota=10g testpool/testfs/dir1(logically limits space) # zfs set reservation=20g testpool/testfs/dir2(logically preallocates space) Now, what if you're running out of space in your pool and you want to
add more capacity? Just add some more devices with # zpool add testpool c0t0d0s6 c0t1d0s6 invalid vdev specification use '-f' to override the following errors: mismatched replication level: pool uses 2-way mirror and new vdev uses 1-way disk ZFS is smart: It knew that my previous devices were mirrors, and it
caught the fact that I was trying to add unmirrored storage. If that's
actually what I had meant to do, I could force the operation with # zpool add testpool mirror c0t0d0s6 c0t1d0s6 # zfs list NAME USED AVAIL REFER MOUNTPOINT testpool 396K 15.8G 99.5K /testpool testpool/testfs 98.5K 15.8G 98.5K /testpool/testfs testpool/testfs2 98.5K 15.8G 98.5K /testpool/testfs2 Again, due to that dynamic metadata, there's no need to wait for the
pool to adjust its size or for the mirror to resilver before using the
new space. If I run a
# zpool status
pool: testpool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
testpool ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s5 ONLINE 0 0 0
c0t1d0s5 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
RAID-ZUp until this point, we only looked at mirrored storage. ZFS implements another type of RAID, called RAID-Z, which is similar to RAID-5 but instead uses a variable stripe width to eliminate an issue with RAID-5 called the RAID-5 write hole. Under RAID-5, writes are performed to two or more independent devices, and the parity block is written as a component of each stripe. Since these writes are non-atomic, a power failure between the data and parity transactions results in the possibility of data corruption. Some vendors have attempted to address this with parity region logging (like dirty region logging, only for the parity disk) or battery-backed NVRAM. With the NVRAM solution, data is written into NVRAM, then the data and parity writes are made to disk, and finally the NVRAM is released. NVRAM is expensive and can sometimes become a bottleneck. Also, any data stored in NVRAM will be lost after three days of downtime. In an extended disaster recovery situation, this loss of data might be unacceptable. Unlike RAID-5, all RAID-Z writes are full-stripe writes, meaning they are made to all disks. There's no read-modify-write overhead, no RAID-5 write hole, and no need for NVRAM in hardware. If you're trying to decide when to use straight mirroring and when to use RAID-Z, take a look at Roch Bourbonnais' When to (and not to) use RAID-Z blog entry. File System Snapshots and ClonesNow that we have some file systems full of data, what happens when a user (or a sys admin) accidentally modifies or deletes something and wants to restore the data? Going to backup tapes or disks is slow, and it is not very self-service friendly for the end user. Fortunately, ZFS includes capabilities for snapshots (like those offered by other storage vendors), and the even better news is that there's virtually no overhead at all due to the copy-on-write architecture. In fact, sometimes it is faster to take a snapshot rather than free the blocks containing the old data! Under ZFS, as in Network Appliance Inc.'s implementation, only modified
data is tracked. The number of snapshots you can keep is only limited by your
storage capacity. Let's take a brief tour of snapshotting. There is a # cd /testpool/testfs2 # ls -al .zfs/snapshot total 0 dr-xr-xr-x 2 root root 2 Jun 29 17:18 . dr-xr-xr-x 3 root root 3 Jun 29 17:18 .. # cp /bin/sh /bin/tcsh /bin/ksh . # zfs snapshot testpool/testfs2@snap1 # ls -al .zfs/snapshot/snap1/ total 786 drwxr-xr-x 2 root sys 5 Jun 29 17:17 . dr-xr-xr-x 3 root root 3 Jun 29 17:19 .. -r-xr-xr-x 1 root other 238332 Jun 29 17:17 ksh -r-xr-xr-x 1 root other 113160 Jun 29 17:17 sh -r-xr-xr-x 1 root other 364176 Jun 29 17:17 tcsh # zfs list testpool/testfs2@snap1 NAME USED AVAIL REFER MOUNTPOINT testpool/testfs2@snap1 0 - 882K - If I delete # rm ksh # ls -al total 514 drwxr-xr-x 2 root sys 4 Jun 29 17:24 . drwxr-xr-x 4 root sys 4 May 22 10:48 .. dr-xr-xr-x 3 root root 3 Jun 29 17:25 .zfs -r-xr-xr-x 1 root other 113160 Jun 29 17:17 sh -r-xr-xr-x 1 root other 364176 Jun 29 17:17 tcsh # cp .zfs/snapshot/snap1/ksh . # ls -al total 787 drwxr-xr-x 2 root sys 5 Jun 29 17:26 . drwxr-xr-x 4 root sys 4 May 22 10:48 .. dr-xr-xr-x 3 root root 3 Jun 29 17:27 .zfs -r-xr-xr-x 1 root other 238332 Jun 29 17:26 ksh -r-xr-xr-x 1 root other 113160 Jun 29 17:17 sh -r-xr-xr-x 1 root other 364176 Jun 29 17:17 tcsh My other option is to perform a rollback of the entire file system so that it matches a specific snapshot: # rm ksh # cp /bin/bash . # ls -al total 515 drwxr-xr-x 2 root sys 5 Jun 29 17:28 . drwxr-xr-x 4 root sys 4 May 22 10:48 .. dr-xr-xr-x 3 root root 3 Jun 29 17:28 .zfs -r-xr-xr-x 1 root other 737148 Jun 29 17:28 bash -r-xr-xr-x 1 root other 113160 Jun 29 17:17 sh -r-xr-xr-x 1 root other 364176 Jun 29 17:17 tcsh # ls -al .zfs/snapshot/snap1/ total 786 drwxr-xr-x 2 root sys 5 Jun 29 17:19 . dr-xr-xr-x 3 root root 3 Jun 29 17:28 .. -r-xr-xr-x 1 root other 238332 Jun 29 17:19 ksh -r-xr-xr-x 1 root other 113160 Jun 29 17:17 sh -r-xr-xr-x 1 root other 364176 Jun 29 17:17 tcsh # cd / # zfs rollback testpool/testfs2@snap1 # ls -al /testpool/testfs2/ total 787 drwxr-xr-x 2 root sys 5 Jun 29 17:19 . drwxr-xr-x 4 root sys 4 Jun 29 17:29 .. dr-xr-xr-x 3 root root 3 Jun 29 17:29 .zfs -r-xr-xr-x 1 root other 238332 Jun 29 17:19 ksh -r-xr-xr-x 1 root other 113160 Jun 29 17:17 sh -r-xr-xr-x 1 root other 364176 Jun 29 17:17 tcsh When a snapshot isn't needed anymore, you can recover the space with
# zfs destroy testpool/testfs2@snap1 # ls -al testpool/testfs2/.zfs/snapshot/ total 0 dr-xr-xr-x 2 root root 2 Jun 29 17:31 . dr-xr-xr-x 3 root root 3 Jun 29 17:31 .. As a tip, it's often useful to name snapshots by date for ease of tracking. Another useful feature similar to snapshotting is cloning. Unlike snapshots, files in clones can be modified and written to. This is often very useful when you have a data set that is almost but not quite identical. One example might be a server that boots diskless clients. In this next example, I'll set up the file system for the first diskless client, take a snapshot, clone the snapshot, and then modify one of the files in the clone: # zfs create testpool/diskless/nevada (create a bootable filesystem on testpool/diskless/nevada) # zfs create testpool/diskless/sb2500 # zfs snapshot testpool/diskless/nevada@snap1 # zfs clone testpool/diskless/nevada@snap1 \ testpool/diskless/sb2500/nevada # vi /testpool/diskless/sb2500/nevada/etc/hosts (change hostname to sb2500 and save) Automatic Endian AdaptivenessLike UFS, ZFS is supported on both SPARC and x64 hardware. Unlike UFS, you can remove the data disks from one architecture and add them to the other without needing to worry about format or endian issues. Removing a data disk from a system running on SPARC processors and attaching it to an x64 box is now completely transparent. Sun refers to this as automatic endian adaptiveness. When a block is read from disk, ZFS checks the endianness. If its endianness matches the current architecture, no changes are made. If it does not match, ZFS byte swaps the file before presenting the data and then writes the block back in its own native endianness. So while there's some overhead in accessing non-native blocks, this happens less often as these blocks are re-written in the new machine's native endianness. Note that endianness conversion only applies to ZFS meta data blocks, not user data. If you wanted to migrate the data disk from one machine to another, you'd first export all of the pools on the physical media, detach the disks, attach them to the new machine, then import the pools:
During the export, all file system metadata like size, quotas,
reservations, NFS exportability, and so on, are saved, and the disks and pool
are logically removed from the system. When you import the data on the new
machine, you can specify a specific pool or you can just run the Future ZFS EnhancementsAs you can see, a lot of features to save time and money are already built into ZFS. The project isn't finished by any means, though. Still more cool features are on the horizon. One of the hottest ZFS topics is bootable ZFS, and this sub-project is currently underway in OpenSolaris. Currently, adventurous souls can use a UFS shim to boot into a ZFS root file system on x64 using directions from Tabriz Leman's blog entry on ZFS Mountroot. Sun has talked about adding ZFS bootability to machines running on SPARC technology soon by teaching the SPARC OBP to utilize GRUB as a boot loader. ZFS boot is also discussed in Jeff Bonwick's message on the zfs-discuss list: "[zfs-discuss] Re: Re: Bootable ZFS and Live upgrade". Various people have been using ZFS snapshots to perform quick and dirty backups of their file system, but it's likely that we'll see tighter integration between snapshots and actual commercial backup software. For code-level discussion of how snapshots work, take a look at Matthew Ahrens's blog entry on snapshots. According to Eric Schrock in his zfs-discuss post regarding Bootable ZFS and Live upgrade there
are plans to support LiveUpgrade using ZFS clones. In order for this to
happen, the installation and upgrade tools must first be modified and ZFS is also addressing its security on the wire and on the physical media. The ZFS on disk encryption support project at OpenSolaris is an ongoing effort to provide on-disk encryption and decryption and key management support for ZFS. Darren Moffat wrote a draft document of the zfs-crypto-project for opensolaris.org, detailing various aspects of the planned implementation. The goals are to protect data provided from a SAN over an untrusted path, protect data from theft of physical storage, and provide a mechanism for secure deletion. The destruction of ZFS encrypted file system keys acts as a form of secure deletion, since once the keys are gone, there's no way to retrieve the data. One of the other security features on the horizon is a DOD-compliant secure deletion. In this scenario, as soon as a block is freed, it's overwritten multiple times. To keep on top of what else is in the works, be sure to visit the opensolaris.org web site and subscribe to the zfs-discuss or zfs-code mailing lists. Resources
Comments (latest comments first)Discuss and comment on this resource in the BigAdmin Wiki
Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License. |
BigAdmin SubscriptionsBigAdmin Areas
BigAdmin Sun Center
BigAdmin Topics | ||||||||