BigAdmin System Administration Portal
Feature Article
Print-friendly VersionPrint-friendly Version

ZFS, Sun's Cutting-Edge File System (Part 1: Storage Integrity, Security, and Scalability)

Amy Rich, August 2006
Part 2: Ease of Administration and Future Enhancements

ZFS Resources
Training Resources
 
 

Contents:


ZFS Overview

Sun's recent addition to the Solaris 10 06/06 Operating System, ZFS, is an innovative ground-up redesign of the traditional UNIX file system. The engineers at Sun and members of the open source community have drawn from some of the best practices currently on the market (such as Network Appliances snapshots, VERITAS object-based storage management, transactions, and checksumming), and contributed their own ideas and expertise to develop a new streamlined, cohesive approach to file system design. Even though it's still early in its life cycle, ZFS has made such an impact that other UNIX vendors and open source enthusiasts have already intimated their plans to port it to their own operating systems (see Porting ZFS to other platforms on the OpenSolaris site).

With ZFS, Sun addresses the important issues of integrity and security, scalability, and difficulty of administration that often plague other UNIX file systems. In this two-part series, we'll examine the behind-the-scenes workings of ZFS and how this can translate into a savings of time and money for your organization. In this first part, I'll discuss the data integrity and security model and ZFS scalability. The second part will cover ease of administration and future ZFS enhancements.


ZFS Data Integrity and Security

Among the most important components of any file system are data integrity and security. It's vital that information on the disk not suffer from bit rot, silent corruption, or even malicious or accidental tampering. In the past, file systems have had various problems overcoming these challenges and providing reliable, accurate data.

File systems built on older technology, such as earlier versions of UFS, overwrite blocks when modifying in-use data. If a power failure occurs during a write, the data is corrupted and the file system may lose block pointers to important data. To work around this issue, the fsck command does its best to find dirty blocks and reconnect information where it can. Unfortunately, fsck needs to look at the entire file system, often taking between seconds and hours to complete depending on the size of the file system. On business-critical systems, every minute down is money lost. To speed up recovery from a power failure, many file system implementations (including later versions of UFS) added journaling or logging. In the case of a corrupt journal, fsck is still invoked to repair the file system. And even this enhanced version of UFS, along with most other journaled file systems, doesn't log user data due to the amount of required overhead.

For reliability reasons, people have also moved to disk or file system mirroring by using some sort of volume management software. If power is lost and the two halves of the mirror are inconsistent, one half of the mirror resyncs to the other, even if only a handful of blocks are questionable. Not only is I/O performance degraded during the resync, but the machine can't always accurately predict which copy of the data is uncorrupted. Sometimes it picks the wrong mirror to trust, and the bad data overwrites the good. To address the performance issue, some volume managers introduced what's known as dirty region logging (DRL). Now only the areas undergoing writes at the time of the power loss need to resync. This works well to mitigate the performance issue, but it doesn't address the problem with detecting which side of the mirror has the valid data.

ZFS addresses these issues by making transaction-based copy-on-write modifications and constantly checksumming every in-use block in the file system.

Transaction-Based Copy-on-Write Operations

ZFS is a combination of file system and volume manager; the file system-level commands require no concept of the underlying physical disks because of storage pool virtualization. All of the high-level interactions occur through the data management unit (DMU), a concept similar to a memory management unit (MMU), only for disks instead of RAM. All of the transactions committed through the DMU are atomic, so data is never left in an inconsistent state.

In addition to being a transaction-based file system, ZFS only performs copy-on-write operations. This means that the blocks containing the in-use data on disk are never modified. The changed information is written to alternate blocks, and the block pointer to the in-use data is only moved once the write transactions are complete. This happens all the way up the file system block structure to the top block, called the uberblock.

As shown in Figure 1, transactions select unused blocks to write modified data and only then change the location to which the preceding block points.

Image of Copy-On-Write Transactions

Figure 1: Copy-On-Write Transactions
Image source: Jeff Bonwick

If the machine were to suffer a power outage in the middle of a data write, no corruption occurs because the pointer to the "good" data is not moved until the entire write is complete. (Note: The pointer to the data is the only thing that is moved.) This eliminates the need for a journaling or logging file system and any need for a fsck or mirror resync when a machine reboots unexpectedly.

End-to-End Checksumming

To avoid accidental data corruption ZFS provides memory-based end-to-end checksumming. Most checksumming file systems only protect against bit rot because they use self-consistent blocks where the checksum is stored with the block itself. In this case, no external checking is done to verify validity. This style of checksumming won't catch things like:

  • Phantom writes where the write is dropped on the floor
  • Misdirected reads or writes where the disk accesses the wrong block
  • DMA parity errors between the array and server memory or from the driver, since the checksum validates the data inside the array
  • Driver errors where the data winds up in wrong buffer inside the kernel
  • Accidental overwrites such as swapping to a live file system

With ZFS, the checksum is not stored in the block but next to the pointer to the block, all the way up to the uberblock. Only the uberblock has a self-validating SHA-256 checksum. All block checksums are done in server memory, so any error up the tree is caught including the aforementioned misdirected reads and writes, parity errors, phantom writes, and so on. In the past, the burden on the CPU would have bogged down the machine, but these days CPU technology and speed are advanced enough to check disk transactions on the fly. Not only does ZFS catch these problems, but in a mirrored or RAID-Z configuration, the data is self-healing. (The second article in this series will include more information on RAID-Z.)

One of the favorite Sun demonstrations showcasing data self-healing is the following use of dd where c0t1d0s5 is one half of a mirror or a RAID-Z file system:

dd if=/dev/urandom of=/dev/dsk/c0t1d0s5 bs=1024
count=100000

This writes garbage on half of the mirror, but when those blocks are accessed, ZFS performs a checksum and recognizes that the data is bad. ZFS then checksums the other copy of the data, finds it to be valid, and resilvers the bad block on the corrupted side of the mirror instead of panicking because of data corruption. In a RAID-Z configuration, ZFS sequentially checks for the block on each disk and compares the parity checksum until it finds a match. When a match is found, ZFS knows it's found a block of valid data and fixes all other bad disks. The resilvering process is completely transparent to the user who never even realizes that a problem had occurred.

ZFS constantly checks for corrupt data in the background via a process called scrubbing. The file system code that's used for scrubbing the disks is the same code that's used for resilvering, attaching mirrors, and replacing disks, making the entire process tightly integrated. The administrator can also force a check of an entire storage pool by running the command zpool scrub. (The second article in this series will include more information on storage pools.)

  #zpool scrub testpool
  #zpool status

    pool: testpool
   state: ONLINE
   scrub: scrub completed with 0 errors on Thu Jun 29 12:47:15 2006
  config:

        NAME          STATE     READ WRITE CKSUM
        testpool      ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c0t0d0s5  ONLINE       0     0     0
            c0t1d0s5  ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c0t0d0s6  ONLINE       0     0     0
            c0t1d0s6  ONLINE       0     0     0

After running the aforementioned dd command to the corrupt part of the mirror, the output would look like the following:

  #zpool scrub testpool
  #zpool status

    pool: testpool
   state: ONLINE
  status: One or more devices has experienced an unrecoverable error.  An
          attempt was made to correct the error.  Applications are unaffected.
  action: Determine if the device needs to be replaced, and clear the errors
          using 'zpool online' or replace the device with 'zpool replace'.
     see: http://www.sun.com/msg/ZFS-8000-9P
   scrub: scrub completed with 0 errors on Thu Jun 29 12:51:29 2006
  config:

        NAME          STATE     READ WRITE CKSUM
        testpool      ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c0t0d0s5  ONLINE       0     0     0
            c0t1d0s5  ONLINE       0     0     5  2.50K repaired
          mirror      ONLINE       0     0     0
            c0t0d0s6  ONLINE       0     0     0
            c0t1d0s6  ONLINE       0     0     0

The output now reports:

  • A device may be damaged.
  • Applications were unaffected by the error, and ZFS corrected it behind the scenes.
  • The device may need to be replaced.

The output also provides the following information:

  • How to replace the defective device
  • How to clear the errors
  • A URL containing more information about the type of error and how to perform further troubleshooting and correct it
  • Which disk slice had errors and how much was successfully repaired

Running zpool online testpool clears the errors in the CKSUM column, but continues to show that c0t1d0s5 was repaired.

For an in-depth discussion, read Jeff Bonwick's blog entry on ZFS mirror resilvering.

One other ZFS security benefit: It uses NFSv4/NT-style ACLs that include full allow/deny semantics and inheritance. The access controls, based on 17 different attributes, are very fine grained.


ZFS Scalability

While data security and integrity is paramount, a file system must also perform well and stand the test of time, otherwise it won't see much use. The designers of ZFS have removed or greatly increased the limits imposed by modern file systems by using a 128-bit architecture and making all metadata dynamic. ZFS also implements data pipelining, dynamic block sizing, intelligent prefetch, dynamic striping, and built-in compression to improve performance.

The 128-Bit Architecture

Current trends in the industry show that disk drive capacity roughly doubles every nine months to a year. If this trend continues, file systems will require 64-bit addressability in about 10 to 15 years. Instead of planning on 64-bit requirements, the designers of ZFS have taken the long view and implemented a 128-bit file system. This means that ZFS delivers more than 16 billion times more capacity than current 64-bit systems. According to Jeff Bonwick, the ZFS chief architect, in ZFS: the last word in file systems, "Populating 128-bit file systems would exceed the quantum limits of earth-based storage. You couldn't fill a 128-bit storage pool without boiling the oceans." Jeff also discussed the mathematics behind this statement in his blog entry on 128-bit storage. Since we don't yet have the technology to produce that kind of energy for the mass market, we might be safe for a while.

Dynamic Metadata

In addition to being 128-bit, ZFS metadata is 100 percent dynamic. Because of this, creation of new storage pools and file systems is extremely fast. Only 1 to 2 percent of writes to disk are metadata, which results in a big initial overhead savings. There are, for example, no static inodes, so the only restriction is the number of inodes that will fit on the the disks in the storage pool.

The 128-bit architecture also means that there are no practical limits on the number of files, directories, and so on. Here are some theoretical limits that might, if you can conceive of the scope, knock your socks off:

  • 248 snapshots in any file system
  • 248 files in any individual file system
  • 16 exabyte file systems
  • 16 exabyte files
  • 16 exabyte attributes
  • 3x1023 petabyte storage pools
  • 248 attributes for a file
  • 248 files in a directory
  • 264 devices in a storage pool
  • 264 storage pools per system
  • 264 file systems per storage pool

File System Performance

The fundamental design of ZFS provides a number of performance enhancements over traditional file systems. For starters, ZFS uses a pipelined I/O engine, similar in concept to CPU pipelines. The pipeline operates on I/O interdependencies and can sort based on relative priority and deadline. This pipeline provides scoreboarding, priority, deadline scheduling, out-of-order issue, and I/O aggregation. ZFS also implements an intelligent prefetch algorithm that recognizes linear or algorithmic access patterns and guesses the next block to prefetch.

ZFS uses concurrency to improve speed whenever it can. The file system supports parallel reads and writes to same file, as well as parallel constant-time directory operations, and the locking strategy is scalable and fast. In addition, work within any given transaction can be done in any order, so the DMU batches reads and writes to optimize disk work. Since transactions are copy-on-write, sequential blocks can be chosen for new data instead of accessing the disk randomly. This allows the disks to run at or near platter speed. ZFS also automatically matches the block size (from 512 bytes to 128K) to the workload to achieve the best performance.

ZFS dynamically stripes data across all available devices. When you add another disk or slice to a stripe, ZFS will automatically incorporate the new space and re-balance the write striping to take advantage of the new space. If a device should drop to degraded mode because of errors, ZFS also does its best not to write to that device, instead spreading the load out amongst the other devices.

Lastly, ZFS offers built-in data compression on a per-file system basis. In addition to reducing the on-disk space usage, compression can also decrease the necessary amount of I/O by two to three times. For this reason, enabling compression actually makes some workloads go faster if they are I/O bound instead of CPU bound.

For some information on benchmarking ZFS, take a look at a few other pages:

In the next article, I'll cover more hands-on use of ZFS: creating storage pools and file systems, setting file system parameters, and data snapshots and cloning. I'll also highlight some of the more interesting features on the ZFS development horizon.


Resources
Discuss and comment on this resource in the BigAdmin Wiki


Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.


BigAdmin
  
 
 
 
 
Contact About Sun News & Events Employment Site Map Privacy Terms of Use Trademarks Copyright 1994-2008 Sun Microsystems, Inc.