ZFS, Sun's Cutting-Edge File System (Part 1: Storage Integrity, Security, and Scalability)Amy Rich, August 2006
Contents: ZFS OverviewSun's recent addition to the Solaris 10 06/06 Operating System, ZFS, is an innovative ground-up redesign of the traditional UNIX file system. The engineers at Sun and members of the open source community have drawn from some of the best practices currently on the market (such as Network Appliances snapshots, VERITAS object-based storage management, transactions, and checksumming), and contributed their own ideas and expertise to develop a new streamlined, cohesive approach to file system design. Even though it's still early in its life cycle, ZFS has made such an impact that other UNIX vendors and open source enthusiasts have already intimated their plans to port it to their own operating systems (see Porting ZFS to other platforms on the OpenSolaris site). With ZFS, Sun addresses the important issues of integrity and security, scalability, and difficulty of administration that often plague other UNIX file systems. In this two-part series, we'll examine the behind-the-scenes workings of ZFS and how this can translate into a savings of time and money for your organization. In this first part, I'll discuss the data integrity and security model and ZFS scalability. The second part will cover ease of administration and future ZFS enhancements. ZFS Data Integrity and SecurityAmong the most important components of any file system are data integrity and security. It's vital that information on the disk not suffer from bit rot, silent corruption, or even malicious or accidental tampering. In the past, file systems have had various problems overcoming these challenges and providing reliable, accurate data. File systems built on older technology, such as earlier versions of UFS,
overwrite blocks when modifying in-use data. If a power failure occurs during
a write, the data is corrupted and the file system may lose block pointers to
important data. To work around this issue, the For reliability reasons, people have also moved to disk or file system mirroring by using some sort of volume management software. If power is lost and the two halves of the mirror are inconsistent, one half of the mirror resyncs to the other, even if only a handful of blocks are questionable. Not only is I/O performance degraded during the resync, but the machine can't always accurately predict which copy of the data is uncorrupted. Sometimes it picks the wrong mirror to trust, and the bad data overwrites the good. To address the performance issue, some volume managers introduced what's known as dirty region logging (DRL). Now only the areas undergoing writes at the time of the power loss need to resync. This works well to mitigate the performance issue, but it doesn't address the problem with detecting which side of the mirror has the valid data. ZFS addresses these issues by making transaction-based copy-on-write modifications and constantly checksumming every in-use block in the file system. Transaction-Based Copy-on-Write Operations ZFS is a combination of file system and volume manager; the file system-level commands require no concept of the underlying physical disks because of storage pool virtualization. All of the high-level interactions occur through the data management unit (DMU), a concept similar to a memory management unit (MMU), only for disks instead of RAM. All of the transactions committed through the DMU are atomic, so data is never left in an inconsistent state. In addition to being a transaction-based file system, ZFS only performs copy-on-write operations. This means that the blocks containing the in-use data on disk are never modified. The changed information is written to alternate blocks, and the block pointer to the in-use data is only moved once the write transactions are complete. This happens all the way up the file system block structure to the top block, called the uberblock. As shown in Figure 1, transactions select unused blocks to write modified data and only then change the location to which the preceding block points.
Figure 1: Copy-On-Write Transactions If the machine were to suffer a power outage in the middle of a data
write, no corruption occurs because the pointer to the "good" data is
not moved until the entire write is complete. (Note: The pointer to the data is the only thing that is moved.) This eliminates the need
for a journaling or logging file system and any need for a End-to-End Checksumming To avoid accidental data corruption ZFS provides memory-based end-to-end checksumming. Most checksumming file systems only protect against bit rot because they use self-consistent blocks where the checksum is stored with the block itself. In this case, no external checking is done to verify validity. This style of checksumming won't catch things like:
With ZFS, the checksum is not stored in the block but next to the pointer to the block, all the way up to the uberblock. Only the uberblock has a self-validating SHA-256 checksum. All block checksums are done in server memory, so any error up the tree is caught including the aforementioned misdirected reads and writes, parity errors, phantom writes, and so on. In the past, the burden on the CPU would have bogged down the machine, but these days CPU technology and speed are advanced enough to check disk transactions on the fly. Not only does ZFS catch these problems, but in a mirrored or RAID-Z configuration, the data is self-healing. (The second article in this series will include more information on RAID-Z.) One of the favorite Sun demonstrations showcasing data self-healing is the following use of dd if=/dev/urandom of=/dev/dsk/c0t1d0s5 bs=1024 count=100000 This writes garbage on half of the mirror, but when those blocks are accessed, ZFS performs a checksum and recognizes that the data is bad. ZFS then checksums the other copy of the data, finds it to be valid, and resilvers the bad block on the corrupted side of the mirror instead of panicking because of data corruption. In a RAID-Z configuration, ZFS sequentially checks for the block on each disk and compares the parity checksum until it finds a match. When a match is found, ZFS knows it's found a block of valid data and fixes all other bad disks. The resilvering process is completely transparent to the user who never even realizes that a problem had occurred. ZFS constantly checks for corrupt data in the background via a
process called scrubbing. The file system code that's used for scrubbing
the disks is the same code that's used for resilvering, attaching
mirrors, and replacing disks, making the entire process tightly
integrated. The administrator can also force a check of an entire storage pool by running the command
#zpool scrub testpool
#zpool status
pool: testpool
state: ONLINE
scrub: scrub completed with 0 errors on Thu Jun 29 12:47:15 2006
config:
NAME STATE READ WRITE CKSUM
testpool ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s5 ONLINE 0 0 0
c0t1d0s5 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
After running the aforementioned
#zpool scrub testpool
#zpool status
pool: testpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool online' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed with 0 errors on Thu Jun 29 12:51:29 2006
config:
NAME STATE READ WRITE CKSUM
testpool ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s5 ONLINE 0 0 0
c0t1d0s5 ONLINE 0 0 5 2.50K repaired
mirror ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
The output now reports:
The output also provides the following information:
Running For an in-depth discussion, read Jeff Bonwick's blog entry on ZFS mirror resilvering. One other ZFS security benefit: It uses NFSv4/NT-style ACLs that include full allow/deny semantics and inheritance. The access controls, based on 17 different attributes, are very fine grained. ZFS ScalabilityWhile data security and integrity is paramount, a file system must also perform well and stand the test of time, otherwise it won't see much use. The designers of ZFS have removed or greatly increased the limits imposed by modern file systems by using a 128-bit architecture and making all metadata dynamic. ZFS also implements data pipelining, dynamic block sizing, intelligent prefetch, dynamic striping, and built-in compression to improve performance. The 128-Bit Architecture Current trends in the industry show that disk drive capacity roughly doubles every nine months to a year. If this trend continues, file systems will require 64-bit addressability in about 10 to 15 years. Instead of planning on 64-bit requirements, the designers of ZFS have taken the long view and implemented a 128-bit file system. This means that ZFS delivers more than 16 billion times more capacity than current 64-bit systems. According to Jeff Bonwick, the ZFS chief architect, in ZFS: the last word in file systems, "Populating 128-bit file systems would exceed the quantum limits of earth-based storage. You couldn't fill a 128-bit storage pool without boiling the oceans." Jeff also discussed the mathematics behind this statement in his blog entry on 128-bit storage. Since we don't yet have the technology to produce that kind of energy for the mass market, we might be safe for a while. Dynamic Metadata In addition to being 128-bit, ZFS metadata is 100 percent dynamic. Because of this, creation of new storage pools and file systems is extremely fast. Only 1 to 2 percent of writes to disk are metadata, which results in a big initial overhead savings. There are, for example, no static inodes, so the only restriction is the number of inodes that will fit on the the disks in the storage pool. The 128-bit architecture also means that there are no practical limits on the number of files, directories, and so on. Here are some theoretical limits that might, if you can conceive of the scope, knock your socks off:
File System Performance The fundamental design of ZFS provides a number of performance enhancements over traditional file systems. For starters, ZFS uses a pipelined I/O engine, similar in concept to CPU pipelines. The pipeline operates on I/O interdependencies and can sort based on relative priority and deadline. This pipeline provides scoreboarding, priority, deadline scheduling, out-of-order issue, and I/O aggregation. ZFS also implements an intelligent prefetch algorithm that recognizes linear or algorithmic access patterns and guesses the next block to prefetch. ZFS uses concurrency to improve speed whenever it can. The file system supports parallel reads and writes to same file, as well as parallel constant-time directory operations, and the locking strategy is scalable and fast. In addition, work within any given transaction can be done in any order, so the DMU batches reads and writes to optimize disk work. Since transactions are copy-on-write, sequential blocks can be chosen for new data instead of accessing the disk randomly. This allows the disks to run at or near platter speed. ZFS also automatically matches the block size (from 512 bytes to 128K) to the workload to achieve the best performance. ZFS dynamically stripes data across all available devices. When you add another disk or slice to a stripe, ZFS will automatically incorporate the new space and re-balance the write striping to take advantage of the new space. If a device should drop to degraded mode because of errors, ZFS also does its best not to write to that device, instead spreading the load out amongst the other devices. Lastly, ZFS offers built-in data compression on a per-file system basis. In addition to reducing the on-disk space usage, compression can also decrease the necessary amount of I/O by two to three times. For this reason, enabling compression actually makes some workloads go faster if they are I/O bound instead of CPU bound. For some information on benchmarking ZFS, take a look at a few other pages:
In the next article, I'll cover more hands-on use of ZFS: creating storage pools and file systems, setting file system parameters, and data snapshots and cloning. I'll also highlight some of the more interesting features on the ZFS development horizon. Resources
Discuss and comment on this resource in the BigAdmin Wiki
Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License. |
| ||||||||