I've done a bunch of reading up on ZFS, the Zettabyte File System that is part of modern Solaris. One of the more interesting parts to me is how ZFS handles snapshots, and its use of deadlists to handle object expiration.

Various projects I've had rolling around in the back of my head need some sort of object store with a practical snapshot capability. Some of those projects, like large scale backups, could involve a gigantic number of objects, and trivial object expiration mechanisms just don't work that well for that. Using bitmaps means gigantic bitmaps, as well as a pre-defined number of snapshots you can have, neither of which are all that attractive. The mechanism TSM uses for this, keeping track of everything in a database and doing queries to do expiration, is also unattractive because it just doesn't scale well: expiration of objects is proportional to the entire number of objects present. If you have a billion objects but only one actually needs deletion, you still have to look at all billion objects.

ZFS's mechanism, however, seems fairly attractive, but the way it handles object expiration is subtle and didn't really click in my head until recently. This post is my attempt to convince myself I really understand how it works finally; I figure if I can explain it to someone else I actually understand it.

First, some background on ZFS. ZFS deals with filesystems (or block image files), but both of those are essentially presentation layers on top of an object storage system, so in this article I'll be talking a lot about objects and object sets.

• Everything is an object.
• Changes are grouped together into transactions.
• Every transaction has an ID associated with it, a txid
• Every object has a birthdate, which is simply the txid
• Objects are copy-on-write: any change to a particular object is not written out over the existing object. Rather, a new version of the object is created with the changes present, and any object going up the tree that references the object is made anew to reference the new object, and so on up until you reach the root of the object set tree.

A lot of this is done to handle ZFS's consistency model — once committed to persistent storage ZFS should always be consistent. Some of these qualities, however, are also useful when dealing with snapshots. Essentially, any particular version of a set of objects (a particular filesystem at a point in time, in the case of ZFS) can be pointed at by a particular instance of the root object. This makes it trivial to make snapshots: instead of deleting the old version of the root object, just keep it around and make a pointer to it, saying "this snapshot is defined by this root object". It also makes taking a snapshot an O(1) operation: no matter how many objects you have or how many snapshots you have, creating a new snapshot is always just keeping the old root object around and making a pointer to it.

Here's how we handle snapshots:

• Snapshots are an ordered list of pointers to root objects
• The are ordered by the the birthdate of the root object they point to.
• The head of the list is a special "the beginning of time" object, the tail of the list is the current version of the object tree (the "live copy").
• Everything in between represents a snapshot of the object tree at a certain point in time. There are various bits of metadata associated with the snapshot, like a human readable name, etc., but the important parts to us are the birthdate and the deadlist, which we'll get to in a little bit.

The question at hand is: when can we actually delete an object? There are two variations on that question:

• When we delete an object from the current live object set, can we really delete it, or is it referenced in a previous snapshot?
• When we delete a snapshot, how do we know which objects in it are no longer referenced by any other snapshot, or the live version of the object set?

Let's look at the first question: when we no longer need an object in the live version of the object set, can we delete it? To answer that question, we look at that object's birthdate and compare it to the birthdate noted in the very last snapshot on the list before the life set. If it was born after the last snapshot on the list, we must have created it, so we can delete it at will. If it was born at or before the last snapshot, some previous snapshot refers to it, and it can't be deleted.

What do we do with that object, then? This is where we introduce the deadlist, which is simply a list of objects that we want to get rid of but which are referenced by some previous snapshot. When a snapshot is created of the current live object set, that snapshot inherits the current deadlist, and the live set's deadlist is emptied. This helps us answer the second question, what objects can we get rid of when we delete a snapshot?

Let's say we have three snapshots:

	    |
[snapshot A: Born 100]--[dead list: 99, 43, 17]
|
|
[snapshot B: Born 150]--[dead list: 149, 128, 93]
|
|
[snapshot C: Born 328]--[dead list: 92, 140, 185]
|
|


and each of those snapshots has a corresponding deadlist with various items on it with the birthdates listed; e.g. snapshot C's deadlist has three items on it, with birthdates of 140, 160 and 185. It's useful to note that every item on a particular snapshot's deadlist will be younger than the snapshot time.

Let's say we want to get rid of snapshot B. What blocks do we need to get rid of? We need to get rid of anything that is all of the following: born after snapshot A but before snapshot C and also died after snapshot A but before snapshot C.

Snapshot C's dead list meets three of the requirements: everything on that list was born before snapshot C, died after snapshot A and died before snapshot C. To meet the fourth requirement, we look at everything on snapshot C's deadlist and compare it to snapshot A's birthdate. If the item was born *after* snapshot A's birthdate, we can actually delete it. The items born on 140 and 185 meet this requirement, so they can actually be deleted, The item born on 92, however, can't be deleted, because something at or before snapshot A (with a birthdate of 100) references it. And, we still have to do something with snapshot B's deadlist. So we add 92 to snapshot B's dead list, set that whole thing to snapshot C's dead list, and finally get rid of snapshot B:

            |
|
[snapshot A: Born 100]--[dead list: 99, 43, 17]
|
|
[snapshot C: Born 328]--[dead list: 148, 128, 93, 92]
|
|


And with that, we can easily expire objects, and do so in a fairly efficient manner. The time required to expire objects is proportional only to the number of objects that actually need to be deleted, and most importantly has nothing to do with the total number of objects present (like it would be with a bitmap or trolling through a database).