New file format?

Duplicity uses a standard backup scheme in the sense that for the first backup, duplicity performs a full backup. Additional backups are stored in separate files and record only changed files.

This scheme is similar to the one people have been using with tar to back up the files for decades. And in fact, duplicity does use the tar format. However, for various reasons it may be time to come up with a new archive format that is better suited for hard disks and duplicity.

Why not tar?
Current Proposal

Block level formatting
Inner level formatting
Changes from previous version

Benefits of this format
Open questions
Acknowledgments

Why not tar?

Tar is a very standard format, so it seems best to use it unless there are good reasons not to. However, tar (Tape ARchive) is showing its age and has some disadvantages:

Not indexed: In order to see which files are contained in a tar archive, it is necessary to scan through the entire archive. This can be impractical if the archive is very large.
Not seek()able inside container: Because tar does not support encryption/compression on the inside of archives, tarballs nowadays are usually post-processed with gzip or similar. However, once compressed, the tar archive becomes opaque, and it is impossible to seek around inside. Thus if you want to extract the last file in a huge .tar.gz or .tar.gpg archive, it is necessary to read through all the prior data in the archive and discard it!

Other archive formats like WinZip get around this by compressing files inside the container. Because the container knows the length of each file's compressed data, earlier files can be seeked past.
Doesn't support many filesystem features: Tar also doesn't have support for many features of modern filesystems including access control lists (ACLs), extended attributes (EAs), resource forks or other metadata (as on MacOS for instance), or encoding types for filenames (e.g. UTF-8 vs ascii). The format does not lend itself to extension so adding these features will probably not be easy.
General cruft: There are some legacy features of tar that make it a pain to work with. For instance, long file names and long hard link names are not handled very elegantly, and quantities like uid are stored as 8 byte octal numbers, which besides being odd is too small to accomodate uids actually seen by some users.
duplicity already doesn't use straight tar: When writing incremental backups, duplicity already exceeds straight tar in that whether an enclosed file represents a diff, the file itself, or a deleted file is enclosed in the filename. For instance the files marked
```
snapshot/bin/ls
diff/bin/ls
deleted/bin/ls
```
in duplicity's tar archive all relate to bin/ls but hold different kinds of information about that file. Although the eventual archive file still can be extracted by, say, GNU tar, this extra information will not be understood.

Current Proposal

This section describes a possible format in general terms. Please let me know if you have suggestions or corrections.

Block Level Formatting

The archive will have two different levels of formatting, much like a .tar.gz file has two levels of formatting. At the outer level, the archive will look like a header, two collections of blocks, and a footer. The collections of blocks hold the important archive data, usually in compressed and/or encrypted form.

At the outer, or block, level, the file will look like this:

Outer Structure

Here is more detail on the items in that diagram:

Variable length text sections: Some of the sections above (and in the inner file below) are variable length text fields, in a hierarchical format. Perhaps they will be in XML, or perhaps something simpler. Either way, it should be easy to both parse and extend these sections.
Block headers: Block headers are fixed length (probably 16 byte) sections that contain the length of the following block, as well as a predictable string, which may be useful in case of corruption.
Archive header: The most important information contained here is how the blocks are encoded, for instance with gzip or GnuPG. It could also include other information like the archive title, a copyright notice, etc.
Archive footer: Must contain all information in the header, in case the header is lost. May also contain information useful to the destination after the archive is received. For instance, a cryptographic signature of the rest of the archive contents.
Block table offset: We write this offset to the end of the archive to make random access easier. When looking for a specific file in the archive, an application can seek to the end of the file, follow the pointer, read the block table, and then start on the index. No need to process the regular data.

Inner Level Formatting:

The block level formatting described above gives us a file within a file. However this inner file now has the property of being completely encrypted or compressed, while retaining some level of random access.

The inner file may be laid out as follows:

At the top level there are two data streams. The first, called the Inner file, has two parts, the Inner data file and the Inner index file. The inner data file is laid out similarly to tar, with a file header followed by its data. The Inner index file is in some sense redundant information that merely provides an easy way to access the inner data file.

The block table is a second stream all by itself. It should contain a list of the outer offsets of all the blocks (at the outer or block level) along with the inner offset of the beginning of that block's data. Given an inner offset (for instance as obtained from a file header), the block table allows an application to find the correct block, seek to it, and uncompress/decrypt it. Also, the block table should include the offset in the inner index file of the root (/) directory's header.

Here are some more remarks and details on the inner level structure:

File x metadata/contents: This contains the familiar file metadata you might see in a tar header, like mtime and permissions. However, as it is a variable length field, the format here will be pretty loose, with the option of omitting or adding various information like ACLs, EAs, etc. For regular files, it is recommended that a strong hash such as SHA1 also be included.
- Multiple contents: In the inner data file, files (like file 1 in the diagram) can optionally have more than one stream of bytes associated with their contents. This may be useful for recording extended attributes, resource forks, etc.
- Headers in index vs data: The file headers in the inner index file are the same as the file headers dispersed through the inner data file. However, if their contents are stored separately, the headers in the index must contain the offset of the start of their contents in the inner data file. Additionally, directory entries in the index should contain a list of their files along with the inner file index offset of each.
Why include two copies of the metadata?? It seemed to be the only way to get both stream encoding/decoding and random access. Also, metadata compresses very well. On my system rdiff-backup saves metadata for 15+GB of files in about 6MB (<0.04%). Testing on the most recent linux kernel (which contains smaller files on average), metadata took about 0.3% of the space. Unless average file size becomes much smaller, the tradeoff seems worth it.
Block headers: As in the outer file, block headers contain the length of the block along with a recognizable byte sequence.

Benefits of this format

This section is basically the inverse of the Why not tar? section above.

Random access: Unlike a .tar.gz or .tar.gpg file, it is possible for an application to find a file in the middle of a huge archive without reading through the unnecessary parts. To do this, seek to the end, read the offset of the index file, and start scanning for the appropriate file.
Filesystem-ability: In fact, it seems quite possible to write a read-only file system driver, to enable archives like this to be mounted. The block table contains the information contained in the superblock. We can identify a file's inode with the offset of its header in inner index file.
Stream reading/writing: Like tar, we preserve the stream unpacking and writing of archives. On unpacking side, the application can simply discard everything past the inner data file. On the writing side, things are slightly more complicated because we must save the file metadata until the end, which may require writing a temporary file. However, I don't think the extra space required will be unreasonable.
Flexibility: With metadata, headers, and footers in text format like XML, it should be relatively easy to encode extra information and to expand for the future.

Open questions on above

Here are some issues I am unsure about.

Error Correction? Right now the archive format adds predictable byte sequences per block in order to recover from errors. I don't know much about error recovery myself---is this sufficient? What other types of safeguards are there?
XML? We need some kind of variable length hierarchical format to store the metadata in. However, XML may be too heavy. I have heard a number of suggestions; at this point the next step seems to be to start thinking about the details of implementation, which should provide more information.

At this point I am leaning toward XML, or at least a subset of it. However, I was once enthusiastic about XML for the format for rdiff-backup's metadata, but ended up rejecting it as too complicated, so maybe the same thing will happen here.
Multivolume support? Hopefully this won't be hard to add, but I haven't thought about it yet.
Sparse file support? How important is this? I haven't thought this yet either, hopefully it won't be too hard to add.

Acknowledgments

Thanks to Will Dyson, John Goerzen, Randall Nortman, and Kevin Spicer for valuable discussion.