Go to the first, previous, next, last section, table of contents.

Miscellanea

These are just some random thoughts of mine. Your mileage may vary.

Limitations of the compressed file format

bzip2-0.9.5 and 0.9.0 use exactly the same file format as the previous version, bzip2-0.1. This decision was made in the interests of stability. Creating yet another incompatible compressed file format would create further confusion and disruption for users.

Nevertheless, this is not a painless decision. Development work since the release of bzip2-0.1 in August 1997 has shown complexities in the file format which slow down decompression and, in retrospect, are unnecessary. These are:

The run-length encoder, which is the first of the compression transformations, is entirely irrelevant. The original purpose was to protect the sorting algorithm from the very worst case input: a string of repeated symbols. But algorithm steps Q6a and Q6b in the original Burrows-Wheeler technical report (SRC-124) show how repeats can be handled without difficulty in block sorting.
The randomisation mechanism doesn't really need to be there. Udi Manber and Gene Myers published a suffix array construction algorithm a few years back, which can be employed to sort any block, no matter how repetitive, in O(N log N) time. Subsequent work by Kunihiko Sadakane has produced a derivative O(N (log N)^2) algorithm which usually outperforms the Manber-Myers algorithm. I could have changed to Sadakane's algorithm, but I find it to be slower than bzip2's existing algorithm for most inputs, and the randomisation mechanism protects adequately against bad cases. I didn't think it was a good tradeoff to make. Partly this is due to the fact that I was not flooded with email complaints about bzip2-0.1's performance on repetitive data, so perhaps it isn't a problem for real inputs. Probably the best long-term solution, and the one I have incorporated into 0.9.5 and above, is to use the existing sorting algorithm initially, and fall back to a O(N (log N)^2) algorithm if the standard algorithm gets into difficulties.
The compressed file format was never designed to be handled by a library, and I have had to jump though some hoops to produce an efficient implementation of decompression. It's a bit hairy. Try passing decompress.c through the C preprocessor and you'll see what I mean. Much of this complexity could have been avoided if the compressed size of each block of data was recorded in the data stream.
An Adler-32 checksum, rather than a CRC32 checksum, would be faster to compute.

It would be fair to say that the bzip2 format was frozen before I properly and fully understood the performance consequences of doing so.

Improvements which I was able to incorporate into 0.9.0, despite using the same file format, are:

Single array implementation of the inverse BWT. This significantly speeds up decompression, presumably because it reduces the number of cache misses.
Faster inverse MTF transform for large MTF values. The new implementation is based on the notion of sliding blocks of values.
bzip2-0.9.0 now reads and writes files with fread and fwrite; version 0.1 used putc and getc.