These are just some random thoughts of mine. Your mileage may vary.
bzip2-0.9.5
and 0.9.0
use exactly the same file format as the previous
version, bzip2-0.1
. This decision was made in the interests of
stability. Creating yet another incompatible compressed file format
would create further confusion and disruption for users.
Nevertheless, this is not a painless decision. Development
work since the release of bzip2-0.1
in August 1997
has shown complexities in the file format which slow down
decompression and, in retrospect, are unnecessary. These are:
bzip2
's existing algorithm for
most inputs, and the randomisation mechanism protects
adequately against bad cases. I didn't think it was
a good tradeoff to make. Partly this is due to the fact
that I was not flooded with email complaints about
bzip2-0.1
's performance on repetitive data, so
perhaps it isn't a problem for real inputs.
Probably the best long-term solution,
and the one I have incorporated into 0.9.5 and above,
is to use the existing sorting
algorithm initially, and fall back to a O(N (log N)^2)
algorithm if the standard algorithm gets into difficulties.
decompress.c
through the C preprocessor
and you'll see what I mean. Much of this complexity
could have been avoided if the compressed size of
each block of data was recorded in the data stream.
It would be fair to say that the bzip2
format was frozen
before I properly and fully understood the performance
consequences of doing so.
Improvements which I was able to incorporate into 0.9.0, despite using the same file format, are:
bzip2-0.9.0
now reads and writes files with fread
and fwrite
; version 0.1 used putc
and getc
.