Monday, April 12, 2010

Various LZMA tools

Once upon a time there was gzip, and many people used it to create .gz files (also .tar.gz and .tgz). Then bzip2 came along, and it created .bz2 files (also .tar.bz2 and .tbz2). Nowadays, most Linux distributions have switched to a new compression algorithm LZMA (Lempel-Ziv-Markov chain algorithm), but it has several implementations. Finding the right implementation has been slightly confusing to me. LZMA has been implemented by XZ Utils, 7zip, and lzip. All of which have different container format.

XZ Utils is like the traditional gzip and bzip2 stream compressor. The file extension is .xz (also .tar.xz and .txz). The XZ Utils implementation is a successor of LZMA Utils, which uses a container format now considered legacy, but XZ Utils handles this legacy format. The legacy format is specified by the suffix .lzma (also .tar.lzma).

7zip is more like the .zip format, in the sense that an archive can contain multiple compressed files. The program also handles multiple formats, .zip as well as .tar.gz and .tar.bz2. Its Unix implementation is called p7zip. Use 7zip if you see .7z (also .tar.7z).

A lesser known implementation, lzip, is implemented in the style of gzip and zlib, and produces .lz files (also .tar.lz and .tlz). This is not to be confused with .lzma files produced by LZMA Utils, as these two formats are not compatible.

According to Google search today, there are 27,000 results for "tar.xz," 23,700 results for "tar.7z," 22,200 results for "tar.lzma," and only 2,200 results for "tar.lz." GNU tar recognizes suffixes for all the formats above except .7z, and assigns -J option for .xz files.

The following is an informal benchmark run. I compiled these programs using the default Makefile's compiler optimization flags without taking note of what they are, so I might be comparing apples and oranges.
$ time zcat hugefile.gz | 7za a -si hugefile.7z
...

real    20m2.066s
user    31m17.788s
sys     0m14.417s
7zip seems to have implemented multi-threading, and is able to scale to 150% CPU time.
$ time zcat hugefile.gz | xz > hugefile.xz

real    38m5.059s
user    38m6.863s
sys     0m10.712s
xv promises to implement multi-threading support in the future.
$ time zcat hugefile.gz | lzip > hugefile.lz

real    43m59.658s
user    43m50.838s
sys     0m7.064s
These are the resulting file sizes.
$ ls -l
total 699264
-rw------- 1 liulk grad2 147791612 Apr 12 23:56 hugefile.7z
-rw------- 1 liulk grad2 238094571 Apr 12 22:00 hugefile.gz
-rw------- 1 liulk grad2 111189035 Apr 13 00:09 hugefile.lz
-rw------- 1 liulk grad2 112989212 Apr 13 00:03 hugefile.xz
The difference is not that great. The implementations seem to be of comparable quality, sacrificing more time in order to achieve smaller files. My personal preference would be xz at the moment, by virtue that multi-threaded support is promised, that GNU tar has a dedicated option -J for the format, and that xz comes with a suite of utilities (unxz, xzless, xzcat, etc.) analogous to gzip counterparts (gunzip, zless, zcat, etc.).

Update (May 19): I just noticed today that lzip has a parallel implementation plzip that can scale to multiple processors. The timing result is as follows:
$ time xzcat hugefile.xz | plzip > hugefile.lz

real    6m46.324s
user    40m7.743s
sys     0m19.721s
It scaled to all the available CPU on the shared computing node I tested on.

No comments: