¶The ARC file format
Finding comprehensive specs for a file format is often hard.
There are a few file formats that have really well thought out documentation, such as PNG. Then there's the archive format ARC, which I was trying to write a decoder for. The great thing about search engines on the Internet is their ability to magnify the popularity of a single document called "The ARC archive file format" that doesn't actually tell you most of what you need to know about the format, such as the bitstream ordering for bit packing.
Now, there is source code available for ARC extraction, but using source code isn't always the best idea. First, there can be legal issues if the licensing or origin of the code turns out to be unfavorable, but another problem is that source code often isn't very good documentation and often still needs to be reverse engineered a bit to extract the essential bits... assuming it's even comprehensive. As such, I often avoid using source code as a reference if I can. Fortunately, the compression methods used here aren't crazy like arithmetic encoding and can be eyeballed in a hex editor for verification, so with some elbow grease the rest can be determined from existing archives and decoders.
Here's what I've determined:
Header fields
The ARC file format doesn't have a main file header, only individual headers for each file or block. The best doc I've found for the header is one off of Simtel:
http://cdn.simtel.net/pub/simtel/00/03/99/43/arc_file.inf
For the most part the header is documented pretty well, but this file has clarifications in several files that are documented incorrectly or omitted elsewhere:
- The file name field is 13 bytes, including the null terminator and extension period. Some docs show this as 12 bytes, which can be misleading unless you spot the missing byte.
- The file header stores the timestamp for a file in MS-DOS format, which has two second precision. Some documents show this to be a 32-bit DWORD with the date in the high word; this is backwards, as the first 16-bit word contains the date and the second 16-bit word has the time.
- The CRC polynomial is x^16 + x^12 + x^5 + 1. You can compute this using a similar algorithm to the CRC used for PNG by switching to a 16-bit shift register and replacing the polynomial constant 0xEDB88320 with 0xA003.
Compression mode 3 (packed)
A form of run-length encoding, where the byte 0x90 is used to denote a run. This means that like PCX encoding, it has the dubious honor of favoring some byte values over others. 0x90 0x00 is the escape for the 0x90 literal, while any other follow byte is the number of times to repeat the previously emitted byte, plus one. The wiki for The Unarchiver also documents this format as Rle90:
http://code.google.com/p/theunarchiver/wiki/Rle90Algorithm
The algorithm is also used in a pre-pass on many of the later compression algorithms.
Compression mode 4 (squeezed)
A combination of run-length encoding (RLE) and Huffman coding. The run-length encoding is the same as Packed (mode 3) and is decoded after the Huffman decoder. The compressed format is as follows:
- The first 16-bit word contains the number of tree nodes, or equivalently the number of codes in the tree (these are the same due to the tree encoding).
- After that is the tree, which is encoded as an array of nodes with two 16-bit indices per node. The first index is followed for a zero bit and the second for a one bit. Non-negative numbers indicate a branch to another internal node by absolute array index, while negative numbers are leaves for codes. A byte X is encoded as -(x+1). The literal byte can be extracted as the low byte of the one's complement.
- Afterward follows the bitstream. Codes are packed LSB first, with the codes themselves also LSB first. This is backwards from MPEG/JPEG and the same as Deflate. (I hate this encoding -- it's easier to optimize the MPEG/JPEG order.)
This form of tree encoding has the advantage that it's easy to read and can be used directly in decoding, but takes quite a bit of space, 1K for a full tree. Canonical Huffman tree encoding is a lot more efficient -- just using 4-bit lengths drops this to 128 bytes.
Compression mode 8 (crunched)
This is a combination of run-length encoding (RLE) and Lempel-Ziv-Welch (LZW) encoding. The run-length encoding is the same as Packed (mode 3) and is decoded after the LZW decoder. As usual, there are crucial details missing from most documents:
- The first byte is the maximum size of a code in bits. This is normally 12 bits.
- Like the squeeze encoding, codes are packed into bytes LSB first. The codes themselves, however, are stored in natural bit order. This means that if 0xA5 0xFF are the first two bytes, the bitstream will start 0xA5 0xFE.
- The code width gradually widens from 9 bits up to 12 bits as the dictionary grows. Once the dictionary is full, no new strings are added until a reset occurs.
- 0x0100 (256) is the reset code, which clears the dictionary and resets the code size to 9 bits. The tricky part is that there is a variable amount of padding after the reset code. I was able to track this behavior back to the old Un*x compress.c code, which reads N-bit codes in blocks of N bytes, and which discards the rest of a block whenever a reset code is encountered.
- The first code in the stream or after a reset denotes a string but does not add a new string to the dictionary. This code is 9 bits long even though it should always be a literal, since back-to-back resets aren't useful.