exepack

David Fifield <david@bamsoftware.com>

Last updated:

Source code tarball
exepack-1.4.0.tar.gz (sig)
Precompiled Windows executable
exepack.exe (sig)
Git repo
git clone https://www.bamsoftware.com/git/exepack.git

exepack is a program to compress and decompress 16-bit DOS executables with EXEPACK, a format for self-extracting executables.

Compression:
exepack unpacked.exe packed.exe
Decompression:
exepack -d packed.exe unpacked.exe
Use in a pipeline:
unzip -p comic.zip comic.exe | exepack -d /dev/stdin unpacked.exe

I wanted to reverse engineer some old DOS games like Mega Man and Captain Comic. These games' executables are packed with EXEPACK; you need to unpack them in order for the disassembly to make sense. I decided to write my own unpacker after encountering a file compressed using a variant of EXEPACK that other tools couldn't handle at the time.

Goals:

exepack is written in Rust. You need rustc and cargo to compile it.

Versions of the EXEPACK format

If you have a DOS EXE file that contains the string Packed file is corrupt, it is most likely packed with EXEPACK.

The most prominent documentation of EXEPACK online is at the DOS Game Modding Wiki: http://www.shikadi.net/moddingwiki/Microsoft_EXEPACK#File_Format. As of this writing, the format described there is just one of several slightly incompatible EXEPACK formats. The formats differ in the size of the EXEPACK metadata header, whether they support an optional padding block, the size of their executable decompression stub, the localization of an error message, and the presence of certain bugs.

The general structure of an EXEPACK-packed file is:

┌──────────────────────────────────────┐   runtime address
│ EXE header                           │
├──────────────────────────────────────┤ ← ds:0100 = es:0100
│ compressed data                      │
├──────────────────────────────────────┤ 
│ optional skip_len padding            │
│ (only if EXEPACK header is 18 bytes) │
├──────────────────────────────────────┤ ← cs:0000
│ EXEPACK header (16 or 18 bytes)      │
├──────────────────────────────────────┤ ← cs:ip
│ EXEPACK decompression stub           │
├──────────────────────────────────────┤
│ packed relocation table              │
├──────────────────────────────────────┤ ← cs:exepack_size
┆ possible trailing garbage            ┆

Pointers use segment:offset notation: cs:ip means, as a linear address, 16×cs+ip.

The cs and ip fields in the EXE header tell us where to find the EXEPACK header and how big it is. There are two possible EXEPACK headers, a 16-byte one and an 18-byte one. They differ in the presence of a skip_len field.

16-byte header 18-byte header
uint16_t real_ip
uint16_t real_cs
uint16_t mem_start
uint16_t exepack_size
uint16_t real_sp
uint16_t real_ss
uint16_t dest_len
uint16_t signature "RB"
uint16_t real_ip
uint16_t real_cs
uint16_t mem_start
uint16_t exepack_size
uint16_t real_sp
uint16_t real_ss
uint16_t dest_len
uint16_t skip_len
uint16_t signature "RB"

The header field names are from ModdingWiki. mem_start is not an actual meaningful header field; it is just temporary storage used by the decompression stub. exepack_size is the size of the entire EXEPACK block: header, stub, and packed relocation table. dest_len should perhaps instead be called uncompressed_len: it's the size (in 16-byte paragraphs) of the uncompressed data. Similarly, cs could also be called compressed_len, because the compressed data ends just before the EXEPACK header. The only exception is when skip_len is present; in that case, uncompressed_len and compressed_len both get reduced by 16×(skip_len − 1). With the 16-byte header, it is as if skip_len always has the value 1; i.e., no skip_len padding.

(Aside: apart from complicating the unpacking algorithm, skip_len doesn't seem to serve any purpose. w4kfu found many executables with skip_len > 1, but they would work just as well with skip_len = 1.)

The decompression stub immediately follows the EXEPACK header. As it is located at cs:ip, it is the code that DOS will jump to as soon as the compressed executable is loaded. The stub is responsible for copying itself out of the way, decompressing the compressed data, and jumping to the entry point of the original uncompressed program. There have been several different decompression stubs over the years. The following table shows the characteristics of the ones that are known to me. See doc/README.stubs and doc/*.asm in the exepack source code for commented disassembly. The one with size 283 is the format documented at ModdingWiki. This program uses its own custom stub, designed to fix the problems of the other stubs, while keeping a size of 283 for compatibility with other external unpackers.

size skip_len? restores ax? A20 bug? relocation 0xffff bug? allows expansion? error string producer
258nonoyesyesnoPacked file is corruptEXEPACK 4.00; or LINK /EXEPACK 3.02, 3.05, or 3.06
258nonoyesyesnoFichero corrompido    ?
279nonoyesnonoPacked file is corruptEXEPACK 4.03, LINK /EXEPACK 3.51, or IBM Linker/2 1.0
277nonoyesnonoPacked file is corruptLINK /EXEPACK 3.10, 3.60, 3.61, 3.64, 3.65, 5.01.20, 5.01.21
283yesnoyesnonoPacked file is corruptEXEPACK 4.05 or 4.06
290noyesnononoPacked file is corruptLINK /EXEPACK 3.69, 5.05, 5.10, 5.13, 5.15, 5.31.009, 5.60, 5.60.220, or 5.60.339
283yesyesnonoyesPacked file is corruptexepack (this program)
size
Size of the decompression stub code, not counting the EXEPACK header or packed relocation table.
skip_len?
"no" means a 16-byte EXEPACK header without skip_len; "yes" means an 18-byte EXEPACK header with skip_len.
restores ax?
The state of most CPU registers is unspecified at startup; but ax has a meaning. The decompression stub should restore the original value of ax before jumping to the decompressed code, but most versions do not.
A20 bug?
"yes" means the stub relies on 8086-style 20-bit address wraparound; i.e., it requires the address fff0:0123 to map to the linear address 0x23, not 0x100023. Stubs with this bug may falsely error out with "Packed file is corrupt" when run at a low address in memory.
relocation 0xffff bug?
The decompression stub has to apply relocations, by adding the program's starting segment to various 16-bit values in the program text. "yes" means the stub has a bug when patching a pointer at X:ffff: it will patch the bytes X:ffff and X:0000, instead of X:ffff and (X+0x1000):0000.
allows expansion?
The standard stubs can't cope when the compressed program would be bigger than the original uncompressed program. The custom stub in exepack handles this case, which means you can, for example, recursively compress an executable 10 times and it will still run correctly.

An external unpacker like this one doesn't care about the contents of the decompression stub, but it has to know its length in order to locate the packed relocation table. There is no field that indicates where the stub ends and the relocations begin; it's implicit in the offsets encoded into the instructions of the stub. The error string Packed file is corrupt is a fairly reliable indicator: it always appears right at the end of the stub. However the message may be localized (Fichero corrompido    ), so it's not completely foolproof. I initially tried having a table of known stubs, but later I changed it to instead search for the byte pattern that precedes the error message, cd 21 b8 ff 4c cd 21, which encodes the instructions int 0x21; mov ax, 0x4cff; int 0x21, then seek 22 bytes past the end of it. Since the error message seems to always be 22 bytes, this works. You can always check your guess after reading the packed relocation table; it should end exepack_size bytes after the beginning of the EXEPACK header.

After the stub comes the packed relocation table. Notionally, the relocation table is an array of segment:offset pointers. EXEPACK compresses the array by normalizing all the pointers to have a segment that is a multiple of 0x1000, and then storing 16 separate arrays containing offsets only. The first uint16_t is the number of offsets in the array for segment 0000, followed by that many uint16_ts for the offsets themselves; then a uint16_t for the number of offsets for segment 1000, followed by that many offsets; and so on up to segment f000.

The EXEPACK and LINK version numbers come from the source code of UNP, and from experimenting with various versions of those programs sent to me by Dennis Luehring. The Detect-It-Easy software has signatures for versions of EXEPACK: EXEPACK.2.sg, WordPerfect EXEPack.2.sg. YaraRules has EXEPACKv405v406 and EXEPACKLINKv360v364v365or50121 rules. RGB Classic Games marks some versions as "2nd generation", but I don't know what their criteria for that are.

Unpacking algorithm

Taking the above observations into consideration, here is a rough algorithm for EXEPACK unpacking that is compatible with known formats. An implementation would have to deal with several possible error conditions, for example skip_len > dest_len.

min_extra_paragraphs in an EXE file is something like the BSS segment in a Unix executable. It specifies an amount of additional memory to allocate, beyond the main program text. min_extra_paragraphs is effectively part of the program image, but the EXE file stores only its size, not its contents. After decompressing, you must adjust min_extra_paragraphs so that the before and after of the sum

program size + min_extra_paragraphs size

remains constant. Usually, this will mean decreasing min_extra_paragraphs, because the decompressed program is usually larger than the compressed program, but min_extra_paragraphs may also increase. See an investigation of how Microsoft EXEPACK.EXE and UNP handle this field.

exepack computes the checksum field for files it writes, but ignores checksum in files it reads. You may as well leave the field set to zero. Microsoft KnowledgeBase article Q71971 tells how to compute the EXE checksum, but also states that checksums are ignored in practice:

Note that Microsoft LINK does not correctly calculate the checksum if the linker command line includes the /CODEVIEW or /EXEPACK option switches. However, because the MS-DOS, Microsoft Windows, and OS/2 versions 1.x do not verify the checksum, this behavior does not present a problem under normal circumstances. Microsoft LINK version 5.3 and later do not compute a 16-bit or 32-bit checksum. The reserved bytes in the .EXE header are set to zero.

Various implementation of DOS I found do not even examine checksum, let alone verify it:

Strangely, Microsoft EXEPACK.EXE 4.00 always uses a value of 0x1399 in its checksum fields; you can see an instance of this in Revision 1 of COMIC.EXE. Microsoft EXEPACK.EXE 5.00 always writes a checksum of 0x0000.

Jason Summers has done corpus analysis of the EXE checksum.

Decompression algorithm

The decompression algorithm is just as described at ModdingWiki. It runs backwards, and decompresses the buffer into itself.

You need to do the 0xb2 copy operation in reverse as shown, because the destination region may overlap the source region. I.e., you can't just memcpy in a forward direction. You could probably use memmove.

Here is the decompression algorithm, with no error or bounds checking.

// The dst and src indices initially point one byte past the end
// of their respective regions.
decompress(buf, dst, src) {
	while (buf[src-1] == 0xff)
		src--;
	do {
		command = buf[--src];
		length = buf[--src];
		length = (length<<8) + buf[--src];
		switch (command & 0xfe) {
		case 0xb0:
			fill = buf[--src];
			for (i = 0; i < length; i++)
				buf[--dst] = fill;
			break;
		case 0xb2:
			for (i = 0; i < length; i++)
				buf[--dst] = buf[--src];
			break;
		default:
			error(); // Packed file is corrupt
		}
	} while ((command & 0x01) != 0x01);
}

Bugs

Send bug reports to david@bamsoftware.com. This is also where you should send files that seem to be EXEPACK-compressed but which this program cannot handle.

Open questions

The following are not bugs exactly, but rather questions that came up during development that I had to decide one way or another. I'm not sure that what I chose is the best way. If you have an opinion or insight, let me know.

Other EXEPACK decompressors

unEXEPACK

Single source file, written in C. Supports multiple formats. This is the alternative I recommend if exepack doesn't suit your needs.

UNEXEPACK

Single source file, written in C. Only supports one EXEPACK format. Has a bug when processing certain packed relocation tables.

UNP a.k.a. unp411

DOS-based unpacker for a ton of self-extracting executable formats. Written in assembly language. No longer maintained. The way it works is cute: it recognizes the input format, sets some breakpoints, runs the executable's own unpacking code, and copies the result out of memory. As a consequence, it really only works inside a real DOS environment. (And isn't safe to run on untrusted files.) I got some of the version numbers for different decompression stubs from UNP's labeled signatures in the source file exe/eexpk.asm.

Thanks