DBPF/Compression

From SimsWiki
< DBPF
Revision as of 09:12, 28 December 2006 by Delphy (Talk | contribs)

Jump to: navigation, search

Contents

Overview

The idea behind the compression is to reuse previously decoded strings. For example, if the word "heureka" occurs twice in a file, the second occurence would be encoded by pointing to the first.

The compression is done by defining control characters that tells three things:

  1. How many characters of plain text that follows that should be appended to the output.
  2. How many characters that should be read from the already decoded text (and appended to the output)
  3. At which offset in the already decoded text to read the characters.

Thus, the algorithm to decompress these files goes like this:

Read file size at offset 0 
Seek to offset 9 
while not end of file is reached do 
{ 
	- Read next control character. 
	- (Depending on control character read 0-3 more bytes that are a part of the control character.) 
	- Figure out how many characters that should be read and from where by inspecting the control character. 
	- Read 0-n characters from source and append them to the output. 
	- Copy 0-n characters from somewhere in the output to the end of the output. 
} 

Control Characters

There are 4 types of control characters which are used with different restrictions of how many characters that can be read and from how far behind these can be read. The following conventions are used to describe them:

CC length
Length of control character.
Num plain text
Number of chars immediately after the control character that should be read and appended to output.
Num to copy
Number of chars that should be copied from somewhere in the already decoded output and added to the end of the output.
Copy offset
Where to start reading characters when copying from somewhere in the already decoded output.
This is given as an offset from the current end of the output buffer, i.e. an offset of 0 means that you should copy the last character in the output and append it to the output. And offset of 1 means that you should copy the second-to-last character.
byte0
first byte of control character.
Bits
Bits of the control character.
  • p - num plain text
  • c - num to copy
  • o - copy offset
  • i - identifier.

Note: It can sometimes be confusing when a control character states that you should copy for example 10 characters 5 steps from the end of the output. Clearly, you cannot read more than 5 characters before you reach the end of the buffer. The solution is to read and write one character at the time. Each time you read a character you copy it to the end thereby increasing the size of the output. By doing this, even offset 0 is possible and would result in duplicating the last character a number of times. This is utilized by the compression to recreate repeating text, for example bars of repeating dashes

This is the simplest form of control character. The only thing it does is telling how many plain text characters that follows. The formula for this is: (C - 0x7F) * 4. Thus a value of 0xE0 means that you should read 4 characters of plain text and append to the output.

0x00 - 0x7F

CC length: 2 bytes
Num plain text: byte0 & 0x03
Num to copy: ( (byte0 & 0x1C) > > 2) + 3
Copy offset: ( (byte0 & 0x60) < < 3) + byte1 + 1
Bits: 0oocccpp oooooooo
Num plain text limit: 0-3
Num to copy limit: 3-11
Maximum Offset: 1023


0x80 - 0xBF

CC length: 3 bytes
Num plain text: ((byte1 & 0xC0) > > 6 ) & 0x03
Num to copy: (byte0 & 0x3F) + 4
Copy offset: ( (byte1 & 0x3F) < < 8 ) + byte2 + 1
Bits: 10cccccc ppoooooo oooooooo
Num plain text limit: 0-3
Num to copy limit: 4-67
Maximum Offset: 16383


0xC0 - 0xDF

This format differes depending on the game.

Sims 2

CC length: 4 bytes
Num plain text: byte0 & 0x03
Num to copy: ( (byte0 & 0x0C) < < 6 )  + byte3 + 5
Copy offset: ((byte0 & 0x10) < < 12 ) + (byte1 < < 8 ) + byte2 + 1
Bits: 110occpp oooooooo oooooooo cccccccc
Num plain text limit: 0-3
Num to copy limit: 5-1028
Maximum Offset: 131072

SimCity 4

CC length: 4 bytes
Num plain text: byte0 & 0x03
Num to copy: ( (byte0 & 0x1C) < < 6 )  + byte3 + 5
Copy offset: (byte1 < < 8) + byte2
Bits: 110cccpp oooooooo oooooooo cccccccc
Num plain text limit: 0-3
Num to copy limit: 5-2047
Maximum Offset: 65535

0xE0 - 0xFC

CC length: 1 byte 
Num plain text: ((byte0 & 0x1F) < < 2 )
Num to copy: 0 
Copy offset: - 
Bits: 111ppppp 
Num plain text limit: 4-128 
Num to copy limit: 0 
Maximum Offset: - 

0xFD - 0xFF

CC length: 1 byte 
Num plain text: (byte0 & 0x03)
Num to copy: 0 
Copy offset: - 
Bits: 111ppppp 
Num plain text limit: 4-128 
Num to copy limit: 0 
Maximum Offset: - 

Example Code

This is written in PHP, converted from Perl code by dmchess [1]

// Read a 4 byte unsigned number from the file
// read_UL4 is a php function in my DBPF class that grabs the next 4 bytes and uses unpack to convert to a integer
$len = $this->read_UL4($handle);
// Read the next 5 bytes (they are useless afaik)
$garbagedata = fread($handle, 5);
// Decompress the chunk
// We do $len - 9 here becuase we are ignoring the first 9 bytes of the chunk (4 for the length value itself, 5 for other data)
// See later for a description of $this->decompress
$data = $this->decompress($handle, $len - 9);
// ** Internally used I/O functions
// Reads a 4 byte unsigned integer
/*
       Used internally by the class to read a C/C++
       "unsigned long" (a 4 byte unsigned integer)
       from an open file
       $fh - the file handle from which to read
       returns - returns the value read; has no error return
*/
function read_UL4($fh)
{
       $d = fread($fh, 4);
       $a = unpack("Vn", $d);
       return $a["n"];
}
// Reads a 2 byte unsigned integer
/*
        Used internally by the class to read a C/C++
       "unsigned short" (a 2 byte unsigned integer)
       from an open file
       $fh - the file handle from which to read
       returns - returns the value read; has no error return
*/
function read_UL2($fh)
{
       $d = fread($fh, 2);
       $a = unpack("vn", $d);
       return $a["n"];
}
// Reads a 1 byte unsigned integer
/*
       Used internally by the class to read a C/C++
       "unsigned char" (a 1 byte unsigned integer)
       from an open file
       $fh - the file handle from which to read
       returns - returns the value read; has no error return
*/
function read_UL1($fh)
{
       $d = fread($fh, 1);
       $a = unpack("Cn", $d);
       return $a["n"];
}
       // Decompresses string
       /*
               PHP DBPF decompression by Delphy
               Thanks to dmchess (http://hullabaloo.simshost.com/forum/viewtopic.php?t=6578&postdays=0&postorder=asc)
               for the Perl code that I used for this
               $handle - file handle for reading
               $len - length of compressed string
       */
       function decompress($handle, $len) {
               $buf = ;
               $answer = "";
               $answerlen = 0;
               $numplain = "";
               $numcopy = "";
               $offset = "";

Main loop:

               for (;$len>0;) {
                       $cc = $this->read_UL1($handle);
                       $len -= 1;
               //      printf("      Control char is %02x, len remaining is %08x. \n",$cc,$len);
                       if ($cc >= 252): // 0xFC
                               $numplain = $cc & 0x03;
                               if ($numplain > $len) { $numplain = $len; }
                               $numcopy = 0;
                               $offset = 0;
                       elseif ($cc >= 224): // 0xE0
                               $numplain = ($cc - 0xdf) << 2;
                               $numcopy = 0;
                               $offset = 0;
                       elseif ($cc >= 192): // 0xC0
                               $len -= 3;
                               $byte1 = $this->read_UL1($handle);
                               $byte2 = $this->read_UL1($handle);
                               $byte3 = $this->read_UL1($handle);
                               $numplain = $cc & 0x03;
                               $numcopy = (($cc & 0x0c) <<6) + 5 + $byte3;
                               $offset = (($cc & 0x10) << 12 ) + ($byte1 << 8) + $byte2;
                       elseif ($cc >= 128): // 0x80
                               $len -= 2;
                               $byte1 = $this->read_UL1($handle);
                               $byte2 = $this->read_UL1($handle);
                               $numplain = ($byte1 & 0xc0) >> 6;
                               $numcopy = ($cc & 0x3f) + 4;
                               $offset = (($byte1 & 0x3f) << 8) + $byte2;
                       else:
                               $len -= 1;
                               $byte1 = $this->read_UL1($handle);
                               $numplain = ($cc & 0x03);
                               $numcopy = (($cc & 0x1c) >> 2) + 3;
                               $offset = (($cc & 0x60) << 3) + $byte1;
                       endif;
                       $len -= $numplain;

This section basically copies the parts of the string to the end of the buffer:

                       if ($numplain > 0) {
                               $buf = fread($handle, $numplain);
                               $answer = $answer.$buf;
                       }
                       $fromoffset = strlen($answer) - ($offset + 1);  # 0 == last char
                       for ($i=0;$i<$numcopy;$i++) {
                               $answer = $answer.substr($answer,$fromoffset+$i,1);
                       }
                       $answerlen += $numplain;
                       $answerlen += $numcopy;
               }

Return the decompressed string back:

               return $answer;
       }

Related Pages

Personal tools
Namespaces

Variants
Actions
Navigation
game select
Toolbox