SCZ - Format, and How-It-Works

https://sourceforge.net/projects/scz-compress/                       November 25, 2008
The first figure below shows how the files in the SCZ package are related.

Either library (or both) can be included in (or linked to) applications.

Files:

The next figure below shows the basic structure of a compressed block or segment. For small or medium sized files, there may be only one segment needed. The magic numbers help check that the file is the right type, and that indexing and contents are as expected/needed.

SCZ works on an iterative basis. After the first two magic numbers (101,98), the iteration-count for the block is stored. This says how many times the block was compressed, and how many times to re-apply decompression. Next is 24-bits (3-bytes) to hold the size of the block data. This avoids reserving a special end-of-block character. Next is the forcing character. This must be inserted ahead of any special symbols to indicate the next character is to be taken literally; not interpreted. (Like an escape character.) Next is the symbol table, starting with the table's size (count). Each consists of a marker character that is used to replace the following pair (or phrase). The table is followed by a magic barrier character (91) to assure we are beginning the compressed data segment at the right point. All this is followed by the segment's original checksum and a continuation or end marker.

The next figure shows how the basic segment structure can be chained. This permits either large files to be processed in sections, or streaming for dynamic processes.

Notes:

  1. In the initial version of SCZ, before the blocking format and checksums were added, the second magic number was 99. This could serve as a way to switch decoding to the old format for any files so compressed. The format should be stable now, and few files were probably compressed with the early format, so no back-compatibility is built-in to the tools now, but could be this way.

  2. Both scz_compress_lib.c and scz_decompress_lib.c can be included in other programs without conflicting each other. All public (global) variables and routine names are prefaced with SCZ_, to minimize conflicts with user code. These library files should coexist peacefully with most any other programs.

  3. Blocking added. Originally, SCZ would compress a whole file or buffer in one chunk, no matter how large it was. This could be inefficient for extremely large files, and limits scalability. The larger the file to be compressed, the more RAM memory would be needed. Also, SCZ had to fit one lookup table to the whole file, even though the redundancy-statistics often change throughout a large file. This limited compression quality. As well, for streaming applications, it meant no output could be processed until all the data came in.

    Therefore, an integral 'blocking' capability was added to the SCZ format. It enables SCZ routines to process large files or buffers in smaller segments. This limits the dynamic memory required to a constant, regardless of the file size. It also provides more efficient compression, because replacement symbols can better match the redundancy statistics of smaller portions of large files. And it enables streaming operations - the ability to compress/decompress a little bit at a time, continuously, if need be. Also, if bytes of data in a compressed file is corrupted, it may allow partial recovery of the good blocks.

    The 'blocking-size' is arbitrary. It defaults to 4 MB, but can be set smaller or larger by changing the value of the "sczbuflen" variable. Reducing block size yields greater compression, but consumes more time.

  4. An internal checksum was added to SCZ format. It uses an 8-bit checksum. An original checksum is calculated on each input segment and stored as one byte in the compressed data. Upon decompression, a new checksum is computed on the decompressed data, and compared to the original that was stored. Any mismatch indicates data corruption. A match provides more than 99.6% confidence that the data is exact and that nothing was lost in the compress/storage/decompress process. (Actually the confidence is probably higher than that, because the positions, values, and buffer length must match to several other words, and there is checking against magic numerals/boundaries, which would catch many potential perturbations that might be missed by the checksum. More checksum bytes would provide even greater confidence, but with diminishing returns - adding overhead while only diminishing the remaining less than 0.4% cases. At least this scheme is better-than-nothing at all.)

  5. You are welcome to make variants for the user-access routines, such as: Scz_Compress_File, Scz_Compress_Buffer2File, Scz_Compress_Buffer2Buffer, or their 'decompress' cousins, as well as the stand-alone comp/decomp applications. Consider these as examples for how to call SCZ's core compress/decompress routines.

  6. A tests-package was added, for testing SCZ compress/decompress routines. It contains a generic test-data generator for testing, benchmarking, or comparing compression methods. It can generate random binary data files with arbitrary sizes and with arbitrary amounts of compressibility. Click here to download SCZ-Tests.

    By testing SCZ routines with thousands of different files of various sizes, we can gain confidence in SCZ's correctness and efficiency. The regression tests can be quickly re-run whenever any improvements to SCZ are proposed, to verify that it continues to work properly. The tests also provide additional examples for how to call the compression routines.


 

Return to SCZ-Main Page

SourceForge.net Logo