SCZ - Format, and How-It-Works
https://sourceforge.net/projects/scz-compress/
November 25, 2008
The first figure below shows how the files in the SCZ package are related.
Either library (or both) can be included in (or linked to) applications.
Files:
- scz.h - Prototypes of user-callable SCZ routines. Include in your application when linking to pre-compiled SCZ libraries.
- scz_core.c - Common internal routines and data variables used by SCZ.
- scz_compress_lib.c - Base compression functions.
These routines are to be included and called from other programs.
This file contains the following user-callable convenience routines:
Scz_Compress_File( infilename, outfilename )
Scz_Compress_Buffer2File( buffer, N, outfilename )
Scz_Compress_Buffer2Buffer(inbuffer,N, outbuffer,M,lastbuf_flag)
(See comments above each routine for usage instructions and formal defs.)
- scz_decompress_lib.c - Base decompression functions. These functions reverse
what the scz_compress_lib.c do. They are to be included and called
from other programs.
This file contains the following user-callable convenience routines:
Scz_Decompress_File( *infilename, *outfilename );
Scz_Decompress_File2Buffer( *infilename, **outbuffer, *M );
Scz_Decompress_Buffer2Buffer( *inbuffer, *N, **outbuffer, *M );
(See comments above each routine for usage instructions and formal defs.)
- scz_compress.c - Stand-alone file compression utility, uses scz_compress_lib.c.
- scz_decompress.c - Stand-alone file decompression utility, uses scz_decompress_lib.c.
The next figure below shows the basic structure of a compressed block or segment.
For small or medium sized files, there may be only one segment needed.
The magic numbers help check that the file is the right type, and that
indexing and contents are as expected/needed.

SCZ works on an iterative basis. After the first two magic numbers (101,98),
the iteration-count for the block is stored. This says how many times the block
was compressed, and how many times to re-apply decompression. Next is 24-bits (3-bytes)
to hold the size of the block data. This avoids reserving a special end-of-block character.
Next is the forcing character. This must be inserted ahead of any special symbols to
indicate the next character is to be taken literally; not interpreted. (Like an
escape character.) Next is the symbol table, starting with the table's size (count).
Each consists of a marker character that is used to replace the following pair (or phrase).
The table is followed by a magic barrier character (91) to assure we are beginning
the compressed data segment at the right point. All this is followed by the
segment's original checksum and a continuation or end marker.
The next figure shows how the basic segment structure can be chained.
This permits either large files to be processed in sections, or streaming
for dynamic processes.
Notes:
- In the initial version of SCZ, before the blocking format and checksums were added,
the second magic number was 99. This could serve as a way to switch decoding
to the old format for any files so compressed. The format should be stable now,
and few files were probably compressed with the early format, so no back-compatibility
is built-in to the tools now, but could be this way.
- Both scz_compress_lib.c and scz_decompress_lib.c can be included in other
programs without conflicting each other. All public (global) variables and
routine names are prefaced with SCZ_, to minimize conflicts with user code.
These library files should coexist peacefully with most any other programs.
- Blocking added. Originally, SCZ would compress a whole file or buffer
in one chunk, no matter how large it was. This could be inefficient for
extremely large files, and limits scalability. The larger the file to be
compressed, the more RAM memory would be needed. Also, SCZ had to fit one
lookup table to the whole file, even though the redundancy-statistics often
change throughout a large file. This limited compression quality.
As well, for streaming applications, it meant no output could be processed
until all the data came in.
Therefore, an integral 'blocking' capability was added to the SCZ format.
It enables SCZ routines to process large files or buffers in smaller segments.
This limits the dynamic memory required to a constant, regardless of the
file size. It also provides more efficient compression, because replacement
symbols can better match the redundancy statistics of smaller portions of
large files. And it enables streaming operations - the ability to
compress/decompress a little bit at a time, continuously, if need be.
Also, if bytes of data in a compressed file is corrupted, it may allow
partial recovery of the good blocks.
The 'blocking-size' is arbitrary. It defaults to 4 MB, but can be set
smaller or larger by changing the value of the "sczbuflen" variable.
Reducing block size yields greater compression, but consumes more time.
- An internal checksum was added to SCZ format. It uses an 8-bit checksum.
An original checksum is calculated on each input segment and stored as
one byte in the compressed data. Upon decompression, a new checksum
is computed on the decompressed data, and compared to the original
that was stored. Any mismatch indicates data corruption. A match provides
more than 99.6% confidence that the data is exact and that nothing was lost
in the compress/storage/decompress process. (Actually the confidence
is probably higher than that, because the positions, values, and buffer
length must match to several other words, and there is checking against magic
numerals/boundaries, which would catch many potential perturbations that might be
missed by the checksum. More checksum bytes would provide even greater
confidence, but with diminishing returns - adding overhead while only
diminishing the remaining less than 0.4% cases. At least this scheme is
better-than-nothing at all.)
- You are welcome to make variants for the user-access routines, such as:
Scz_Compress_File, Scz_Compress_Buffer2File, Scz_Compress_Buffer2Buffer,
or their 'decompress' cousins, as well as the stand-alone comp/decomp applications.
Consider these as examples for how to call SCZ's core compress/decompress routines.
- A tests-package was added, for testing SCZ compress/decompress routines.
It contains a generic test-data generator for testing, benchmarking, or
comparing compression methods. It can generate random binary data files with
arbitrary sizes and with arbitrary amounts of compressibility.
Click here to download SCZ-Tests.
By testing SCZ routines with thousands of different files of various
sizes, we can gain confidence in SCZ's correctness and efficiency.
The regression tests can be quickly re-run whenever any improvements
to SCZ are proposed, to verify that it continues to work properly.
The tests also provide additional examples for how to call the compression
routines.
Return to SCZ-Main Page