Echo BASIC - Data Compression

 BBS: Inland Empire Archive
Date: 09-17-82 (10:59)             Number: 239
From: RICH GELDREICH               Refer#: NONE
  To: MARK BARBA                    Recvd: NO  
Subj: Data Compression               Conf: (2) Quik_Bas

MB>  Hi, I would like to know if anyone out there could give me an example of
MB>how to use Data Compression.  I never figured out what goes on during the
MB>data compression and could never figure out how PkZip/ARJ and LZH could
MB>compress a text file from 100k to 20k....  if anyone would have to time to
MB>explain the ways of compressing data I would appreciate it.


        Aaaa...    My  favorite   subject   (I've   made  many  data
compression programs in QuickBASIC,  and I've  posted  some  of  the
simpler ones).   Really, I'll come strait out and say it: the theory
behind  data  compression  is  simple,  but the algorithms (usually)
aren't.

        There  are  two  main  types of data compression techniques.
The first,  and oldest type,   is statisical modeling.   Put simply,
this type of compression works by assigning very small codes to  the
characters which are used most in a file,  and assigning large codes
to the characters which aren't used  so  often.    In  other  words,
the  characters  with the most priority have the smallest codes,  so
compression results.   There are many algorithms available which can
do this: Shannon-Fano (sp?) coding,  Huffman coding,  and arithmitic
coding.   Shannon-Fano is the oldest, and simplest.   Huffman coding
came after Shannon-Fano,  and is very popular.    Arithmitic  coding
just  recently  became  popular  in  the last few years,  because of
hardware requirements...

        The  other  main   type   is   sliding  window  compression.
Basically,  this type of compression gave birth back in 1977,  by  a
paper  released  by  Lempel  and  Ziv  (forget  the  first  names!).
Basically,   this  type of compression replaces groups of characters
with some type of pointer  into the previously seen text.   Repeated
phrases,  which occur often in files,  can be coded into only a  few
bits,  so compression results.   If the phrase isn't in the last few
thousand characters or so seen, it's simply sent normally.

        Around a year later,  in 1978, Lempel & Ziv released another
paper, describing a variant to their earlier algorithm.  A few years
later,   in  the  very  early  80's(1982?),   Terry  Welch took this
algorithm  and  published  his   variation,   which  he  called  LZW
compression.  This became a very popular type of compression, in the
Unix and MS-DOS worlds.   The Unix "compress" program,   ARC,   GIF,
TIFF,   PKZip's  shrinking,   and  a zillion other programs use some
variation of LZW compression...

        But,  someone from the peanut gallery saw that LZ77 could do
better than LZW,  so LZW  bit  the  dust  when  a  variant  of  LZ77
compression, called LZSS, gained popularity.

        All  of  the  popular  data compression programs today use a
combination of two distinct compression algorithms.   The first is a
variant of LZ77,  called LZSS.   LZSS is basically an opimization of
LZ77, in speed and efficiency(there's more to it, of course, but you
get the idea).   After the file is codes with LZSS,  the LZSS output
is   then  coded  with  a  statistical  modeling  scheme,   such  as
Shannon-Fano (like PKZip) or Huffman  (like ARJ and LHA) coding,  to
achieve "state of the art" compression.

        This was all off  the  top  of  my  head,   but  you  should
hopefully get the general idea of it.

        Rich
---
 * SLMR 2.0 * Nothing is so smiple that it can't get screwed up.

--- MsgToss 2.0b
 * Origin: Computer Co-Op - Voorhees, NJ | Ted Hare (1:266/29)