Echo BASIC - HASHING 4/9

 BBS: Inland Empire Archive
Date: 05-09-92 (01:14)             Number: 189
From: TOM HAMMOND                  Refer#: NONE
  To: LYN BORCHERT                  Recvd: NO  
Subj: HASHING     4/9                Conf: (2) Quik_Bas

Continued from Hashing 3/9

So, how about those collisions?  We have two items of data that
transform or hash into the same index.  Now we have a choice, we can
handle the collision data in an overflow area - usually a linear array
or file, or we can re-hash, or retransform, the key to find a new area
inside the main array in which to store the data.  All the commercial
implementations of Hashing I have used use re-hashing.  If handled
correctly, it is much faster than a linear search of an external array
or file.  If handled poorly, a re-hash is about as bad as a external
array or file, so there is still little reason to add the additional
complication of a different data structure.

A collision is indicated when we transform Hank's name and get an index
of 8 and discover element 8 is already occupied by Mike.  Then we
re-hash and put Hank in 12, which happens to be empty.  If 12 were not
empty, we'd re-hash again, and repeat the process until we found an
empty space or we realized that the array or file was full.  Some people
will place a limit on the number of retries they will allow.  After 20,
40, or 100 retries they declare the data base full, and encourage the
user to resize the data base.  Where the threshold should be depends on
your computer, how much storage space you have, and how patient you are.
With a 33mhz 486 and a fast, cached, hard disk you can afford more
retries than someone still using a 4mhz Z-80 and a floppy disk drive.
But the person with floppy disk may have to accept a higher number of
retries to allow all the data to fit.  The trade off here is between
time efficiency and space utilization.  The number of retries, if a good
hash and re-hash are selected, is quite small, but it does have a
statistical distribution.  You can have a 20% full data base with 90% of
all records recovered in 1 try, 9% recovered in less than 3 tries, and 1
record that takes 50 tries to recover it.  It doesn't make much sense
here to double the size of your DATA base or change your hash or re-hash
for that one record!

In many data sets we encounter "clustering" of data, where we see that
elements 8, 10, 11, 12, and 15 are full, but all the rest of the 500
element array is empty.  The greater the degree of clustering, the
higher the likelyhood of collisions.  Selecting a good transformation is
very important here, as is the importance of selecting a good re-hash.

Hashing in general works best if several conditions are met...

1. the data should be fairly random - sequential numbers like check
numbers, invoice numbers, and the like are poor choices for keys,

2. the array or file should be sized so that the number of elements is a
prime number, and one that isn't too close to being a multiple of 10 -
so neither 11 or 101 are suggested, and

3. the hash and rehash are selected to work well with your data set.

The hashing algorithym can be complex or simple, but it must return a
number that points into the array or file we are using.  The simpler the
algorithym, the faster the transformation will run, but that may cost us
later if we are searching the array repeatedly due to collisions.

A friendly vendor once told me that the effects of array size and
hashing algorithym is counter intuitive and the best way to deal with
the situation is by computer modeling.  Or, pick a hash and rehash and
run some data through it.  Change parameters, and try again.  And then
zero in on the best hash and rehash for your data.

A simple approach with numbers is to square them and normalize them for
the size of the array.  A somewhat more complex approach that many
people like is to split the key into two numbers, square them, sum the
squares, and then normalize to array size.  For example "3245" could be
broken into "32" and "45", those could be squared, summed, and
normalized.

Some people like mixing digits, so "3245" would break into "34" snd
"25".  This seems to help when you have to use sequential numbers as the
key.  Strings can be converted into numbers based on their ASCII values
and then processed as described above, although this can be a fairly
lengthy process.

The first re-hash I saw just added 1 to the previous index and then
normalized for the size of the array.  This increased clustering and
seemed just awful until the vendor explained that his system was written
for use on floppy disk drives and the re-hash was selected to minimize
head movement.  He probably had made a good choice for his environment,
but when transfered to a hard disk environment, his re-hash was a
disaster.

A better approach is to spread the data through the array or file as
quickly as possible.  I maintain a "powers of 2" table and process the
table.  So, first I add 2^0 (or 1) to the index.  Then 2^1 (or 2), then
2^2 (4), 2^3 (8), and so on.  This quickly moves data away from
clustering and with a prime number of elements in the array assures us
that all elements will eventually be tested.  I use numbers up to the
power of 2 just larger than the size of the array, and then I start over
with 2^0.  So, if our file has 27 elements, and the hash value was 5,
here's the re-hash pattern.

Old key    Add    New key   Normalized
   3        1        4          4
   4        2        6          6
   6        4       10         10
  10        8       18         18
  18       16       34      34 - 27 = 7
   7       32       39      39 - 27 = 12
  12        1       13         13
etc.

Continues in next file (HASHING - 5 of 9)


--- WM v2.00/91-0231
 * Origin: The Modem Zone BBS (314) 893-5106 (1:289/2)