Echo BASIC - HASHING 3/9

 BBS: Inland Empire Archive
Date: 05-09-92 (01:14)             Number: 188
From: TOM HAMMOND                  Refer#: NONE
  To: LYN BORCHERT                  Recvd: NO  
Subj: HASHING     3/9                Conf: (2) Quik_Bas

Continued from Hashing 2/9

Balanced?  Perfect?  What's that all about?  What if the data is not
added in an order that leads to the tree above?  What if the data is
added alphabetically, except for Hank?  The resulting tree looks like
this...

Aaron
/  \
   Bill
    / \
     George
      /  \
       Leonard
        /   \
           Margaret
           /   \
               MIKE
               /  \
                  Sam
                  /  \
                    Willard
                     /    \

This is a worst case tree.  The search time is proportional to (Number
of entries/2) * processing overhead, like a linear search, except that
the processing overhead is a good bit higher.  When we add "Hank", he
would be entered into the "Less than Leonard" slot.

A tree is balanced when you look at it and it looks balanced.  Loosely
speaking, it is "perfect" when the tree extends down no more levels than
those needed to represent the tree.  In our test data base case, with 8
test entries, we see that we need 4 levels to represent the tree.  The
number that represents this is, I think, INT(LOG(BASE 2)N)+1.  The log,
base 2, of 8, should be 3. Adding 1 to this gives 4.

Please note that in both the trees shown here, and in all trees, there
is only a single spot where a new entry should be placed.

One sticky spot in implementing a tree is deleting elements.  It is
necessary to adjust the tree so that all the elements are correctly
placed after the deletion.  There are some simple algorithyms for this,
but the book I like for this is at work.  From memory though....  If the
item you want to delete is at the end of the tree (has no children),
there is no problem, just delete it.  Examples of this in the first tree
I drew were Aaron, George, Margaret, and Sam.  In the second tree, only
Willard qualifies.

If the item you want to delete has only 1 child, there is no problem,
just adjust the pointer of the parent to the item you wish to delete to
point to the only child of the item you wish to delete.  In the case of
the first tree, only Willard qualifies.  And to delete Willard, you
would need to make Mike's greater than pointer point to Sam.  In the
second tree I drew, ANY item can be deleted this way.

The last case is where the item to be delete has more than 1 child.
There is a simple algorithym that allows this to be done with something
like 2 changes.  But like I said, the book's at work.  All of the
deletion code can become more difficult if you are trying to maintain a
balanced tree.

The methods for searching the tree to produce an in-order, or sorted,
list are pretty straight forward.  In psudeo-code....

Look for the lowest entry
Repeat
    display the current entry
    search for the next higher entry
until all nodes have been visited

Hopefully someone who has used binary trees more than I have will be
kind enough to post some real source code....

And on to hashing....Hashing has as it's fundamental principle that the
DATA should be transformed into an index into the array.  So, I take a
name, transform it, come up with a number, and if all goes well, that
number is the index of the array element where the name and phone number
are.  The placement of data in the array seems as random as the input
data.

Most people immediately think that they need as many array elements as
there are possible keys to make this work.  With social security
numbers, that suggests that 999,999,999 (or so) entries are needed.
This probably wouldn't fit into your computer's RAM this year, so
hashing suddenly seems like a pipe dream.

However, it is understood that the transform need not produce a unique
address, and that data "collisions" will occur where several (or many)
keys will result in the same index.  This is taken care of by one of
several methods we'll discuss later.  You can size the array in a
fashion that makes sense to you.

In practice, with a well chosen transformation, the search time is
proportional to the "fullness" of the array.  That is, we see that we
usually find or data in 1 or 2 searches, regardless of how much data we
have to search, until the data base passes a critical threshold (usually
around 75 to 80% full), at which point things fall apart rapidly.  This
approach is very attractive when using external storage, as it minimizes
disk delays.  So you can start with a data base that contains the names
and phone numbers of your 20 closest friends and then expand to a data
base that contains the names and phone numbers of everyone on Earth, and
in theory the search time will be the same - based solely on how full
the DATA base is.  (Of course, you might have to resize your data base
from time to time, and you might need a larger hard disk to allow the
data to fit.)

Continues in next file (HASHING - 4 of 9)


--- WM v2.00/91-0231
 * Origin: The Modem Zone BBS (314) 893-5106 (1:289/2)