Echo BASIC - HASHING 2/9

 BBS: Inland Empire Archive
Date: 05-09-92 (01:12)             Number: 187
From: TOM HAMMOND                  Refer#: NONE
  To: LYN BORCHERT                  Recvd: NO  
Subj: HASHING     2/9                Conf: (2) Quik_Bas

Continued from HASHING 1/9

In short, we start searching at the beginning of the data base, checking
each entry in order until we either find the data we want or reach the
end of the data.  We are searching the data in order, or linearly.  So,
the search time, when we have a match, is on average proportional to
(LastEntry/2) * processing overhead.  If the data is not on file, the
search processing overhead will vary, depending on the kind of computer
we have, the language we write the search in (QB OF COURSE!!), and so
on.  There are some fairly simple ways to reduce the processing
overhead, which will decrease the search time.

In most cases, the changes below will cut search time by 1/3.  But we
are still limited by the fact we are examining, on average, half the
data base when we find the data we want, and all of the data base when
we don't find the data we want.  To reduce the overhead, we make the
following changes.  Instead of tracking the last entry, we track the
next available entry.  And we stuff the test value in the array as the
next value.  We will ALWAYS find the test value (except maybe with an
old version of MicroSoft BASIC for the Commodore PET or OSI Challenger),
so we need not check for the end of the array.  Here goes....

a$(NextEntry%, 0) = TestName$
TestPoint% = 0

DO WHILE a$(TestPoint%, 0) <> a$(NextEntry%, 0)
   TestPoint% = TestPoint% + 1
LOOP

IF TestPoint% <> NextEntry% THEN
   PRINT a$(TestPoint%, 0); "'s Phone Number is "; a$(TestPoint%, 1)
ELSE
   PRINT TestName$; " Not on file."
   'etc
END IF

As you can see, since we will always find our test value, as it's at the
end of the array, we needn't concern ourselves with boundary checking
and we needn't worry about checking the 501st element of the 500 element
array.  (Since we are using element 0, I suppose that's the 502cd
element of a 501 element array.)

Variations on the theme involve keeping the array sorted and only
searching until an element => the desired value is found.  This does cut
search time, bounds checking is still required, and unless the data base
is very stable (one where you make few additions) it's not always clear
that the machine costs of frequent sorting can ever be recovered.

When we want a sorted list of names and phone numbers, it is very easy
to sort a linear list.  We know where the list starts and ends, and
there are lots of good sorts available to sort the list.  For this size
array, a shell sort is going to probably be close to optimum.  If you
often want a sorted list, it might pay to keep the list sorted, but that
is more than likely going to be a convenience factor than a machine time
savings.

Deleting data in a linear list is pretty simple.  If you are keeping
your DATA sorted, just copy each item below the item to be deleted up 1
item, and adjust your last item or next item pointer and there you are.
If you are not keeping your data sorted, it's even easier - just copy
the last item in the array over the item to be deleted and adjust your
pointers.  Of course, in Basic or QuickBasic, the SWAP command is much
faster than an assignment, so you'll want to SWAP the data rather than
copying it.

This goes as far into linear searches as I care to go.  We see the
fundamental limitations of linear searches pretty clearly.  It's time to
break the mold and try a new algorithym.

Since linear searches have some fundamental problems, what can we do to
help the situation?  The two most common alternatives are binary trees
and hashing.  Both have their advantages and disadvantages.

A binary tree arranges data with pointers, and you search data based on
whether the key is less than or greater than the value under test.  The
pointers are cumbersome to maintain, and a poorly arranged tree can
devolve into a linear list in short order.  The algorithyms that keep
this from happening are pretty well understood, so this is not a real
problem.  The storage overhead is not typically as high as that needed
by a hashed access.  A major advantage of a binary tree is that the data
is always in order - no sorting required.  Just walk the data pointers,
and there you are.  Let me draw up a sample data base..... the data base
will consist of the names Aaron, Bill, George, Leonard, Margaret, Mike,
Sam, and Willard.  First, there is a pointer to the root of the tree,
which will be Leonard's entry.  Then there will be pointers to the
elements larger and smaller than Leonard, and those elements will have
pointers to elements larger and smaller than them..... so here's a
perfect tree (not the only perfect tree that can be derived from this
data)...

                              Leonard
                             /       \
                         Bill MIKE
                        /    \       /     \
                    Aaron   George Margaret Willard
                   /     \ /     \ /      \   /   \
                                             Sam

A major problem with binary trees is representing them on paper <grin>.
If we want to find "Hank", we start searching at the root and see that
"Hank" is less than Leonard, so we look at "Bill", see that "Hank" is
greater than "Bill", and then look at "George".  "Hank" is greater than
"George", but there is no entry greater than "George", so we no know
that "Hank" is not on file, and that the "Greater than George" slot is
where "Hank" is to be added, if the user wants to add Hank.  The search
time is, in a balanced tree, proportional to (log(base 2) of Number of
entries) * processing overhead.  There is additional overhead when an
item is added to keep the tree balanced.

Continues in next file (HASHING - 3 of 9)


--- WM v2.00/91-0231
 * Origin: The Modem Zone BBS (314) 893-5106 (1:289/2)