Quik_Bas Faq2.1 4/

 BBS: Inland Empire Archive
Date: 03-28-93 (21:04)             Number: 247
From: QUINN TYLER JACKSON          Refer#: NONE
  To: ALL                           Recvd: NO  
Subj: Quik_Bas Faq2.1       4/       Conf: (2) Quik_Bas
>>> Continued from previous message
Q3.1    That's pretty fast!  I was so used to doing a sequential search on
        an unsorted list.  Now that I have the QuickSort and the BiSearch
        routines, I can use them as a pair for faster list searches.  The
        thing is, as soon as I want to add something to the list, it
        puts everything out of order by only one entry, and that hardly
        seems worth sorting all over again, even with something as fast
        as Cornel Huth's iterative QuickSort algorithm.  Are there any
        alternatives to this way of doing things?  I've heard talk of
        something called 'hashing' but I don't have any idea of what that
        is all about.  How would I use hashing to avoid having to either
        resort the list, or use a slow insertion algorithm?  Insertion is
        horrendously slow with disk files.

A3.1    Hashing is a very efficient method of record access, be it in RAM
        or be it with a disk file.  Basically, hashed arrays or data files
        can be quickly searched for a given item by a key index.  Whenever
        you have to add an item to the list, you can at lightening speed,
        and since hashing "sorts" the array on-the-fly, as it were, there is
        no need to push records around to add new items to a hashed record.

        The first concept you must understand with hashing is the key index.
        Every data structure you design with hashing in mind has to have
        one field that is unique.  This is a prerequisite that you just can't
        get around.  Of course, you could actually combine several fields
        to generate this unique key, which effectively serves the same
        purpose.  A good application of this is a Fidonet nodelist that uses
        the node address as the hashing key.  No two alike in theory.

        But just how does this key work?  First of all, let's take a look
        at the Fidonet example.  Every full Fidonet address is unique to
        one node.  Assume that the full nodelist has about 15000 entries.
        Okay, if you want a hashing table to hold 15000 unique entries, then
        research has shown that the table should be at least 30% greater
        than the number of entries in it.  That would make 19500 table
        entries.  This means that 4500 entries in the list will be left
        empty for best hashing results.

        Now, another problem comes up.  How does the key come into play?
        Well, let's look at a simple key: 1153999.  Since the list is 19500
        long, we certainly can't just put this in record 1153999.  Hashing
        involves dividing the key by the table size and taking the remainder
        and using that as the record number:

                           59
                    ----------  R 3499
               19500) 1153999


        Okay, 3499 is the record number in which we would put the data.
        This is the basic idea behind hashing.  There is a trouble, however.
        Collision occurs whenever a node address, when divided by 19500 has
        a remainder of 3499.  That 'bucket' is already full!  So, what to
        do?  Generate another bucket number, see if that bucket is full,
        and if it is, keep generating new buckets until we find an empty
        bucket.

        To find an item in a hashed table, we get its key, divide by the
        table size, and look at the bucket that is represented by the
        remainder.  If that isn't the one, we generate the next bucket
        address, until we arrive at an empty bucket.  If we encounter
        the correct key BEFORE we arrive at an empty bucket, then we've
        found our entry.  If we arrive at an empty bucket, the record is
        not in the table.  And there you have hashing.

        A well designed hashing table will yield this number of collisions
        per insertion or search:


T1.0    Hashing Collision Table

        TABLE FULLNESS          COLLISIONS
        ==================================
             50%                   2.0
             60%                   2.5
             70%                   3.3
             90%                  10.0


=======>8 TABLE 1.0 ENDS HERE 8<=========

        That shows better results than even the binary search, with large
        lists!

        Research has shown that the most efficient hashing tables, that is,
        the ones with the least number of collisions, have a prime
        number of entries.  A table size of 1019 should produce less
        collisions than one of 1000.  Research has also shown that if the
        prime is of the form 4K+3, where K is any positive integer, then
        collisions are reduced even further.  1019 also meets this second
        requirement.  But, since a table size twice the size of the maximum
        number of entries it will ever hold is inefficient, the 4K+3
        criterion should be abandoned at a certain point in favor of any
        prime number.  Since most of us aren't idiot savants who can just
        come up with that number to suit our needs, here is a FUNCTION,
        written by Charles Graham, that accepts the maximum number of
        entries a table will have, and returns the proper type of prime
        number, to be used as a hashing table size:

S4.0    FSTPRIME.BAS [F210S04.BAS]

DEFINT A-Z
' This FUNCTION returns a prime number that is at least 30% greater than
' threshold.  It will TRY to return a prime number that also fits into the
' form 4K+3, where k is any integer, but if the prime number is twice the
' size of the threshold, it will ignore this criterion.
'
'       Written by Charles Graham, Tweaked by Quinn Tyler Jackson
'
FUNCTION funFirstPrime (threshold)
CONST TRUE = -1
CONST FALSE = NOT TRUE

tp30 = INT((threshold * 1.3) + .5)
IF tp30 / 2 = tp30 \ 2 THEN
    tp30 = tp30 + 1
END IF
c = tp30 - 2
IF c < 1 THEN
    c = 1
END IF
t2 = threshold * 2
DO
    c = c + 2
    FOR z = 3 TO SQR(c)
        ind = TRUE
        IF c / z = c \ z THEN
            ind = FALSE
            EXIT FOR
        END IF
    NEXT z
    IF ind THEN
        IF (c - 3) / 4 = INT((c - 3) / 4) OR c > t2 THEN
            funFirstPrime = c
            EXIT DO
        END IF
    END IF
LOOP
END FUNCTION

=======>8 SAMPLE 4.0 ENDS HERE 8<=========

Q3.1    How do I know when to use sequential searches, when to use
        binary searches, and when to use hashing?  Are there any sort
        of guidelines?
>>> Continued to next message

 * OLX 2.1 TD * Post all questions in this box --> []
Outer Court
Echo Basic Postings

Books at Amazon:

Back to BASIC: The History, Corruption, and Future of the Language

Hackers: Heroes of the Computer Revolution (including Tiny BASIC)

Go to: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Scientists and Iconoclasts who were the Hero Programmers of the Software Revolution

The Advent of the Algorithm: The Idea that Rules the World

Moths in the Machine: The Power and Perils of Programming

Mastering Visual Basic .NET