BBS: Inland Empire Archive Date: 03-28-93 (21:04) Number: 247 From: QUINN TYLER JACKSON Refer#: NONE To: ALL Recvd: NO Subj: Quik_Bas Faq2.1 4/ Conf: (2) Quik_Bas
>>> Continued from previous message
Q3.1 That's pretty fast! I was so used to doing a sequential search on
an unsorted list. Now that I have the QuickSort and the BiSearch
routines, I can use them as a pair for faster list searches. The
thing is, as soon as I want to add something to the list, it
puts everything out of order by only one entry, and that hardly
seems worth sorting all over again, even with something as fast
as Cornel Huth's iterative QuickSort algorithm. Are there any
alternatives to this way of doing things? I've heard talk of
something called 'hashing' but I don't have any idea of what that
is all about. How would I use hashing to avoid having to either
resort the list, or use a slow insertion algorithm? Insertion is
horrendously slow with disk files.
A3.1 Hashing is a very efficient method of record access, be it in RAM
or be it with a disk file. Basically, hashed arrays or data files
can be quickly searched for a given item by a key index. Whenever
you have to add an item to the list, you can at lightening speed,
and since hashing "sorts" the array on-the-fly, as it were, there is
no need to push records around to add new items to a hashed record.
The first concept you must understand with hashing is the key index.
Every data structure you design with hashing in mind has to have
one field that is unique. This is a prerequisite that you just can't
get around. Of course, you could actually combine several fields
to generate this unique key, which effectively serves the same
purpose. A good application of this is a Fidonet nodelist that uses
the node address as the hashing key. No two alike in theory.
But just how does this key work? First of all, let's take a look
at the Fidonet example. Every full Fidonet address is unique to
one node. Assume that the full nodelist has about 15000 entries.
Okay, if you want a hashing table to hold 15000 unique entries, then
research has shown that the table should be at least 30% greater
than the number of entries in it. That would make 19500 table
entries. This means that 4500 entries in the list will be left
empty for best hashing results.
Now, another problem comes up. How does the key come into play?
Well, let's look at a simple key: 1153999. Since the list is 19500
long, we certainly can't just put this in record 1153999. Hashing
involves dividing the key by the table size and taking the remainder
and using that as the record number:
59
---------- R 3499
19500) 1153999
Okay, 3499 is the record number in which we would put the data.
This is the basic idea behind hashing. There is a trouble, however.
Collision occurs whenever a node address, when divided by 19500 has
a remainder of 3499. That 'bucket' is already full! So, what to
do? Generate another bucket number, see if that bucket is full,
and if it is, keep generating new buckets until we find an empty
bucket.
To find an item in a hashed table, we get its key, divide by the
table size, and look at the bucket that is represented by the
remainder. If that isn't the one, we generate the next bucket
address, until we arrive at an empty bucket. If we encounter
the correct key BEFORE we arrive at an empty bucket, then we've
found our entry. If we arrive at an empty bucket, the record is
not in the table. And there you have hashing.
A well designed hashing table will yield this number of collisions
per insertion or search:
T1.0 Hashing Collision Table
TABLE FULLNESS COLLISIONS
==================================
50% 2.0
60% 2.5
70% 3.3
90% 10.0
=======>8 TABLE 1.0 ENDS HERE 8<=========
That shows better results than even the binary search, with large
lists!
Research has shown that the most efficient hashing tables, that is,
the ones with the least number of collisions, have a prime
number of entries. A table size of 1019 should produce less
collisions than one of 1000. Research has also shown that if the
prime is of the form 4K+3, where K is any positive integer, then
collisions are reduced even further. 1019 also meets this second
requirement. But, since a table size twice the size of the maximum
number of entries it will ever hold is inefficient, the 4K+3
criterion should be abandoned at a certain point in favor of any
prime number. Since most of us aren't idiot savants who can just
come up with that number to suit our needs, here is a FUNCTION,
written by Charles Graham, that accepts the maximum number of
entries a table will have, and returns the proper type of prime
number, to be used as a hashing table size:
S4.0 FSTPRIME.BAS [F210S04.BAS]
DEFINT A-Z
' This FUNCTION returns a prime number that is at least 30% greater than
' threshold. It will TRY to return a prime number that also fits into the
' form 4K+3, where k is any integer, but if the prime number is twice the
' size of the threshold, it will ignore this criterion.
'
' Written by Charles Graham, Tweaked by Quinn Tyler Jackson
'
FUNCTION funFirstPrime (threshold)
CONST TRUE = -1
CONST FALSE = NOT TRUE
tp30 = INT((threshold * 1.3) + .5)
IF tp30 / 2 = tp30 \ 2 THEN
tp30 = tp30 + 1
END IF
c = tp30 - 2
IF c < 1 THEN
c = 1
END IF
t2 = threshold * 2
DO
c = c + 2
FOR z = 3 TO SQR(c)
ind = TRUE
IF c / z = c \ z THEN
ind = FALSE
EXIT FOR
END IF
NEXT z
IF ind THEN
IF (c - 3) / 4 = INT((c - 3) / 4) OR c > t2 THEN
funFirstPrime = c
EXIT DO
END IF
END IF
LOOP
END FUNCTION
=======>8 SAMPLE 4.0 ENDS HERE 8<=========
Q3.1 How do I know when to use sequential searches, when to use
binary searches, and when to use hashing? Are there any sort
of guidelines?
>>> Continued to next message
* OLX 2.1 TD * Post all questions in this box --> []

Books at Amazon:
Back to BASIC: The History, Corruption, and Future of the Language
Hackers: Heroes of the Computer Revolution (including Tiny BASIC)
Go to: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Scientists and Iconoclasts who were the Hero Programmers of the Software Revolution
The Advent of the Algorithm: The Idea that Rules the World
Moths in the Machine: The Power and Perils of Programming
Mastering Visual Basic .NET