BBS: Inland Empire Archive Date: 03-28-93 (21:04) Number: 247 From: QUINN TYLER JACKSON Refer#: NONE To: ALL Recvd: NO Subj: Quik_Bas Faq2.1 4/ Conf: (2) Quik_Bas
>>> Continued from previous message Q3.1 That's pretty fast! I was so used to doing a sequential search on an unsorted list. Now that I have the QuickSort and the BiSearch routines, I can use them as a pair for faster list searches. The thing is, as soon as I want to add something to the list, it puts everything out of order by only one entry, and that hardly seems worth sorting all over again, even with something as fast as Cornel Huth's iterative QuickSort algorithm. Are there any alternatives to this way of doing things? I've heard talk of something called 'hashing' but I don't have any idea of what that is all about. How would I use hashing to avoid having to either resort the list, or use a slow insertion algorithm? Insertion is horrendously slow with disk files. A3.1 Hashing is a very efficient method of record access, be it in RAM or be it with a disk file. Basically, hashed arrays or data files can be quickly searched for a given item by a key index. Whenever you have to add an item to the list, you can at lightening speed, and since hashing "sorts" the array on-the-fly, as it were, there is no need to push records around to add new items to a hashed record. The first concept you must understand with hashing is the key index. Every data structure you design with hashing in mind has to have one field that is unique. This is a prerequisite that you just can't get around. Of course, you could actually combine several fields to generate this unique key, which effectively serves the same purpose. A good application of this is a Fidonet nodelist that uses the node address as the hashing key. No two alike in theory. But just how does this key work? First of all, let's take a look at the Fidonet example. Every full Fidonet address is unique to one node. Assume that the full nodelist has about 15000 entries. Okay, if you want a hashing table to hold 15000 unique entries, then research has shown that the table should be at least 30% greater than the number of entries in it. That would make 19500 table entries. This means that 4500 entries in the list will be left empty for best hashing results. Now, another problem comes up. How does the key come into play? Well, let's look at a simple key: 1153999. Since the list is 19500 long, we certainly can't just put this in record 1153999. Hashing involves dividing the key by the table size and taking the remainder and using that as the record number: 59 ---------- R 3499 19500) 1153999 Okay, 3499 is the record number in which we would put the data. This is the basic idea behind hashing. There is a trouble, however. Collision occurs whenever a node address, when divided by 19500 has a remainder of 3499. That 'bucket' is already full! So, what to do? Generate another bucket number, see if that bucket is full, and if it is, keep generating new buckets until we find an empty bucket. To find an item in a hashed table, we get its key, divide by the table size, and look at the bucket that is represented by the remainder. If that isn't the one, we generate the next bucket address, until we arrive at an empty bucket. If we encounter the correct key BEFORE we arrive at an empty bucket, then we've found our entry. If we arrive at an empty bucket, the record is not in the table. And there you have hashing. A well designed hashing table will yield this number of collisions per insertion or search: T1.0 Hashing Collision Table TABLE FULLNESS COLLISIONS ================================== 50% 2.0 60% 2.5 70% 3.3 90% 10.0 =======>8 TABLE 1.0 ENDS HERE 8<========= That shows better results than even the binary search, with large lists! Research has shown that the most efficient hashing tables, that is, the ones with the least number of collisions, have a prime number of entries. A table size of 1019 should produce less collisions than one of 1000. Research has also shown that if the prime is of the form 4K+3, where K is any positive integer, then collisions are reduced even further. 1019 also meets this second requirement. But, since a table size twice the size of the maximum number of entries it will ever hold is inefficient, the 4K+3 criterion should be abandoned at a certain point in favor of any prime number. Since most of us aren't idiot savants who can just come up with that number to suit our needs, here is a FUNCTION, written by Charles Graham, that accepts the maximum number of entries a table will have, and returns the proper type of prime number, to be used as a hashing table size: S4.0 FSTPRIME.BAS [F210S04.BAS] DEFINT A-Z ' This FUNCTION returns a prime number that is at least 30% greater than ' threshold. It will TRY to return a prime number that also fits into the ' form 4K+3, where k is any integer, but if the prime number is twice the ' size of the threshold, it will ignore this criterion. ' ' Written by Charles Graham, Tweaked by Quinn Tyler Jackson ' FUNCTION funFirstPrime (threshold) CONST TRUE = -1 CONST FALSE = NOT TRUE tp30 = INT((threshold * 1.3) + .5) IF tp30 / 2 = tp30 \ 2 THEN tp30 = tp30 + 1 END IF c = tp30 - 2 IF c < 1 THEN c = 1 END IF t2 = threshold * 2 DO c = c + 2 FOR z = 3 TO SQR(c) ind = TRUE IF c / z = c \ z THEN ind = FALSE EXIT FOR END IF NEXT z IF ind THEN IF (c - 3) / 4 = INT((c - 3) / 4) OR c > t2 THEN funFirstPrime = c EXIT DO END IF END IF LOOP END FUNCTION =======>8 SAMPLE 4.0 ENDS HERE 8<========= Q3.1 How do I know when to use sequential searches, when to use binary searches, and when to use hashing? Are there any sort of guidelines? >>> Continued to next message * OLX 2.1 TD * Post all questions in this box --> []
Books at Amazon:
Back to BASIC: The History, Corruption, and Future of the Language
Hackers: Heroes of the Computer Revolution (including Tiny BASIC)
Go to: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Scientists and Iconoclasts who were the Hero Programmers of the Software Revolution
The Advent of the Algorithm: The Idea that Rules the World
Moths in the Machine: The Power and Perils of Programming
Mastering Visual Basic .NET