BBS: Inland Empire Archive Date: 05-09-92 (01:14) Number: 189 From: TOM HAMMOND Refer#: NONE To: LYN BORCHERT Recvd: NO Subj: HASHING 4/9 Conf: (2) Quik_Bas
Continued from Hashing 3/9 So, how about those collisions? We have two items of data that transform or hash into the same index. Now we have a choice, we can handle the collision data in an overflow area - usually a linear array or file, or we can re-hash, or retransform, the key to find a new area inside the main array in which to store the data. All the commercial implementations of Hashing I have used use re-hashing. If handled correctly, it is much faster than a linear search of an external array or file. If handled poorly, a re-hash is about as bad as a external array or file, so there is still little reason to add the additional complication of a different data structure. A collision is indicated when we transform Hank's name and get an index of 8 and discover element 8 is already occupied by Mike. Then we re-hash and put Hank in 12, which happens to be empty. If 12 were not empty, we'd re-hash again, and repeat the process until we found an empty space or we realized that the array or file was full. Some people will place a limit on the number of retries they will allow. After 20, 40, or 100 retries they declare the data base full, and encourage the user to resize the data base. Where the threshold should be depends on your computer, how much storage space you have, and how patient you are. With a 33mhz 486 and a fast, cached, hard disk you can afford more retries than someone still using a 4mhz Z-80 and a floppy disk drive. But the person with floppy disk may have to accept a higher number of retries to allow all the data to fit. The trade off here is between time efficiency and space utilization. The number of retries, if a good hash and re-hash are selected, is quite small, but it does have a statistical distribution. You can have a 20% full data base with 90% of all records recovered in 1 try, 9% recovered in less than 3 tries, and 1 record that takes 50 tries to recover it. It doesn't make much sense here to double the size of your DATA base or change your hash or re-hash for that one record! In many data sets we encounter "clustering" of data, where we see that elements 8, 10, 11, 12, and 15 are full, but all the rest of the 500 element array is empty. The greater the degree of clustering, the higher the likelyhood of collisions. Selecting a good transformation is very important here, as is the importance of selecting a good re-hash. Hashing in general works best if several conditions are met... 1. the data should be fairly random - sequential numbers like check numbers, invoice numbers, and the like are poor choices for keys, 2. the array or file should be sized so that the number of elements is a prime number, and one that isn't too close to being a multiple of 10 - so neither 11 or 101 are suggested, and 3. the hash and rehash are selected to work well with your data set. The hashing algorithym can be complex or simple, but it must return a number that points into the array or file we are using. The simpler the algorithym, the faster the transformation will run, but that may cost us later if we are searching the array repeatedly due to collisions. A friendly vendor once told me that the effects of array size and hashing algorithym is counter intuitive and the best way to deal with the situation is by computer modeling. Or, pick a hash and rehash and run some data through it. Change parameters, and try again. And then zero in on the best hash and rehash for your data. A simple approach with numbers is to square them and normalize them for the size of the array. A somewhat more complex approach that many people like is to split the key into two numbers, square them, sum the squares, and then normalize to array size. For example "3245" could be broken into "32" and "45", those could be squared, summed, and normalized. Some people like mixing digits, so "3245" would break into "34" snd "25". This seems to help when you have to use sequential numbers as the key. Strings can be converted into numbers based on their ASCII values and then processed as described above, although this can be a fairly lengthy process. The first re-hash I saw just added 1 to the previous index and then normalized for the size of the array. This increased clustering and seemed just awful until the vendor explained that his system was written for use on floppy disk drives and the re-hash was selected to minimize head movement. He probably had made a good choice for his environment, but when transfered to a hard disk environment, his re-hash was a disaster. A better approach is to spread the data through the array or file as quickly as possible. I maintain a "powers of 2" table and process the table. So, first I add 2^0 (or 1) to the index. Then 2^1 (or 2), then 2^2 (4), 2^3 (8), and so on. This quickly moves data away from clustering and with a prime number of elements in the array assures us that all elements will eventually be tested. I use numbers up to the power of 2 just larger than the size of the array, and then I start over with 2^0. So, if our file has 27 elements, and the hash value was 5, here's the re-hash pattern. Old key Add New key Normalized 3 1 4 4 4 2 6 6 6 4 10 10 10 8 18 18 18 16 34 34 - 27 = 7 7 32 39 39 - 27 = 12 12 1 13 13 etc. Continues in next file (HASHING - 5 of 9) --- WM v2.00/91-0231 * Origin: The Modem Zone BBS (314) 893-5106 (1:289/2)
Books at Amazon:
Back to BASIC: The History, Corruption, and Future of the Language
Hackers: Heroes of the Computer Revolution (including Tiny BASIC)
Go to: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Scientists and Iconoclasts who were the Hero Programmers of the Software Revolution
The Advent of the Algorithm: The Idea that Rules the World
Moths in the Machine: The Power and Perils of Programming
Mastering Visual Basic .NET