BBS: Inland Empire Archive Date: 05-09-92 (01:12) Number: 187 From: TOM HAMMOND Refer#: NONE To: LYN BORCHERT Recvd: NO Subj: HASHING 2/9 Conf: (2) Quik_Bas
Continued from HASHING 1/9 In short, we start searching at the beginning of the data base, checking each entry in order until we either find the data we want or reach the end of the data. We are searching the data in order, or linearly. So, the search time, when we have a match, is on average proportional to (LastEntry/2) * processing overhead. If the data is not on file, the search processing overhead will vary, depending on the kind of computer we have, the language we write the search in (QB OF COURSE!!), and so on. There are some fairly simple ways to reduce the processing overhead, which will decrease the search time. In most cases, the changes below will cut search time by 1/3. But we are still limited by the fact we are examining, on average, half the data base when we find the data we want, and all of the data base when we don't find the data we want. To reduce the overhead, we make the following changes. Instead of tracking the last entry, we track the next available entry. And we stuff the test value in the array as the next value. We will ALWAYS find the test value (except maybe with an old version of MicroSoft BASIC for the Commodore PET or OSI Challenger), so we need not check for the end of the array. Here goes.... a$(NextEntry%, 0) = TestName$ TestPoint% = 0 DO WHILE a$(TestPoint%, 0) <> a$(NextEntry%, 0) TestPoint% = TestPoint% + 1 LOOP IF TestPoint% <> NextEntry% THEN PRINT a$(TestPoint%, 0); "'s Phone Number is "; a$(TestPoint%, 1) ELSE PRINT TestName$; " Not on file." 'etc END IF As you can see, since we will always find our test value, as it's at the end of the array, we needn't concern ourselves with boundary checking and we needn't worry about checking the 501st element of the 500 element array. (Since we are using element 0, I suppose that's the 502cd element of a 501 element array.) Variations on the theme involve keeping the array sorted and only searching until an element => the desired value is found. This does cut search time, bounds checking is still required, and unless the data base is very stable (one where you make few additions) it's not always clear that the machine costs of frequent sorting can ever be recovered. When we want a sorted list of names and phone numbers, it is very easy to sort a linear list. We know where the list starts and ends, and there are lots of good sorts available to sort the list. For this size array, a shell sort is going to probably be close to optimum. If you often want a sorted list, it might pay to keep the list sorted, but that is more than likely going to be a convenience factor than a machine time savings. Deleting data in a linear list is pretty simple. If you are keeping your DATA sorted, just copy each item below the item to be deleted up 1 item, and adjust your last item or next item pointer and there you are. If you are not keeping your data sorted, it's even easier - just copy the last item in the array over the item to be deleted and adjust your pointers. Of course, in Basic or QuickBasic, the SWAP command is much faster than an assignment, so you'll want to SWAP the data rather than copying it. This goes as far into linear searches as I care to go. We see the fundamental limitations of linear searches pretty clearly. It's time to break the mold and try a new algorithym. Since linear searches have some fundamental problems, what can we do to help the situation? The two most common alternatives are binary trees and hashing. Both have their advantages and disadvantages. A binary tree arranges data with pointers, and you search data based on whether the key is less than or greater than the value under test. The pointers are cumbersome to maintain, and a poorly arranged tree can devolve into a linear list in short order. The algorithyms that keep this from happening are pretty well understood, so this is not a real problem. The storage overhead is not typically as high as that needed by a hashed access. A major advantage of a binary tree is that the data is always in order - no sorting required. Just walk the data pointers, and there you are. Let me draw up a sample data base..... the data base will consist of the names Aaron, Bill, George, Leonard, Margaret, Mike, Sam, and Willard. First, there is a pointer to the root of the tree, which will be Leonard's entry. Then there will be pointers to the elements larger and smaller than Leonard, and those elements will have pointers to elements larger and smaller than them..... so here's a perfect tree (not the only perfect tree that can be derived from this data)... Leonard / \ Bill MIKE / \ / \ Aaron George Margaret Willard / \ / \ / \ / \ Sam A major problem with binary trees is representing them on paper <grin>. If we want to find "Hank", we start searching at the root and see that "Hank" is less than Leonard, so we look at "Bill", see that "Hank" is greater than "Bill", and then look at "George". "Hank" is greater than "George", but there is no entry greater than "George", so we no know that "Hank" is not on file, and that the "Greater than George" slot is where "Hank" is to be added, if the user wants to add Hank. The search time is, in a balanced tree, proportional to (log(base 2) of Number of entries) * processing overhead. There is additional overhead when an item is added to keep the tree balanced. Continues in next file (HASHING - 3 of 9) --- WM v2.00/91-0231 * Origin: The Modem Zone BBS (314) 893-5106 (1:289/2)
Books at Amazon:
Back to BASIC: The History, Corruption, and Future of the Language
Hackers: Heroes of the Computer Revolution (including Tiny BASIC)
Go to: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Scientists and Iconoclasts who were the Hero Programmers of the Software Revolution
The Advent of the Algorithm: The Idea that Rules the World
Moths in the Machine: The Power and Perils of Programming
Mastering Visual Basic .NET