Once Upon an Algorithm Page 10
A stack data type can be used to reverse the elements in a list by first placing them on the stack and then retrieving them in the reverse order (last in, first out). One application is finding a way back, which is, of course, what Hansel and Gretel do, what Theseus did to find his way out of the labyrinth in the Greek myth of Theseus and the Minotaur, and what Adso did in The Name of the Rose to find his way back through the labyrinth of the library. Another application is understanding the movie Memento, which is mostly told in reverse chronological order. By pushing the scenes of the movie onto a mental stack and then taking them off in the reverse order they were presented, one obtains the events of the story in the proper order.
Using a tree to represent family relationships like those in The Hound of the Baskervilles might not seem necessary, but there are stories in which the families get so big that it becomes very hard to track the relationships without a corresponding representation. For example, George R. R. Martin’s series of novels A Song of Ice and Fire, also known through the TV series Game of Thrones, contains three large family trees, House Stark, House Lannister, and House Targaryen. The gods from Greek and Norse mythology provide further examples of large family trees.
Problem Solving and Its Limitations
Indiana Jones
Lost and Found
Where is that note you created a few months ago? You know you wrote it on a sheet of paper that was related to the project that just came up again. You searched all the places where that sheet of paper could possibly be—at least you think you did. Yet, you can’t find it. You observe yourself searching some of the places repeatedly—maybe you didn’t look carefully enough the first time. You also find notes you were looking for desperately and unsuccessfully some time ago. Darn it.
Does that scenario sound familiar? I certainly have experienced it, and not just once. Without doubt, finding something can be difficult, and an unsuccessful search can be quite frustrating. Even in the days of the seemingly all-powerful Google search engine, finding the right information on the internet can be difficult if you don’t know the right keywords or if the keywords are too general and match too many sources.
Fortunately, we are not doomed to be helpless victims of data deluge. The challenge of finding something efficiently can be met by organizing the space in which to search. After all, your mother was right when she told you to clean up your room.1 An effective organization of the search space involves one or both of the following principles: (1) partition the space into disjoint areas and place the searchable items into these areas, and (2) arrange the searchable items in some order (within their areas).
For example, the partitioning principle may lead to placing all books on a shelf, all papers into a file drawer, and all notes into a ring binder. If you then have to find a note, you know that you only have to search in the ring binder and not in the file drawer or on the bookshelf. Partitioning is important because it allows you to effectively limit the search space to a more manageable size. This principle can also be applied on multiple levels to make a search even more focused. For example, books can be further grouped according to their subject, or papers can be grouped by the year in which they were written.
Of course, partitioning works only under two important conditions. First, the categories used for subdividing the search space must be discernible for the item you are looking for. Suppose, for example, you are looking for Franz Kafka’s The Metamorphosis. Did you place it under “Fiction” or under “Philosophy”? Or maybe you placed it with your other entomology books, or with comics about transforming superheroes, such as The Hulk or some of the X-Men. Second, this strategy can work only if the search partition is maintained correctly at all times. For example, if you take a note out of the ring binder, and instead of putting it back immediately after using it, keep it on your desk or stuff it in the bookshelf, you won’t be able to find it in the binder the next time you’re looking for it.
Maintaining an order, the second principle to support searching, finds applications in many different situations, from using printed dictionaries to holding a hand of playing cards. The order among the searchable items makes it easier and faster to find items. The method we employ effortlessly in searching a sorted collection is called binary search. We pick an item somewhere in the middle of the collection and then continue searching to the left or right, depending on whether the item we’re looking for is smaller or larger, respectively. It is also possible to combine the partitioning and ordering principles, which often provides some helpful flexibility. For example, while we can keep the books on the shelf ordered by author name, we can order the papers by date or the notes by their main subject.
A closer look at the two principles reveals that they are related. Specifically, the idea of keeping things in order implies the idea of partitioning the search space, applied rigorously and recursively. Each element divides the collection space into two areas containing all smaller and larger elements, and each of those areas is again organized in the same way. Maintaining an ordered arrangement requires real effort. The interesting question from a computing perspective is whether the payoff of faster search justifies the effort of maintaining order. The answer depends on how large the collection is and how often one has to search it.
The need for searching arises not only in an office. Other environments, such as kitchens, garages, or hobby rooms, are subject to the same challenges, frustrations, and corresponding solutions. Searching is also called for in situations not usually perceived as such. In chapter 5, Indiana Jones’s quest for the Holy Grail in The Last Crusade provides several obvious and not-so-obvious examples of searching.
5
The Search for the Perfect Data Structure
Data types discussed in chapter 4 capture specific access patterns to collections of data. Since collections can grow very large, an important practical consideration is how to manage them efficiently. We have already seen that different data structures can support the efficiency of specific operations and that they have different space requirements. While building and transforming collections are important tasks, finding items in a collection is probably the most frequently needed operation.
We are searching for things all the time. Often this happens unconsciously, but sometimes we are made painfully aware of it, such as when a mundane activity like getting your car keys can turn into an agonizing search. The more places we have to search and the more items we have to go through, the more difficult it is to find what we’re looking for. And we tend to accumulate lots of artifacts over the years—after all, we are descendants of hunters and gatherers. In addition to collector’s items such as stamps, coins, or sports cards, we also hoard lots of books, pictures, or clothes over time. Sometimes this happens as a side effect of some hobby or passion—I know several home improvement aficionados who have amassed an impressive collection of tools.
If your bookshelf is organized alphabetically or by topic, or if your picture collection is electronic and has location and time tags, finding a particular book or picture might be easy, but if the number of items is large and lacks any form of organization, such a search can become arduous.
The situation gets much worse when it comes to electronically stored data. Since we have access to basically unlimited storage space, the size of stored data grows rapidly. For example, according to YouTube, 300 hours of video are uploaded to its site every minute.1
Searching is a prevalent problem in real life, and it is also an important topic in computer science. Algorithms and data structures can speed up the search process considerably. And what works with data can sometimes help with storing and retrieving physical artifacts. There is indeed a very simple method that relieves you from ever having to search for your car keys again—if you follow the method meticulously, that is.
The Key to Fast Search
In the story The Last Crusade, Indiana Jones embarks on two major searches. First, he tries to find his father, Henry Jones, Sr., and then both of them
together try to find the Holy Grail. As an archaeology professor, Indiana Jones knows a thing or two about searching. In one of his lectures he actually explains to his students,
Archaeology is the search for facts.
How does a search for an archaeological artifact—or a person for that matter—work? If the location of the sought object is known, no search is of course required. Simply go there and find it. Otherwise, the search process depends on two things: the search space and potential clues or keys that can narrow down the search space.
In The Last Crusade, Indiana Jones receives his father’s notebook containing information about the Holy Grail by mail from Venice, which causes him to start his search there. The notebook’s origin is a clue for Indiana Jones that narrows down the initial search space considerably—from all of earth to one city. In this example the search space is literally the two-dimensional geometric space, but in general the term search space means something more abstract. For example, any data structure for representing a collection of items can be viewed as a search space in which one can search for particular items. Sherlock Holmes’s list of suspects is such a search space in which one can look for a particular name.
How well are lists suited for searching? As I have discussed in chapter 4, in the worst case we have to inspect all elements in the list before we find the desired element or learn that the element is not contained in the list. This means that searching through a list makes no effective use of clues for narrowing down the search space.
To understand why a list is not a good data structure for searching, it is instructive to take a closer look at how clues actually work. In the two-dimensional space a clue provides a boundary that separates an “outside,” which is known to not contain the element, from an “inside,” which may contain it.2 Similarly, for a data structure to make use of clues it must provide a notion of boundary that can separate different parts of the data structure and thus restrict the search to one of these parts. Moreover, a clue or a key is a piece of information that is connected to the object that is sought. The key must be able to identify the boundary between elements that are relevant for the current search and those that aren’t.
In a list each element is a boundary that separates the elements preceding it from those that follow it. However, since we never directly access elements in the middle of a list and instead always traverse a list from one end to another, list elements are effectively unable to separate an outside from an inside. Consider the first element in a list. It does not rule out any element from the search except itself, because if it is not the element we are looking for, we have to continue the search through all the remaining elements of the list. But then, when we inspect the second list element, we’re faced with the same situation. If the second element is not what we are looking for, we again have to continue the search through all the still remaining elements. Of course, the second element defines as outside the first element of the list, which we do not have to check, but since we have looked at it already, this does not mean any savings in search effort. And the same is true for each list element: the outside it defines has to be inspected to reach the element.
After he has arrived in Venice, Indiana Jones’s search continues in a library. Searching for a book in a library provides a nice practical illustration for the notions of boundary and key. When book titles are placed on shelves grouped by the last name of the author, each shelf is often labeled with the names of the authors of the first and last book on the shelf. The two names define the range of authors whose books can be found on the shelf. These two names effectively define a boundary that separates the books from authors within the range from all other books. Suppose we are trying to find The Hound of the Baskervilles, by Arthur Conan Doyle, in a library. We can use the author’s last name as a key to first identify the shelf whose name range contains it. Once we have found that, we continue searching for the book on that shelf. This search happens in two phases, one for finding the shelf and another to find the book on the shelf. This strategy works well, since it divides the search space into a number of small, nonoverlapping areas (the shelves) using author name boundaries.
The search through the shelves can be carried out in different ways. One simple approach is to inspect the shelves one by one until the shelf with the range containing Doyle is found. This approach treats the shelves effectively as a list and consequently has the same shortcomings. In the case of Doyle, the search might not take too long, but if we are looking for a book by Yeats, it will take much longer. Few people would actually start their search for Yeats at the A shelf but instead begin closer to the Z shelf and thus find the correct shelf more quickly. This approach is based on the assumption that all shelves are ordered by the names of authors whose books they carry and that if there were 26 shelves (and assuming one shelf per letter), one would start looking for Yeats on the twenty-fifth shelf and for Doyle on the fourth. In other words, one would treat the shelves as an array that is indexed by letters and whose cells are shelves. Of course, it is unlikely that the library contains exactly 26 shelves, but the approach scales to any number of shelves; one would simply start the search for Yeats in the last one-thirteenth of the shelves. A problem with this approach is that the number of authors whose names start with the same letter varies greatly (there are more authors whose names start with an S than with an X). In other words, the distribution of books is not uniform across all shelves, and thus this strategy is not exact. Therefore, one generally has to go back and forth to locate the target shelf. Exploiting knowledge about the distribution of author names, one could increase the precision of this method substantially, but even the simple strategy works well in practice and is effective in excluding as “outside” a large number of shelves that need not be considered at all.
The search for the book within one shelf can proceed again in different ways. One can scan the books one by one or apply a similar strategy to the one for locating shelves by estimating the location of the book based on the clue’s position in the range of authors for that particular shelf. Overall, the two-phase method works pretty well and is much faster than the naive approach of looking at all books one by one. I performed an experiment and searched for The Hound of the Baskervilles in the Corvallis Public Library. I was able to locate the correct shelf in five steps and the book on the shelf in another seven steps. This is quite an improvement over the naive method, given that at the time the library carried in its adult fiction section 44,679 books on 36 shelves.
The decision by Indiana Jones to travel to Venice in search of his father is based on a similar strategy. In this case the world is viewed as an array whose cells correspond to geographic regions that are indexed by city names. The return address on the mail containing Henry Jones, Sr.’s notebook serves as a clue that picks out the cell indexed by “Venice” to continue the search. In Venice, the search leads Indiana Jones to a library, where he is looking not for a book but for the tomb of Sir Richard, one of the knights of the first crusade. He finds the tomb after crushing a floor tile marked with an X, which is not without irony, given his earlier proclamation to his students,
X never, ever marks the spot.
The key to a fast search is to have a structure that allows the search space to be effectively narrowed down quickly. The smaller the “inside” identified by a key, the better, because it allows the search to converge faster. In the case of the book search there are two search phases separated by one major narrowing-down step.
Survival by Boggle
Search often comes in disguise and in the most unexpected circumstances. Toward the end of The Last Crusade, Indiana Jones arrives in the Temple of the Sun, where he has to overcome three challenges before he can finally enter the room that contains the Holy Grail. The second challenge requires him to cross a tiled floor over an abyss. The problem is that only some of the tiles are safe while others crumble and lead to a gruesome death when stepped on. The floor consists of 50 or so tiles, arranged in a nonregular g
rid, each marked with a letter of the alphabet. Life as an archaeologist is tricky: one day you have to crush a floor tile to make progress; on other days you have to avoid it at all costs.
Finding those tiles that are safe to step on is not an easy task, since without any further restrictions the number of possibilities is huge: more than one thousand trillion—1 followed by 15 zeros. The clue to finding a viable sequence of tiles that leads safely to the other side of the floor is that the tiles should spell the name Iehova. Although this information basically solves the puzzle, the identification of the correct sequence of tiles still requires some work. Surprisingly, it involves a search that systematically narrows down the space of possibilities.
This task is similar to playing Boggle, where the goal is to find strings of connected letters on a grid that form words.3 Indiana Jones’s task seems to be much easier because he knows the word already. However, in Boggle, consecutive characters must be on adjacent tiles, which is not a constraint for the tile floor challenge and thus leaves more possibilities for Indiana Jones to consider and makes it harder again.
To illustrate the search problem to be solved, let us assume for simplicity that the tile floor consists of six rows, each containing eight different letters, which yields a grid of 48 tiles. If the correct path consists of one tile in each row, there are 8 × 8 × 8 × 8 × 8 × 8 = 262,144 possible paths through all possible combination of tiles in the six rows. Of these, only one is viable.
How does Indiana Jones now find the path? Guided by the clue word, he finds the tile that contains the letter I in the first row and steps on it. This identification is itself a search process that involves multiple steps. If the letters on the tiles do not appear in alphabetical order, Indiana Jones has to look at them one by one until he finds the tile with the letter I. He then steps on it, and continues searching for the tile with the letter e in the second row of tiles, and so forth.