Once Upon an Algorithm Page 8
The sequence of documents on your desk and the line in a coffee shop is called a list data structure. Here the physical arrangement of the people in line ensures the ordering required for a queue. In contrast, the method for assigning consecutive numbers doesn’t require you to physically maintain the order of people or documents; they can be placed anywhere, since the numbers capture the correct ordering. This arrangement of assigning elements to numbered slots is called an array data structure. In addition to the numbered slots (realized by people carrying the numbers they have pulled), one also has to maintain two counters, one for the next available slot and one for the next number to be served.
By representing collections as data structures they become accessible to computation. The choice of data structure is important for the efficiency of algorithms and is sometimes influenced by other considerations, such as the available space. When Sherlock Holmes maintains information about a case, such as a collection of suspects, he is essentially using data types and data structures. I therefore continue using the story of The Hound of the Baskervilles to motivate and explain these concepts.
4
Detective’s Notebook: Accessory after the Fact
Computation is particularly useful when we have to deal with large amounts of data that cannot be handled in a few individual steps. In such cases an appropriate algorithm can ensure that all data are processed systematically and, in many cases, also efficiently.
Signs, discussed in chapter 3, illustrate how representation works for individual pieces of information and how this representation becomes part of computation. For example, Hansel and Gretel’s movements from pebble to pebble mean that they are in danger until they make the move from the last pebble to their home. But even though collections of signs are signs themselves, it is not clear how to compute with such collections. In the case of Hansel and Gretel, an individual pebble is a signifier for one location, and the collection of all pebbles signifies a path from danger to safety, but how is such a collection built and used systematically? The maintenance of a collection of data raises two questions.
First, in which order will data be inserted into, looked up in, and removed from a collection? The answer depends, of course, on the computational task that the collection is involved in, but we can observe that specific patterns of accessing the elements in a collection reoccur. Such a data access pattern is called a data type. For example, the pebbles employed by Hansel and Gretel are visited in the opposite order in which they were placed; such an access pattern is called a stack.
Second, how can a collection be stored so that the access pattern, or data type, is supported most efficiently? Here the answer depends on a wide variety of factors. For example, how many elements have to be stored? Is this number known in advance? How much space does each element require for storage? Do all elements have the same size? Any particular way of storing a collection is called a data structure. A data structure makes a collection amenable to computation. One data type can be implemented by different data structures, which means that a particular access pattern can be implemented through different ways of storing data. The difference between data structures lies in how efficiently they support specific operations on the collection. Moreover, one data structure can implement different data types.
This chapter discusses several data types, the data structures to implement them, and how they are used as part of computations.
The Usual Suspects
When the perpetrator of a crime is known (maybe there are eye witnesses and a confession), we don’t require the skills of a Sherlock Holmes. But when there are several suspects, we need to keep track of their motives, alibis, and other relevant information to investigate the case in detail.
In The Hound of the Baskervilles the suspects include Dr. Mortimer, Jack Stapleton and his presumed sister Beryl (who is really his wife), the escaped convict Seiden, Mr. Frankland, and the Barrymore couple, the servants of the late Sir Charles Baskerville. Before Watson leaves to visit Baskerville Hall, Sherlock Holmes instructs Watson to report all relevant facts but to exclude Mr. James Desmond from the suspects. When Watson suggests to Holmes to also exclude the Barrymore couple, Sherlock Holmes responds:
No, no, we will preserve them upon our list of suspects.1
This short exchange demonstrates two things.
First, even though Sherlock Holmes doesn’t know anything about data structures, he is using one, since he seems to have kept a list of suspects. A list is a simple data structure for storing data items by linking them together. A list provides a distinctive form of accessing and manipulating these data items. Second, the list of suspects is not a static entity; it grows and shrinks as new suspects are added or when suspects are cleared. Adding, removing, or otherwise changing items in a data structure requires algorithms that generally take more than one step, and it is the runtime of these algorithms that determines how well a specific data structure is suited for a particular task.
Because of their simplicity and versatility, lists are probably the most widely used data structure in computer science and beyond. We all use lists on a regular basis in the form of to-do lists, shopping lists, reading lists, wish lists, and all kinds of rankings. The order of the elements in a list matters, and the elements are typically accessed one by one, starting at one end and proceeding to the other. Lists are often written down vertically, one element per line, with the first element at the top. Computer scientists, however, write lists horizontally, presenting the elements from left to right, connected by arrows to indicate the order of the elements.2 Using this notation, Sherlock Holmes can write down his list of suspects as follows:
Mortimer → Jack → Beryl → Seiden → …
The arrows are called pointers and make the connection between list elements explicit, which becomes important when one considers how lists are updated. Assume Sherlock Holmes’s suspect list is Mortimer → Beryl and that he wants to add Jack between the two.
If the elements are written down as a vertical list of items without any empty space between them, he has to resort to some extra notation to make the position of the new element clear. The alternative is to simply write down a fresh copy of the complete new list. That would be, however, a huge waste of time and space. It would require time and space that is in the worst case quadratic in the size of the final list.
Pointers give us the flexibility to write down new elements wherever we find space and still place them at the proper position in the list by connecting them to their neighboring list elements. For example, we can place Jack at the end of the list, redirect the outgoing pointer from Mortimer to Jack, and add a pointer from Jack to Beryl.
In the context of the story, the order of the suspects in this list is arbitrary and has no meaning, but we can see that the creation of a list forces us to pick some order for its elements. It is a defining characteristic of a list that its elements are kept in a specific order.
The inspection of the list’s elements happens in the particular order given by the list. Thus to find out whether Selden is a suspect, we have to start at the beginning of the list and check the elements one by one by following the pointers. While it seems we can spot the element Selden directly in the list, this works only for relatively small lists. Since our visual field is limited, we cannot immediately recognize a specific element in a long list and thus have to resort to the one-element-at-a-time traversal of the list.
A physical analogy for a list is a ring binder that contains a sheet of paper for each element. To find a particular element in a ring binder one has to look at individual pages one by one, and one can insert new pages anywhere between other pages.
A salient property of lists is that the time it takes to find an element depends on where the element is located in the list. In this example, Selden would be found in the fourth step. In general, finding an element may require the traversal of the whole list, since the element could be last. In the discussion about runtime complexity in chapter 2, suc
h an algorithm was called linear, since the time complexity is directly proportional to the number of elements in the list.
As mentioned, it is not clear that Sherlock Holmes’s list actually contains the suspects in the particular order shown, and the fact that Selden comes after Beryl does not mean anything, since the purpose of the list is only to remember who is among the suspects. All that matters is whether a person is on the list.3 Does that mean that a list is not an adequate representation to remember suspects? Not at all; it just means that a list may contain information (such as the ordering of elements) that is not needed for a particular task. This observation suggests that a list is just one possible data structure for representing data about suspects and that there might be other representations that could be used for the same purpose—as long as they support the same operations for adding, removing, and finding elements. These operations express requirements for what has to be done with the data.
Such requirements for data, expressed through a collection of operations, is called a data type in computer science. The requirements for the suspect data are the ability to add, remove, and find elements. This data type is called a set.
Sets are widely applicable, since they correspond to predicates related to a problem or algorithm. For example, the set of suspects corresponds to the predicate “is a suspect,” which applies to people and which can be used to affirm or reject statements such as “Selden is a suspect,” depending on whether the person to which the predicate is applied is a member of the set. The pebble-tracing algorithm used by Hansel and Gretel also uses a predicate when it instructs, “Find a shining pebble that was not visited before.” The predicate here is “not visited before”; it applies to pebbles and can be represented by a set that is initially empty and to which pebbles are added after they have been visited.
While a data type describes requirements of what to do with data, a data structure provides a concrete representation to support these requirements. You can think of a data type as a description of a data management task and of a data structure as a solution for that task. (The following mnemonic may help with remembering the meaning of the two terms: A data type describes a task; a data structure describes a solution.) A data type is a more abstract description of data management than a data structure and has the advantage that some details can be left unspecified and thus can lead to succinct and general descriptions. In The Hound of the Baskervilles the set data type reflects the task of maintaining a collection of suspects without having to say in detail how to implement it. In the Hansel and Gretel story a set data type to remember visited pebbles is enough to describe the algorithm. However, to actually execute the operations stipulated by a data type, a computer needs to employ a concrete data structure that defines how the operations operate on the representation offered by the data structure. Also, only when a concrete data structure is selected for an algorithm can we determine the algorithm’s runtime complexity.
Since one data type can be implemented by different data structures, the question is which data structure one should choose. One would want the data structure that implements the data type operations with the best possible runtime so that the algorithm using the data structure runs as fast as possible. However, this is not always an easy decision because a data structure may support some operations well but not others. Moreover, data structures also have different space requirements. The situation is similar to choosing a vehicle for a particular transportation task. A bike is environmentally friendly, and you can’t beat the mileage it gets per gallon. But it is comparatively slow, can only transport one or two people, and has a limited range. You may need a van or even a bus for traveling with many people over longer distances. You pick a truck to transport large items, a sedan for traveling comfortably, and maybe a sports car when you’re a male in your fifties.
Coming back to the question of how to implement the set data type, two popular alternatives to using lists are the array and the binary search tree data structures. Binary search trees are discussed in detail in chapter 5; here I focus on arrays.
If a list is like a ring binder, an array is like a notebook, which has a fixed number of pages that are each uniquely identified. An individual field in an array data structure is called a cell, and a cell’s identifier is also called its index. Identifying (or indexing) cells is usually done using numbers, but we can also use letters or names as identifiers as long as we can directly open a particular page using the identifier.4 The importance of an array data structure lies in the fast access it provides to each individual cell. No matter how many cells the array contains, it takes just one step to access a cell. Any such operation that requires only one or a few steps, irrespective of the size of the data structure, is said to run in constant time.
To represent a set with a notebook, we assume that each page has a tab that is labeled by one of the possible members of the set. Thus to represent the suspects in The Hound of the Baskervilles we label the pages with all potential suspects’ names, that is, Mortimer, Jack, and so on. The notebook also contain pages for Desmond and other people that could in principle be suspects. This is different from the ring binder, which contains only actual suspects. Then to add, say, Selden, as a suspect we open the page labeled “Selden” and place some mark on it (for example, we write + or “yes” on it). To remove a suspect, we also open that person’s page, but remove the mark (or write – or “no” on it). To find out whether somebody is a suspect, we go to their page and check the mark. An array works in the same way. We directly access its cells using an index and read or modify the information stored in it.
The important difference between arrays and lists is that one can locate individual cells immediately in an array, whereas one has to scan the elements of a list from the beginning to find the element (or reach the end of the list in case the element is not in the list).
Since we can directly open a particular page in the notebook (or access a cell in an array), all three operations—adding, removing, and finding suspects—can be performed in constant time, which is optimal: we cannot possibly do any of this faster. Since the list data structure requires linear time for checking and removing suspects, the array data structure seems to be the winner hands down. So why are we even talking about lists?
The problem with arrays is their fixed size, that is, a notebook has a specific number of pages and cannot grow over time. This has two important implications. First, the notebook has to be chosen big enough to contain all possible suspects right from the beginning, even if many of them never become suspects. Thus we may waste a lot of space, and we may be carrying around a huge book with pages for hundreds or thousands of potential suspects, whereas in reality the set of suspects may be very small; it may very well contain fewer than ten suspects at any given time. This also means that the preparation of the notebook’s thumb index at the start can take a long time because it requires writing the name of each potential suspect on a different page. Second—and this is an even more serious problem—it might not be clear at the beginning of the mystery who all the potential suspects are. In particular, new potential suspects may become known as the story unfolds, which is certainly the case in The Hound of the Baskervilles. This lack of information prohibits the use of a notebook, since its initialization is impossible.
Figure 4.1 A data type can be implemented by different data structures. Inserting an element into a list can be done by simply adding it to the front of the list, but removal requires traversing the list to find it. With an array, insertion and removal are done by directly accessing the array cell indexed by an element and changing the mark accordingly. Arrays allow faster implementation, but lists are more space efficient.
This weakness of the bulky array is the strength of the nimble list, which can grow and shrink over time as needed and which never stores more elements than necessary. In choosing a data structure for implementing a set data type we have to be mindful of the following trade-off. An array provides a very fast impleme
ntation of the set operations but potentially wastes space and might not work in all situations. A list is more space efficient than an array and works under all circumstances, but it implements some operations less efficiently. The situation is summarized in figure 4.1.
Information Gathering
Identifying suspects is only one step in solving a murder mystery. To narrow down the set of suspects Sherlock Holmes and Dr. Watson need to gather specific information about them, such as motive or potential alibi. In the case of Selden, for example, this information includes the fact that he is an escaped convict. All this additional information should be stored together with each corresponding suspect. When using a notebook, Sherlock Holmes would add information about a person on the page reserved for that person.
The operations of the set data type cannot do that, but a few small changes to the operations for adding and finding elements are sufficient to make it happen. First, the operation for inserting elements takes two pieces of information: a key for identifying the information plus the additional information associated with the key. The key for information about a suspect is his or her name. Second, the operations for finding and removing suspects will only take the key as input. In the case of removing, a person’s name and all the additional information stored with it will be removed. In the case of finding a person, the information stored for that name will be returned as a result.
This minor but important extension of the set data type is called a dictionary because like a real dictionary it allows us to look up information based on a keyword, as when Sherlock Holmes uses a medical directory to find out about Dr. Mortimer’s professional history at the beginning of The Hound of the Baskervilles. A dictionary can be viewed as a collection of signs, and each key is a signifier for the information stored with it. The dictionary data type differs in two ways from a traditional printed dictionary. First, the content of a printed dictionary is fixed, whereas a dictionary data type can change—new definitions can be added, obsolete ones can be deleted, and existing ones can be updated. Second, the entries in a printed dictionary are alphabetically sorted by keys, whereas that is not required for a dictionary data type. The ordering of entries in a printed dictionary is necessary because the large number of entries makes direct access to particular pages impossible. Since a thumb index with access to each page would contain too many entries and would have to be too tiny to be of practical use, the sorted keys enable a user of the dictionary to find the entry using a search algorithm (see chapter 5).