Once Upon an Algorithm Read online

Page 13


  First Things First

  For his day job, Indiana Jones works as a college professor, which means that he faces issues related to grading exams. Thus sorting is relevant for him. Moreover, as an archaeologist he has to keep collections of physical artifacts and organize notes and observations made over the course of his excursions. Like the documents in an office these can benefit greatly from being sorted because order makes the search for particular items more efficient. Indiana Jones’s adventures illustrate another application of sorting, which arises whenever he makes a plan of how to accomplish a complex task. A plan is the arrangement of a set of actions in the right order.

  Indiana Jones’s task in Raiders of the Lost Ark is to find the ark containing the Ten Commandments. The ark is rumored to be buried inside a secret chamber in the ancient city of Tanis. To find the ark, Indiana Jones has to find this secret chamber, called Well of Souls. The chamber’s location can be discovered in a model of Tanis, itself located in a map room. By placing a special golden disc at a particular location in the map room, sunlight will be focused on the location of the Well of Souls in the model of Tanis, thus revealing the place where the Well of Souls can be found. The golden disc was originally in possession of Indiana Jones’s former teacher and mentor Professor Ravenwood but was later given to his daughter, Marion Ravenwood. Thus to find the ark Indiana Jones has to perform several tasks, including the following:

  Locate the Well of Souls (Well).

  Find the map room (Map).

  Obtain the golden disc (Disc).

  Use the disc to focus the sunbeam (Sunbeam).

  Find the possessor of the golden disc (Marion).

  In addition to performing these tasks, Indiana Jones has to travel between different places: to Nepal to find Marion Ravenwood and to the city of Tanis in Egypt, where the ark is hidden in the Well of Souls.

  It is not difficult to find the correct ordering for this particular set of tasks. The interesting question, from a computing perspective, is what different algorithms exist for solving the problem and what the runtimes of these algorithms are. There are two approaches that most people will probably apply when asked to order the sequence. Both methods start with an unsorted list and repeatedly move elements around until the list is sorted. The difference between the two methods can be best illustrated by describing how to move elements from an unsorted list into a sorted list.

  Figure 6.1 Selection sort repeatedly finds the smallest element in an unsorted list (left of vertical bar) and appends it to the sorted list (right). Elements are not sorted by name but according to the dependency of tasks they represent.

  The first method, illustrated in figure 6.1, is to repeatedly find the smallest element in the unsorted list and append it to the sorted list. To compare actions, one action is considered to be smaller than another if it can come before. Consequently, the smallest action is one that does not have to be preceded by any other action. At the beginning all elements are still to be processed, and the sorted list is empty. Since the first task is to travel to Nepal, this means Nepal is the smallest element and is thus the first element to be selected and appended to the sorted list. The next step is to find the possessor of the disc, that is, to find Marion. Next Indiana Jones has to obtain the disc, travel to Tanis, and uncover the map room, which is reflected in the second-to-last line in figure 6.1. Finally, Indiana Jones has to focus the sunbeam using the disc to reveal the location of the Well of Souls, where he can then find the ark. The resulting sorted list represents the plan for Indiana Jones’s adventure.

  As the figure illustrates, the algorithm is finished when the list of unsorted elements becomes empty, since in that case all elements have been moved to the sorted list. This sorting algorithm is called selection sort because it is based on repeatedly selecting an element from an unsorted list. Note that the method works just as well the other way around, that is, by repeatedly finding the largest element and adding it to the beginning of the sorted list.

  Figure 6.2 Insertion sort repeatedly inserts the next element from the unsorted list (left of the vertical bar) into the sorted list (right).

  It may seem obvious, but how can we actually find the smallest element in a list? A simple method is to remember the value of the first element and compare it with the second, third, and so on, until we find an element that is smaller. In that case, we take the value of that element and continue comparing and remembering the currently smallest element until we reach the end of the list. This method takes linear time in the length of the list in the worst case, since the smallest element can be at the end of the list.

  Most of the effort in selection sort is expended on finding the smallest element in an unsorted list. Even though the unsorted list shrinks by one element in each step, the overall runtime of the sorting algorithm is still quadratic, since we have to traverse, on average, a list that contains half the elements. We saw the same pattern when analyzing Hansel’s algorithm in chapter 2 for fetching the next pebble from home: the sum of the first n numbers is proportional to the square of n.1

  The other popular method for sorting takes an arbitrary element (usually the first) from the unsorted list and places it at the correct position in the sorted list. The element is inserted after the largest element in the sorted list that is smaller than the element to be inserted. In other words, the insertion step traverses the sorted list to find the last element that has to precede the element to be inserted.

  The effort in this method is spent on the insertion of elements, not their selection, which gives the method the name insertion sort (see figure 6.2). Insertion sort is the preferred sorting method of many people who play cards. Having a pile of dealt cards in front of them, they pick up one after the other and insert it into an already sorted hand.

  The difference between insertion sort and selection sort can best be seen when the Sunbeam element is moved from the unsorted list into the sorted list. As figure 6.2 shows, the element is simply removed from the unsorted list, without the need for any search, and is inserted into the sorted list by traversing it until its proper position between Map and Well is found.

  This example also demonstrates a subtle difference in the runtime of both algorithms. While selection sort has to traverse one of the two lists completely for every selected element, insertion sort has to do this only when the element to be inserted is larger than all elements currently in the sorted list. In the worst case, when the list to be sorted is already sorted to begin with, every element will be inserted at the end of the sorted list. In this case insertion sort has the same runtime as selection sort. In contrast, when the list is inversely sorted, that is, from the largest to smallest element, every element will be inserted at the front of the sorted list, which leads to a linear runtime for insertion sort. One can show that on average insertion sort still has quadratic runtime. Insertion sort can be much faster than selection sort in some cases and never does worse.

  Why does insertion sort have a better runtime in some cases than selection sort, even though they both operate similarly? The key difference is that insertion sort makes use of the result of its own computation. Since new elements are inserted into an already sorted list, insertions do not always have to traverse the whole list. In contrast, selection sort always appends to the sorted list, and since the selection process cannot exploit the sorting, it always has to scan the whole unsorted list. This comparison illustrates the important computer science design principle of reuse.

  Apart from efficiency, do the two sorting methods differ in how well they are suited for the problem of arranging actions into a plan? Neither method is ideal, because a particular difficulty in this example problem is not the process of sorting but the decision about which actions precede other actions. If there is uncertainty about the exact ordering of elements, selection sort seems least attractive, since in the very first step one already has to decide how a tentative smallest element compares to all other elements. Insertion sort is better, since one can pi
ck an arbitrary element in the first step without the need to perform any comparisons and thus already obtains a sorted one-element list. However, in the steps that follow, each selected element has to be compared to more and more elements in the growing sorted list to find its correct position. While in the beginning the number of difficult comparisons might be smaller for insertion sort than for selection sort, the algorithm might force decisions that one might not be able to make yet. Is there a method that gives a sorter more control over which elements to compare?

  Split as You Please

  Preferably, a sorting method would allow one to delay difficult decisions and start with those that are easy. In the case of Indiana Jones’s plan for locating the Lost Ark, for example, it is clear that the Well of Souls and the map room are in Tanis, so every action connected to those two locations must come after the step of traveling to Tanis, and everything else must come before.

  By dividing the list elements into those that come before and after a separating element, which is called a pivot, we have split one unsorted list into two unsorted lists. What have we gained from doing this? Even though nothing is sorted yet, we have accomplished two important goals. First, we have decomposed the problem of sorting one long list into that of sorting two shorter lists. Problem simplification is often a crucial step toward solving a problem. Second, once the two subproblems are solved, that is, after the two unsorted lists are sorted, we can simply append them to get one sorted list. In other words, the decomposition into subproblems facilitates constructing the solution for the overall problem from the two subproblems.

  This is a consequence of how the two unsorted lists are formed. Let’s label by S the list of elements that are smaller than Tanis, and by L the list of those that are larger than Tanis. We then know that all elements in S are smaller than all elements in L (but S and L are not sorted yet). Once the lists S and L have been sorted, the list that results from appending S, Tanis, and L will also be sorted. Therefore, the last task is to sort the lists S and L. Once this is done, we can simply append the results. How are these smaller lists to be sorted? We can choose any method we like. We can recursively apply the method of splitting and joining, or if the lists are small enough, we can employ a simple method, such as selection sort or insertion sort.

  In 1960 the British computer scientist Tony Hoare (formally, Sir Charles Antony Richard Hoare) invented this sorting method, which is called quicksort. Figure 6.3 illustrates how quicksort works in generating Indiana Jones’s plan for locating the Lost Ark. In the first step, the unsorted list is split into two lists, separated by the pivot element, Tanis. In the next step, the two unsorted lists have to be sorted. Since each of them contains only three elements, this can be easily done using any algorithm.

  For illustration, let’s sort the sublist of elements smaller than Tanis using quicksort. If we pick Nepal as the separating element, we obtain the unsorted list Disc → Marion when gathering the elements that are larger than Nepal. Correspondingly, the list of elements smaller than Tanis is empty, which is trivially sorted already. To sort this two-element list, we simply have to compare the two elements and invert their position, which yields the sorted list Marion → Disc. Then we append the empty list, Nepal, and Marion → Disc and obtain the sorted sublist Nepal → Marion → Disc. The sorting works similarly if we pick any of the other elements as pivots. If we pick Disc, we again obtain an empty list and a two-element list that needs to be sorted, and if we pick Marion, we obtain two single-element lists that are both already sorted. Sorting the sublist of elements larger than Tanis is analogous.

  Figure 6.3 Quicksort splits a list into two sublists of elements that are smaller and larger, respectively, than a chosen pivot element. Then those two lists are sorted, and the results are appended together with the pivot element to form the resulting sorted list.

  Once the sublists are sorted, we can append them with Tanis in the middle and obtain the final result. As figure 6.3 shows, quicksort converges surprisingly quickly. But is this always the case? What is the runtime of quicksort in general? It seems we got lucky by picking Tanis as the pivot in the first step, since Tanis divides the list into two lists of equal size. If we can always pick a pivot with that property, the sublists will always be cut in half, which means that the number of division levels will be proportional to the logarithm of the original list’s length. What does this mean for the overall runtime of quicksort in this and the general case?

  The first iteration, in which we split one list into two, takes linear time, since we have to inspect all the elements in the list. In the next iteration we have to split two sublists, which again takes altogether linear time because the total number of elements in both lists is 1 less than the original list,2 no matter where we split and how long each of the two sublists are. And this schema continues: each level has no more elements than the previous one and thus takes linear time to split. Altogether we accumulate linear time for each level, which means that the runtime of quicksort depends on the number of levels needed for splitting sublists. Once the original list has been completely decomposed into single elements, the splitting process ensures that all these elements are in the correct order and can simply be concatenated into a sorted result list. This takes, once more, linear effort, so the total runtime for quicksort is given by the sum of the linear times for all levels.

  Figure 6.4 Comparison of linearithmic, linear, and quadratic runtimes.

  In the best case, when we can split each sublist roughly in half, we obtain a logarithmic number of levels. For example, a list with 100 elements leads to seven levels, a list with 1,000 elements leads to ten levels, and a list with 1,000,000 can be completely decomposed in only 20 levels.3 For the total runtime this means that in the case of 1,000,000 elements, we have to spend the linear effort of scanning a list of 1,000,000 elements only 20 times. This leads to a runtime on the order of tens of millions of steps, which is much better than a quadratic runtime, which means hundreds of billions to trillions of steps. This runtime behavior, where the runtime grows proportionally to the size of the input times the logarithm of the size is called linearithmic. It is not quite as good as linear runtime, but it is much better than quadratic runtime (see figure 6.4).

  Quicksort has linearithmic runtime in the best case. However, if we pick the pivots unwisely, the situation is quite different. For example, if we choose Nepal instead of Tanis as the first pivot, the sublist of smaller elements will be empty, and the sublist of larger elements will contain all elements except Nepal. If we next choose Marion as the pivot of that sublist, we run into the same situation where one list is empty and the other is only one element shorter than the sublist that is split. We can see that in this situation quicksort effectively behaves just like selection sort, which also repeatedly removes the smallest element from the unsorted list. And like selection sort, quicksort will in this case also have quadratic runtime. The efficiency of quicksort depends on the selection of the pivot: if we can always find an element in the middle, the splitting process returns evenly divided lists. But how can we find a good pivot? While it is not easy to guarantee good pivots, it turns out that picking a random element or the median of the first, middle, and last element works well on average.

  The importance and impact of the pivot is closely related to the notion of boundary that was used in chapter 5 to explain the essence of effective search. In the case of searching, the purpose of the boundary is to divide the search space into an outside and an inside so that the inside is as small as possible to make the remainder of the search easier. In the case of sorting, the boundary should divide the sorting space into equal parts so that the sorting of each part is sufficiently simplified through the decomposition. Therefore, if the pivot is not chosen well, quicksort’s runtime deteriorates. But while quicksort has quadratic runtime in the worst case, it has linearithmic runtime on average, and it performs well in practice.

  The Best Is Yet to Come

  Are there even faster sort
ing methods than quicksort, for example, an algorithm with linearithmic or better runtime in the worst case? Yes. One such algorithm is mergesort, which was invented by the Hungarian-American mathematician John von Neumann in 1945.4 Mergesort splits the unsorted list into two and thus works by decomposing the problem into smaller parts similarly to quicksort. However, mergesort does not compare elements in this step; it just divides the list into two equal parts. Once these two sublists are sorted, they can be merged into one sorted list by traversing them in parallel and comparing the elements from the two sublists one by one. This works by repeatedly comparing the first elements of the two sublists and taking the smaller one. Since both those sublists are sorted, this ensures that the merged list is sorted, too. But how can the two sublists that result from the splitting step be sorted? This is done by recursively applying mergesort to both lists. Mergesort is illustrated in figure 6.5.

  While quicksort and mergesort are great algorithms to program an electronic computer with, they are not so easy for the unaided human memory to use, because they require a fair amount of bookkeeping. In particular, for larger lists, both algorithms have to maintain a potentially large collection of small lists; in the case of quicksort, those lists also have to be kept in the right order. Thus, Indiana Jones, like the rest of us, would probably stick to some form of insertion sort, perhaps augmented by a smart selection of elements, unless the size of the list to be sorted demands a more efficient algorithm. For example, whenever I teach a large undergraduate class, I use a variant of bucket sort for sorting exams by name. I place exams into different piles (called buckets) according to the first letter of the student’s last name and keep each bucket sorted by using insertion sort. After all exams have been placed into their buckets, the buckets are appended in alphabetical order to produce an ordered list. Bucket sort is similar to counting sort (discussed later).