The original version of this story appeared in Quanta Magazine.
Computer scientists often deal with abstract problems that are hard to comprehend, but an exciting new algorithm matters to anyone who owns books and at least one shelf. The algorithm addresses something called the library sorting problem (more formally, the “list labeling” problem). The challenge is to devise a strategy for organizing books in some kind of sorted order—alphabetically, for instance—that minimizes how long it takes to place a new book on the shelf.
Imagine, for example, that you keep your books clumped together, leaving empty space on the far right of the shelf. Then, if you add a book by Isabel Allende to your collection, you might have to move every book on the shelf to make room for it. That would be a time-consuming operation. And if you then get a book by Douglas Adams, you’ll have to do it all over again. A better arrangement would leave unoccupied spaces distributed throughout the shelf—but how, exactly, should they be distributed?
This problem was introduced in a 1981 paper, and it goes beyond simply providing librarians with organizational guidance. That’s because the problem also applies to the arrangement of files on hard drives and in databases, where the items to be arranged could number in the billions. An inefficient system means significant wait times and major computational expense. Researchers have invented some efficient methods for storing items, but they’ve long wanted to determine the best possible way.
Last year, in a study that was presented at the Foundations of Computer Science conference in Chicago, a team of seven researchers described a way to organize items that comes tantalizingly close to the theoretical ideal. The new approach combines a little knowledge of the bookshelf’s past contents with the surprising power of randomness.
“It’s a very important problem,” said Seth Pettie, a computer scientist at the University of Michigan, because many of the data structures we rely upon today store information sequentially. He called the new work “extremely inspired [and] easily one of my top three favorite papers of the year.”
Narrowing Bounds
So how does one measure a well-sorted bookshelf? A common way is to see how long it takes to insert an individual item. Naturally, that depends on how many items there are in the first place, a value typically denoted by n. In the Isabel Allende example, when all the books have to move to accommodate a new one, the time it takes is proportional to n. The bigger the n, the longer it takes. That makes this an “upper bound” to the problem: It will never take longer than a time proportional to n to add one book to the shelf.
The authors of the 1981 paper that ushered in this problem wanted to know if it was possible to design an algorithm with an average insertion time much less than n. And indeed, they proved that one could do better. They created an algorithm that was guaranteed to achieve an average insertion time proportional to (log n)2. This algorithm had two properties: It was “deterministic,” meaning that its decisions did not depend on any randomness, and it was also “smooth,” meaning that the books must be spread evenly within subsections of the shelf where insertions (or deletions) are made. The authors left open the question of whether the upper bound could be improved even further. For over four decades, no one managed to do so.
However, the intervening years did see improvements to the lower bound. While the upper bound specifies the maximum possible time needed to insert a book, the lower bound gives the fastest possible insertion time. To find a definitive solution to a problem, researchers strive to narrow the gap between the upper and lower bounds, ideally until they coincide. When that happens, the algorithm is deemed optimal—inexorably bounded from above and below, leaving no room for further refinement.