Monday, January 19, 2009

Last time, we took a brute force approach to solving Project Euler problem seven. Unfortunately, the resulting prime number generator turned out to be fairly ugly and not really efficient enough to handle the needs of future YAPES1 problems. This time, we’ll approach the problem using a tried-and-true algorithm that should give us the desired performance and produce a beautiful solution. I can already hear some of you rolling your eyes, because we’ll be looking at...

The Sieve of Eratosthenes

Amazingly, the Sieve of Eratosthenes is perhaps the oldest known algorithm for generating prime numbers, and yet it remains reasonably efficient by today’s standards. It’s true that there are faster methods available, but this should be fast enough. The basic algorithm is nothing earth-shattering—it’s first semester computer science stuff. However, we’ll add a twist to keep things interesting.

The Sieve of Eratosthenes works by marking composite numbers:

  1. Start with the natural numbers from 2 up to some maximum limit.
  2. Take the first unmarked number (2) and mark all multiples of that number. Note that 2 remains unmarked. Only its multiples (e.g. 4,6,8,…) are marked.
  3. Take the next unmarked number and mark all of its multiples.
  4. Repeat step 3 until there are no more numbers left to test.
  5. When finished, the unmarked numbers are the primes.

As described, this algorithm doesn’t quite work for our purposes. It is certainly easy to implement, but because of the starting requirement (a big list of numbers), it doesn’t lend itself well to generating a, potentially, infinite stream of primes. However, with a small modification, we can make it work.

The adaptation we’ll implement comes from from the excellent paper, The Genuine Sieve of Eratosthenes2, by Melissa E. O’Neill. The idea is to turn the algorithm inside out. Instead of keeping track of which numbers are prime, we’ll maintain a table of composite numbers along with their known factors. As we iterate the natural numbers, the table will be updated with new information about which numbers are known to be composite and what factors have been found.

Confused? Let’s walk through the algorithm, step by step.

Tracking Composites

We start with 2 as the first natural number and check to see if it exists in the composite table. Since the table is empty, 2 isn’t present, so we add a new list containing 2 to the table with the square of 2 (4) as the key. Finally, since 2 was not found in the composite table, we know it is prime.

Sieve Diagram 1 (after 2)

We take the next natural number, 3. Like before, 3 is not found in the composite table, so we add it to the table using its square (9) as the key and return 3 as prime.

Sieve Diagram 2 (after 3)

We take the next natural number, 4. This time is different—4 is found in the composite table, so we know it’s not prime. In this case, we take the list found in the table under 4 (the known factors), and for each factor, we add 4 and insert the result into the table. Thus, we insert 6 into the table along with its known factor, 2. (Interestingly, this algorithm doesn’t tell us that 3 is a factor of 6). Finally, we remove 4 from the table.

Sieve Diagram 3 (after 4)

Let's skip ahead to some more interesting cases.

After 9 has been processed, the table looks like so:

Sieve Diagram 4 (after 9)

The next natural number, 10, is found in the table, so we know that it’s composite and not prime. Again, we take the list of known factors, and for each factor, we add 10 and insert the result into the table. This time, the value (12) already exists in the table, so we add it to the list of known factors for that value. Lastly, we remove 10 from the table.

Sieve Diagram 5 (after 10)

The next natural number, 11, is not found in the table, so we insert its square and return it as prime.

Sieve Diagram 6 (after 11)

We find the next natural number, 12, in the table. However, because there is more than one known factor in the table, we add two new composites, 12+2=14 and 12+3=15, before finally removing 12.

Sieve Diagram 7 (after 12)

As you can see, this take on the classic algorithm allows us to determine which numbers are prime and composite by using a minimal amount of memory. Obviously, the table can grow large as we find larger primes, but the memory usage is quite reasonable relative to the traditional sieve where a large list of natural numbers would have to be generated before even the first prime is returned.

Show Me Some Code!

The F# code nearly writes itself.

let reinsert x table prime =
   let comp = x+prime
   match Map.tryfind comp table with
   | None        -> table |> Map.add comp [prime]
   | Some(facts) -> table |> Map.add comp (prime::facts)

let rec sieve x table =
  seq {
    match Map.tryfind x table with
    | None ->
        yield x
        yield! sieve (x+1L) (table |> Map.add (x*x) [x])
    | Some(factors) ->
        yield! sieve (x+1L) (factors |> List.fold_left (reinsert x) (table |> Map.remove x)) }

let primes =
  sieve 2L Map.empty

The meat of the algorithm is in the sieve function. Here, we try to locate x in the table. If it is not found, we yield x (because it’s prime) and recursively call sieve passing x+1 and a table with the square of x added to it. (Note that Map is an immutable data structure, so “mutating” functions like Map.add actually return a brand new Map.) If x is found in the table, we recursively call sieve passing x+1 and a table from which x has been removed and each known factor has been reinserted. The reinsert function performs the trivial work of reinserting a known factor into the table.

A couple of other interesting details about this code are worth pointing out:

  • sieve returns a sequence expression, denoted by the seq keyword and the curly braces. Sequence expressions are lazily evaluated, meaning that the next prime won't be calculated until requested. In addition, the sequence expression is recursive due to the use of the yield! keyword, which can be used to return another sequence expression as part of a sequence.
  • The call to List.fold_left might look a bit exotic because we are folding over the list of known factors to produce a new table with each factor reinserted.

Mutation Isn't All Bad

Now that we have a working prime number generator, let’s take a moment to optimize. Until now, we’ve been sticking with the immutable F# Map type, but we could employ a mutable data structure to improve performance. To replace the data structure in a way that preserves the beauty of the algorithm, we’ll define a set of APIs for Dictionary<TKey,TValue> that mimic the Map APIs:

module Dict =
  open System.Collections.Generic

  let empty() = new Dictionary<_,_>()

  let add k v (d : #IDictionary<'a,'b>) =
    d.[k] <- v; d

  let remove (k : 'a) (d : #IDictionary<'a,'b>) =
    d.Remove(k) |> ignore; d

  let tryfind k (d : #IDictionary<'a,'b>) =
    match d.TryGetValue(k) with
    | true, v  -> Some(v)
    | false, _ -> None

Now, it’s a simple matter of replacing references to Map with Dict, and our algorithm immediately benefits from the performance boost of using a mutable table.

let reinsert x table prime =
   let comp = x+prime
   match Dict.tryfind comp table with
   | None        -> table |> Dict.add comp [prime]
   | Some(facts) -> table |> Dict.add comp (prime::facts)

let rec sieve x table =
  seq {
    match Dict.tryfind x table with
    | None ->
        yield x
        yield! sieve (x+1L) (table |> Dict.add (x*x) [x])
    | Some(factors) ->
        yield! sieve (x+1L) (factors |> List.fold_left (reinsert x) (table |> Dict.remove x)) }

let primes =
  sieve 2L (Dict.empty())

With our prime number generator in hand, solving Project Euler problem seven is trivial:

primes |> Seq.nth 10000

At last, we’ve produced a prime number generator with a good balance of performance and beauty. If we’ve learned anything from this article or its predecessor, I hope it’s that finding an optimal solution often requires finding an appropriate algorithm.

Take that Chris!

1Yet Another Project Euler Series
2O'Neill's paper appears in The Journal of Functional Programming, Volume 19, Issue 01.

posted on Monday, January 19, 2009 11:44:09 PM (Pacific Standard Time, UTC-08:00)  #    Comments [6]

kick it on DotNetKicks.com