Tuesday, February 16, 2010

Problem 10: Sum of Primes

Problem 10 is fairly interesting.
The sum of the primes below 10 is 2 + 3 + 5 + 7 = 17.
Find the sum of all the primes below two million.

The solution is a lot easier if you are familiar with the Sieve of Eratosthenes. Wikipedia has a fairly good article on the Sieve, in which they sum up the steps as follows;
  1. Create a list of consecutive integers from two to n: (2, 3, 4, ..., n),
  2. Initially, let p equal 2, the first prime number,
  3. While enumerating all multiples of p starting from p2, strike them off from the original list,
  4. Find the first number remaining on the list after p (it's the next prime); let p equal this number,
  5. Repeat steps 3 and 4 until p2 is greater than n.
  6. All the remaining numbers in the list are prime.

Seems easy enough, here is an fairly direct translation in Python;
start = 2
numbers = range(start, n + 1)

for i in numbers:
    multiple = i
    while i * multiple <= n:
        try:
            numbers.remove(i * multiple)
        except ValueError:
            pass
        multiple += 1
        
return numbers

Python's ability to remove items from a List seems to be lacking here; at the very least the code is a little more terse then usual. The remove operation raises an error is the element is not present, so telling it to pass will simply ignore the resulting error. I am sure there is a terser way to represent the Sieve in Python, but the code above emphasizes readable and is comparable to pseudo code, except for the error handling.

The problem with the previous implementation is that it is slow. with a N of 10000 taking about a second, a N of 100000 taking ~101 seconds, and a N of 1000000 uncomfortably heating up my laptop, though it should clock in at about an hour.

Excuse my sloppy Big O notation, but according to Wikipedia, the complexity of the Sieve is O(n(logn)(loglogn)). 

The red flag here is the remove call. While I'm not sure how Python implements List.remove, according to the documentation it removes the first instance of the element it is searching for. So the method needs to iterate the list until it finds the element, and then readjust the remainder of the List, which is a costly operation, especially inside a for and while loop. I believe this would add another n to the complexity, O(n^2(logn)(loglogn)). (Edit: please disregard my sloppy Big O notation here, without knowing the implementation details behind the Python List remove operation I can only hazard a guess. We know that the remove operation stops at the first instance of the char, but then the List must be have an element removed, which can vary in cost depending on the data structure used. Analysis of this operation requires Python guru skills that I currently lack.)  I am really out of practice with Big O notation, so please correct me if I am wrong here.

It took me awhile to figure out how to get the run time of the sieve down, I eventually went online to try to find a hint of how to improve performance. Someone on the internet recommended replacing the composite numbers with zeros. It is painfully obvious once you see it; get rid of the remove operation altogether. Instead of removing the non-prime (composite) elements, replace them with a placeholder value.

Here is the improved implementation;
start = 2
numbers = range(start, n + 1)
placeholder = 0

#set composite numbers as a placeholder value
for i in range(len(numbers)):
    cur = numbers[i]
    if cur != placeholder:
        multiple = cur
        while cur*multiple <= n:
            listPos = (cur*multiple) - start
            numbers[listPos] = placeholder
            multiple += 1

#make a list of primes
primes = []
for i in numbers:
    if i != placeholder:
        primes.append(i)

return primes

This new implementation can handle the two million that we were looking for in problem in about 2 seconds. All that is left is to sum the primes. Here is the complete code;
def sieve(n):
    start = 2
    numbers = range(start, n + 1)
    placeholder = 0
    
    #set composite numbers as a placeholder value
    for i in range(len(numbers)):
        cur = numbers[i]
        if cur != placeholder:
            multiple = cur
            while cur*multiple <= n:
                listPos = (cur*multiple) - start
                numbers[listPos] = placeholder
                multiple += 1
    
    #make a list of primes
    primes = []
    for i in numbers:
        if i != placeholder:
            primes.append(i)
    
    return primes

if __name__ == '__main__':
    primes = sieve(2000000)
    sum = 0
    for i in primes:
        sum += i
    print sum

No comments: