the variance of an arbitrary Iterable

2013.Apr.23

I’ve asked the following interview question to a number of candidates. This question is based on an actual problem I have had to solve in a real report (find the variance of an Iterable, where the Iterable may not be a Sequence.)

The question starts simply:

Write a function to calculate the average of a list of numbers, xs.

1
2
3
4
5
6
7
8
9
10
11
from __future__ import division # allow for true division
                                # ref: http://www.python.org/dev/peps/pep-0238/

def average(xs):
	total, count = 0, 0
	for x in xs:
		total += x
		count += 1
	return total / count

assert average([1,2,3,4,5,6]) == 21/6 == 3.5

This is the only part of the question where I generally don’t hold the candidate’s hand.

Each part of this question can lead to follow-up discussions.

For example, a follow-up discussion on errors & error reporting:

What if xs is an empty list? Should average([]) == 0? How is that distinguished from average([0])? Should the function raise an exception? (It already will raise ZeroDivisionError: do we translate this into a ValueError?) Where might this function be used? What are the user's expectations when calling this function? What implications does that have on the way we might handle errors?

An easy follow-up discussion on complexity:

What is the asymptotic time behaviour of the above? Can we do any better? Why not? What is the asymptotic memory usage of the above? Can we do any better? Why?

The following two discussions can be repeated for each step of the question, though only the answer to the latter changes (when we write the constant memory usage formulation with a generator expression.)

Additionally, there is a follow-up discussion to be had on what the candidate thinks of the problem and the solution.

Do you like the solution? Is it something you’d do in real code? Are there any qualifications on when, where, and how you might solve these problems?

There exist the following functions in the global scope (__builtins__):

1
2
3
4
5
# sum: sum an Iterable of non-strings
assert sum([1,2,3]) == 6

# len: length of some Sized (usu. Sequence)
assert len([1,2,3,4]) == 4

Rewrite average using the __builtins__, sum and len.

1
2
3
4
5
6
from __future__ import division

def average(xs):
	return sum(xs) / len(xs)

assert average([1,2,3,4,5,6]) == 21/6 == 3.5

Simple enough!

Now, Python has list comprehensions:

1
2
3
4
5
6
7
8
9
# [ expression for loop-variable in Iterable if predicate ]
#   ----------     -------------    --------    ---------

# e.g., the squares of all the odd numbers between [0,10)
xs = [x**2 for x in xrange(10) if x % 2]
assert xs == [1**2, 3**2, 5**2, 7**2, 9**2] == [1,9,25,49,81]

# xs is a materialised list, taking O(N) memory
assert type(xs) is list

Can we write variance that finds the variance of the list and makes use of list comprehensions?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from __future__ import division

# we'll use the following formulation of variance (provided):
#   E[(x - avg)**2]
def average(xs):
	return sum(xs) / len(xs)

def variance(xs):
	avg = average(xs)
	ys = [(x - avg)**2 for x in xs]
	return average(ys)

data = [1,2,3,4,5,6]
assert average(data) == 21/6 == 3.5
assert variance(data) == 18.5/6

Python also has generator expressions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# ( expression for loop-variable in Iterable if predicate )
#   ----------     -------------    --------    ---------

# e.g., the squares of all the odd numbers between [0,10)
xs = (x**2 for x in xrange(10) if x % 2)
assert list(xs) == [1**2, 3**2, 5**2, 7**2, 9**2] == [1,9,25,49,81]

# xs is NOT materialised; it yields values as it is
#   iterated over, and, by itself, requires O(1) memory
assert type(xs) is not list

# a generator expression can, as a result, represent a list
#   of any size, even a countably infinite one!

# e.g., the squares of all the odd numbers from [0, inf)
from itertools import count
xs = (x**2 for x in count() if x % 2)

# we can step through the generator with next()
assert next(xs) == 1**2 == 1
assert next(xs) == 3**2 == 9
assert next(xs) == 5**2 == 25

# note that the generator is a one-way stream:
#   once we step through and yield a value, we can never revisit that value!

# in a future post, I'll discuss the difference between generators and generator expressions

len works on any object that has __len__ defined.

(More precisely, in CPython it works on any object that has obj->ob_type->tp_as_sequence->sq_length or obj->obj_type->tp_as_mapping->mp_length.)

Can we write our own len that will work on any finite Iterable, even a finite generator?

1
2
3
4
5
6
7
def len(xs):
	count = 0
	for x in xs:
		count += 1
	return count

assert len([1,2,3,4]) == 4

Can we do this using generator expression syntax?

Writing this using generator syntax requires an intuitive leap: that the expression yielded does not have to be a function of the loop variable.

Additionally, candidates with C++ or Java background will try to solve this with an explicit mutation of the count (count++ for _ in xs), but integers in Python are non-mutable, and the purpose of this question is to flirt with a more functional approach.

1
2
def len(xs):
	return sum(1 for _ in xs)

Using this result, can we rewrite average?

We have to be able to calculate the average of any Iterable (including non-materialised generators)? This means that we can only iterate over the Iterable once.

This requires another intuitive leap: that we can capture two values in the generator expression.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from __future__ import division

# we'll presume we have access to the following 
# with standard Python tuples, x + y means concatenation
# for the purposes of this example, we'll assume we have a tuple
#   wherein x + y means pairwise addition, e.g.,
#   (a,b) + (c,d) == (a+c, b+d)
class sumtup(tuple):
	def __add__(self, other):
		# note: (a,b) + (c,d,e) == (a+c, b+d)
		return type(self)(sum(p) for p in zip(self,other))
	def __eq__(self, other):
		return all(a == b for a,b in zip(self, other))

assert sum([sumtup([1,2]), sumtup([3,4])], sumtup([0,0])) == (1+3,2+4) == (4,6)

def average(xs):
	xs_tot, count = sum((sumtup([x, 1]) for x in xs), sumtup([0,0]))
	return xs_tot/count

data = [1,2,3,4,5,6]
assert average(data) == 21/6

A cute observation that avoids the use of sumtup by using a complex number.

Recall that for a complex number:

1
2
3
4
5
6
7
8
from __future__ import division

def average(xs):
	result = sum(x+1j for x in xs)
	return result.real/result.imag

data = [1,2,3,4,5,6]
assert average(data) == 21/6

Finally, can we use all of these results to rewrite variance?

For the sheer don’t use this code of it, I will avoid the use of sumtup.

sum is merely a fold (reduce in Python terms) on addition.

1
2
3
4
5
6
7
8
from __future__ import division

variance = lambda xs: (lambda (xs_sq_tot, xs_tot, count): (xs_sq_tot - xs_tot**2/count)/count)(
                       reduce(lambda acc,ys: type(ys)(sum(y) for y in zip(acc,ys)),
                              ((x**2, x, 1) for x in xs)))

data = [1,2,3,4,5,6]
assert round(variance(data),3) == round(17.5/6,3) # round, because we lose a little precision

The purpose of this question is to start a discussion.

I have recommended candidates who didn’t get any of the answers on the first try but were able to clearly demonstrate their skill, knowledge, and experience in the surrounding discussion.

I’m just about the easiest interviewer out there, and I don’t hold candidates to getting the exact answer on the first try. In fact, many parts of this question rely on making a single intuititve leap, and I don’t think it’s fair to ask riddles in interviews.

In fact, I like giving this question to people who don’t know Python or who don’t know Python very well. It allows me to show them something and see how they approach putting it into use.

What kind of questions to they ask? Are they engaged in the discussion? What experience or knowledge can they bring to bear in answering it?

In the end, I think that what matters most to me is whether the candidate can have a reasonable discussion about the process of creating software. Are they engaged in their work? Are they easy to work with? Can they hold a productive discussion on software development? Can they reconcile differences of opinion to negotiate a practical, reasonable consensus?