generator elision

2013.Apr.25

The following material is taken from talks I gave at PyData New York and PyData Silicon Valley

We can conceptualise a generator as a delayed computation (a thunk, if you will.)

Let’s start with a simple demonstration of an equivalence between functions and generators:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from math import sqrt

# functions
def filter_odds_fun(xs):
	temp = []
	for x in xs:
		if x%2:
			temp.append(x)
	return temp

def filter_squares_fun(xs):
	temp = []
	for x in xs:
		if int(sqrt(x))**2 == x:
			temp.append(x)
	return temp

# generators
def filter_odds_gen(xs):
	for x in xs:
		if x%2:
			yield x

def filter_squares_gen(xs):
	for x in xs:
		if int(sqrt(x))**2 == x:
			yield x

input_data = xrange(100)
assert filter_squares_fun(filter_odds_fun(input_data)) == \
  list(filter_squares_gen(filter_odds_gen(input_data)))

We can easily see that the two constructions are equivalent. Modelling our calculation as the composition of two functions gives us the same answer as modelling it as two generators. The only obvious difference is that we need to force materialisation of the result in the case of composed generators.

However, in the case of composed functions, each function eagerly evaluates its result and the next function in the computation sees only this value. To put it another way, we can’t share much information between the parts of our computation, and the only way to share information is via the return value.

In the case of the generators, each generator lazily evaluates one step of the computation at a time, an each part of our computation can see that it is operating on another generator.

One simple consequence is that our generator composition can terminate early:

1
2
3
4
5
6
7
# shorter formulation
filter_odds = lambda xs: (x for x in xs if x%2)
filter_squares = lambda xs: (x for x in xs if int(sqrt(x))**2 == x)

# get the first three values, then terminate
from itertools import islice
print list(islice(filter_squares(filter_odds(xrange(100))),None,3))

We can see that the inner part of our computation, filter_odds, receives information from the outer part of our computation, islice: namely, when it can terminate.

This is a trivial transference of information.

Maybe we can do better with a little decoration, and a couple lines of (please, don’t use this) code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# a simple helper for the below
from itertools import islice
class partialdict(dict):
	def __missing__(self, key):
		return self[frozenset(islice(iter(key),1,None))]

# a wrapped generator type (will yield a generator instance when called)
#   that has 1-place lookahead
# when you call this type to instantiate a generator instance: e.g., f(g(...))
#   it will look ahead to see if the argument is a wrapped generator
#     instance and find a dispatch function specified by the types of f and g 
#   the dispatch function takes the values of f and g and 
#     returns a new, wrapped generator
class typed(object):
	def __init__(self, gen, gentype):
		self.gen, self.gentype = gen, gentype
	def __call__(self, arg):
		arg_gentype = arg.gentype if isinstance(arg, wrapped) else ''
		dispatch = gentype.dispatch[frozenset([self.gentype, arg_gentype])]
		rv, rv_type = dispatch(self.gen, arg)
		if not isinstance(rv, wrapped) or rv.gentype != rv_type:
			rv = wrapped(rv, rv_type)
		return rv 

# wrapped generator instance that keeps track of its own type
class wrapped(object):
	def __init__(self, gen, gentype):
		self.gen, self.gentype = gen, gentype
	def __iter__(self):
		return self
	def __next__(self):
		return next(self.gen)
	next = __next__
	def send(self, *args, **kwargs):
		return self.gen.send(*args, **kwargs)
	def throw(self, *args, **kwargs):
		return self.gen.throw(*args, **kwargs)
	def close(self, *args, **kwargs):
		return self.gen.close(*args, **kwargs)

# a way to construct a wrapped generator type with a dispatch table
class gentype(object):
	dispatch = partialdict({frozenset(): lambda gen,arg: (gen(arg),'')})
	def __init__(self, gentype):
		self.gentype = gentype
	def __call__(self, gen):
		return typed(gen, gentype=self.gentype)

# an example type and dispatch table

# passthrough is a generator type that signifies the generator does nothing
#   meaningful (just passes data through) and can be freely elided
#   from any computation
passthrough = gentype('passthrough')

# the dispatch table looks like this
#   for f(g(...)):

#    type(f)   |    type(g)   |   result of f(g(xs))
# -----------------------------------------
#  passthrough   passthrough          xs
#      -         passthrough          f(xs)
#  passthrough        -               g(xs)

# because this is all based on sets, the above relations are symmetric
gentype.dispatch[frozenset([passthrough.gentype, passthrough.gentype])] = \
  lambda gen,arg: (arg, passthrough.gentype)
gentype.dispatch[frozenset([passthrough.gentype])] = \
  lambda gen,arg: (arg, passthrough.gentype)

With the above mechanisms in place, we can model a problem as follows.

In the below, f and g are generators that are typed/tagged as being passthrough, meaning they do not change the output and can be freely elided from any computation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# dummy function, doesn't do anything; should be elided
@passthrough
def f(xs):
	print 'fff'
	for x in xs:
		yield x # input -> output

# another dummy; should be elided
@passthrough
def g(xs):
	print 'ggg'
	for x in xs:
		yield x # input -> output

def h(xs):
	print 'hhh'
	for x in xs:
		if x%2: # do some actual work, i.e., filter
			yield x

result = f(g(h(x for x in xrange(10))))

# materialise
print list(result)

In the above, we can see that only h is called! As expected, both f and g are completely elided from the computation!

Imagine that, intead of being elided, we could implement more complex transformations. f(g(...)) can just as easily replace itself with an optimised, combined version, written in C!

Next time: * more sophisticated examples of the same technique * how we can use generators to capture whole programme information for further optimisations * building a full, parallel type system in Python with isinstance and __instancecheck__ * building this additional information into the interpreter * lambda fusion in the interpreter