__builtins__.hash puzzle

2013.Apr.11

A puzzle:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
### CPython2.7

assert hash(5) == 5
assert all(hash(x) == x for x in xrange(10))

class Foo(object):
	def __init__(self, x):
		self.x = x
	def __hash__(self):
		return x

assert hash(Foo(5)) == 5
assert all(hash(Foo(x)) == x for x in xrange(10))

# but...
assert hash(-1) != -1
assert hash(Foo(-1)) != -1

# why?

### PyPy

# yet...

assert hash(-1) == -1
assert hash(Foo(-1)) == -1

The answer (in part):

This is an interesting CPython implementation detail. (PyPy behaves as you might expect.)

In slot_tp_hash (for user-defined types) and in all the hash methods for the builtin types, you’ll see something like:

1
2
3
 if (h == -1 && !PyErr_Occurred())
     h = -2;
 return h;

Hash values of -1 are automatically and consistently converted to -2.

As you might now suspect, CPython repurposes a -1 return value to signal an error (a common convention in C.)

Indeed, builtin_hash looks like:

1
2
3
4
5
6
7
8
9
10
static PyObject *
builtin_hash(PyObject *self, PyObject *v)
{
     long x;

     x = PyObject_Hash(v);
     if (x == -1)
         return NULL; // this signals an error has occured
     return PyInt_FromLong(x);
}

I think this does tell us something meaningful. (Don’t use hash in any externally-facing fashion.)

As I see it, Python provides us with a set of practical, common protocols implemented as double-underscore methods: e.g., if you want to serialise object, Pickle provides you with a protocol that can you can implement for any object.

I think, in general, it’s good practice to work within these protocols rather than to build your own. For example, in a C or C++ system, it wouldn’t be uncommon to devise your own protocol for rendering nicely formatted string representations of your objects. In Python, however, we can skip this work and just use __format__ and __builtins__.format.

Unfortunately, these protocols are often ad hoc, and it’s not immediately clear that, for example, __hash__ isn’t really intended for broader use. It’s understandable in this case, because hash is provided mainly as a hook into the internals of another implementation detail: dict lookup. (What parts of dictionary behaviour are standardised? Could you write a compatible Python implementation using C++ STL std::map, which preserves insertion order?)

If you wanted to write a library that needed to give a hash for some type (maybe you’re sticking them in some content addressable store like .git), you might have to build your own parallel protocol. It would be tempting to just use __builtins__.hash, since it is supposed to be a consistent way to get the hash of an object. However, __hash__ clearly doesn’t care about consistency of implementation, so what kind of corner cases might you run into? And how would you know this from the outset?