Python defaultdict as sparse array: use with caution

Today I learned…
2023-12-10

TLDR: When using defaultdicts to represent sparse data with large cardinalities, remember that reading values with [] is side-effecting and increases the size of the dictionary.

Use .get() for reads, and reserve [] for insertions, increments, and updates.

Python’s collections package has some nice convenient datastructures that you might be familiar with, such as namedtuple or deque.

There is one very interesting, but flawed datastructure in there, namely the defaultdict. The defaultdict lets you, the programmer, define a ‘default factory’ lambda, which will be called to create (and persist) a default value in the dictionary, when __getitem__ (i.e. []-access) is called.

Here’s an example:

>>> from collections import defaultdict
>>> d = defaultdict(lambda: list())
>>> d["animals"]
[]
>>> d["animals"].append("dog")
>>> d
defaultdict(<function <lambda> at 0x7f53579c8860>, {'animals': ['dog']})

`defaultdict` and sparse arrays

Syntactically, this is very nice when you’re creating or updating sparse arrays, because you can just “fire and forget”, setting and incrementing (or decrementing) values at any index you can think of:

>>> d = defaultdict(lambda: 0)
>>> d[0] = 1
>>> d[100] += 5
>>> d[100] += 5
>>> d[100]
10
>>> d[99]
0

However, this type of syntactic convenience leads to complacency. __getitem__ access on a defaultdict is not the equivalent of .get() with a nice default. It is, unfortunately a side-effecting operation, as the value produced by the default factory is stored in the dict before being returned.

Now, it does say so right in the documentation:

If default_factory is not None, it is called without arguments to provide a default value for the given key, this value is inserted in the dictionary for the key, and returned.

The large-scale implications of this behavior are often not evident until it turns out your code is somehow performing very slowly and consuming lots of memory.

`defaultdict` effectively un-sparsifies your sparse array if you’re querying for every value

Here is a demonstration of the side-effecting behavior in action.

# file: sparse.py
import sys
from collections import defaultdict

def test(use_get):
    n = 1000*1000*10
    sparse = defaultdict(lambda: 0)
    sparse[int(n/3)] = 1
    sparse[int(n/3)*2] = 2
    sparse[int(n/3)*3] = 3
    print("SIZE BEFORE:", sys.getsizeof(sparse))
    sum = 0
    if use_get:
        # We use .get() so as not to trigger the default_factory
        for idx in range(n):
            sum += sparse.get(idx, 0)
    else:
        # We use []-access, bloating the dict as a side-effect
        for idx in range(n):
            sum += sparse[idx]
    print("SIZE AFTER:", sys.getsizeof(sparse))
    print(f"RESULT (using get: {use_get})", sum)

if sys.argv[1] == "True":
    test(True)
else:
    test(False)

In the code above, we first create a dictionary representing a sparse vector with 10-million entries, the default value being 0. Then, we set up some actual values roughly at every 1/3rd of the vector length. Our goal is to sum up all the values in this sparse array, going through every index.

If use_get is set to True, we use the .get() method on our defaultdict. If use_get is set to False, we use []-access.

We measure the size of the dict before and after the summation. Here are the results and the runtimes:

# Using .get() method:
/bin/time python3 ./sparse.py True
SIZE BEFORE: 232
SIZE AFTER: 232
RESULT (using get: True) 6
1.20user 0.00system 0:01.21elapsed 99%CPU (0avgtext+0avgdata 9984maxresident)k

# Using []-syntax:
/bin/time python3 ./sparse.py False
SIZE BEFORE: 232
SIZE AFTER: 335544408
RESULT (using get: False) 6
3.33user 0.82system 0:04.17elapsed 99%CPU (0avgtext+0avgdata 677260maxresident)k

Much worse! The dict ballooned from 232 bytes to over 300 megabytes. Additionally, it took the python runtime almost three times longer to process all 20 million keys in our sparse vector.

When using []-notation with a defaultdict, one must always remember to only use it for side-effecting operations, where updating the dictionary is in fact called for. If the goal is just to read a value, .get() with a sensible default is the way to go.

Python defaultdict as sparse array: use with caution

defaultdict and sparse arrays

defaultdict effectively un-sparsifies your sparse array if you’re querying for every value

`defaultdict` and sparse arrays

`defaultdict` effectively un-sparsifies your sparse array if you’re querying for every value