NOTES ON OPTIMIZING DICTIONARIES | |
================================ | |
Principal Use Cases for Dictionaries | |
------------------------------------ | |
Passing keyword arguments | |
Typically, one read and one write for 1 to 3 elements. | |
Occurs frequently in normal python code. | |
Class method lookup | |
Dictionaries vary in size with 8 to 16 elements being common. | |
Usually written once with many lookups. | |
When base classes are used, there are many failed lookups | |
followed by a lookup in a base class. | |
Instance attribute lookup and Global variables | |
Dictionaries vary in size. 4 to 10 elements are common. | |
Both reads and writes are common. | |
Builtins | |
Frequent reads. Almost never written. | |
Size 126 interned strings (as of Py2.3b1). | |
A few keys are accessed much more frequently than others. | |
Uniquification | |
Dictionaries of any size. Bulk of work is in creation. | |
Repeated writes to a smaller set of keys. | |
Single read of each key. | |
Some use cases have two consecutive accesses to the same key. | |
* Removing duplicates from a sequence. | |
dict.fromkeys(seqn).keys() | |
* Counting elements in a sequence. | |
for e in seqn: | |
d[e] = d.get(e,0) + 1 | |
* Accumulating references in a dictionary of lists: | |
for pagenumber, page in enumerate(pages): | |
for word in page: | |
d.setdefault(word, []).append(pagenumber) | |
Note, the second example is a use case characterized by a get and set | |
to the same key. There are similar use cases with a __contains__ | |
followed by a get, set, or del to the same key. Part of the | |
justification for d.setdefault is combining the two lookups into one. | |
Membership Testing | |
Dictionaries of any size. Created once and then rarely changes. | |
Single write to each key. | |
Many calls to __contains__() or has_key(). | |
Similar access patterns occur with replacement dictionaries | |
such as with the % formatting operator. | |
Dynamic Mappings | |
Characterized by deletions interspersed with adds and replacements. | |
Performance benefits greatly from the re-use of dummy entries. | |
Data Layout (assuming a 32-bit box with 64 bytes per cache line) | |
---------------------------------------------------------------- | |
Smalldicts (8 entries) are attached to the dictobject structure | |
and the whole group nearly fills two consecutive cache lines. | |
Larger dicts use the first half of the dictobject structure (one cache | |
line) and a separate, continuous block of entries (at 12 bytes each | |
for a total of 5.333 entries per cache line). | |
Tunable Dictionary Parameters | |
----------------------------- | |
* PyDict_MINSIZE. Currently set to 8. | |
Must be a power of two. New dicts have to zero-out every cell. | |
Each additional 8 consumes 1.5 cache lines. Increasing improves | |
the sparseness of small dictionaries but costs time to read in | |
the additional cache lines if they are not already in cache. | |
That case is common when keyword arguments are passed. | |
* Maximum dictionary load in PyDict_SetItem. Currently set to 2/3. | |
Increasing this ratio makes dictionaries more dense resulting | |
in more collisions. Decreasing it improves sparseness at the | |
expense of spreading entries over more cache lines and at the | |
cost of total memory consumed. | |
The load test occurs in highly time sensitive code. Efforts | |
to make the test more complex (for example, varying the load | |
for different sizes) have degraded performance. | |
* Growth rate upon hitting maximum load. Currently set to *2. | |
Raising this to *4 results in half the number of resizes, | |
less effort to resize, better sparseness for some (but not | |
all dict sizes), and potentially doubles memory consumption | |
depending on the size of the dictionary. Setting to *4 | |
eliminates every other resize step. | |
* Maximum sparseness (minimum dictionary load). What percentage | |
of entries can be unused before the dictionary shrinks to | |
free up memory and speed up iteration? (The current CPython | |
code does not represent this parameter directly.) | |
* Shrinkage rate upon exceeding maximum sparseness. The current | |
CPython code never even checks sparseness when deleting a | |
key. When a new key is added, it resizes based on the number | |
of active keys, so that the addition may trigger shrinkage | |
rather than growth. | |
Tune-ups should be measured across a broad range of applications and | |
use cases. A change to any parameter will help in some situations and | |
hurt in others. The key is to find settings that help the most common | |
cases and do the least damage to the less common cases. Results will | |
vary dramatically depending on the exact number of keys, whether the | |
keys are all strings, whether reads or writes dominate, the exact | |
hash values of the keys (some sets of values have fewer collisions than | |
others). Any one test or benchmark is likely to prove misleading. | |
While making a dictionary more sparse reduces collisions, it impairs | |
iteration and key listing. Those methods loop over every potential | |
entry. Doubling the size of dictionary results in twice as many | |
non-overlapping memory accesses for keys(), items(), values(), | |
__iter__(), iterkeys(), iteritems(), itervalues(), and update(). | |
Also, every dictionary iterates at least twice, once for the memset() | |
when it is created and once by dealloc(). | |
Dictionary operations involving only a single key can be O(1) unless | |
resizing is possible. By checking for a resize only when the | |
dictionary can grow (and may *require* resizing), other operations | |
remain O(1), and the odds of resize thrashing or memory fragmentation | |
are reduced. In particular, an algorithm that empties a dictionary | |
by repeatedly invoking .pop will see no resizing, which might | |
not be necessary at all because the dictionary is eventually | |
discarded entirely. | |
Results of Cache Locality Experiments | |
------------------------------------- | |
When an entry is retrieved from memory, 4.333 adjacent entries are also | |
retrieved into a cache line. Since accessing items in cache is *much* | |
cheaper than a cache miss, an enticing idea is to probe the adjacent | |
entries as a first step in collision resolution. Unfortunately, the | |
introduction of any regularity into collision searches results in more | |
collisions than the current random chaining approach. | |
Exploiting cache locality at the expense of additional collisions fails | |
to payoff when the entries are already loaded in cache (the expense | |
is paid with no compensating benefit). This occurs in small dictionaries | |
where the whole dictionary fits into a pair of cache lines. It also | |
occurs frequently in large dictionaries which have a common access pattern | |
where some keys are accessed much more frequently than others. The | |
more popular entries *and* their collision chains tend to remain in cache. | |
To exploit cache locality, change the collision resolution section | |
in lookdict() and lookdict_string(). Set i^=1 at the top of the | |
loop and move the i = (i << 2) + i + perturb + 1 to an unrolled | |
version of the loop. | |
This optimization strategy can be leveraged in several ways: | |
* If the dictionary is kept sparse (through the tunable parameters), | |
then the occurrence of additional collisions is lessened. | |
* If lookdict() and lookdict_string() are specialized for small dicts | |
and for largedicts, then the versions for large_dicts can be given | |
an alternate search strategy without increasing collisions in small dicts | |
which already have the maximum benefit of cache locality. | |
* If the use case for a dictionary is known to have a random key | |
access pattern (as opposed to a more common pattern with a Zipf's law | |
distribution), then there will be more benefit for large dictionaries | |
because any given key is no more likely than another to already be | |
in cache. | |
* In use cases with paired accesses to the same key, the second access | |
is always in cache and gets no benefit from efforts to further improve | |
cache locality. | |
Optimizing the Search of Small Dictionaries | |
------------------------------------------- | |
If lookdict() and lookdict_string() are specialized for smaller dictionaries, | |
then a custom search approach can be implemented that exploits the small | |
search space and cache locality. | |
* The simplest example is a linear search of contiguous entries. This is | |
simple to implement, guaranteed to terminate rapidly, never searches | |
the same entry twice, and precludes the need to check for dummy entries. | |
* A more advanced example is a self-organizing search so that the most | |
frequently accessed entries get probed first. The organization | |
adapts if the access pattern changes over time. Treaps are ideally | |
suited for self-organization with the most common entries at the | |
top of the heap and a rapid binary search pattern. Most probes and | |
results are all located at the top of the tree allowing them all to | |
be located in one or two cache lines. | |
* Also, small dictionaries may be made more dense, perhaps filling all | |
eight cells to take the maximum advantage of two cache lines. | |
Strategy Pattern | |
---------------- | |
Consider allowing the user to set the tunable parameters or to select a | |
particular search method. Since some dictionary use cases have known | |
sizes and access patterns, the user may be able to provide useful hints. | |
1) For example, if membership testing or lookups dominate runtime and memory | |
is not at a premium, the user may benefit from setting the maximum load | |
ratio at 5% or 10% instead of the usual 66.7%. This will sharply | |
curtail the number of collisions but will increase iteration time. | |
The builtin namespace is a prime example of a dictionary that can | |
benefit from being highly sparse. | |
2) Dictionary creation time can be shortened in cases where the ultimate | |
size of the dictionary is known in advance. The dictionary can be | |
pre-sized so that no resize operations are required during creation. | |
Not only does this save resizes, but the key insertion will go | |
more quickly because the first half of the keys will be inserted into | |
a more sparse environment than before. The preconditions for this | |
strategy arise whenever a dictionary is created from a key or item | |
sequence and the number of *unique* keys is known. | |
3) If the key space is large and the access pattern is known to be random, | |
then search strategies exploiting cache locality can be fruitful. | |
The preconditions for this strategy arise in simulations and | |
numerical analysis. | |
4) If the keys are fixed and the access pattern strongly favors some of | |
the keys, then the entries can be stored contiguously and accessed | |
with a linear search or treap. This exploits knowledge of the data, | |
cache locality, and a simplified search routine. It also eliminates | |
the need to test for dummy entries on each probe. The preconditions | |
for this strategy arise in symbol tables and in the builtin dictionary. | |
Readonly Dictionaries | |
--------------------- | |
Some dictionary use cases pass through a build stage and then move to a | |
more heavily exercised lookup stage with no further changes to the | |
dictionary. | |
An idea that emerged on python-dev is to be able to convert a dictionary | |
to a read-only state. This can help prevent programming errors and also | |
provide knowledge that can be exploited for lookup optimization. | |
The dictionary can be immediately rebuilt (eliminating dummy entries), | |
resized (to an appropriate level of sparseness), and the keys can be | |
jostled (to minimize collisions). The lookdict() routine can then | |
eliminate the test for dummy entries (saving about 1/4 of the time | |
spent in the collision resolution loop). | |
An additional possibility is to insert links into the empty spaces | |
so that dictionary iteration can proceed in len(d) steps instead of | |
(mp->mask + 1) steps. Alternatively, a separate tuple of keys can be | |
kept just for iteration. | |
Caching Lookups | |
--------------- | |
The idea is to exploit key access patterns by anticipating future lookups | |
based on previous lookups. | |
The simplest incarnation is to save the most recently accessed entry. | |
This gives optimal performance for use cases where every get is followed | |
by a set or del to the same key. |