| NOTES ON OPTIMIZING DICTIONARIES | |
| ================================ | |
| Principal Use Cases for Dictionaries | |
| ------------------------------------ | |
| Passing keyword arguments | |
| Typically, one read and one write for 1 to 3 elements. | |
| Occurs frequently in normal python code. | |
| Class method lookup | |
| Dictionaries vary in size with 8 to 16 elements being common. | |
| Usually written once with many lookups. | |
| When base classes are used, there are many failed lookups | |
| followed by a lookup in a base class. | |
| Instance attribute lookup and Global variables | |
| Dictionaries vary in size. 4 to 10 elements are common. | |
| Both reads and writes are common. | |
| Builtins | |
| Frequent reads. Almost never written. | |
| Size 126 interned strings (as of Py2.3b1). | |
| A few keys are accessed much more frequently than others. | |
| Uniquification | |
| Dictionaries of any size. Bulk of work is in creation. | |
| Repeated writes to a smaller set of keys. | |
| Single read of each key. | |
| Some use cases have two consecutive accesses to the same key. | |
| * Removing duplicates from a sequence. | |
| dict.fromkeys(seqn).keys() | |
| * Counting elements in a sequence. | |
| for e in seqn: | |
| d[e] = d.get(e,0) + 1 | |
| * Accumulating references in a dictionary of lists: | |
| for pagenumber, page in enumerate(pages): | |
| for word in page: | |
| d.setdefault(word, []).append(pagenumber) | |
| Note, the second example is a use case characterized by a get and set | |
| to the same key. There are similar use cases with a __contains__ | |
| followed by a get, set, or del to the same key. Part of the | |
| justification for d.setdefault is combining the two lookups into one. | |
| Membership Testing | |
| Dictionaries of any size. Created once and then rarely changes. | |
| Single write to each key. | |
| Many calls to __contains__() or has_key(). | |
| Similar access patterns occur with replacement dictionaries | |
| such as with the % formatting operator. | |
| Dynamic Mappings | |
| Characterized by deletions interspersed with adds and replacements. | |
| Performance benefits greatly from the re-use of dummy entries. | |
| Data Layout (assuming a 32-bit box with 64 bytes per cache line) | |
| ---------------------------------------------------------------- | |
| Smalldicts (8 entries) are attached to the dictobject structure | |
| and the whole group nearly fills two consecutive cache lines. | |
| Larger dicts use the first half of the dictobject structure (one cache | |
| line) and a separate, continuous block of entries (at 12 bytes each | |
| for a total of 5.333 entries per cache line). | |
| Tunable Dictionary Parameters | |
| ----------------------------- | |
| * PyDict_MINSIZE. Currently set to 8. | |
| Must be a power of two. New dicts have to zero-out every cell. | |
| Each additional 8 consumes 1.5 cache lines. Increasing improves | |
| the sparseness of small dictionaries but costs time to read in | |
| the additional cache lines if they are not already in cache. | |
| That case is common when keyword arguments are passed. | |
| * Maximum dictionary load in PyDict_SetItem. Currently set to 2/3. | |
| Increasing this ratio makes dictionaries more dense resulting | |
| in more collisions. Decreasing it improves sparseness at the | |
| expense of spreading entries over more cache lines and at the | |
| cost of total memory consumed. | |
| The load test occurs in highly time sensitive code. Efforts | |
| to make the test more complex (for example, varying the load | |
| for different sizes) have degraded performance. | |
| * Growth rate upon hitting maximum load. Currently set to *2. | |
| Raising this to *4 results in half the number of resizes, | |
| less effort to resize, better sparseness for some (but not | |
| all dict sizes), and potentially doubles memory consumption | |
| depending on the size of the dictionary. Setting to *4 | |
| eliminates every other resize step. | |
| * Maximum sparseness (minimum dictionary load). What percentage | |
| of entries can be unused before the dictionary shrinks to | |
| free up memory and speed up iteration? (The current CPython | |
| code does not represent this parameter directly.) | |
| * Shrinkage rate upon exceeding maximum sparseness. The current | |
| CPython code never even checks sparseness when deleting a | |
| key. When a new key is added, it resizes based on the number | |
| of active keys, so that the addition may trigger shrinkage | |
| rather than growth. | |
| Tune-ups should be measured across a broad range of applications and | |
| use cases. A change to any parameter will help in some situations and | |
| hurt in others. The key is to find settings that help the most common | |
| cases and do the least damage to the less common cases. Results will | |
| vary dramatically depending on the exact number of keys, whether the | |
| keys are all strings, whether reads or writes dominate, the exact | |
| hash values of the keys (some sets of values have fewer collisions than | |
| others). Any one test or benchmark is likely to prove misleading. | |
| While making a dictionary more sparse reduces collisions, it impairs | |
| iteration and key listing. Those methods loop over every potential | |
| entry. Doubling the size of dictionary results in twice as many | |
| non-overlapping memory accesses for keys(), items(), values(), | |
| __iter__(), iterkeys(), iteritems(), itervalues(), and update(). | |
| Also, every dictionary iterates at least twice, once for the memset() | |
| when it is created and once by dealloc(). | |
| Dictionary operations involving only a single key can be O(1) unless | |
| resizing is possible. By checking for a resize only when the | |
| dictionary can grow (and may *require* resizing), other operations | |
| remain O(1), and the odds of resize thrashing or memory fragmentation | |
| are reduced. In particular, an algorithm that empties a dictionary | |
| by repeatedly invoking .pop will see no resizing, which might | |
| not be necessary at all because the dictionary is eventually | |
| discarded entirely. | |
| Results of Cache Locality Experiments | |
| ------------------------------------- | |
| When an entry is retrieved from memory, 4.333 adjacent entries are also | |
| retrieved into a cache line. Since accessing items in cache is *much* | |
| cheaper than a cache miss, an enticing idea is to probe the adjacent | |
| entries as a first step in collision resolution. Unfortunately, the | |
| introduction of any regularity into collision searches results in more | |
| collisions than the current random chaining approach. | |
| Exploiting cache locality at the expense of additional collisions fails | |
| to payoff when the entries are already loaded in cache (the expense | |
| is paid with no compensating benefit). This occurs in small dictionaries | |
| where the whole dictionary fits into a pair of cache lines. It also | |
| occurs frequently in large dictionaries which have a common access pattern | |
| where some keys are accessed much more frequently than others. The | |
| more popular entries *and* their collision chains tend to remain in cache. | |
| To exploit cache locality, change the collision resolution section | |
| in lookdict() and lookdict_string(). Set i^=1 at the top of the | |
| loop and move the i = (i << 2) + i + perturb + 1 to an unrolled | |
| version of the loop. | |
| This optimization strategy can be leveraged in several ways: | |
| * If the dictionary is kept sparse (through the tunable parameters), | |
| then the occurrence of additional collisions is lessened. | |
| * If lookdict() and lookdict_string() are specialized for small dicts | |
| and for largedicts, then the versions for large_dicts can be given | |
| an alternate search strategy without increasing collisions in small dicts | |
| which already have the maximum benefit of cache locality. | |
| * If the use case for a dictionary is known to have a random key | |
| access pattern (as opposed to a more common pattern with a Zipf's law | |
| distribution), then there will be more benefit for large dictionaries | |
| because any given key is no more likely than another to already be | |
| in cache. | |
| * In use cases with paired accesses to the same key, the second access | |
| is always in cache and gets no benefit from efforts to further improve | |
| cache locality. | |
| Optimizing the Search of Small Dictionaries | |
| ------------------------------------------- | |
| If lookdict() and lookdict_string() are specialized for smaller dictionaries, | |
| then a custom search approach can be implemented that exploits the small | |
| search space and cache locality. | |
| * The simplest example is a linear search of contiguous entries. This is | |
| simple to implement, guaranteed to terminate rapidly, never searches | |
| the same entry twice, and precludes the need to check for dummy entries. | |
| * A more advanced example is a self-organizing search so that the most | |
| frequently accessed entries get probed first. The organization | |
| adapts if the access pattern changes over time. Treaps are ideally | |
| suited for self-organization with the most common entries at the | |
| top of the heap and a rapid binary search pattern. Most probes and | |
| results are all located at the top of the tree allowing them all to | |
| be located in one or two cache lines. | |
| * Also, small dictionaries may be made more dense, perhaps filling all | |
| eight cells to take the maximum advantage of two cache lines. | |
| Strategy Pattern | |
| ---------------- | |
| Consider allowing the user to set the tunable parameters or to select a | |
| particular search method. Since some dictionary use cases have known | |
| sizes and access patterns, the user may be able to provide useful hints. | |
| 1) For example, if membership testing or lookups dominate runtime and memory | |
| is not at a premium, the user may benefit from setting the maximum load | |
| ratio at 5% or 10% instead of the usual 66.7%. This will sharply | |
| curtail the number of collisions but will increase iteration time. | |
| The builtin namespace is a prime example of a dictionary that can | |
| benefit from being highly sparse. | |
| 2) Dictionary creation time can be shortened in cases where the ultimate | |
| size of the dictionary is known in advance. The dictionary can be | |
| pre-sized so that no resize operations are required during creation. | |
| Not only does this save resizes, but the key insertion will go | |
| more quickly because the first half of the keys will be inserted into | |
| a more sparse environment than before. The preconditions for this | |
| strategy arise whenever a dictionary is created from a key or item | |
| sequence and the number of *unique* keys is known. | |
| 3) If the key space is large and the access pattern is known to be random, | |
| then search strategies exploiting cache locality can be fruitful. | |
| The preconditions for this strategy arise in simulations and | |
| numerical analysis. | |
| 4) If the keys are fixed and the access pattern strongly favors some of | |
| the keys, then the entries can be stored contiguously and accessed | |
| with a linear search or treap. This exploits knowledge of the data, | |
| cache locality, and a simplified search routine. It also eliminates | |
| the need to test for dummy entries on each probe. The preconditions | |
| for this strategy arise in symbol tables and in the builtin dictionary. | |
| Readonly Dictionaries | |
| --------------------- | |
| Some dictionary use cases pass through a build stage and then move to a | |
| more heavily exercised lookup stage with no further changes to the | |
| dictionary. | |
| An idea that emerged on python-dev is to be able to convert a dictionary | |
| to a read-only state. This can help prevent programming errors and also | |
| provide knowledge that can be exploited for lookup optimization. | |
| The dictionary can be immediately rebuilt (eliminating dummy entries), | |
| resized (to an appropriate level of sparseness), and the keys can be | |
| jostled (to minimize collisions). The lookdict() routine can then | |
| eliminate the test for dummy entries (saving about 1/4 of the time | |
| spent in the collision resolution loop). | |
| An additional possibility is to insert links into the empty spaces | |
| so that dictionary iteration can proceed in len(d) steps instead of | |
| (mp->mask + 1) steps. Alternatively, a separate tuple of keys can be | |
| kept just for iteration. | |
| Caching Lookups | |
| --------------- | |
| The idea is to exploit key access patterns by anticipating future lookups | |
| based on previous lookups. | |
| The simplest incarnation is to save the most recently accessed entry. | |
| This gives optimal performance for use cases where every get is followed | |
| by a set or del to the same key. |