Save a few bits in dicts #96472

matthiasgoergens · 2022-09-01T02:58:51Z

At the moment, dicts vary what int datatype they use for their indices depending on their size.

As a minor complication, indices are stored as signed integers, and thus they shift to the next bigger int size at 1<<7, 1<<15 and 1<<31.

However, we can double all those boundaries, and thus save one, two or four bytes per index for sizes that fall between the old and revised boundaries.

Background

Indices in dicts are of type Py_ssize_t. Non-negative indices are interpreted to point to a place into DK_ENTRIES or DK_UNICODE_ENTRIES. Negative indices indicate one of four special conditions:

#define DKIX_EMPTY (-1)
#define DKIX_DUMMY (-2)  /* Used internally */
#define DKIX_ERROR (-3)
#define DKIX_KEY_CHANGED (-4) /* Used internally */

dk_indices stores these indices in the smallest int variant that fits all the values.

Optimization

The size of dk_indices is always a power of 2. But because we want to avoid hash collisions, the size of DK_ENTRIES is always a far bit smaller than that. In the current implementation, it's two-thirds of the size of dk_indices. But the details don't matter and might be subject to change. What matters is that we always have at least four unused bit patterns.

To proceed with an example for a single byte:

In a signed int8_t normally every bit pattern lexicographically at or above 0b01111111 counts as a negative number. Everything below is positive. That's what C does when casting from int8_t to Py_ssize_t.

However, the lowest negative number we need is only -4, which corresponds to 0b11111100. And the biggest positive index we need is 169 which corresponds 0b10101001. There's a big gap between both bit patterns, and we can programmatically detect which case we are in. Instead of adding more prose, let's look at the C code:

#define DKIX_LOWEST_RESERVED (-4) /* Used internally */

static inline Py_ssize_t
dictkeys_get_index(const PyDictKeysObject *keys, Py_ssize_t i)
{
    int log2size = DK_LOG_SIZE(keys);
    Py_ssize_t ix;

    if (log2size <= 8) {
        const uint8_t *indices = (const uint8_t*)(keys->dk_indices);
        const uint8_t uix = indices[i];
        if(uix >= (uint8_t)DKIX_LOWEST_RESERVED) {
            ix = (int8_t) uix;
        } else {
            ix = uix;
        }
    }

For comparison, the status quo looks like this:

static inline Py_ssize_t
dictkeys_get_index(const PyDictKeysObject *keys, Py_ssize_t i)
{
    int log2size = DK_LOG_SIZE(keys);
    Py_ssize_t ix;

    if (log2size < 8) {
        const int8_t *indices = (const int8_t*)(keys->dk_indices);
        ix = indices[i];
    }

I have a prototype and also ran some benchmarks. Here are two examples:

In this first plot above, you can see that we sometimes save a bit of memory.

This plot shows that the difference in runtime is pretty much a wash.

So overall, a modest memory saving under certain circumstances, but at no cost in speed.

I'm happy to run some more benchmarks, if anyone has ideas or suggestions.

P.S. Instead of mucking around with bits, we could also just add and subtract 4 (when storing / loading from the indices array). That would save some branching.

Alas, for some reason, I can't seem to make that work. I think there might be some other parts of the code that is relying on the exact bit-representation in dk_indices. My initial proposal has the benefit of not changing what any of these bits mean. Especially not the bit pattern for the special entries.

If we can fix that one, I think that just adding and subtracting would probably be easier and perhaps a tiny bit faster.

Linked PRs

gh-96472: Save a few bits in dicts #96473

The text was updated successfully, but these errors were encountered:

rhettinger · 2022-09-01T15:24:22Z

-1 This adds code to a hot inlined code path with very tight code. It makes all dicts pay for something that only slightly benefits a rare dict instance that happens to be sitting right at a size boundary. A general rule for optimization is "don't put non-essential code on a hot path".

matthiasgoergens · 2022-09-02T09:37:43Z

@rhettinger Another general rule for optimization that I heard is "Don't assume, measure." Modern compilers and processors etc can be quite surprising.

Do you have any kind of benchmark in mind that you expect would show a slowdown?

No need to write any code, I'm happy to do the legwork. Just a general idea is enough. I already ran a few benchmarks myself, but perhaps I missed something.

If we can find anything that's slowed down, I'll update the entry on faster-cpython/ideas., so that future efforts can stand on our shoulders. (Basically, so that the next person who has this idea doesn't have to start from scratch, but can see what happened before and under which scenarios, if any, we had a slowdown.)

P.S. I can also code up a variant that has no impact on the hot path. The cost of that is a bigger diff in the PR.

I haven't done so, because in all my benchmarks, I could not detect any impact from the simpler approach. So no need to mitigate something that didn't seem to exist.

rhettinger · 2022-09-05T04:50:12Z

Another general optimization that I heard is "Don't assume, measure." Modern compilers and processors etc can be quite surprising.

Measurement is hard. Almost every easy to write timing won't measure the branch prediction penalties that we will encounter in production. When I was working on micro-optimizations for deques and sets, it was common to encounter changes that sped-up the Clang build but slowed the GCC build or vice-versa. Unless it sped-up both, the optimization was questionable. Also the effects were different on 32-bit builds and 64-bits. Python runs on Intel, Solaris, ARM, Apple Silcon, Crays etc. What is cheap on one machine or compiler may be expensive on another and we end-up tuning to the local minimum on your particular setup. Difficulty in measuring is why we need tools like VTune and Valgrind to reveal what is really happening. It also helps to think algorithmically — how many logic steps are being performed — which branches are predictable in real code, etc.

I have many years of detailed optimization work under my belt and wish you wouldn't dismiss out of hand everything I say. It feels like all the review is just wasted on someone who will continually persist and hand wave away valid comments. It feels more like a "pull demand" rather than a "pull request".

Putting extra code on a hot path is almost always detrimental to some builds or workloads. In order to justify this rare and tiny savings of a few bits, we need to be pretty certain that the extra code always has a zero cost and that it will continue to have zero cost as the machines, code, and compilers change over time (i.e. webassembly, 32-bits builds, ARM machines, homebrew builds, etc).

rhettinger · 2022-09-05T04:53:38Z

Tim, Inada, and Mark, you are the other experts on this part of the code. Do you think the extra code with "saving a few bits" at the boundary points? My judgment and experience say no, but that doesn't count for much any more.

matthiasgoergens · 2022-09-05T05:04:28Z

@rhettinger I see your concerns. That's part of why I already changed the prototype not to have any extra branching, and only one add and one subtract instead.

If you want to, we can also produce a version that doesn't even have those.

So the main idea is that we use the same bits to represent an index into a table, but also to represent a few special conditions.

At the moment, we reserve more than half the possible representations for our special conditions, but we actually only need four special values.

You seem worried about one extra add and subtract on the hot path? Fair enough.

(As I suggested earlier, I can most like produce a version that avoid these extras: the idea would be that when we use our value as an index, we already do so as an addition to some pre-calculated offset (DK_ENTRIES). We could probably do some micro-optimizations to fold the new offset of 4 into the already existing offset.

I'm not sure whether outguessing the compiler like this is worth it.)

I'm mostly interested in the memory saving for small dicts between sizes 128 to 256, as I assume those are probably more common than sizes between (1<<15) to (1<<16).

I do agree that the memory savings are modest.

markshannon · 2022-09-05T09:22:55Z

@rhettinger
I like the new version without branching.

The PR increases the information density without any real increase in complexity. Which seems like a win to me.
@methane?

rhettinger · 2022-09-05T15:13:40Z

I see your concerns. That's part of why I already #96473 not to have any extra branching, and only one add and one subtract instead.

That is much improved. Addition and subtraction are very cheap.

rhettinger · 2022-09-05T16:39:42Z

To be clear on just how much or how little memory is saved in an application, please run this model before and after the PR.

import sys

total_size = 0
num_dicts = 1_000 # number of dicts with len(d) == 1
for len_dict in range(1, 2000):
    d = dict.fromkeys(range(len_dict))
    total_size += sys.getsizeof(d) * num_dicts
    num_dicts = round(num_dicts * 0.98)  # geometric distribution
print(f'{total_size:,}')

Either assume a geometric distribution or Zipf's law or use empirical data from a large app.

matthiasgoergens · 2022-10-01T13:39:49Z

Sorry, I was quite busy with other things. I'm still working on this. (But if someone else wants to pick it up, feel free.)

matthiasgoergens added the type-feature A feature request or enhancement label Sep 1, 2022

matthiasgoergens mentioned this issue Sep 1, 2022

gh-96472: Save a few bits in dicts #96473

Open

mdboom added the performance Performance or resource usage label Sep 1, 2022

rhettinger assigned tim-one and methane Sep 5, 2022

rhettinger assigned markshannon Sep 5, 2022

iritkatriel added the 3.12 only security fixes label Sep 25, 2022

markshannon mentioned this issue Jan 7, 2023

Save a few bits in dicts faster-cpython/ideas#455

Closed

iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 27, 2023

erlend-aasland added 3.13 bugs and security fixes and removed 3.12 only security fixes labels Jan 5, 2024

rhettinger closed this as not planned Won't fix, can't repro, duplicate, stale Apr 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save a few bits in dicts #96472

Save a few bits in dicts #96472

matthiasgoergens commented Sep 1, 2022 •

edited by bedevere-app bot

Loading

rhettinger commented Sep 1, 2022

matthiasgoergens commented Sep 2, 2022 •

edited

Loading

rhettinger commented Sep 5, 2022

rhettinger commented Sep 5, 2022

matthiasgoergens commented Sep 5, 2022 •

edited

Loading

markshannon commented Sep 5, 2022

rhettinger commented Sep 5, 2022

rhettinger commented Sep 5, 2022

matthiasgoergens commented Oct 1, 2022

Save a few bits in dicts #96472

Save a few bits in dicts #96472

Comments

matthiasgoergens commented Sep 1, 2022 • edited by bedevere-app bot Loading

Background

Optimization

Linked PRs

rhettinger commented Sep 1, 2022

matthiasgoergens commented Sep 2, 2022 • edited Loading

rhettinger commented Sep 5, 2022

rhettinger commented Sep 5, 2022

matthiasgoergens commented Sep 5, 2022 • edited Loading

markshannon commented Sep 5, 2022

rhettinger commented Sep 5, 2022

rhettinger commented Sep 5, 2022

matthiasgoergens commented Oct 1, 2022

matthiasgoergens commented Sep 1, 2022 •

edited by bedevere-app bot

Loading

matthiasgoergens commented Sep 2, 2022 •

edited

Loading

matthiasgoergens commented Sep 5, 2022 •

edited

Loading