Skip to content

Save a few bits in dicts #96472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matthiasgoergens opened this issue Sep 1, 2022 · 9 comments
Closed

Save a few bits in dicts #96472

matthiasgoergens opened this issue Sep 1, 2022 · 9 comments
Assignees
Labels
3.13 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement

Comments

@matthiasgoergens
Copy link
Contributor

matthiasgoergens commented Sep 1, 2022

(Ported from faster-cpython/ideas.)

At the moment, dicts vary what int datatype they use for their indices depending on their size.

As a minor complication, indices are stored as signed integers, and thus they shift to the next bigger int size at 1<<7, 1<<15 and 1<<31.

However, we can double all those boundaries, and thus save one, two or four bytes per index for sizes that fall between the old and revised boundaries.

Background

Indices in dicts are of type Py_ssize_t. Non-negative indices are interpreted to point to a place into DK_ENTRIES or DK_UNICODE_ENTRIES. Negative indices indicate one of four special conditions:

#define DKIX_EMPTY (-1)
#define DKIX_DUMMY (-2)  /* Used internally */
#define DKIX_ERROR (-3)
#define DKIX_KEY_CHANGED (-4) /* Used internally */

dk_indices stores these indices in the smallest int variant that fits all the values.

Optimization

The size of dk_indices is always a power of 2. But because we want to avoid hash collisions, the size of DK_ENTRIES is always a far bit smaller than that. In the current implementation, it's two-thirds of the size of dk_indices. But the details don't matter and might be subject to change. What matters is that we always have at least four unused bit patterns.

To proceed with an example for a single byte:

In a signed int8_t normally every bit pattern lexicographically at or above 0b01111111 counts as a negative number. Everything below is positive. That's what C does when casting from int8_t to Py_ssize_t.

However, the lowest negative number we need is only -4, which corresponds to 0b11111100. And the biggest positive index we need is 169 which corresponds 0b10101001. There's a big gap between both bit patterns, and we can programmatically detect which case we are in. Instead of adding more prose, let's look at the C code:

#define DKIX_LOWEST_RESERVED (-4) /* Used internally */
static inline Py_ssize_t
dictkeys_get_index(const PyDictKeysObject *keys, Py_ssize_t i)
{
    int log2size = DK_LOG_SIZE(keys);
    Py_ssize_t ix;

    if (log2size <= 8) {
        const uint8_t *indices = (const uint8_t*)(keys->dk_indices);
        const uint8_t uix = indices[i];
        if(uix >= (uint8_t)DKIX_LOWEST_RESERVED) {
            ix = (int8_t) uix;
        } else {
            ix = uix;
        }
    }

For comparison, the status quo looks like this:

static inline Py_ssize_t
dictkeys_get_index(const PyDictKeysObject *keys, Py_ssize_t i)
{
    int log2size = DK_LOG_SIZE(keys);
    Py_ssize_t ix;

    if (log2size < 8) {
        const int8_t *indices = (const int8_t*)(keys->dk_indices);
        ix = indices[i];
    }

I have a prototype and also ran some benchmarks. Here are two examples:

dict_bit_mem

In this first plot above, you can see that we sometimes save a bit of memory.

dict_bit

This plot shows that the difference in runtime is pretty much a wash.

So overall, a modest memory saving under certain circumstances, but at no cost in speed.

I'm happy to run some more benchmarks, if anyone has ideas or suggestions.


P.S. Instead of mucking around with bits, we could also just add and subtract 4 (when storing / loading from the indices array). That would save some branching.

Alas, for some reason, I can't seem to make that work. I think there might be some other parts of the code that is relying on the exact bit-representation in dk_indices. My initial proposal has the benefit of not changing what any of these bits mean. Especially not the bit pattern for the special entries.

If we can fix that one, I think that just adding and subtracting would probably be easier and perhaps a tiny bit faster.

Linked PRs

@matthiasgoergens matthiasgoergens added the type-feature A feature request or enhancement label Sep 1, 2022
@mdboom mdboom added the performance Performance or resource usage label Sep 1, 2022
@rhettinger
Copy link
Contributor

-1 This adds code to a hot inlined code path with very tight code. It makes all dicts pay for something that only slightly benefits a rare dict instance that happens to be sitting right at a size boundary. A general rule for optimization is "don't put non-essential code on a hot path".

@matthiasgoergens
Copy link
Contributor Author

matthiasgoergens commented Sep 2, 2022

@rhettinger Another general rule for optimization that I heard is "Don't assume, measure." Modern compilers and processors etc can be quite surprising.

Do you have any kind of benchmark in mind that you expect would show a slowdown?

No need to write any code, I'm happy to do the legwork. Just a general idea is enough. I already ran a few benchmarks myself, but perhaps I missed something.

If we can find anything that's slowed down, I'll update the entry on faster-cpython/ideas., so that future efforts can stand on our shoulders. (Basically, so that the next person who has this idea doesn't have to start from scratch, but can see what happened before and under which scenarios, if any, we had a slowdown.)

P.S. I can also code up a variant that has no impact on the hot path. The cost of that is a bigger diff in the PR.

I haven't done so, because in all my benchmarks, I could not detect any impact from the simpler approach. So no need to mitigate something that didn't seem to exist.

@rhettinger
Copy link
Contributor

Another general optimization that I heard is "Don't assume, measure." Modern compilers and processors etc can be quite surprising.

Measurement is hard. Almost every easy to write timing won't measure the branch prediction penalties that we will encounter in production. When I was working on micro-optimizations for deques and sets, it was common to encounter changes that sped-up the Clang build but slowed the GCC build or vice-versa. Unless it sped-up both, the optimization was questionable. Also the effects were different on 32-bit builds and 64-bits. Python runs on Intel, Solaris, ARM, Apple Silcon, Crays etc. What is cheap on one machine or compiler may be expensive on another and we end-up tuning to the local minimum on your particular setup. Difficulty in measuring is why we need tools like VTune and Valgrind to reveal what is really happening. It also helps to think algorithmically — how many logic steps are being performed — which branches are predictable in real code, etc.

I have many years of detailed optimization work under my belt and wish you wouldn't dismiss out of hand everything I say. It feels like all the review is just wasted on someone who will continually persist and hand wave away valid comments. It feels more like a "pull demand" rather than a "pull request".

Putting extra code on a hot path is almost always detrimental to some builds or workloads. In order to justify this rare and tiny savings of a few bits, we need to be pretty certain that the extra code always has a zero cost and that it will continue to have zero cost as the machines, code, and compilers change over time (i.e. webassembly, 32-bits builds, ARM machines, homebrew builds, etc).

@rhettinger
Copy link
Contributor

Tim, Inada, and Mark, you are the other experts on this part of the code. Do you think the extra code with "saving a few bits" at the boundary points? My judgment and experience say no, but that doesn't count for much any more.

@matthiasgoergens
Copy link
Contributor Author

matthiasgoergens commented Sep 5, 2022

@rhettinger I see your concerns. That's part of why I already changed the prototype not to have any extra branching, and only one add and one subtract instead.

If you want to, we can also produce a version that doesn't even have those.

So the main idea is that we use the same bits to represent an index into a table, but also to represent a few special conditions.

At the moment, we reserve more than half the possible representations for our special conditions, but we actually only need four special values.

You seem worried about one extra add and subtract on the hot path? Fair enough.

(As I suggested earlier, I can most like produce a version that avoid these extras: the idea would be that when we use our value as an index, we already do so as an addition to some pre-calculated offset (DK_ENTRIES). We could probably do some micro-optimizations to fold the new offset of 4 into the already existing offset.

I'm not sure whether outguessing the compiler like this is worth it.)

I'm mostly interested in the memory saving for small dicts between sizes 128 to 256, as I assume those are probably more common than sizes between (1<<15) to (1<<16).

I do agree that the memory savings are modest.

@markshannon
Copy link
Member

@rhettinger
I like the new version without branching.

The PR increases the information density without any real increase in complexity. Which seems like a win to me.
@methane?

@rhettinger
Copy link
Contributor

I see your concerns. That's part of why I already #96473 not to have any extra branching, and only one add and one subtract instead.

That is much improved. Addition and subtraction are very cheap.

@rhettinger
Copy link
Contributor

To be clear on just how much or how little memory is saved in an application, please run this model before and after the PR.

import sys

total_size = 0
num_dicts = 1_000 # number of dicts with len(d) == 1
for len_dict in range(1, 2000):
    d = dict.fromkeys(range(len_dict))
    total_size += sys.getsizeof(d) * num_dicts
    num_dicts = round(num_dicts * 0.98)  # geometric distribution
print(f'{total_size:,}')

Either assume a geometric distribution or Zipf's law or use empirical data from a large app.

@iritkatriel iritkatriel added the 3.12 only security fixes label Sep 25, 2022
@matthiasgoergens
Copy link
Contributor Author

Sorry, I was quite busy with other things. I'm still working on this. (But if someone else wants to pick it up, feel free.)

@iritkatriel iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 27, 2023
@erlend-aasland erlend-aasland added 3.13 bugs and security fixes and removed 3.12 only security fixes labels Jan 5, 2024
@rhettinger rhettinger closed this as not planned Won't fix, can't repro, duplicate, stale Apr 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

8 participants