-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Add a new interface for external profilers and debuggers to get the call stack efficiently #115946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
… profilers and debuggers
… profilers and debuggers
…xternal profilers and debuggers
This certainly requires the attention of @markshannon and @Fidget-Spinner.
Looking at the code in the PR I see one pointer write, one counter increment, and several reads. We probably want a benchmark run. |
Absolutely! Also the PR doesn't work in the current state because I am missing a lot of the work in the new interpreter structure so I may need some help here to know where to put the tracking code. EDIT: I have also updated the description to reflect that it's not just one write |
@gvanrossum Do you think it's worth discussing this in one Faster CPython sync? |
Perhaps. Especially if you're not 100% satisfied with my answers. (Wait a day for @markshannon to pitch in as well.) I don't know if you still have the meeting in your calendar -- we haven't seen you in a while, but of course you're always welcome! We now meet every other week. The next one is Wed March 13. The Teams meeting is now owned by Michael Droettboom. |
I am satisfied (and thanks a lot for taking the time to read my comments and to answer, I really appreciate it). I think is more than it seems that there are some moving pieces and some compromises to be made and I think a realtime discussion can make this easier.
I do have it still! Should I reach out to Michael (after leaving time for @markshannon to answer first) to schedule the topic? |
Agreed.
Wouldn't hurt to get this on the agenda early. |
👍 I sent @mdboom an email |
While this might make life easier for profilers and out-of-process debuggers (what's wrong with in-process debuggers?), it will make all Python programs slower. |
Think of this not just as a way to make the profiler life easier, but also a way to not make it break once we start inlining things or to make the JIT not push frames. If we don't do this, what do you propose we should do to not break these tools once the JIT is there or we start inlining stuff and frames start to be missing?
That you cannot use them if Python is frozen or deadlocked and you need them if you want to attach to an already-running tool. |
I'd like to remind all participants in this conversation to remain open to the issues others are trying to deal with. We will eventually need some kind of compromise. Please try working towards one. |
@markshannon What would be an acceptable approach/compromise for you to ensure that we can still offer reliable information for these tools in a way that doesn't impact too much the runtime? |
@pablogsal I wonder if you could write a bit more about how such tools currently use the frame stack (and how this has evolved through the latest Python versions, e.g. how they coped with 3.10/3.11 call inlining, whether 3.12 posed any new issues, how different 3.13 is currently? And I'd like to understand how having just a stack of code objects (as this PR seems to be doing) will help. I think I recall you mentioned that reading the frame stack is inefficient -- a bit more detail would be helpful. IIUC the current JIT still maintains the frame stack, just as the Tier 2 interpreter does -- this is done using the The current (modest) plans for call inlining in 3.13 (if we do it at all) only remove the innermost frame, and only if that frame doesn't make any calls into C code that could recursively enter the interpreter. (This is supported by escape analysis which can be overridden by the "pure" qualifier in bytecodes.c.) When you catch it executing uops (either in the JIT or in the Tier 2 interpreter) the frame is that of the caller and the inlined code just uses some extra stack space. (I realize there's an open issue regarding how we'd calculate the line number when the VM is in this state.) |
Absolutely! Let me try to describe here the general idea and if any specific aspects are not clear, tell me and we can go into detail. This way you don't need to read a big wall of text with too much detail. How the tools workAll the tools do variations of the same thing. The setup is that there are two process: the Python process being traced and the debugger/profiler (that can be written in any language). The algorithm is as follows (go directly to step 5 if you want to read specifically how the frame stack it's used):
If you want to get the idea reading code you can read the test that I added for that here: https://github.com/python/cpython/blob/main/Modules/_testexternalinspection.c . This test does a simplified version of what these tools do (it only fetches function names, not file names nor line offsets). How having just a stack of code objects will help.This is a compromise. The best thing would have been to get all the frame objects in a contiguous array. That way you don't need to pay one system call per frame + a bunch of system calls per code object but you could copy ALL THE FRAMES in one system call. The reason I decided to just do code objects is because I assumed that with inlining, JIT and other optimisations having the frames themselves there will be very difficult as frames will be changing or be missing so having code objects would be a much more stable interface for the tools (code objects don't change that much) and it would be easier for the optimisations to maintain the status (as frames tend to mutate a lot depending on several factors such as when ownership is handled to PyFrameObjects* and more). Unfortunately, with this methods tools won't be able to get the current line being executed, so this is indeed suboptimal, but this is precisely one of the compromises I was trying to make. Why reading the frame stack is inefficientThis is inefficient because every time you copy memory from the traced process you pay for a system call. Once you locate the current thread state (that needs to be done once). Every time you take an snapshot you need to do:
for every frame. This means that you pay This is a problem specially for profilers because the VM can change the stack as this is happening yielding incorrect results at high stack levels. So reducing the system calls is a big deal. How the coped with 3.12 inliningTo get the Python stack tools just needed to add an extra step between 4 and 5 that was fetching the The painful point was that these tools also want to merge the C stack with the Python stack. That is, instead of showing just the Python stack or the C stack (like in GDB), they want to show both merged. This is achieved by substituting every Please, tell me if any of this is too confusing or you need more context. I tried to do my best to summarise it without being a wall of text but is a bit challenging to add just the minimum context so feel free to tell me if I need to explain something in more detail, show you example code or any other way I can help :) |
This is super helpful! It looks like you would be a fan of @Fidget-Spinner's proposal for a different layout of the frame stack: That is not an accepted proposal, but if we could make that work without slowing things down compared to the current approach, it would presumably work for you too -- you'd get all the frames in a single array! (Or maybe it's chunked and there might be a few segments to it, but still, many frames per chunk.) Coming back to @markshannon's objection, it is indeed difficult to accept even a very small slowdown to support tools like this, since most users won't ever be using those tools. An obvious compromise would be to have some sort of option so users can choose between more speed or trace tool support, without giving up all speed. Ideally there would be a command-line flag (some Regarding the JIT causing frames to go missing, that's a slightly separate topic into which I have less insight. (I believe currently it faithfully implements |
I totally understand this position. I would ask then that at least we try not to make it impossible for these tools to work. So even if they need to do a bit of extra work to copy some extra fields in the case of JIT or inlining, let's at least ensure that we don't make it so hard that's not feasible. These tools have lots of users so disallowing them to work (even if they need to do a lot of work to keep working) would be quite a breaking change.
I think that is an acceptable compromise (although it doesn't come without challenges). For profilers, it would be totally ok because asking that users run Python in a specific way when profiling is OK. For debuggers, it's more challenging. Users normally use debuggers when Python is frozen or deadlocked or has crashed (notice all this analysis can be done in a core file as well, not only on a live process) and want to know what is the stack. This use case will be noticeably affected because you don't know if you are going to need to use a debugger or not until you do, in which case it's too late. On the other hand, I think is ok if we say "you can sacrifice a bit of speed if you want to be able to use one of these tools".
The problem with a compile-time flag is that most users won't be able to use these tools because most users do not compile Python themselves. I think this would be too much because it will raise a lot the ask for the end user. |
I guess if you have one isolated hang in production you can't debug it well, alas. (Though presumably things aren't hopeless, there will still be a current frame in the thread state, and it will point to a recent ancestor of where you actually are.) But once you have multiple hangs you can advise a team to turn on a flag so their hangs will be more debuggable.) Understood about the build-time flag, those are always a pain. |
Yeah, I think missing one frame at the end is not bad at all. I was working on this PR in case we plan to start missing a lot of intermediate frames or a lot of consecutive ones, in which case the traces would be very confusing |
Just to clear this up: when talking about Python frames, the effect of the JIT can be ignored. It just does whatever the tier two interpreter does. When dealing with C frames, the difference is that there may be frames with no debugging info in the C stack. That could be a challenge for tools that try to interleave the two stacks, or are specifically inspecting C code. However, I think frame pointers are still present, so it should be possible to unwind through JIT frames without knowing much about them (@mdboom recently had some success using In the future, we could explicitly emit debug info in much the way that the |
I think this is because for his use case it's only using the leaf frame. If you look at every individual stack trace they are probably incomplete. I will run a test tomorrow to confirm. BTW: I have accumulated quite a lot of experience working with Perf and the JIT interface so I may be able to help with that particular issue. |
First, let's worry about that when it actually happens. Second, mapping the actual machine state back to the canonical VM state is a more general and tricky problem, needing a carefully designed solution. We need a solution that allows the original state to be reconstructed, but at a low enough cost to not undermine those optimizations. |
Looking at the list of system calls.
Once the first system call has been performed, the only way you can get incorrect information is if the code object is destroyed before you have completed calls 4. I don't see how your proposal would prevent that. |
For profilers this is a potential problem because they don't stop the program but they are prepared to deal with this (detect corruption or incorrect object copies). They need to do this race-y operation anyway because they must get the stack without stopping the process. For debuggers it's not a problem at all because the process is either stopped or it has already crashed.
What do you mean with "2-4"? Not sure I understand the question correctly. |
That's correct that for my use case I'm only collecting leaf frames, so I don't really know if it can seek "up through" a JIT frame or not. |
Items 2 to 4 on the list above, that is:
Why do these need to be done every sample, can't they done once per code-object and be cached? |
It looks like
The "trick" to getting them to show is to decrease the threshold for which Note also these are "anonymous" JIT frames -- we aren't currently adding |
Hummm, that's interesting but it's still not conclusive because we are not seeing full linear independent stacks that we can check that are not missing anything. We should inspect the actual full per-event stacks. We can do that by running "perf script" over the file and check that we are not missing frames. I will run a bunch of tests if someone can point me out to some example code that surely gets jitted that has a bunch of stack frames |
Additionally there is another check: perf uses libubwind (or libdw) as unwinder. What happens if you try to unwind using libubwind under a kitted frame? @brandtbucher and I did the test back in the day and it failed. Also what happens if gdb tries to unwind? That's an easier check to do 🤔 |
import operator
def f(n):
if n <= 0:
return
for _ in range(16):
operator.call(f, n - 1)
f(6) You should usually see |
One important thing has changed since then: the JIT no longer omits frame pointers when compiling templates. So this may just be fixed now (not sure). |
Ok, I tested with the latest JIT and I can confirm neither gdb nor libunwind can unwind through the JIT. I used this extension:
with @brandtbucher example:
Attaching with gdb and unwinding shows:
and libunwind shows:
This is with frame pointers:
|
I will check with perf later in the day |
I checked and I can confirm perf cannot unwind throughout the JIT (with or without frame pointers). For this I used @brandtbucher script:
I started it with
I let it run for a while and then run
So as you can see every time perf his a Jitted frame, it fails to unwind. @mdboom I think what you are getting in your script it's these partial frames reordered, not the full stack. You can confirm running |
Note that if we confirm this it means that any current attribution of Perf to jitted frames it's inaccurate |
@pablogsal: Can you be more specific about how you are running
If I'm understanding correctly, I see how this means that where the JIT'ted code in the stack is inaccurate, but I don't see how it follows that time attributed to JIT'ted code is inaccurate (which I realize is the easier subset of this whole problem). The numbers certainly add up and make sense. |
Assuming your result file it's called "perf.data" (the default) you can just run
This is not a problem if you only account for leaf frames and we believe Perf is correctly identifying jitted frames in all cases in these conditions. If you need to account for callers, because it cannot unwind the stack it cannot discover that above a given frame there are multiple jitted frames (like in @brandtbucher example) and therefore if you measure anything other than self-time (as opposed to self-time + children time, as in flamegraphs) it will be inaccurate. |
Got it, that makes sense. I can confirm the output of |
Disclaimer: I work on continuous profiler for Python at Datadog. Our profiler, and I believe py-spy too, follow the pattern of what's described in @pablogsal's comment here. And our profiler work as an outside-in profiler, meaning it runs in a separate native thread in the same process as the profiled Python application. It doesn't rely on GIL when copying CPython internal states. From CPython 3.11+, we've been using This also made me wonder whether there could be something equivalent of More broadly, I'm interested in CPython having a C API that could be used by profiler and debugger tools out there to get stack trace as efficient as possible. I believe there's a Python API for doing so, but the fact that there are quiet a few tools following the same pattern using |
Currently, external profilers and debuggers need to issue one system call per frame object when they are retrieving the call stack from a remote process as these form a linked list. Furthermore, for every frame object they need at least two or three more system calls to retrieve the code object and the code object name and file. This has several disadvantages:
For these reasons, I propose to add a new interface consisting in a contiguous array of pointers that always contains all the code objects necessary to retrieve the call stack. This has the following advantages:
process_vm_readv
).Linked PRs
The text was updated successfully, but these errors were encountered: