Skip to content

Async Call-Stack Reconstruction #91048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mpage mannequin opened this issue Mar 1, 2022 · 29 comments
Open

Async Call-Stack Reconstruction #91048

mpage mannequin opened this issue Mar 1, 2022 · 29 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-asyncio type-feature A feature request or enhancement

Comments

@mpage
Copy link
Mannequin

mpage mannequin commented Mar 1, 2022

BPO 46892
Nosy @gvanrossum, @DinoV, @asvetlov, @1st1, @kumaraditya303, @itamaro, @mpage

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2022-03-01.22:13:38.844>
labels = ['interpreter-core', '3.11', 'type-feature', 'expert-asyncio']
title = 'Async Call-Stack Reconstruction'
updated_at = <Date 2022-03-12.09:30:23.838>
user = 'https://github.com/mpage'

bugs.python.org fields:

activity = <Date 2022-03-12.09:30:23.838>
actor = 'asvetlov'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core', 'asyncio']
creation = <Date 2022-03-01.22:13:38.844>
creator = 'mpage'
dependencies = []
files = []
hgrepos = []
issue_num = 46892
keywords = []
message_count = 5.0
messages = ['414301', '414884', '414930', '414948', '414985']
nosy_count = 7.0
nosy_names = ['gvanrossum', 'dino.viehland', 'asvetlov', 'yselivanov', 'kumaraditya', 'itamaro', 'mpage']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue46892'
versions = ['Python 3.11']

Linked PRs

@mpage
Copy link
Mannequin Author

mpage mannequin commented Mar 1, 2022

Profiling tools that use the call-stack (i.e. all of them) paint an incomplete picture of what’s really going on in async-heavy codebases. They can only show the stack of the currently executing task; they miss the chain of awaitables that are transitively waiting on the current task. To remedy this, we have added support in Cinder to expose the async call-stack. This consists of the call stack for the currently executing task, followed by the chain of awaitables that are transitively reachable from the currently executing task. See below for a clarifying example.

async def f1():
  return await f2()

async def f2():
  return await asyncio.ensure_future(f3())

async def f3():
  return await f4()

async def f4():
  await asyncio.sleep(10)
  return 42

When retrieved from f4, the two different stacks (top-of-stack last) are:
sync - [f3, f4]
async - [f1, f2, f3, f4]

We’d like to merge our implementation into CPython so that other heavy users of asyncio can benefit. This will consist of a few parts:

  1. A new limited C-API to set and retrieve the “awaiter” of an awaitable object.
  2. Additions to PyAsyncMethods to store function pointers for setting and retrieving the awaiter on instances.
  3. An API in managed code to retrieve the async call stack as a list of fully qualified names (i.e. <module>:<class>.<function>).

@mpage mpage mannequin added 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-asyncio type-feature A feature request or enhancement labels Mar 1, 2022
@asvetlov
Copy link
Contributor

We’d like to merge our implementation into CPython

Could you provide a link first, please?

@mpage
Copy link
Mannequin Author

mpage mannequin commented Mar 11, 2022

Sorry for the confusion, I'm working on a PR. I filed the BPO to gauge interest in the feature.

@gvanrossum
Copy link
Member

I've recently dabbled a bit in some new primitives for asyncio, and based on that experience I think this would be very useful.

IIRC Trio does this (presumably at considerable cost) in userland.

@asvetlov
Copy link
Contributor

The idea looks interesting.
The devil in the details I guess.
I'm curious what is the memory and performance penalty.
Waiting for the PR as the discussion starting point.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@ezio-melotti ezio-melotti moved this to Todo in asyncio Jul 17, 2022
@itamaro
Copy link
Contributor

itamaro commented Sep 5, 2022

somewhat related discussion (where this feature might have been helpful) - https://discuss.python.org/t/can-i-get-a-second-opinion-on-an-asyncio-issue/18471

this is the cinder 3.10 implementation of the cr_awaiter field

@gvanrossum
Copy link
Member

If someone wants to move this forward let them propose on a design here, and (if they're confident enough) submit a PR.

@gvanrossum
Copy link
Member

@mpage Are you still interested in working on this? I am still interested in having this as a feature in CPython!

@gvanrossum gvanrossum added 3.12 only security fixes and removed 3.11 only security fixes labels Sep 21, 2022
@mpage
Copy link
Contributor

mpage commented Sep 22, 2022

@gvanrossum Yes, still interested! Just haven't found the time yet to start working on it.

@gvanrossum
Copy link
Member

You might be interested in the existence of Task.get_stack(). Apparently it was part of the original asyncio code (I'd forgotten about it). I'm not sure if it addresses this problem in general (it's part of asyncio) or if it is fast enough or if it even works.

@mpage
Copy link
Contributor

mpage commented Sep 22, 2022

Task.get_stack() will only give you a call stack for that task. The interesting/useful part of this proposal is that it allows you to construct an "await stack" across Task boundaries. This produces a clearer/more complete picture of the control flow in many cases. Maybe there's a way to stitch the results of Task.get_stack() for multiple tasks that I'm not aware of.

For example, the code from the initial comment is codified in https://gist.github.com/mpage/584a02fc986d32b11b290c7032700369. Unfortunately you need Cinder in order to run it. When using Task.get_stack() we see the synchronous call stack for the task running the f3 and f4 coroutines. When using the new approach, we see the full "await stack," f1, f2, f3:

> ./python ~/local/scratch/trace_async_stacks.py
Stack retrieved from Task.get_stack():
  <module>
  run
  run_until_complete
  run_forever
  _run_once
  _run
  f3
Stack retrieved using awaiters:
  f1
  f2
  f3
>

@gvanrossum
Copy link
Member

Oh, you're right. Task.get_stack() is disappointing -- it knows a bit about how to get to the the stack frame of the coroutine running in the task, and then it just walks the call stack following f_back, which is uninteresting. I'm sorry I even mentioned it!

Looking through the Cinder code it seems that this requires a fair amount of C code (which is maybe why you haven't submitted your PR yet?). Is that fundamental or an optimization? How would pure Python code in 3.11 go about finding the awaiter of a suspended coroutine? Is there just no way? What if we know it's an asyncio task?

@gvanrossum
Copy link
Member

I just created a little proof-of-concept that gets the await chain for a task, unless that task is the current task. To get around that, you can create a dummy task and call it from there. Here's the basic code (sans dummy task hack):

def get_stack(task):
    coro = task.get_coro()
    frames = []
    while coro is not None and hasattr(coro, "cr_frame"):
        frame = coro.cr_frame
        frames.append(frame)
        coro = coro.cr_await
    return frames

It probably needs improvements (I see in Tasks.get_stack() that there are complications for legacy coroutines and async generators) but this seems to get the job done for suspended tasks.

Maybe that's enough if you need this in a debugging context? Or is this something where performance is important? Or am I missing something? (Maybe you need this when an uncaught exception is raised?)

@gvanrossum
Copy link
Member

Talking to myself here, my above code doesn't appear to be able to cross task boundaries -- I can get it to produce [f3, f4] for your example program (by calling it from a dummy helper task), but it's missing [f1, f2]. It looks like the iterator used by the Future object (either the Python or the C version) is impenetrable. Thoughts? (Gist here.)

@gvanrossum
Copy link
Member

I have a better version now. It relies on asyncio.Task, in particular the (public) asyncio.all_tasks() function and the (private) Task._fut_waiter attribute.
Output for a program similar to yours:

Task stack:
  Task: Task-1
    Coro: main
  Task: F1
    Coro: f1
    Coro: f2
  Task: F3
    Coro: f3

This leaves a lot to be desired if you're in a coroutine world other than asyncio, and it's pretty inefficient, since it must traverse the _fut_waiter links backwards, and the only way I can think of doing that (without changing asyncio itself) is to create a complete reverse map by walking all tasks. If you occasionally need this for serious debugging it's probably fine; if you routinely log it at various points "just in case" this sounds expensive. (If we decide to make this an official API in 3.12 we can do something about this of course.)

My question for you at this point is, do you really need to do this purely at the coroutine level, or is it acceptable for your use case(s) if this is essentially a Task API?

@gvanrossum
Copy link
Member

gvanrossum commented Sep 25, 2022

I just realized that f4 and f5 were missing from the output. A little tweak to the logic handles that -- the coro stack should be collected from the helper task as well. I've updated the gist, and the output is now:

Task: Task-1
  Coro: main stack.py:108
Task: F1
  Coro: f1 stack.py:82
  Coro: f2 stack.py:85
Task: F3
  Coro: f3 stack.py:88
Task: F4
  Coro: f4 stack.py:91
  Coro: f5 stack.py:94
  Coro: task_stack stack.py:69

@mpage
Copy link
Contributor

mpage commented Sep 28, 2022

Our primary use case for this feature is an always-on sampling profiler (roughly equivalent to GWP at Google). Periodically, for each machine in the fleet the profiler will wake up, grab the await stack for the currently running task, and finally send the await stack to a service that performs aggregation. The primary use of this data is to provide CPU-time profiling for application.

Performance is important for us. Since this is always on and pauses request processing we don't want the implementation to negatively impact performance of in-flight requests. We typically have a large number of active tasks (I'm somewhat embarrassed to admit that I don't have a number handy here), so I'm not sure the approach you've taken here would work for us. The logic for collecting the await stack is currently implemented as an eBPF probe and the awaiter implementation in Cinder simplifies that. The probe "only" has to walk the await stack backwards from the coroutine for the current task.

The approach you've taken is neat but unfortunately I don't think it can cross gather boundaries. I modified your example slightly to print the await stack of two gathered coroutines.

Your version:

> python3 ~/local/scratch/print_await_stack_upstream.py
Task: F4_0
  Coro: f4 ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:95
  Coro: f5 ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:98
  Coro: task_stack ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:69
Task: F4_1
  Coro: f4 ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:95
  Coro: f5 ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:98
  Coro: task_stack ../../../../home/mpage/local/scratch/print_await_stack_upstream.py:69
>

Cinder version:

> ./python ~/local/scratch/print_await_stack_cinder.py
Stack retrieved using awaiters:
  f1
  f2
  f3
  f4
  f5
Stack retrieved using awaiters:
  f1
  f2
  f3
  f4
  f5
>

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Feb 7, 2025
cmaloney pushed a commit to cmaloney/cpython that referenced this issue Feb 8, 2025
pablogsal added a commit to pablogsal/cpython that referenced this issue Apr 23, 2025
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Apr 23, 2025
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Apr 23, 2025
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Apr 23, 2025
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Apr 23, 2025
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Apr 23, 2025
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Apr 23, 2025
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Apr 23, 2025
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Apr 25, 2025
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
vstinner added a commit to vstinner/cpython that referenced this issue Apr 25, 2025
pablogsal added a commit to pablogsal/cpython that referenced this issue Apr 25, 2025
…ucture on Windows

To avoid having the debug sections being optimised away by the compiler
we use  __attribute__((used)) on gcc and clang but in Windows this is
not supported by the Microsoft compiler and there is no equivalent flag.
Unfortunately Windows offers almost no alternative other than exporting
the symbol in the dynamic table or using it somehow.
pablogsal added a commit to pablogsal/cpython that referenced this issue Apr 25, 2025
…ucture on Windows

To avoid having the debug sections being optimised away by the compiler
we use  __attribute__((used)) on gcc and clang but in Windows this is
not supported by the Microsoft compiler and there is no equivalent flag.
Unfortunately Windows offers almost no alternative other than exporting
the symbol in the dynamic table or using it somehow.
pablogsal added a commit that referenced this issue Apr 25, 2025
… on Windows (#132963)

To avoid having the debug sections being optimised away by the compiler
we use  __attribute__((used)) on gcc and clang but in Windows this is
not supported by the Microsoft compiler and there is no equivalent flag.
Unfortunately Windows offers almost no alternative other than exporting
the symbol in the dynamic table or using it somehow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-asyncio type-feature A feature request or enhancement
Projects
Status: Todo
Development

No branches or pull requests