Add @torch.no_grad() to cache layer update methods #43041

yashwantbezawada · 2025-12-25T23:46:13Z

What does this PR do?

Adds @torch.no_grad() decorator to cache layer update() methods that use in-place operations, preventing torch.func.grad from failing with "in-place operation would mutate a captured Tensor" errors.

Why only StaticLayer and StaticSlidingWindowLayer?

After investigation, I found that only these two classes need the decorator:

Cache Layer	Operation	In-Place?	Needs `@torch.no_grad()`?
`DynamicLayer`	`torch.cat()`	No	No - breaks gradient flow
`DynamicSlidingWindowLayer`	`torch.cat()`	No	No - breaks gradient flow
`StaticLayer`	`index_copy_()`	Yes	Yes
`StaticSlidingWindowLayer`	`copy_()`, `index_copy_()`	Yes	Yes
`QuantizedLayer`	`torch.cat()`	No	No - breaks gradient flow

Adding the decorator to DynamicLayer (and subclasses) would break gradient flow because torch.cat() creates new tensors that participate in the computation graph. Models like T5 use DynamicCache and need gradients to flow through cached key/values.

Changes

Added @torch.no_grad() decorator to:

StaticLayer.update()
StaticSlidingWindowLayer.update()

Testing

All 9 cache unit tests pass
Verified torch.func.grad works with StaticCache
Verified gradient flow preserved for DynamicCache

Decorates all cache layer update() methods with @torch.no_grad() to prevent PyTorch autograd from complaining about tensor version changes when computing gradients with respect to model inputs. This follows the same pattern used by optimizer.step() methods and is safe because cache updates are only used during inference/generation, never during training. Methods decorated: - DynamicLayer.update() - DynamicSlidingWindowLayer.update() - StaticLayer.update() - StaticSlidingWindowLayer.update() - QuantizedLayer.update() Fixes huggingface#43010

Remove @torch.no_grad() from DynamicLayer, DynamicSlidingWindowLayer, and QuantizedLayer since they use torch.cat() (not in-place) and need gradient flow preserved. Keep @torch.no_grad() only on StaticLayer and StaticSlidingWindowLayer which use index_copy_() and copy_() (in-place operations) that cause torch.func.grad to fail.

yashwantbezawada added 2 commits December 25, 2025 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add @torch.no_grad() to cache layer update methods #43041

Add @torch.no_grad() to cache layer update methods #43041

yashwantbezawada commented Dec 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add @torch.no_grad() to cache layer update methods #43041

Are you sure you want to change the base?

Add @torch.no_grad() to cache layer update methods #43041

Conversation

yashwantbezawada commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why only StaticLayer and StaticSlidingWindowLayer?

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yashwantbezawada commented Dec 25, 2025 •

edited

Loading