Skip to content

Conversation

@Gkrumbach07
Copy link
Collaborator

Problem

The frontend was spamming 35,000+ requests to the /agui/events SSE endpoint when the backend was unavailable or restarting. This caused:

  1. Backend health checks to timeout due to request flood
  2. Pod restarts (6 restarts in 14 minutes observed)
  3. Vicious cycle: backend restarts → frontend spams → backend overwhelmed → more restarts

Root Cause

Two issues combined to create the spam:

  1. Browser's native EventSource reconnect - The EventSource wasn't being closed on error, so the browser's built-in reconnect fired alongside our custom reconnect logic

  2. Dependency array bug - The useEffect in page.tsx had aguiStream in its dependency array. Since aguiStream is an object that changes on every render, this caused infinite disconnect/reconnect cycles

Fix

use-agui-stream.ts

  • Close EventSource immediately on error (prevents browser native reconnect)
  • Add exponential backoff: 1s → 2s → 4s → 8s → 16s → 30s max
  • Reset attempts on successful connection
  • Skip reconnection if component is unmounted

page.tsx

  • Remove aguiStream from useEffect dependency array
  • Use a ref for the disconnect function instead

Testing

  • Verify SSE connects normally when backend is healthy
  • Verify reconnection backs off exponentially when backend is down
  • Verify no request spam in Network tab

- Add exponential backoff for SSE reconnection (1s → 30s max)
- Close EventSource immediately on error to prevent browser's native reconnect
- Fix useEffect dependency array in page.tsx (removed aguiStream object)
- Reset reconnect attempts on successful connection
- Skip reconnection if component is unmounted

Fixes issue where frontend would spam 35k+ requests when backend was unavailable,
causing health check failures and pod restarts.
@github-actions
Copy link
Contributor

github-actions bot commented Jan 19, 2026

Claude Code Review

Summary

This PR successfully fixes a critical production issue where the frontend was spamming 35,000+ requests to the /agui/events SSE endpoint. The fix implements exponential backoff and resolves a React dependency array bug that caused infinite reconnection cycles. The implementation follows established patterns and significantly improves system stability.

Issues by Severity

🚫 Blocker Issues

None - This PR is production-ready and fixes a critical bug.

🔴 Critical Issues

None

🟡 Major Issues

1. Missing Test Coverage

  • Location: No test files found for SSE reconnection logic
  • Impact: Cannot verify exponential backoff behavior without manual testing
  • Recommendation: Add tests for:
    • Exponential backoff progression (1s → 2s → 4s → 8s → 16s → 30s)
    • Reset on successful connection
    • Unmount handling
    • EventSource closure on error
  • Pattern Reference: .claude/patterns/react-query-usage.md emphasizes test coverage for critical hooks

2. No Observability/Metrics

  • Location: use-agui-stream.ts:660 - Only console.log for reconnection attempts
  • Impact: Hard to monitor reconnection behavior in production
  • Recommendation: Consider adding metrics/telemetry:
    • Track reconnection attempts (count, delay)
    • Alert on excessive reconnections
    • Monitor SSE connection health
  • Note: This is a follow-up enhancement, not blocking for merge

🔵 Minor Issues

1. Magic Numbers Could Be Constants

  • Location: use-agui-stream.ts:95-96
const MAX_RECONNECT_DELAY = 30000 // 30 seconds max
const BASE_RECONNECT_DELAY = 1000 // 1 second base
  • Observation: Well-documented inline, but could be extracted to a shared config if other components need similar backoff
  • Impact: Low - current approach is acceptable per components/frontend/DESIGN_GUIDELINES.md

2. TypeScript any Usage

Positive Highlights

✅ Excellent Problem Diagnosis

  • Root cause analysis in PR description is thorough and accurate
  • Identified both the EventSource closure issue AND the dependency array bug
  • Clear explanation of the vicious cycle (backend restart → spam → overwhelm → restart)

✅ Correct Fix for Dependency Array Bug

  • Location: page.tsx:225-249
  • Properly uses refs to avoid dependency on aguiStream object
  • Follows React best practices for stable callbacks
  • Comment clearly explains WHY the dependency is excluded (line 247-248)

✅ Proper EventSource Management

  • Location: use-agui-stream.ts:624-638
  • Immediately closes EventSource on error (prevents browser native reconnect)
  • Checks if EventSource is still active before reconnecting
  • Respects unmount state (mountedRef.current)
  • Clears pending timeouts properly

✅ Exponential Backoff Implementation

  • Location: use-agui-stream.ts:653-658
  • Correct exponential formula: BASE * 2^(attempts-1)
  • Caps at reasonable max (30s) to prevent indefinite growth
  • Resets on successful connection (line 607)
  • Logs attempts for debugging (line 660)

✅ Follows Frontend Standards

  • Zero any types: ✅ Compliant
  • Uses hooks properly: ✅ Compliant
  • Type-safe: ✅ All types properly defined
  • Component size: ✅ Changes are focused and minimal
  • Reference: .claude/context/frontend-development.md

✅ Security Considerations

  • No token handling changes (no security risk)
  • No new external dependencies
  • Respects unmount state (prevents memory leaks)

Recommendations

Priority 1: Add Tests (Before Next Release)

Create test file components/frontend/src/hooks/__tests__/use-agui-stream.test.ts:

describe("useAGUIStream reconnection", () => {
  it("should implement exponential backoff", () => {
    // Test: 1s → 2s → 4s → 8s → 16s → 30s max
  })
  
  it("should reset attempts on successful connection", () => {
    // Test: reconnectAttemptsRef resets to 0
  })
  
  it("should not reconnect if unmounted", () => {
    // Test: mountedRef prevents reconnection
  })
  
  it("should close EventSource on error", () => {
    // Test: eventSource.close() called immediately
  })
})

Priority 2: Add Monitoring (Follow-up PR)

Consider adding observability for production:

// Track reconnection metrics
const logReconnectionAttempt = (attempt: number, delay: number) => {
  // Could integrate with Langfuse or custom metrics
  console.log(`[useAGUIStream] Reconnecting in ${delay}ms (attempt ${attempt})`)
  
  // Optional: Send to observability service
  if (attempt > 5) {
    // Alert: Excessive reconnections
  }
}

Priority 3: Document in ADR (Optional)

Consider documenting this fix in an ADR:

  • Why exponential backoff is necessary for SSE
  • Lessons learned from the 35k request incident
  • Guidance for future SSE implementations

Code Quality Validation

✅ Pre-Commit Checklist (Frontend)

  • Zero any types (or justified with eslint-disable)
  • All UI uses Shadcn components (N/A for this change)
  • All data operations use React Query (N/A for this change)
  • Components under 200 lines (N/A for this change)
  • Single-use components colocated (N/A for this change)
  • All buttons have loading states (N/A for this change)
  • All lists have empty states (N/A for this change)
  • All nested pages have breadcrumbs (N/A for this change)
  • All routes have loading.tsx, error.tsx (N/A for this change)
  • npm run build passes with 0 errors, 0 warnings (assumed)
  • All types use type instead of interface (N/A for this change)

✅ Error Handling Pattern Compliance

  • Follows React hook error handling patterns
  • Proper cleanup on unmount
  • Respects component lifecycle
  • Reference: .claude/patterns/error-handling.md

✅ Security Standards

  • No token handling changes
  • No RBAC implications
  • No backend security impact
  • Reference: .claude/context/security-standards.md

Final Verdict

Status: ✅ APPROVED FOR MERGE

This PR successfully fixes a critical production issue with a well-implemented solution. The exponential backoff and dependency array fix are both correct and follow established patterns. The only gaps are test coverage and observability, which can be addressed in follow-up work.

Impact: This fix will prevent backend pod restarts, reduce unnecessary network traffic, and improve overall system stability.

Risk: Low - Changes are isolated to SSE reconnection logic, with proper safeguards for unmount and cleanup.


Reviewed by Claude Code following standards from:

  • CLAUDE.md
  • .claude/context/frontend-development.md
  • .claude/patterns/error-handling.md
  • .claude/patterns/react-query-usage.md

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@Gkrumbach07 Gkrumbach07 merged commit a62c470 into ambient-code:main Jan 19, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant