How To Build Resilient Microservices Using Circuit Breakers and Retries: A Developer’s Guide To Surviving
To prevent microservices from crashing due to flaky APIs, use circuit breakers, retries, and fallbacks. Build systems that fail smart, not hard.
Join the DZone community and get the full member experience.
Join For FreeWhat’s up, fellow geeks? Think of the time when you are treated at a busy pizza place. The pizza oven broke down, and with new orders coming in, the entire kitchen is at a standstill. If we take that oven as a flaky, unreliable third-party API, there you go—microservices disaster! With retries and circuit breakers at your disposal, you can ensure your system keeps sizzling instead of crashing down.
In this guide, I will share these patterns assuming we are doing some pair programming at a whiteboard. We will look at some code (Hystrix and Resilience4J), tell war stories, revel in my failures (hint: wild retries), and have a good time. Let’s get down to it, shall we?
Why Resilience Cannot Be Optional in Microservices
Think of microservices as a relay race; the second a single runner drops their baton, the entire team loses the race. Without resilience, one abused API call can easily lead to:
- Cascading Failures: 1 failed service leads to cascading issues on other dependent services leading to a complete system crash (Service A fails -> Service B times out -> Service C crashes).
- Resources are wasted: Example - 1000 re-tries a second on a dead end point will cost a fortune
-
Users are frustrated: They will be asking themselves this question, “Why is my cart empty?!”
Proceed to Circuit Breakers and Retries:
- “Stop calling that broken service!” - Circuit Breakers.
- “Maybe it’ll work on the third try?” - Retries.
-
“Fine, show cached data instead.” - Fallbacks.
Here’s a simplification of that—no business language here, I assure you.
The Circuit Breaker Pattern: Your System Emergency Brake
How It Works: (The Pizza Oven Analogy )
- Closed State: Everything is working and orders are flowing.
- Open state: Cutting off order flow and looking at the non functional stove.
-
Half-open state: Testing the stove to see if it is operational again.
A circuit-breaker in code works on a preset error threshold, if exceeded the functionality is put on hold until diagnosed.
Code Example: Hystrix—(the OG Circuit Breaker)
Before going into maintenance mode – Netflix's Hystrix was the go to library. This is how you would incorporate it in a Spring Boot app:
@Service
public class PaymentService {
@HystrixCommand(fallbackMethod = "processPaymentFallback")
public String processPayment(String orderId) {
// Simulate calling a flaky payment gateway
if (Math.random() > 0.5) {
throw new RuntimeException("Payment failed!");
}
return "Payment processed for " + orderId;
}
public String processPaymentFallback(String orderId) {
return "Fallback: Payment queued for " + orderId;
}
}
application.yml
)
hystrix:
command:
default:
circuitBreaker:
requestVolumeThreshold: 5 # Min requests before tripping
errorThresholdPercentage: 50 % failures needed to trip
sleepWindowInMilliseconds: 10000 # Time before half-open
What Happens:
- To use a circuit on Hystrix after five failed attempts (50%+ failures).
- For a period of ten seconds, every call, without exception goes to the process payment fall back and all fail.
Code Example: Resilience4j – The Modern Alternative
Lightweight modular and Java 8 compatible, this is how the same logic would be implemented in Resilience4J.
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.minimumNumberOfCalls(5)
.waitDurationInOpenState(Duration.ofSeconds(10))
.build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService", config);
public String processPayment(String orderId) {
return circuitBreaker.executeSupplier(() -> {
if (Math.random() > 0.5) {
throw new RuntimeException("Payment failed!");
}
return "Payment processed for " + orderId;
});
}
Add a Fallback:
public String processPayment(String orderId) {
return Decorators.ofSupplier(() -> processPaymentCore(orderId))
.withCircuitBreaker(circuitBreaker)
.withFallback(List.of(RuntimeException.class),
e -> "Fallback: Payment queued for " + orderId)
.get();
}
Key differences from Hystrix:
- No dependency on thread pools (always uses caller’s thread).
- Higher modularity (being able to use only what is necessary: retries, bulkheads, etc.).
The Retry Pattern: “Requires Three Attempts?”
When to Retry (And When to Call It Quits)
Best suited to transitory mistakes, such as timeouts or skipped networks (retries) – they may succeed.
Core Principles:
-
Non-Idempotent Operations Should Not Be Attempted Again (such as payments, unless some big bill is desired).
- Longer waiting intervals are suggested by Exponential Backoff (an increase in wait time for subsequent retries).
Code Snippet: Resilience4J Retries + Circuit Breaker
RetryConfig retryConfig = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(500))
.retryExceptions(RuntimeException.class)
.build();
Retry retry = Retry.of("paymentRetry", retryConfig);
public String processPayment(String orderId) {
return Decorators.ofSupplier(() -> processPaymentCore(orderId))
.withRetry(retry)
.withCircuitBreaker(circuitBreaker)
.withFallback(...)
.get();
}
Flow:
- If up to three restarts have been executed (spaced by 500ms) and no success is reached, the circuit breaker will at least remember all failures.
- Fallback will kick in when the circuit has opened five times and feedback failures were recorded.
Dealing with Your Own Blunders: The Best Practices
- Don't allow tests to continue indefinitely, set Timeouts.
TimeLimiterConfig timeoutConfig = TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofSeconds(2))
.build();
2. Everything must be watched: Tell everybody everything. Combine Prometheis and Grafana to monitor:
-
The state as opened and closed of the circuit breaker.
-
Attempts would be logged for retries.
-
The discussion of Error rates.
3. Test Failure Scenarios: Netflix is not the only place for a Chaos Monkey.
The Partial Collapse of E-Commerce: A Case Study
During line Black Friday, a client was facing major issues as the product recommendation service kept failing, the result was a slow products page. The download time was a massive 20 seconds.
Solution:
- Set circuit breaker into recommendation services First step:
- Then slow it down with 3 successive delays of 1, 3, and 5 seconds.
-
Populate the first step with products that are user popular from a cached set
Reach Contrary Clear The End Outcome:
- Clear download time plummeted to a mere 2 seconds and 95 percent of the users transitioned to Crane mode.
- Most users let there accounts in and have no clue absolute zero.
Your Resilience Cheat Sheet
Q: Which one should I choose, Hystrix or Resilience4J?
A: Resilience4J, for new projects, since it is well maintained and requires Java 8 and up.
Q: How many retries are too many?
A: A starting token of two to three will work, but if after five attempts and still nothing, the problem is much worse.
Q: Can I use circuit breakers with asynchronous calls?
A: You can! Resilience4J integrates with CompletableFuture and Reactor.
Wrap up: Systems That Bend, Not Break
Resilience is letting systems fail, while gracefully avoiding disruptions. Here’s an action plan:
- Include Circuit Breakers for attempts that have a likelihood of failing, like third-party APIs and databases.
- Use disallowed non-idempotent operations smartly.
- Lookout: If you have a broken circuit, you have a huge problem.
Don’t forget, no one does it better than Netflix in failing their APIs offer. The magic lies in how they tackle such issues head-on.
Now go ahead and strengthen your microservices, but please share your experiences with endlessly looping retries in the comment section so we can learn together.
Opinions expressed by DZone contributors are their own.
Comments