Back to posts

Fault Tolerance: How to Build Systems That Survive Failures

Circuit breakers, retries, timeouts, fallbacks: the essential patterns for building resilient systems that keep working when everything goes wrong.

TS
Thiago Saraiva
7 min read

eca59d990e8cc5e4e8e1ae333358cd00

Your system depends on an external API. It goes down. Your system freezes. Users lose access. Support gets flooded with tickets. All because a dependency that should have been secondary brought everything down.

In distributed systems, failures aren't the exception, they're the rule. The question isn't IF something will fail, but WHEN. And what your system does in that moment is what separates amateur from production-grade.

Mental model: Fault tolerance is how airplanes fly with multiple engines. Lose one, the others keep you airborne. A Circuit Breaker is the breaker panel in your house: when the toaster shorts out, the breaker trips and saves the wiring instead of burning the house down.

The Numbers That Matter: SLA and Uptime

Before talking patterns, talk numbers:

  • 99% uptime: 3.65 days of downtime per year
  • 99.9% (three nines): 8.7 hours per year
  • 99.99% (four nines): 52.6 minutes per year
  • 99.999% (five nines): 5.26 minutes per year

Each additional "nine" costs exponentially more. Most applications need 99.9%. If someone asks for 99.999%, ask if the budget matches.

War Story: The $440M Bug

On August 1, 2012, Knight Capital deployed new trading code to 7 of 8 servers. The 8th kept old code that reused a dormant flag called "Power Peg." In about 45 minutes of market open, the system executed millions of unintended orders across NYSE. Pre-tax loss: roughly $440 million. The firm was acquired by Getco a few months later. No circuit breaker, no kill switch, no graceful degradation. A single deploy bug wiped out a publicly traded company before lunch. Fault tolerance isn't paranoia. It's payroll.

Circuit Breaker: Stop Hitting a Dead Server

The Circuit Breaker is the most important fault tolerance pattern. It monitors calls to a service and, when it detects too many failures, "opens the circuit" and stops making requests.

CLOSED (normal) --> too many failures --> OPEN (blocks)
   OPEN --> timeout --> HALF-OPEN (tests)
      HALF-OPEN --> success --> CLOSED
      HALF-OPEN --> failure --> OPEN

Implementation with Opossum in Node.js:

The power of the Circuit Breaker: instead of waiting 3 seconds on each request (and blocking 100 threads), it fails immediately. This protects your server from cascading failures. Fun fact: Netflix open-sourced Hystrix in late 2012, after their own engineers spent years watching one slow dependency take down entire API clusters. The pattern went mainstream from there, and Netflix officially put Hystrix into maintenance mode in 2018, recommending resilience4j as the modern successor.

Retry with Exponential Backoff

Retrying failed operations is essential, but it has to be smart:

Two critical details:

  1. Jitter: adds randomness to avoid the thundering herd (everyone retrying in lockstep)
  2. Only retry idempotent operations: retrying a payment can charge twice

The Retry Storm Anti-Pattern

Picture this: a service gets slow, not dead, just slow. Every client retries. Those retries pile on top of existing traffic. The service now has 3x the load it was already struggling with. It dies. When it comes back up, all queued retries hit it at once and kill it again. Congratulations, you built a DDoS against yourself. This is why retries need circuit breakers upstream, jitter, and ideally a retry budget (cap retries at, say, 10% of normal traffic).

Timeout Pattern: Don't Wait Forever

Each layer should have a shorter timeout than the one above it:

Client (10s) --> Gateway (8s) --> Backend (6s) --> Database (4s)

If the database hangs, the backend fails in 4s, the gateway in 6s, and the client in 8s. No cascade.

Bulkhead: Isolate the Compartments

Like a ship with watertight compartments: if one floods, the others keep floating.

If a heavy report exhausts its connection pool, checkout keeps working. The pattern's name comes from shipbuilding, but the lesson is older than software: never let one tenant burn down the building.

Fallback: Always Have a Plan B

Each fallback level returns progressively less complete data, but the system never stops.

Graceful Shutdown: Die with Dignity

The load balancer sees the readiness probe fail and stops sending new traffic. Existing connections drain. Zero downtime.

On the Frontend: Error Boundary + Optimistic UI

Fault tolerance isn't just a backend concern. In React:

If ProductList crashes, ShoppingCart keeps working. Pair this with optimistic UI updates and a toast on rollback, and a flaky network feels like a minor hiccup instead of a broken product.

Chaos Engineering: Break It Before It Breaks You

Netflix took this to the logical extreme with Chaos Monkey: a tool that randomly kills production instances during business hours. The idea is simple but ruthless. If your system can't survive a random kill, you haven't built fault tolerance, you've built hope. Tools like Gremlin and AWS Fault Injection Simulator let you inject latency, drop packets, and simulate region outages on demand. You learn more from one scheduled chaos drill than from six months of reading runbooks.

FAQ

Does Circuit Breaker make sense for internal microservices? Yes, arguably more than for external APIs. Internal services fail, get deployed, get OOM-killed. A breaker protects the caller from a bad neighbor.

What's a retry budget? A cap on retries as a percentage of normal traffic (e.g., retries can't exceed 10% of requests). Prevents retry storms from turning a slow service into a dead one. Popularized by Google SRE.

Recommended cap for exponential backoff? Usually 30 to 60 seconds max. Beyond that, the user has already left. Always add jitter, otherwise everyone syncs up.

Is idempotency a prerequisite for retry? For non-read operations, yes. Retrying POST /payments without idempotency keys is how you charge a customer three times. Use Idempotency-Key headers or deterministic IDs.

Observability vs fault tolerance, which comes first? Observability. You can't fix what you can't see. Metrics, traces, and logs come before fancy patterns. A monitored monolith beats a "resilient" black box.

Conclusion: The Minimum Production Checklist

  • Circuit breakers on external calls
  • Retry with exponential backoff + jitter
  • Timeouts at every layer (cascading)
  • Health checks (liveness + readiness)
  • Graceful shutdown
  • Fallbacks for critical dependencies
  • Error boundaries on the frontend
  • Monitoring: latency, error rate, saturation

You don't need all of it at once. Start with circuit breakers and timeouts on the most critical calls. Add retries and fallbacks. Monitor everything.

Resilient systems aren't the ones that never fail. They're the ones that know how to fail gracefully.