Home

/

The Production-Ready Playbook

/

Resilience: Staying Up When Dependencies Don't

Resilience: Staying Up When Dependencies Don't

Chapter 7
Part II
4
min read

The moment you split work across a network, you inherit a hard truth: the parts fail independently, and they fail at the worst time. The payment gateway returns a 503 for forty seconds in the middle of the Friday dinner rush. The maps service that quotes delivery ETAs slows to a crawl and drags your threads down with it. A database connection drops mid-query while a customer is placing an order. None of these are exotic. They are Tuesday.

Here is the arithmetic that should change how you think about a call chain. Placing one order touches several services in turn: pricing, payment, the maps/ETA quote, courier matching, the restaurant's kitchen feed. Suppose each hop succeeds 95% of the time, which is generous for a real network. Chain ten of them and your end-to-end success rate is 0.95^10, about 60%. Stretch it to twenty hops and you are at 0.95^20, roughly 36%. This is not a measured failure rate from some study; it is multiplication. But it tells you the thing worth knowing: in a distributed system, individual reliability compounds downward, and the only way back up is to make each hop forgive the others.

Reliability isn't the absence of failure. It's the presence of recovery.

The resilience altitude is where you stop pretending dependencies are always there and start designing for the moments they aren't. The patterns here are drawn mostly from Nygard's Release It!, with the retry maths from AWS. We use Polly for the C#, but the concepts outlive any library. Throughout, the call we keep coming back to is the one that hurts most when it breaks: placing an order and charging the customer for it.

Every hop is a coin-flip

Retry (Exponential Backoff + Jitter)

The problem: a transient fault, a dropped connection, a brief timeout, a momentary 503 from the payment gateway, fails a charge that would have succeeded a half-second later.

The naive fix is to try again immediately. That makes things worse. When the payment gateway wobbles, every order in flight retries at once, and the synchronised stampede keeps it down. The fix is exponential backoff: wait longer after each failure (200ms, 400ms, 800ms). The second fix is jitter: randomise the delay so a thousand orders don't all retry on the same tick. AWS's own guidance on this is the canonical reference, and it lands on full jitter as the default (Marc Brooker, "Exponential Backoff And Jitter," AWS Architecture Blog).

var retry = new ResiliencePipelineBuilder()
    .AddRetry(new RetryStrategyOptions
    {
        MaxRetryAttempts = 3,
        BackoffType = DelayBackoffType.Exponential,
        UseJitter = true,
        Delay = TimeSpan.FromMilliseconds(200),
        ShouldHandle = new PredicateBuilder().Handle<PaymentGatewayException>()
    })
    .Build();

What it buys you in production: the most return for the least code of any pattern here. Most faults are transient, and a retry with backoff turns the majority of them into a non-event the customer never sees.

Skip-if: the operation isn't safe to repeat. A retry on a non-idempotent charge can bill the card twice or place two orders for the same basket. Don't add retry until you've read the idempotency section below; the two patterns are a pair, not a choice.

In the front-end. The customer app retries its own fetches too. A flaky mobile connection drops the "place order" request; the client retries with backoff over the same idempotency key (so the server never double-places), and queues the order locally if the device is offline, replaying it when the signal returns.

Circuit Breaker

The problem: the maps/ETA service is genuinely down, not hiccuping, and every order you place waits for its quote to time out, holds a thread, and pointlessly hammers a service that can't answer.

Retrying a corpse is cruel and slow. The Circuit Breaker (Nygard, Release It!; Martin Fowler's bliki, "CircuitBreaker") wraps the call in a state machine. While calls succeed it stays closed. After a run of failures it trips open and fails fast for a cooldown, no call, no wait. After the cooldown it goes half-open, lets a trial call through, and either recovers to closed or trips open again. While the breaker on the ETA service is open, order placement keeps running; it just shows a default ETA instead of a live one (the Timeout + Fallback section makes that concrete).

Circuit breaker
var breaker = new ResiliencePipelineBuilder()
    .AddCircuitBreaker(new CircuitBreakerStrategyOptions
    {
        FailureRatio = 0.5,
        MinimumThroughput = 10,
        SamplingDuration = TimeSpan.FromSeconds(30),
        BreakDuration = TimeSpan.FromSeconds(15)
    })
    .Build();

What it buys you in production: it stops a struggling dependency from taking you down with it, and it gives that dependency room to recover instead of a fresh wave of traffic. A dead ETA service no longer blocks every order; it just degrades the ETA. Fail fast beats fail slow every time.

Skip-if: you're calling a single in-process component or a local cache with no real failure mode. A breaker around your own menu Composite, resolved in memory, is moving parts you have to reason about for no payoff.

the-pareto-stack-cloud-design-patterns-for-small-teams
the-ladder-of-altitudes
how-to-read-this
object-level-the-patterns-that-earn-their-keep
decorator
state
component-level-structuring-one-service
ports-and-adapters-hexagonal
mediator-the-commandquery-split
data-persistence
optimistic-concurrency
messaging-scale
outbox
resilience-staying-up-when-dependencies-dont
rate-limiting-throttling
timeout-fallback
the-composed-pipeline
observability-diagnostics-seeing-inside-production
metrics-the-four-golden-signals
externalised-configuration
hosting-cloud-agnostic-by-default
sidecar-ambassador
orchestrator-agnostic-deploy
a-reference-service
the-relay-outbox-to-queue
the-payment-saga-charge-pay-out-compensate
the-over-engineering-tax
conclusion-production-ready-deliberately
the-pattern-quick-reference-card
altitude-3-data-persistence
altitude-5-resilience
the-skip-list
full-event-sourcing-for-crud
robert-c-martin-uncle-bob-the-house-authority-for-structure
altitude-2-component
altitude-4-messaging-scale
altitude-6-observability-diagnostics

Download the full PDF for free?

Free download — no account required

Get the PDF
Get the PDF
Related Chapters
Free Download
Get the full PDF
All pages, including all code examples, diagrams, and the appendix reference card.
No spam. Unsubscribe at any time.
Your email won't be shared.
Oops! There's a problem with your request. We're working on fixing it. Please try again later.