Resilience: Staying Up When Dependencies Don't

Chapter 7

•

Part II

•

min read

The moment you split work across a network, you inherit a hard truth: the parts fail independently, and they fail at the worst time. The payment gateway returns a 503 for forty seconds in the middle of the Friday dinner rush. The maps service that quotes delivery ETAs slows to a crawl and drags your threads down with it. A database connection drops mid-query while a customer is placing an order. None of these are exotic. They are Tuesday.

Here is the arithmetic that should change how you think about a call chain. Placing one order touches several services in turn: pricing, payment, the maps/ETA quote, courier matching, the restaurant's kitchen feed. Suppose each hop succeeds 95% of the time, which is generous for a real network. Chain ten of them and your end-to-end success rate is 0.95^10, about 60%. Stretch it to twenty hops and you are at 0.95^20, roughly 36%. This is not a measured failure rate from some study; it is multiplication. But it tells you the thing worth knowing: in a distributed system, individual reliability compounds downward, and the only way back up is to make each hop forgive the others.

Reliability isn't the absence of failure. It's the presence of recovery.

The resilience altitude is where you stop pretending dependencies are always there and start designing for the moments they aren't. The patterns here are drawn mostly from Nygard's Release It!, with the retry maths from AWS. We use Polly for the C#, but the concepts outlive any library. Throughout, the call we keep coming back to is the one that hurts most when it breaks: placing an order and charging the customer for it.

Retry (Exponential Backoff + Jitter)

The problem: a transient fault, a dropped connection, a brief timeout, a momentary 503 from the payment gateway, fails a charge that would have succeeded a half-second later.

The naive fix is to try again immediately. That makes things worse. When the payment gateway wobbles, every order in flight retries at once, and the synchronised stampede keeps it down. The fix is exponential backoff: wait longer after each failure (200ms, 400ms, 800ms). The second fix is jitter: randomise the delay so a thousand orders don't all retry on the same tick. AWS's own guidance on this is the canonical reference, and it lands on full jitter as the default (Marc Brooker, "Exponential Backoff And Jitter," AWS Architecture Blog).

var retry = new ResiliencePipelineBuilder()
    .AddRetry(new RetryStrategyOptions
    {
        MaxRetryAttempts = 3,
        BackoffType = DelayBackoffType.Exponential,
        UseJitter = true,
        Delay = TimeSpan.FromMilliseconds(200),
        ShouldHandle = new PredicateBuilder().Handle<PaymentGatewayException>()
    })
    .Build();

What it buys you in production: the most return for the least code of any pattern here. Most faults are transient, and a retry with backoff turns the majority of them into a non-event the customer never sees.

Skip-if: the operation isn't safe to repeat. A retry on a non-idempotent charge can bill the card twice or place two orders for the same basket. Don't add retry until you've read the idempotency section below; the two patterns are a pair, not a choice.

In the front-end. The customer app retries its own fetches too. A flaky mobile connection drops the "place order" request; the client retries with backoff over the same idempotency key (so the server never double-places), and queues the order locally if the device is offline, replaying it when the signal returns.

Circuit Breaker

The problem: the maps/ETA service is genuinely down, not hiccuping, and every order you place waits for its quote to time out, holds a thread, and pointlessly hammers a service that can't answer.

Retrying a corpse is cruel and slow. The Circuit Breaker (Nygard, Release It!; Martin Fowler's bliki, "CircuitBreaker") wraps the call in a state machine. While calls succeed it stays closed. After a run of failures it trips open and fails fast for a cooldown, no call, no wait. After the cooldown it goes half-open, lets a trial call through, and either recovers to closed or trips open again. While the breaker on the ETA service is open, order placement keeps running; it just shows a default ETA instead of a live one (the Timeout + Fallback section makes that concrete).

var breaker = new ResiliencePipelineBuilder()
    .AddCircuitBreaker(new CircuitBreakerStrategyOptions
    {
        FailureRatio = 0.5,
        MinimumThroughput = 10,
        SamplingDuration = TimeSpan.FromSeconds(30),
        BreakDuration = TimeSpan.FromSeconds(15)
    })
    .Build();

What it buys you in production: it stops a struggling dependency from taking you down with it, and it gives that dependency room to recover instead of a fresh wave of traffic. A dead ETA service no longer blocks every order; it just degrades the ETA. Fail fast beats fail slow every time.

Skip-if: you're calling a single in-process component or a local cache with no real failure mode. A breaker around your own menu Composite, resolved in memory, is moving parts you have to reason about for no payoff.

the-pareto-stack-cloud-design-patterns-for-small-teams

the-ladder-of-altitudes

how-to-read-this

object-level-the-patterns-that-earn-their-keep

decorator

state

component-level-structuring-one-service

ports-and-adapters-hexagonal

mediator-the-commandquery-split

data-persistence

optimistic-concurrency

messaging-scale

outbox

resilience-staying-up-when-dependencies-dont

rate-limiting-throttling

timeout-fallback

the-composed-pipeline

observability-diagnostics-seeing-inside-production

metrics-the-four-golden-signals

externalised-configuration

hosting-cloud-agnostic-by-default

sidecar-ambassador

orchestrator-agnostic-deploy

a-reference-service

the-relay-outbox-to-queue

the-payment-saga-charge-pay-out-compensate

the-over-engineering-tax

conclusion-production-ready-deliberately

the-pattern-quick-reference-card

altitude-3-data-persistence

altitude-5-resilience

the-skip-list

full-event-sourcing-for-crud

robert-c-martin-uncle-bob-the-house-authority-for-structure

altitude-2-component

altitude-4-messaging-scale

altitude-6-observability-diagnostics

Download the full PDF for free?

Free download — no account required

Get the PDF

Prev Next