Home

/

The Production-Ready Playbook

/

The payment Saga: charge, pay out, compensate

The payment Saga: charge, pay out, compensate

Chapter 10
Part III
6
min read

The payment Saga: charge, pay out, compensate

Placing an order is not one database write. It is a distributed transaction across the payment provider, the restaurant's payout account, and the courier's. No single commit spans them, so coordination falls to a Saga (Garcia-Molina & Salem; Richardson). The Saga charges the customer, pays out the restaurant, pays out the courier, and on any failure runs the compensations in reverse: refund the customer, claw back a payout. A process manager tracks which step the order is on so a crash resumes rather than restarts.

public async Task Run(Guid orderId, CancellationToken ct)
{
    var charge = await _payments.Charge(orderId, ct);          // step 1
    try
    {
        await _payouts.PayRestaurant(orderId, ct);             // step 2
        await _payouts.PayCourier(orderId, ct);                // step 3
    }
    catch
    {
        await _payments.Refund(charge.Id, ct);                 // compensate step 1
        throw;
    }
}

The charge call is the riskiest hop in the whole service, so it runs through a Polly resilience pipeline: a retry with exponential backoff and jitter (Brooker, AWS) wrapped by a circuit breaker (Nygard, Release It!) and a timeout. The retry rides out a payment-gateway hiccup; the breaker stops hammering a provider that is genuinely down; the timeout keeps a slow provider from holding an order hostage.

_pipeline = new ResiliencePipelineBuilder()
    .AddRetry(new RetryStrategyOptions
    {
        MaxRetryAttempts = 3,
        BackoffType = DelayBackoffType.Exponential,
        UseJitter = true
    })
    .AddCircuitBreaker(new CircuitBreakerStrategyOptions
    {
        FailureRatio = 0.5,
        BreakDuration = TimeSpan.FromSeconds(30)
    })
    .AddTimeout(TimeSpan.FromSeconds(10))
    .Build();

Retries are only safe because the charge is idempotent, keyed on the order id. A retry that lands after the provider already took the money returns the original charge instead of taking it twice. Without that key, every backoff is a chance to double-charge a customer for one dinner.

public async Task<Charge> Charge(Guid orderId, CancellationToken ct) =>
    await _pipeline.ExecuteAsync(t =>
        _provider.Charge(orderId, idempotencyKey: orderId.ToString(), t), ct);

A retry without an idempotency key is a coin-flip on whether the customer pays once or twice. Key the charge, or don't retry it.

The read path: a projection, not the log

The customer does not wait on any of this synchronously. The app polls GET /orders/{id} and reads status from a projection: a live read model the workers keep current as they advance the order's state. The place-order path writes; the read path reads its own shape, the one the tracking screen wants. That separation is CQRS (Greg Young; Fowler's bliki), and the order-history and restaurant live-board views are further Materialized Views (Fowler; Azure Cloud Design Patterns) over the same event stream.

public sealed class GetOrderStatusHandler(IOrderReadModel reads)
    : IRequestHandler<GetOrderStatus, OrderStatusDto?>
{
    public Task<OrderStatusDto?> Handle(GetOrderStatus q, CancellationToken ct) =>
        reads.StatusById(q.OrderId, ct);   // a thin projection query, tenant-scoped by RLS
}

The same Row-Level Security policy that protected the write protects this read. A customer reads only their own order; a restaurant's live-board reads only its own tenant's orders, and no developer had to remember to add WHERE tenant_id = @id. The isolation is structural.

Seeing it run: health, traces, metrics

None of the above is operable unless you can see it. The observability altitude threads through every component at once. A health endpoint (Health Endpoint Monitoring, Azure) tells the orchestrator whether to send traffic and whether to restart. Liveness says the process is alive; readiness says it can reach its database and broker.

builder.Services.AddHealthChecks()
    .AddNpgSql(cfg.GetConnectionString("Orders")!, name: "db")
    .AddCheck<BrokerHealthCheck>("broker");

app.MapHealthChecks("/health/live",  new() { Predicate = _ => false });
app.MapHealthChecks("/health/ready", new() { Predicate = c => c.Tags.Contains("ready") });

Distributed tracing (OpenTelemetry) follows a single order from the HTTP place, through the queue, into the kitchen and courier workers, through the payment Saga, by carrying a correlation id across the broker boundary. Without it, a charge that failed in the Saga is unconnectable to the order that triggered it. With it, one trace tells the whole story: placed, confirmed, charged, assigned, delivered. The metrics are the four golden signals (Google SRE): latency, traffic, errors, saturation. Queue depth is the saturation signal that matters most here, because a queue that only grows during a rush is a worker fleet that cannot keep up.

builder.Services.AddOpenTelemetry()
    .WithTracing(t => t.AddAspNetCoreInstrumentation().AddNpgsql().AddOtlpExporter())
    .WithMetrics(m => m.AddMeter("orders.worker").AddOtlpExporter());

// in the worker: the saturation signal that actually predicts a backed-up kitchen
_queueDepth = meter.CreateObservableGauge("orders.queue.depth", () => _broker.ApproximateDepth());

Structured logs (Serilog) carry the same correlation id and the tenant_id on every line, keyed on order_id, so a support question about one customer's missing dinner is a filter, not an archaeology dig. The trail records references and decisions, not payloads, keeping it useful without turning the log into a copy of the orders table.

Where it lives: a stateless container that scales to zero

The whole thing ships as a container (the 12-Factor App). State lives outside the image, in the database and the broker, so any instance can serve any order and the orchestrator can kill and restart instances freely. Because the workers hold nothing local, the platform can scale them to zero between meal rushes and back up when the queue fills (Cloud Run; AWS App Runner, Azure Container Apps). The analytics worker, idle most of the afternoon, scales all the way down.

FROM mcr.microsoft.com/dotnet/aspnet:9.0 AS base
WORKDIR /app
COPY --from=build /publish .
ENV ASPNETCORE_URLS=http://+:8080
EXPOSE 8080
ENTRYPOINT ["dotnet", "OrderService.dll"]

Scale-to-zero only works if the container shuts down cleanly. On SIGTERM the worker stops pulling new messages, lets the in-flight order finish its current step, and exits, so the platform never kills work mid-charge. The host's graceful-shutdown hook does exactly that, and the IHostedService honouring the cancellation token is what makes it safe.

public async Task StopAsync(CancellationToken ct)
{
    _accepting = false;                       // stop pulling new orders
    await _inFlight.WhenAllOrTimeout(ct);     // let the current step finish, then exit
}

The same image runs on any of the three clouds because nothing in it is tied to one. The broker, the database, and the secrets come from configuration, not from compile-time choices. That is the cloud-agnostic stance made concrete. You build one artifact and point it at whichever target you are deploying to, with no rewrite in between.

The ladder as one picture

Step back and the whole stack is visible in one service. Dependency Injection (Fowler's Inversion of Control article; the "D" of Robert C. Martin's SOLID) and the Mediator wire the inside, while the OrderGateway and RLS hold the data. The Outbox, Pub/Sub, and Competing Consumers move the work, the payment Saga coordinates the money, and Polly keeps the charge safe when the provider wobbles. Health checks, traces, and golden-signal metrics make all of it observable, and a stateless container that scales to zero hosts the lot. A Strategy decides which courier gets the order. A State machine runs that order's life from placed to delivered, against the menu a Composite priced in the first place.

A whole vocabulary of patterns, and one slice of a food-delivery app uses maybe fifteen of them well. That ratio is the book.

Notice what is still missing. Full Event Sourcing never appears, because a projection over explicit state transitions is enough for live tracking; the event log stays a convenience here, not the system of record. Sharding is absent too, since one database holds these orders comfortably until the marketplace spans cities. You will also find no second tenant-isolation tier, no service mesh, and no Claim-Check for the small receipts these orders carry. The Saga earned its place because there genuinely is a distributed transaction, money moving across three parties. The heavier patterns did not, because the order does not yet need them. Every one of those absences was a deliberate skip, and the service is more maintainable for it.

The temptation now is to use all of it. Resist; the next chapter is about the tax you pay when you don't.

the-pareto-stack-cloud-design-patterns-for-small-teams
the-ladder-of-altitudes
how-to-read-this
object-level-the-patterns-that-earn-their-keep
decorator
state
component-level-structuring-one-service
ports-and-adapters-hexagonal
mediator-the-commandquery-split
data-persistence
optimistic-concurrency
messaging-scale
outbox
resilience-staying-up-when-dependencies-dont
rate-limiting-throttling
timeout-fallback
the-composed-pipeline
observability-diagnostics-seeing-inside-production
metrics-the-four-golden-signals
externalised-configuration
hosting-cloud-agnostic-by-default
sidecar-ambassador
orchestrator-agnostic-deploy
a-reference-service
the-relay-outbox-to-queue
the-payment-saga-charge-pay-out-compensate
the-over-engineering-tax
conclusion-production-ready-deliberately
the-pattern-quick-reference-card
altitude-3-data-persistence
altitude-5-resilience
the-skip-list
full-event-sourcing-for-crud
robert-c-martin-uncle-bob-the-house-authority-for-structure
altitude-2-component
altitude-4-messaging-scale
altitude-6-observability-diagnostics

Download the full PDF for free?

Free download — no account required

Get the PDF
Get the PDF
Related Chapters
Free Download
Get the full PDF
All pages, including all code examples, diagrams, and the appendix reference card.
No spam. Unsubscribe at any time.
Your email won't be shared.
Oops! There's a problem with your request. We're working on fixing it. Please try again later.