When it fails anyway: dead-letter and the leash

Chapter 10

•

Part III

•

min read

When it fails anyway — dead-letter and the leash

Retries, schema validation, idempotency keys, and resilient API clients all reduce how often an AI screening agent fails. None of it eliminates failure. So the question that defines a production system, not a demo, is: what happens to the CV the agent couldn't process? It cannot vanish. The one failure mode you are never allowed to have is the silent drop: the candidate who applied, was never screened, and nobody noticed.

The default failure path must be "a human sees it." Concretely: catch the unrecoverable failure, push the item to a dead-letter queue (Pub/Sub, SQS+SNS / Azure Service Bus), and set the candidate's status in the ATS to something a recruiter will actually see, like review_required. The work doesn't disappear; it changes lanes.

// Illustrative excerpt. Failure → dead-letter → visible "needs human" status.
catch (Exception ex) when (ex is not OperationCanceledException)
{
    await _deadLetter.PublishAsync(new FailedItem(candidateId, jobId, ex.Message), ct);
    await _ats.SetCandidateStatusAsync(candidateId, "review_required");  // the leash
    _log.DecisionFailed(candidateId, jobId, ex.GetType().Name);          // typed, no PII
    return AgentOutcome.NeedsHuman(ex.Message);
}

Then watch the dead-letter queue. A DLQ that's growing is the single clearest signal that something systemic has broken (an expired credential, a drifted field, a model outage), and it's the alert that earns its keep. An empty DLQ is a healthy system. A silently full one is a future apology to a client.

"Silently dropped" is not a failure mode. It's a breach of trust with a candidate who'll never know it happened.

Cost and runaway loops

A reliability chapter that ignores cost is lying by omission, because the two share a root cause: the loop that won't stop. A retry storm against a rate-limited endpoint, or an agent that keeps re-asking the model because validation keeps failing, can run up a serious token bill overnight and hammer the ATS into rate-limit jail at the same time. Unbounded retries are an availability bug and a budget bug.

The fix is cheap and comes in three parts. A per-run token budget that aborts the run before it gets expensive. A circuit breaker (Polly) that stops calling a dependency that's clearly down instead of pummelling it. A concurrency cap so a backlog of CVs doesn't all hit the model and the ATS at once.

// Illustrative excerpt. Token budget + circuit breaker around the agent run.
if (_run.TokensUsed + estimate > _run.TokenBudget)            // hard ceiling per run
    throw new BudgetExceededException(_run.TokenBudget);

var breaker = new ResiliencePipelineBuilder()
    .AddCircuitBreaker(new CircuitBreakerStrategyOptions
    {
        FailureRatio = 0.5,                  // trip if half of recent calls fail
        MinimumThroughput = 10,
        BreakDuration = TimeSpan.FromSeconds(30)   // stop hammering a downed dependency
    })
    .Build();

The cost of these guards is a handful of lines. The cost of not having them is a bill you discover after it's spent, and the LLM API is thicker than people assume once it's a real agentic loop. On a GPT-5-class model (the realistic default for screening quality) an agent makes ~2–4 calls per CV at roughly 3k input + 700 output tokens each, which lands at about $20–$75 per 1,000 CVs, a few cents per CV; a budget model like GPT-4o-mini is ~10–15× cheaper (~$2–$4 per 1,000). That's cheap only while the loops stay bounded: an unbounded repair loop multiplies those calls per CV without limit. (We tally the full run-rate honestly in the build-vs-buy chapter; here the point is narrower: bound the loops, or the loops bound your budget.)

Scale-to-zero realities

The deployment target for these agents is Cloud Run (App Runner/ECS Fargate / Azure Container Apps), scale-to-zero: when nobody's screening CVs, the runtime can run no containers and bill no idle compute. As an architecture that's the right default: compute scales with the work. But don't mistake the idle-compute saving for the production run-rate. A real deployment isn't the near-zero hobby case: keep a warm instance (min-instances ≥ 1) to kill cold starts, add a managed audit store (Cloud SQL), a queue (Pub/Sub) and log ingestion, and you're realistically looking at ~$120/month at the low end, ~$300–$350 for a mid-size agency, and $500–$750+ under heavier load or HA (figures we break down in the build-vs-buy chapter). Near-$0 only happens at negligible traffic. Scale-to-zero also hands you a specific set of reliability problems in exchange, and pretending otherwise is how DIY builds get burned.

Cold starts. The first CV after a quiet afternoon waits for a container to spin up and the model client to warm. For batch screening this is a non-issue: a second or two on a 52-second job. For anything interactive, you pay it on every cold request, and the honest fix is min-instances (keep one warm) only where the latency genuinely hurts, which costs you the idle you were trying to avoid. Name the trade-off; don't paper over it.

Losing in-flight work. This is the one that bites. A scaled-to-zero instance can be reclaimed mid-run. If a half-finished agent run lives only in that container's memory, it dies with the container, and a candidate silently goes unscreened. The container is cattle, not a pet, so durable state cannot live inside it.

The architecture that survives this is boring on purpose: stateless containers, durable work in the queue. Work is driven from Pub/Sub. The consumer acks the message only after the decision is written back, so if the instance dies mid-run, the message is never acked, Pub/Sub redelivers it, and a fresh container picks it up. Nothing is lost because nothing of value ever lived in memory.

// Illustrative excerpt. Ack ONLY after the decision is durably written.
await foreach (var msg in _pubsub.PullAsync(ct))   // work comes from the queue, not memory
{
    try
    {
        var outcome = await _agent.ProcessAsync(msg.CandidateId, msg.JobId, ct);
        await _ats.WriteDecisionAsync(outcome);     // the durable side-effect
        await msg.AckAsync();                        // ack LAST — interrupted work redelivers
    }
    catch (Exception ex)
    {
        await _deadLetter.PublishAsync(msg.ToFailedItem(ex), ct);
        await msg.AckAsync();                        // dead-lettered, so don't redeliver-loop
    }
}

The last piece is graceful shutdown. When Cloud Run reclaims an instance it sends SIGTERM first and gives you a short grace period. Catch it, stop pulling new messages, and let the item in flight finish (or leave it un-acked so it redelivers). A drain handler turns "killed mid-write" into "finished cleanly or safely redelivered."

// Illustrative excerpt. SIGTERM → stop intake, drain the in-flight item.
AppDomain.CurrentDomain.ProcessExit += (_, _) =>
{
    _intake.StopAcceptingNew();        // pull no more messages
    _inFlight.WaitForDrain(TimeSpan.FromSeconds(10));  // finish or leave un-acked to redeliver
};

Put those together (stateless containers, ack-on-completion, SIGTERM drain) and scale-to-zero stops being a reliability risk and becomes what it should be: a system whose compute tracks the work and that loses nothing under pressure. (Tracking compute isn't the same as running for nothing: a production deployment keeps a warm instance and a managed audit store, with the run-rate above.) (The Dockerfile that packages this service lives back in the toolkit chapter; the queue and runtime are the same ones we've used throughout.)

Make the container disposable and the queue the memory. Then an instance can die mid-job and the only thing that notices is the redelivery counter.

What reliability actually costs

Step back and look at what this chapter added. Bounded loops. Polly retries with jitter. Schema validation and a repair pass. Defensive deserialisation that fails loud. Re-auth handling for two different token models. Idempotency keys. A dead-letter queue and a review_required status. Token budgets and circuit breakers. Stateless design, ack-on-completion, SIGTERM drain. Every one of those is a few lines of illustrative code, and every one of them is a thing that has to be correct, tested, and maintained against two ATSs that keep moving, on a runtime that keeps reclaiming your containers.

The screening agent was a weekend. This wasn't. This is the part the demo never shows you, and it's not optional. It's the difference between a tool that works on the day you build it and one that's still trustworthy a year later, when the field got renamed and the token model changed and you weren't watching.

Which raises the obvious question: how would you know if any of this started failing? You wouldn't, not without watching it constantly.

Next: Monitoring & Observability, because a safeguard you can't see working is a safeguard you can't trust.

the-math-no-recruiter-can-win-by-hand

what-an-ai-agent-actually-is

the-leash

the-toolkit

the-model-small-capable-swappable

talking-to-your-ats

use-case-1-resume-screening-against-a-job

the-shape-of-the-loop

running-it-thought-action-observation

use-case-2-cv-formatting-redacting-for-clients

reformatting-into-your-branded-template

resume-shortlisting

that-was-easy

security-compliance

keeping-pii-out-of-the-llm

exceptions-reliability

silent-api-drift-the-ats-changes-under-you

when-it-fails-anyway-dead-letter-and-the-leash

monitoring-observability

maintenance-the-lifecycle

the-scorecard-success-metrics-kpis

build-vs-buy-vs-managed

what-an-engineer-actually-costs

what-the-wider-data-says-happens-next

conclusion-how-this-gets-run-for-you

the-promises-behind-the-service

fuller-code-listings

one-full-screening-react-loop-semantic-kernel

env-deployment-reference

secrets-in-dev-vs-production

bullhorn-jobadder-endpoint-cheat-sheets

sources-further-reading

compliance-primary-law-sources

Download the full PDF for free?

Free download — no account required

Get the PDF

Prev Next