Retries, schema validation, idempotency keys, and resilient API clients all reduce how often an AI screening agent fails. None of it eliminates failure. So the question that defines a production system, not a demo, is: what happens to the CV the agent couldn't process? It cannot vanish. The one failure mode you are never allowed to have is the silent drop: the candidate who applied, was never screened, and nobody noticed.
The default failure path must be "a human sees it." Concretely: catch the unrecoverable failure, push the item to a dead-letter queue (Pub/Sub, SQS+SNS / Azure Service Bus), and set the candidate's status in the ATS to something a recruiter will actually see, like review_required. The work doesn't disappear; it changes lanes.
// Illustrative excerpt. Failure → dead-letter → visible "needs human" status.
catch (Exception ex) when (ex is not OperationCanceledException)
{
await _deadLetter.PublishAsync(new FailedItem(candidateId, jobId, ex.Message), ct);
await _ats.SetCandidateStatusAsync(candidateId, "review_required"); // the leash
_log.DecisionFailed(candidateId, jobId, ex.GetType().Name); // typed, no PII
return AgentOutcome.NeedsHuman(ex.Message);
}Then watch the dead-letter queue. A DLQ that's growing is the single clearest signal that something systemic has broken (an expired credential, a drifted field, a model outage), and it's the alert that earns its keep. An empty DLQ is a healthy system. A silently full one is a future apology to a client.
"Silently dropped" is not a failure mode. It's a breach of trust with a candidate who'll never know it happened.
A reliability chapter that ignores cost is lying by omission, because the two share a root cause: the loop that won't stop. A retry storm against a rate-limited endpoint, or an agent that keeps re-asking the model because validation keeps failing, can run up a serious token bill overnight and hammer the ATS into rate-limit jail at the same time. Unbounded retries are an availability bug and a budget bug.
The fix is cheap and comes in three parts. A per-run token budget that aborts the run before it gets expensive. A circuit breaker (Polly) that stops calling a dependency that's clearly down instead of pummelling it. A concurrency cap so a backlog of CVs doesn't all hit the model and the ATS at once.
// Illustrative excerpt. Token budget + circuit breaker around the agent run.
if (_run.TokensUsed + estimate > _run.TokenBudget) // hard ceiling per run
throw new BudgetExceededException(_run.TokenBudget);
var breaker = new ResiliencePipelineBuilder()
.AddCircuitBreaker(new CircuitBreakerStrategyOptions
{
FailureRatio = 0.5, // trip if half of recent calls fail
MinimumThroughput = 10,
BreakDuration = TimeSpan.FromSeconds(30) // stop hammering a downed dependency
})
.Build();The cost of these guards is a handful of lines. The cost of not having them is a bill you discover after it's spent, and the LLM API is thicker than people assume once it's a real agentic loop. On a GPT-5-class model (the realistic default for screening quality) an agent makes ~2–4 calls per CV at roughly 3k input + 700 output tokens each, which lands at about $20–$75 per 1,000 CVs, a few cents per CV; a budget model like GPT-4o-mini is ~10–15× cheaper (~$2–$4 per 1,000). That's cheap only while the loops stay bounded: an unbounded repair loop multiplies those calls per CV without limit. (We tally the full run-rate honestly in the build-vs-buy chapter; here the point is narrower: bound the loops, or the loops bound your budget.)
The deployment target for these agents is Cloud Run (App Runner/ECS Fargate / Azure Container Apps), scale-to-zero: when nobody's screening CVs, the runtime can run no containers and bill no idle compute. As an architecture that's the right default: compute scales with the work. But don't mistake the idle-compute saving for the production run-rate. A real deployment isn't the near-zero hobby case: keep a warm instance (min-instances ≥ 1) to kill cold starts, add a managed audit store (Cloud SQL), a queue (Pub/Sub) and log ingestion, and you're realistically looking at ~$120/month at the low end, ~$300–$350 for a mid-size agency, and $500–$750+ under heavier load or HA (figures we break down in the build-vs-buy chapter). Near-$0 only happens at negligible traffic. Scale-to-zero also hands you a specific set of reliability problems in exchange, and pretending otherwise is how DIY builds get burned.
Cold starts. The first CV after a quiet afternoon waits for a container to spin up and the model client to warm. For batch screening this is a non-issue: a second or two on a 52-second job. For anything interactive, you pay it on every cold request, and the honest fix is min-instances (keep one warm) only where the latency genuinely hurts, which costs you the idle you were trying to avoid. Name the trade-off; don't paper over it.
Losing in-flight work. This is the one that bites. A scaled-to-zero instance can be reclaimed mid-run. If a half-finished agent run lives only in that container's memory, it dies with the container, and a candidate silently goes unscreened. The container is cattle, not a pet, so durable state cannot live inside it.
The architecture that survives this is boring on purpose: stateless containers, durable work in the queue. Work is driven from Pub/Sub. The consumer acks the message only after the decision is written back, so if the instance dies mid-run, the message is never acked, Pub/Sub redelivers it, and a fresh container picks it up. Nothing is lost because nothing of value ever lived in memory.
// Illustrative excerpt. Ack ONLY after the decision is durably written.
await foreach (var msg in _pubsub.PullAsync(ct)) // work comes from the queue, not memory
{
try
{
var outcome = await _agent.ProcessAsync(msg.CandidateId, msg.JobId, ct);
await _ats.WriteDecisionAsync(outcome); // the durable side-effect
await msg.AckAsync(); // ack LAST — interrupted work redelivers
}
catch (Exception ex)
{
await _deadLetter.PublishAsync(msg.ToFailedItem(ex), ct);
await msg.AckAsync(); // dead-lettered, so don't redeliver-loop
}
}The last piece is graceful shutdown. When Cloud Run reclaims an instance it sends SIGTERM first and gives you a short grace period. Catch it, stop pulling new messages, and let the item in flight finish (or leave it un-acked so it redelivers). A drain handler turns "killed mid-write" into "finished cleanly or safely redelivered."
// Illustrative excerpt. SIGTERM → stop intake, drain the in-flight item.
AppDomain.CurrentDomain.ProcessExit += (_, _) =>
{
_intake.StopAcceptingNew(); // pull no more messages
_inFlight.WaitForDrain(TimeSpan.FromSeconds(10)); // finish or leave un-acked to redeliver
};Put those together (stateless containers, ack-on-completion, SIGTERM drain) and scale-to-zero stops being a reliability risk and becomes what it should be: a system whose compute tracks the work and that loses nothing under pressure. (Tracking compute isn't the same as running for nothing: a production deployment keeps a warm instance and a managed audit store, with the run-rate above.) (The Dockerfile that packages this service lives back in the toolkit chapter; the queue and runtime are the same ones we've used throughout.)
Make the container disposable and the queue the memory. Then an instance can die mid-job and the only thing that notices is the redelivery counter.
Step back and look at what this chapter added. Bounded loops. Polly retries with jitter. Schema validation and a repair pass. Defensive deserialisation that fails loud. Re-auth handling for two different token models. Idempotency keys. A dead-letter queue and a review_required status. Token budgets and circuit breakers. Stateless design, ack-on-completion, SIGTERM drain. Every one of those is a few lines of illustrative code, and every one of them is a thing that has to be correct, tested, and maintained against two ATSs that keep moving, on a runtime that keeps reclaiming your containers.
The screening agent was a weekend. This wasn't. This is the part the demo never shows you, and it's not optional. It's the difference between a tool that works on the day you build it and one that's still trustworthy a year later, when the field got renamed and the token model changed and you weren't watching.
Which raises the obvious question: how would you know if any of this started failing? You wouldn't, not without watching it constantly.
Next: Monitoring & Observability, because a safeguard you can't see working is a safeguard you can't trust.
Download the full PDF for free?