Building an AI recruitment tool has a finish line. You build the screening cog and it works. You add guardrails and they hold. You wire up monitoring and the dashboards light up green. Each of those is a task you can complete, tick off, and walk away from. This page is about the part of the tool's life that has no finish line at all: maintenance, the ongoing work of keeping the thing alive once the build is done.
Maintenance is the one that doesn't end. It's not a phase after the build; it's the rest of the tool's life. The model underneath it gets retired. The two ATSs it talks to drift out from under it. The libraries it's made of age. The prompts that encode your judgement go stale the moment a client asks for something new. And every one of those clocks is ticking on someone else's schedule, not yours.
This is the chapter where "we built it in a weekend" meets "and now we run it forever." None of what follows is hard to understand. That's rather the point. It's not hard. It's relentless. There's a difference, and the difference is the whole back half of this book.
A weekend gives you a tool. The years that follow give you a second business you didn't mean to start: running software.
You don't own the model. You rent it, and the landlord changes the locks on a timetable you don't control.
Providers retire model versions on their own schedule. The version you pinned in March is marked deprecated in September and switched off by Christmas, and the migration is your problem, not theirs. Worse than the hard cutoffs are the soft ones: a "minor" version bump that subtly changes how the model reasons. The prompt that reliably flagged a career gap last month now waves it through. Nothing errored. Nothing alerted. The behaviour just moved, and you find out when a client asks why a candidate they'd have screened out made the shortlist.
So treat the model like the volatile dependency it is. Pin the exact version in config, never "latest." Put an abstraction between your code and the SDK, so swapping models is a config change rather than a rewrite. Then gate every migration behind the eval harness from the monitoring layer. Before you move to a new version, you re-run it against your golden set and compare the scores. Numbers hold, you migrate. Numbers slip, you've caught the regression in CI instead of from an angry client.
// Illustrative excerpt — not a copy-paste product.
// The model and prompt versions are config, never hard-coded.
public sealed record AgentVersion(
string ModelId, // e.g. "gpt-4o-2024-08-06" — pinned, never "latest"
string PromptVersion, // e.g. "screen-v7" — maps to a versioned prompt file
decimal MinEvalScore // the bar a migration must clear on the golden set
);
// Migration is gated, not hoped for.
var candidate = config.Get<AgentVersion>("agent:next");
var result = await _evalHarness.RunAsync(candidate, _goldenSet);
if (result.Score < candidate.MinEvalScore)
throw new MigrationBlocked(
$"{candidate.ModelId} scored {result.Score:P1} — below the {candidate.MinEvalScore:P1} bar.");
// Only a passing eval is allowed to promote a new version to production config.The pinning is the easy half. The hard half is that someone has to be watching the provider's deprecation notices, schedule the migration before the cutoff, run the evals, read the diff, and decide. That someone is a standing responsibility, not a one-off. Hold that thought. It's where this chapter lands.
Now run the same problem twice, in parallel, against systems you control even less.
Bullhorn and JobAdder evolve independently of you and of each other. New API versions ship. Endpoints get deprecated. Field semantics change: employmentType starts returning a code where it used to return a label. Rate limits tighten. Auth flows get revised. None of it arrives with your name on the memo, and studies of real-world libraries suggest the median rate of breaking changes runs around 15% (Brito et al., SANER 2017). Two ATSs means two of these treadmills running at once, and you're jogging on both.
The nasty failure mode here is the silent one. An endpoint doesn't vanish. It just starts returning an empty string for a field your scoring depends on. The agent doesn't crash. It scores every candidate against a requirement that's now always blank, and cheerfully shortlists nonsense. Loud failures you'll catch. Quiet ones erode your results for weeks before anyone connects the dots.
Catch it with a contract test: a small, automated check that asserts the fields your agent depends on still exist, in the shape it expects, in each ATS sandbox. It runs in CI and on a schedule. When Bullhorn or JobAdder changes the contract, the test goes red before the change reaches production. Silent drift becomes a loud, early alert, which is the only kind you can act on.
// Illustrative excerpt. A contract test per ATS, run in CI and on a schedule.
[Fact]
public async Task Bullhorn_JobOrder_still_exposes_the_fields_we_score_on()
{
var job = await _bullhornSandbox.GetAsync<JsonElement>(
"entity/JobOrder/SAMPLE?fields=id,title,employmentType,clientCorporation");
AssertPresent(job, "title"); // free-text, scored against
AssertPresent(job, "employmentType"); // semantics changed before — watch it
AssertPresent(job, "clientCorporation");
}
[Fact]
public async Task JobAdder_Job_still_exposes_the_fields_we_score_on()
{
// Base URL comes from the token response ("api"), not a hard-coded host.
var job = await _jobAdderSandbox.GetAsync<JsonElement>("/jobs/SAMPLE");
AssertPresent(job, "title"); // free-text, scored against
AssertPresent(job, "skillTags"); // requirements live in skillTags.tags[]
}
static void AssertPresent(JsonElement e, string field) =>
Assert.True(e.TryGetProperty(field, out var v) && v.ValueKind is not JsonValueKind.Null,
$"Contract broken: '{field}' missing or null — an ATS schema change leaked through.");Sandbox-vs-production matters here too: you want these tests hitting each ATS's sandbox so a breaking change shows up before it hits the live tenant. And there's a human job buried in this one as well. Someone has to subscribe to two changelogs, read two sets of release notes, and keep the sandbox credentials alive. The test catches the breakage. A person still has to fix it.
Below the model and the ATSs sits the stack itself: Semantic Kernel, the .NET SDK, the model SDKs, and the long tail of NuGet packages they drag in. Every one of them updates on its own cadence, and that same ~15% breaking-change rate applies all the way down.
This is a genuine bind, not a lazy one. Skip the updates and you accrue security debt: unpatched libraries are exactly how the supply-chain breaches covered in the security and compliance layer happen. Take the updates and you risk breakage, a minor version bump that quietly changes a method's behaviour. You're forced to choose between two kinds of risk, repeatedly, forever.
The way through is a bot with a guard dog behind it. Let Dependabot or Renovate raise the update pull requests, so nothing silently goes stale. Then make every one of those PRs run the full test suite (the ATS contract tests, the eval harness, the lot) before it's allowed to merge. The bot proposes; CI disposes. An update that breaks a contract or drops an eval score never reaches production. It dies in a pull request, where nobody gets hurt.
# Illustrative excerpt — the CI gate every dependency PR must pass.
on: { pull_request: { paths: ["**/*.csproj", "Directory.Packages.props"] } }
jobs:
guard:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: dotnet test --filter "Category=Unit"
- run: dotnet test --filter "Category=Contract" # the ATS schema checks (12.2)
- run: dotnet run --project tools/EvalHarness # the golden-set gate (12.1)
# Red on any failure → the update can't merge. The net is the point.The tooling is close to free. The thing that costs is the triage: reading why a PR went red, deciding whether to fix forward or hold back, and keeping the green builds green. Bots don't make judgement calls. They just make sure a human has to.
Here's the part that surprises agency owners: most of the "intelligence" in these tools isn't in the code at all. It's in the prompts and the config, and that's where most of the changes land too.
Your branded CV template gets a redesign, so the formatting cog's instructions have to change. A client under NDA wants a new redaction rule. A new compliance requirement lands and the screening prompt needs an extra check. None of these are code changes in the usual sense, which is exactly why they're so often made badly. Someone edits a prompt live in production at 4pm on a Friday because "it's just text."
It is not just text. A prompt is logic, and logic that decides who gets shortlisted is logic that needs the same discipline as any other: versioned in source control, changed through review, and rolled out in stages rather than swapped under live traffic. The eval harness applies here too. A prompt change is a change you evaluate against the golden set before it ships, the same as a model migration. The whole point of treating prompts as versioned source is that when a shortlist looks wrong six weeks from now, you can answer "which prompt version produced this, and what changed?", and roll back if you have to.
If a line of plain English decides who gets the interview, it's production code. Treat it like production code, or it'll bite you like production code.
Everything above shares one hidden assumption: that someone is doing it. Someone watches the deprecation notices. Someone reads two ATS changelogs. Someone triages the dependency PRs. Someone evaluates the prompt change before it ships. Pull that someone out, and every safeguard in this book degrades into a dashboard nobody reads.
That's the real failure mode, and it's depressingly ordinary. The contractor who built it moved on. The one developer who actually understood the prompt logic is on holiday, or handed in their notice in March. The tool keeps running, because software does, right up until the morning a model is deprecated or an ATS field quietly changes shape and there's nobody whose job it is to notice. It doesn't fail with a bang. It rots. Slowly, in the dark, because everyone assumed someone else had it.
This is the most plausible explanation for one of the most-quoted numbers in this book: in the MIT research, partner-built solutions succeeded around 67% of the time versus roughly 33% for internal DIY builds (MIT NANDA). The gap isn't talent. Agency developers are perfectly capable of building these cogs. Part II proved that. The gap is continuity. A partner has a team, a rota, and a contractual reason to still be there in eighteen months. A DIY build has whoever happened to write it, until they don't.
And here's the part that's easy to wave away until it happens: the moment this tool touches live placements, it is a 24/7 software product. Candidates apply at midnight. A model provider's deprecation doesn't wait for office hours. An ATS schema change can land on a Sunday. Sustainable on-call coverage (someone genuinely reachable when it breaks at 2am) needs more than one person; a single owner can't take a holiday, get ill, or sleep without leaving the tool unwatched. A solo DIY build has no sustainable on-call at all. That's not a tooling gap you can buy your way out of with a PagerDuty seat. It's a staffing reality.
Your business is recruitment. Placements, relationships, billing. It is not running a 24/7 software product across two ATSs and a model provider, forever. The two jobs look adjacent. They are not the same job.
Software doesn't rot because it's badly built. It rots because everyone assumed someone else was watching it.
So tally it. Not the build. The running. The weekend gave you the tool for nothing but a weekend. Keeping it alive has an annual run-rate, and most of it is invisible until you list it out.
Start with the parts people assume are cheap, because the idle case really is, and the production case quietly isn't. Cloud Run scales to zero, so a service with no traffic costs almost nothing; that's a genuine architecture win, not a cost claim you can bank. A production deployment isn't the idle case. Pin a warm instance to kill cold starts (min-instances ≥ 1), add a managed audit store (Cloud SQL), a queue (Pub/Sub) and log ingestion, and hosting runs roughly $120 a month at the low end, $300–$350 for a mid-size agency, and $500–$750+ under heavier load or HA (GCP pricing). The LLM API is thicker than the back-of-envelope too, once it's a real agentic loop. On a GPT-5-class model (the realistic default for screening quality) an agent makes two-to-four calls per CV (reckon ~3k input and ~700 output tokens each), which works out to ~$20–$75 per 1,000 CVs, a few cents apiece; a budget model like GPT-4o-mini is ten-to-fifteen times cheaper, ~$2–$4 per 1,000 (OpenAI list prices: GPT-5 ≈ $1.25 in / $10 out per 1M tokens, GPT-4o-mini $0.15 / $0.60; batch mode roughly halves it). At agency volume, say 10,000 CVs a month, that's a few hundred dollars on a frontier model (~$250–$700) versus ~$25 on a mini model. Observability on a free or low tier is $0–$50 a month (Grafana Cloud). Add it all up and the infrastructure is real money but still modest: a few hundred dollars, not a few thousand. And that's the trap. People price the tool by its infrastructure, see a number smaller than one engineer's day rate, and conclude it's nearly free.
It isn't, because none of the work in this chapter is infrastructure. It's people.
Cost that honestly and one line swallows the rest: the engineer who owns it. A fully-loaded mid-level engineer runs roughly $10,000–$15,000 a month in the US, where "fully loaded" means benefits at about 30% of total comp on top of salary (BLS ECEC; salary data from Glassdoor/Levels.fyi). You don't need one full-time just for this. You need their attention, reliably, and attention from a scarce engineer is the most expensive thing in the building. The on-call tooling is pocket change. Sustainable human coverage runs $500–$1,200 per engineer per month and needs several engineers to be real, which a single build simply cannot provide.
There's a peer-reviewed name for all of this. Google's engineers called it Hidden Technical Debt in Machine Learning Systems: the finding that the model is a tiny box in a vast diagram of ongoing maintenance, and that ML systems carry "massive ongoing maintenance costs" (Sculley et al., NeurIPS 2015). The weekend build is that tiny box. This chapter is the rest of the diagram.
Set that run-rate against the alternative the reader can actually price: a market-rate managed solution sits around $2,500 a month, the midpoint of the prevailing $500–$5,000 SMB managed-automation range (Digital Agency Network; Latenode; Arsum; SalemWise). That's not a quote and it's not anyone's price; it's a market assumption you can adopt. Hold it against even a fraction of one engineer's loaded cost and the maths starts to make its own argument: four-to-six times a managed retainer, before you've counted a single hour of on-call you can't sustainably staff anyway.
The weekend was free. The years are not. Infrastructure is the cheap line on the invoice; the expensive line is the person whose job it is to care.
We haven't drawn the full build-vs-buy conclusion yet. That's coming. But you can already feel the shape of it. You can read every dashboard. The question the scorecard forces is whether you can staff them.
Next: The Scorecard. Naming what "working" looks like in numbers, across all six dashboards an agency can read but rarely staff.
Download the full PDF for free?