Running an AI recruitment agent in production is mostly the work that doesn't show up in a demo: locking down secrets, catching the call that fails, watching the thing while it runs, keeping it alive as the world drifts underneath it. All of that effort earns one thing. The right to answer a single question with a number instead of a shrug. This page is the scorecard for that question: six dashboards, around five metrics each, that tell you whether an agent is actually working.
Is it working?
You'd think that's easy. It isn't. "Working" splits into six different questions, and a tool can be acing one while quietly failing another. It can be fast and wrong. Cheap and leaking. Accurate today and rotting toward next quarter's silent breakage. So this chapter puts the six questions on a wall and bolts a number to each one. No number, no answer.
Six dashboards. Around five metrics each. Every metric earns its place by answering three things: why it matters, what good looks like, and where the number actually comes from. That last column is the honest one. A KPI you can't source is a slogan.
A note before the tables. This is the executive view. Your live engineer's screen, the one with the queue depth and the p95 latency ticking over in real time, lives in the monitoring layer. This page is the version you'd put in front of a board: fewer needles, more meaning. Where they overlap, treat the monitoring stack as the instrument panel and this as the flight report.
Read every dashboard left-to-right: the signal tells you if you're winning; the source tells you whether you can trust the signal.
This is the dashboard that justifies the project's existence. Everything else is plumbing in service of these five rows. If this board is green and the rest are red, you have a problem worth fixing. If this board is red, you have a tool nobody needed.
The anchor is the baseline a recruiter starts from: roughly 12 hours a week lost to admin. That's the hole. This dashboard measures how much of it you've filled.
| KPI | Why it matters | Target / signal | Where the number comes from |
|---|---|---|---|
| Hours reclaimed per recruiter / week | The headline promise of the whole exercise. | Trending toward the ~12 hrs/week baseline. | Timesheet delta, or modelled as volume × per-task time saved. |
| Cycle-time reduction (time-to-shortlist / time-to-submit) | Speed is the placement edge: the first decent CV in front of the client tends to win. | Hours collapsing to minutes. | Timestamp from CV-in → shortlist-out. |
| Throughput (CVs per role / per week) | Capacity unlocked without new headcount. | The 45-CVs-a-role demo scale, handled routinely. | Run counts. |
| Adoption rate (% of eligible roles actually run through the tool) | Adoption isn't impact, but zero adoption is guaranteed zero impact. | >80% of eligible volume. | Runs ÷ eligible events in the ATS. |
| Net ROI (labour recovered − run cost) | The single number that ends the build-vs-buy argument. | Comfortable multiples of run cost. | This board's value minus the running-cost board's number. |
Adoption is the row people skip, and it's the one that quietly kills projects. A brilliant agent nobody points at a role saves exactly nothing. This is pilot purgatory by another name: a tool that technically works and practically doesn't, because the team routed around it. In the first ninety days, this is the number to watch above all the others.
The agent does the tireless 90%; the recruiter keeps the 1.5 selling days a week it hands back. That's the trade, in one row.
Fast and wrong is worse than slow and right, because wrong is invisible until it costs you a placement. This dashboard is the leash made measurable. It tells you whether the human on the end of it is correcting the agent occasionally or constantly.
| KPI | Why it matters | Target / signal | Where the number comes from |
|---|---|---|---|
| Human-override rate | The best single quality signal: how often a human disagrees with the agent. | Low and stable. A rising trend means drift. | Human action vs agent recommendation, from the DecisionRecord audit trail. |
| False-negative rate (good candidates wrongly rejected) | The expensive, invisible error: the placement you never knew you missed. | Near-zero, sampled against human judgement. | Periodic blind human re-review of a rejected sample. |
| Shortlist precision (shortlisted candidates who progress) | Did the triage actually help, or just shuffle the pile? | A healthy shortlisted → interview ratio. | ATS pipeline outcomes joined back to runs. |
| Output-validation failure rate | Malformed or schema-invalid model output: the model returning garbage. | Low single digits and falling. | The schema gate that validates model output. |
| Bias / adverse-impact ratio across protected groups | A legal obligation, not a nicety. NYC LL144 and the EU AI Act both demand it. | Within the four-fifths rule, audited on a cadence. | Scheduled bias audit on aggregated, de-identified outcomes. |
Two of these deserve a word. The override rate is the most honest metric on the scorecard. It's the team voting, every day, on whether they trust the tool. And the bias ratio is the row that turns Amazon's 2018 cautionary tale, the recruiting tool that taught itself to penalise women-associated words, from a story into a control you actually run. Visible reasoning plus a human review plus this audit: that's how you stay on the right side of it.
The agent triages. A human makes the final call. This dashboard simply checks that the human is still needed less and less, without ever being needed not at all.
This is the closest cousin of the live monitoring view, distilled to what an owner cares about. Not "what's the queue doing this second," but "did it hold up this month, and how fast did we recover when it didn't."
| KPI | Why it matters | Target / signal | Where the number comes from |
|---|---|---|---|
| Run success / error rate | Baseline health: is it doing the job at all? | e.g. >99% success. | OpenTelemetry metrics. |
| Availability / uptime | Especially meaningful with scale-to-zero, where "off" is normal. | SLA-backed, e.g. 99.9%. | Health checks / Cloud Monitoring uptime (CloudWatch / Azure Monitor). |
| Latency p50 / p95 (including cold-start) | The scale-to-zero tax, made visible. | p95 under an agreed threshold. | Cloud Trace / OpenTelemetry (X-Ray / App Insights). |
DLQ size & review_required count | Work that fell through must never silently pile up. | Drained within SLA; never quietly growing. | Queue-depth metric. |
| MTTR (mean time to recovery) | When it breaks (and it will), how fast are you back? | Inside the agreed SLA window. | Incident timestamps. |
The two rows people underweight are the DLQ size and MTTR. A growing dead-letter queue is the sound of CVs falling into a hole. Every one of them a candidate, possibly a placement, definitely a person who applied and heard nothing. And MTTR is where "monitoring" stops being a dashboard and becomes someone's job at 2am. Hold that thought; it leads straight to the question of who owns the running of all this.
Every other dashboard measures performance. This one measures the absence of disaster, which is harder, because you're proving a negative. The trick, drawn from the security and compliance layer, is that you don't assume these numbers are good; you measure them, continuously, with canaries and audits that would catch you if they weren't.
| KPI | Why it matters | Target / signal | Where the number comes from |
|---|---|---|---|
| PII-leak incidents | The one number that must be zero: measured, not hoped. | 0. | Canary-token catch rate + DLP egress/log alerts. |
| Guardrail efficacy (injection & blocked-payload rate) | Proof the guardrails are doing real work, not decoration. | Attempts detected and blocked; ~no bypasses. | Gateway / output-guardrail counters. |
| Audit-log completeness | Every automated decision needs a defensible record. | 100%; a gap is an EU AI Act / SOC 2 finding. | Decisions ÷ DecisionRecords. |
| Secret-rotation & least-privilege compliance | An over-privileged token is the gap between "mis-tagged" and "deleted the pipeline." | 100% of secrets in-window; 0 over-privileged scopes. | Secret Manager age + IAM / ATS-scope review. |
| Time-to-patch (vuln / dependency) | The security-debt clock starts ticking the day a CVE lands. | Criticals patched within an agreed window. | Dependabot / Renovate + CVE feed, part of keeping dependencies current. |
| DSAR / right-to-erasure fulfilment time | A GDPR obligation with a statutory clock on it. | Within the statutory window. | Request → completion timestamps. |
Notice that PII-leak incidents has a target of zero and a measurement. That pairing is the whole point. "We're careful" is not a metric. "We pump a fake SSN through every day and it has never once reached the model or a log line" is. With GDPR fines reaching 4% of turnover, or about $22 million, this is the dashboard where one red cell costs more than the entire project ever saved.
A demo handles data. A product is trusted with it, and proves the trust on a schedule. This board is that proof.
The cheapest part of an AI agent is the AI. This dashboard makes that uncomfortable truth legible, and sets up the build-vs-buy maths by forcing one honest, all-in number.
| KPI | Why it matters | Target / signal | Where the number comes from |
|---|---|---|---|
| Cost per CV / per decision | The unit economic that scales with you. | Cents, trending down. | (Tokens + infra) ÷ runs. |
| Total monthly run-rate | The honest all-in: LLM + infra + monitoring + on-call. | The number you compare against a market-rate managed solution. | Billing + labour tally. |
| Cost-to-value ratio | Run cost against labour recovered: the sanity check on the outcomes board. | Much less than 1. | This board's cost ÷ the outcomes board's value. |
| Token efficiency (cost lost to retries / waste) | A silent budget leak that retries quietly inflate. | Low retry / wasted-token %. | Polly retry metrics + token accounting. |
| Idle cost (scale-to-zero efficiency) | Proof the architecture earns its keep when nobody's looking. | Near-$0 only at negligible traffic; a warm production instance has a floor. | Cloud Run billed-time at low traffic (App Runner / Container Apps). |
The row to dwell on is total monthly run-rate, because it's the one most teams get wrong by leaving out the only line that matters. Start with the LLM, and be honest about the assumptions. On a GPT-5-class model running an agentic screening loop (reckon ~2–4 model calls per CV at roughly 3k input + 700 output tokens each) that's about $20–$75 per 1,000 CVs, a few cents per CV. Drop to a budget model (GPT-4o-mini) and it's ~10–15× cheaper, around $2–$4 per 1,000. At agency volume (~10,000 CVs/month) that's a few hundred dollars a month on a frontier model (~$250–$700) versus ~$25/month on a mini model, at OpenAI list prices, and batch mode roughly halves it. (For the maths: GPT-5 ≈ $1.25 in / $10 out per 1M tokens; GPT-4o-mini $0.15 in / $0.60 out per 1M.) Hosting is where the comforting "it scales to zero" line gets misread. A production deployment is not the near-zero hobby case: with a warm instance (min-instances ≥ 1 to kill cold starts), a managed audit store (Cloud SQL), queueing (Pub/Sub), and log ingestion, reckon ~$120/month at the low end, ~$300–$350 typical for a mid-size agency, and $500–$750+ under heavier load or HA. Scale-to-zero is a real architecture lever, but it only reaches near-$0 at negligible traffic, not in production. So far, so cheap, and so misleading. The cost of a DIY build was never the infrastructure. It's the people. Park that; the full sum, done elsewhere, benchmarks against a market-rate managed cost of around $2,500/month, the midpoint of the prevailing $500–$5,000 SMB managed-automation range, not anybody's quote.
Five dashboards tell you the tool works today. This one asks the question that every production deployment eventually has to answer: will it still work in eighteen months, when the model you built on is deprecated, both ATSs have shipped breaking changes, and the contractor who understood it has long since moved on?
| KPI | Why it matters | Target / signal | Where the number comes from |
|---|---|---|---|
| Maintenance hours / month | The hidden run-cost nobody budgets for. | A trend, not zero. It's never zero. | Engineering time tracking. |
| Breaking-changes handled (API drift / model deprecations) | The treadmill, made countable. Two ATSs plus model providers = three sources of drift. | Events per quarter, watched and absorbed. | Changelog watch + incident log. |
| Eval-suite pass rate over time | The regression guard that gates every change. | 100% before any deploy. | Scheduled eval harness. |
| Dependency freshness | Skip updates and you accrue security debt; take them and ~15% break callers. | Within N versions; 0 known-vuln deps. | Dependabot / Renovate. |
| Ownership / bus factor | The failure mode behind all the others. | Documented runbooks; more than one owner; on-call genuinely covered. | Qualitative review + on-call rota. |
The honest row here is maintenance hours, and its honesty is in the target: not zero. A tool that needs no maintenance is a tool nobody's checking. The dangerous one is bus factor, the dashboard you can't fully automate, because it measures whether a human function exists. One person who understands the system, on holiday, is a single point of failure wearing a lanyard.
Read back over the six dashboards. Notice what just happened.
Every metric came with a source, a place the number lives. And almost every one of those sources is a system someone has to build, instrument, watch, and keep watching: an eval harness on a schedule. A canary token pumped through egress daily. A bias audit run on a cadence. A changelog watch across two ATSs and a model provider. An on-call rota with more than one name on it. A DSAR clock that runs whether or not anyone's looking.
Here's the uncomfortable part, and it's the whole point of the chapter.
**Any agency can read all six of these dashboards. Very few can staff them: 24/7, across two ATSs, forever.**
Reading a scorecard is a board meeting. Staffing one is a department. The gap between those two sentences is an entire operational function: the secrets store and the monitoring stack and the patch clock and the eval suite and, above all, the human who owns the 2am page. You now know exactly what to measure. The next question isn't what, it's who. Who builds these dashboards, who reads them, and who gets paged when a cell goes red at the worst possible hour.
That's not a technology question. It's an ownership one. And it's where the build-vs-buy decision actually lives.
The YS position fits in a single line, and this chapter is the reason it's credible: we measure ourselves by your wins, not our invoices. Outcome-tied delivery means our scorecard is the one above (your reclaimed hours, your zero leaks, your drained queue) not a count of hours billed. That's not a slogan. It's just which dashboard you choose to be judged on.
Next: Build vs. Buy vs. Managed, and who should really own the numbers you just learned to read.
Download the full PDF for free?