You don't need the biggest model to read a CV against a job. For screening, formatting, and ranking, a fast, inexpensive model such as gpt-4o-mini (a sensible default) does the job well and keeps running costs sensible. We'll model the economics properly in the build-vs-buy chapter; the headline is that the model API is the smaller part. On a budget model like gpt-4o-mini it's roughly $2–$4 per thousand CVs, a fraction of a cent each. Step up to a GPT-5-class model for an agentic screening loop (two to four model calls per CV, around 3k tokens in and 700 out apiece) and it's closer to $20–$75 per thousand, still only a few cents a CV. At agency volume, say ten thousand CVs a month, that's about $25 a month on the mini model or a few hundred (~$250–$700) on a frontier one. Real money at scale, but the expensive part, as ever, is people.
The important design choice isn't which model. It's that the model name is configuration, not code. Models get deprecated on the provider's timetable, not yours, and when that day comes you want to change one line in a secret store, not go hunting through source. Semantic Kernel makes the swap a one-liner, which is the whole point of routing through it rather than wiring a vendor SDK directly into your logic.
Here's the heart of Program.cs: load configuration, build the kernel, register the model, and (critically) register our guarded gateway as the only sanctioned way to reach it.
// Program.cs — illustrative excerpt
using DotNetEnv;
using Microsoft.SemanticKernel;
Env.Load(); // pull .env into the environment
var config = new ConfigurationBuilder()
.AddEnvironmentVariables() // DotNetEnv → IConfiguration
.Build();
var builder = Kernel.CreateBuilder();
builder.AddOpenAIChatCompletion(
modelId: config["OPENAI_MODEL"]!, // swappable — never hard-coded
apiKey: config["OPENAI_API_KEY"]!);
// The model is reachable ONLY through the guarded gateway.
builder.Services.AddSingleton<ILlmGateway, GuardedLlmGateway>();
Kernel kernel = builder.Build();Three things to notice. The model id and key both come from configuration, so nothing sensitive is in the source. The kernel is built once and reused. And no agent in this book is ever handed the kernel to call the model directly; they're handed an ILlmGateway. Which brings us to the most important type in the codebase.
Here is a rule we will hold from this page to the last: no agent calls the language model directly. Every request, every CV, every job description, every prompt, passes through a single guarded component called ILlmGateway.
Why insist on one door? Because the moment you have several places that talk to the model, you have several places to leak a candidate's data, several places to forget a safety check, several things to fix when the rules change. One door means one place to enforce the rules, and one place to prove you enforced them.
// Gateway/ILlmGateway.cs — illustrative excerpt
public interface ILlmGateway
{
// The ONLY sanctioned path from our code to the model.
// Inputs are structured + allowlisted, not free-form payloads.
Task<LlmResult> InvokeAsync(LlmRequest request, CancellationToken ct);
}
// Sketch of what the guard does, in order, before any model call:
// 1. Allowlist — accept only the structured fields we expect
// 2. DLP inspect — scan for PII / secrets that must not leave
// 3. Fail-closed — if anything looks wrong, refuse the call
// 4. Call the model, via Semantic Kernel
// 5. Log through a typed safe sink that refuses raw payloadsIn plain terms: the gateway only accepts the specific, structured information a task needs, never a free-form blob of whatever happened to be in memory. It inspects what's about to be sent for things that mustn't leave, such as a candidate's personal details. If anything looks wrong, it fails closed: it refuses rather than risking the leak. And it logs through a sink that won't write raw candidate data into your logs, because logs leak too.
This is the difference between hoping nothing sensitive escapes and enforcing it. The full build of this gateway, the allowlist, the DLP rules, the safe logging, is the security and compliance chapter's job. For now, just hold the shape: every snippet that follows reaches the model through this one guarded door, never the raw SDK.
One door to the model. Locked by default. That's not paranoia. It's the only version that's safe to run for years.
We package the whole thing as a Docker container. That keeps it cloud-agnostic, since the same image runs on your machine, on a colleague's, and in production unchanged, and it means you're never locked to one provider's runtime. A minimal Dockerfile for a .NET service is short:
# Build
FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build
WORKDIR /src
COPY . .
RUN dotnet publish -c Release -o /app
# Run — slim runtime image, no SDK
FROM mcr.microsoft.com/dotnet/aspnet:8.0
WORKDIR /app
COPY --from=build /app .
ENTRYPOINT ["dotnet", "RecruitmentAgent.dll"]Our default deployment target is GCP Cloud Run (AWS App Runner / ECS Fargate / Azure Container Apps), and the reason is one phrase: scale to zero. With nothing coming through and no traffic to serve, the service can wind down to nothing. The architecture doesn't make you pay for idle. When work arrives, it spins up to handle it. For an agency whose hiring ebbs and flows, that elasticity is the point: you're not paying for a server humming away at three in the morning doing nothing.
Be honest about the bill, though. The near-$0 case is the hobby case: negligible traffic, no warm instance, no supporting plumbing. A real production deployment isn't that. To kill cold starts you keep at least one instance warm; you add a managed audit store (Cloud SQL), a queue (Pub/Sub), and log ingestion. Reckon on roughly $120/month at the low end, $300–$350 for a typical mid-size agency, and $500–$750+ under heavier load or with high availability. We'll cost it properly in the build-vs-buy chapter. Scale-to-zero is an architecture you want; near-zero is not the production norm.
Scale-to-zero isn't free of consequences, and we'll be honest about them later: cold starts, and making sure no in-flight work is lost when an idle instance is reclaimed. The short version, which shapes the design from here on: the service stays stateless, and anything that matters lives in the queue or the ATS, never in the container's memory.
That's the toolkit. A boring runtime someone else patches. Secrets out of the code. A model you can swap in one line. One guarded door. A container that runs anywhere and bills you only when it's actually doing something. Notice what's missing: anything exotic. That's deliberate. The stack was never the interesting part. What we build on it is.
Next: talking to your ATS, how the agent actually reaches into Bullhorn and JobAdder to fetch a CV and write back a result.
Download the full PDF for free?