The problem: configuration baked into the image means the same artifact can't move between environments, and a config change forces a rebuild and redeploy. The surge-pricing threshold that kicks in during a rush is exactly this kind of setting: you want to raise it on a busy Friday without shipping a new build, and a connection string compiled into a binary is a connection string in your source history.
Read config from the environment, not the image (12-Factor, "Config"). The container is identical in staging and production; what differs is what the platform injects at start: environment variables, mounted secrets, a config service. .NET's configuration system layers these for you, so the surge threshold can come from appsettings.json in development and an environment variable in production with no code change.
// Environment wins over the file; surge thresholds and secrets arrive as env vars, never in the image.
builder.Configuration
.AddJsonFile("appsettings.json", optional: true)
.AddEnvironmentVariables();
var ordersConnString = builder.Configuration.GetConnectionString("Orders");
var surgeThreshold = builder.Configuration.GetValue<double>("Pricing:SurgeThreshold");What it buys you in production: one image you can promote unchanged from staging to production, a surge threshold you can dial during a rush without a redeploy, and secrets that live in a secret manager (GCP Secret Manager, AWS Secrets Manager, Azure Key Vault) rather than in a layer anyone who pulls the image can read. It also makes diagnostics honest: when pricing behaves differently between environments, the difference is in config you can inspect, not in a binary you have to decompile.
Skip-if: there's no skipping this one for anything you deploy more than once. The closest thing to a skip is a throwaway prototype that only ever runs on your laptop. Everything else reads its config from the outside. This habit gets reinforced as a first-class hosting concern in the next chapter.
The problem: alerts wired to raw thresholds page you for things that don't matter, and an on-call who's been woken six times by noise will sleep through the seventh page that was real. Alert fatigue is how teams miss the outage they were paged about.
Page on objectives, not noise (Google SRE). An SLO is a target you commit to: "99% of orders are confirmed within 10 seconds over a rolling 28 days." That gives you an error budget, the small allowance of slow or failed confirmations you're permitted, and you alert when you're burning through it fast enough to miss the objective. A single order that takes twelve seconds is not a page. A burn rate that will exhaust the month's budget by Tuesday is.
The mechanics matter less than the discipline. Tie every page to a customer-facing objective, route everything else to a dashboard or a ticket, and make the rule simple: if a page doesn't demand a human act now, it isn't a page.
If an alert doesn't require someone to do something right now, it's a dashboard, not a page.
What it buys you in production: an on-call rotation a small team can actually sustain through a dinner rush. People trust the pager because it only fires when the confirm-within-10-seconds objective is genuinely at risk, and the error budget turns "are we fast enough" into a number you can point at instead of an argument you keep having.
Skip-if: you have no on-call and no confirmation commitment yet. A pre-launch pilot in one neighbourhood doesn't need an error budget. The moment someone is carrying a pager, define the objective first, then wire the alert to it, never the other way around.
The problem: a customer disputes a charge, a restaurant swears it never cancelled the order, and a chargeback review asks "who refunded this order, and when." Ordinary application logs rotate away before you can answer (OWASP Logging Cheat Sheet). An operational log and an audit trail have different lifetimes and different consumers.
Keep a separate, append-only record of consequential actions: who, what, when, against which entity. Make it immutable; you write to it, you never update or delete from it. And store references, not payloads. Echoing book #1's rule, store decisions, not documents: record that agent 88 refunded order 4471 for tenant 12 at a timestamp, not a copy of the order's contents and certainly not the customer's card number or address sitting in your audit table forever.
await audit.Record(new AuditEntry(
Actor: currentUser.Id,
Action: "order.refunded",
Subject: order.Id, // a reference to the order, not its body
Tenant: order.TenantId,
At: DateTimeOffset.UtcNow));What it buys you in production: an answer to the "who refunded this" or "who cancelled that" question that holds up in a chargeback dispute or an incident, without turning your audit store into a second copy of your orders and a fresh pile of customer PII to protect. The append-only shape means the trail itself is trustworthy: nobody can quietly edit history after a disputed cancellation.
Skip-if: nothing you do is consequential or regulated. A read-only menu browser with no privileged actions has nothing to audit. The day your service lets someone refund a charge, cancel another party's order, or change access, you need the trail, and you need it from the first such action, not retrofitted after the first dispute.
Log Aggregation is the piece that makes structured logging pay off at scale: ship every container's logs to one searchable place (Cloud Logging, CloudWatch Logs, Azure Monitor) so you're not SSH-ing into instances that scale to zero. On most platforms it's a default you turn on, which is why it sits just outside the core.
Synthetic Monitoring runs a scripted order against production on a schedule and alerts when the real customer journey breaks, catching the failures that your internal metrics miss because no real customer has hit them yet. Worth adding once you have a critical path, place-an-order being the obvious one, that you can't afford to discover is broken from a support ticket.
RED and USE are two lenses on the golden signals worth knowing by name. RED (Rate, Errors, Duration, from Tom Wilkie) is the request-centric view for services; USE (Utilisation, Saturation, Errors, from Brendan Gregg) is the resource-centric view for machines and pools. They're complements to the four signals, not replacements, and naming them helps a team agree on what to watch.
All of this has to run somewhere, and the next altitude is how you host it without marrying one cloud.
Download the full PDF for free?
Free download — no account required