The Four Golden Signals (latency, traffic, errors, saturation) — Google, Site Reliability Engineering (O'Reilly, 2016), Ch. 6, Monitoring Distributed Systems. https://sre.google/sre-book/monitoring-distributed-systems/ Watched on the order service at the dinner rush. Complements: the RED method (Tom Wilkie) and the USE method (Brendan Gregg, https://www.brendangregg.com/usemethod.html).
Health Endpoint Monitoring — Azure Cloud Design Patterns (https://learn.microsoft.com/azure/architecture/patterns/health-endpoint-monitoring); ASP.NET Core health checks. Liveness and readiness for the order service's orchestrator.
Distributed Tracing, Metrics, Logs (the three pillars) + Correlation IDs — OpenTelemetry (CNCF). https://opentelemetry.io/docs/ Trace one order across order → payment → courier. Vendor-neutral; export to Cloud Trace / Monitoring (X-Ray / CloudWatch · Azure Monitor).
Structured Logging — Serilog. https://serilog.net/ Logs keyed on order_id and tenant_id.
Externalised Configuration — The Twelve-Factor App, "III. Config." https://12factor.net/config Surge thresholds as config, not code. Overlaps Altitude 7; taught here as the diagnostics-friendly habit.
Alerting & SLOs — Google SRE, Service Level Objectives (https://sre.google/sre-book/service-level-objectives/) and Alerting on SLOs (https://sre.google/workbook/alerting-on-slos/). "99% of orders confirmed within 10s": page on objectives, not noise.
Audit Logging — OWASP Logging Cheat Sheet. https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html An immutable who-did-what trail for refunds and cancellations: references, not PII.
Honorable mentions — Log Aggregation; Synthetic Monitoring; the RED and USE methods.
Altitude 7 — Hosting
The Twelve-Factor App — Adam Wiggins / Heroku, 2011. https://12factor.net/ The backbone: stateless processes, externalised config, disposability, port binding. The order service runs stateless, state in DB and queue.
Container as the unit — OCI Image Specification (https://opencontainers.org/); Docker. One image for the order service.
Sidecar / Ambassador — Azure Cloud Design Patterns (https://learn.microsoft.com/azure/architecture/patterns/sidecar and …/ambassador); Kubernetes pod sidecars. Telemetry shipped by a sidecar.
Scale-to-Zero — Knative / Cloud Run (https://cloud.google.com/run/docs); AWS App Runner · Azure Container Apps. The analytics/report worker scales to zero off-peak. Note the cold-start tradeoff.
Graceful Shutdown (SIGTERM) — Kubernetes pod lifecycle and container lifecycle hooks. https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/ Finish in-flight orders before exit.
Orchestrator-Agnostic Deploy — the same OCI image to Cloud Run / ECS / Container Apps / Kubernetes.
Infrastructure as Code — Terraform / Pulumi / Bicep; Fowler bliki. https://martinfowler.com/bliki/InfrastructureAsCode.html The whole food-delivery stack as reproducible, reviewable, cloud-portable environments.
Canary Release — Danilo Sato, Fowler bliki. https://martinfowler.com/bliki/CanaryRelease.html Route 5% of orders through a new pricing engine first, with an instant rollback path.
Honorable mentions — Feature Flags / Feature Toggles (Fowler, https://martinfowler.com/articles/feature-toggles.html); Gateway / Backend-for-Frontend (Sam Newman; Fowler, https://martinfowler.com/articles/gateway-pattern.html) — a BFF each for the customer and courier apps; Secrets Management (GCP Secret Manager · HashiCorp Vault); Service Discovery.
Other foundational and framing texts
Evans, Eric.Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003. The source for Anti-Corruption Layer (over the legacy restaurant POS), Domain Events, and Specification.
Microsoft. Azure Cloud Design Patterns. A vendor-published but broadly applicable catalogue of cloud patterns. https://learn.microsoft.com/azure/architecture/patterns/ Cited as "(Azure Cloud Design Patterns)" throughout; individual pages are listed by altitude above.
Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall Richard (eds.).Site Reliability Engineering: How Google Runs Production Systems. O'Reilly, 2016. Free online: https://sre.google/sre-book/table-of-contents/ The source for the four golden signals and SLO-based alerting.
OpenTelemetry (CNCF). https://opentelemetry.io/docs/ The vendor-neutral standard behind the observability altitude.
Framing & delivery metrics
Used to frame what good delivery looks like, not as quoted benchmarks.
DORA — the four key metrics (deployment frequency, lead time for changes, change failure rate, time to restore service). https://dora.dev/guides/dora-metrics-four-keys/ A framework, not a multiplier; don't quote elite-versus-low figures without the report.
Standish Group, CHAOS 2015 — small projects succeed far more often than grand ones, which underpins the case for small, scoped work. Cite as "(Standish CHAOS 2015)"; use sparingly.
A note on the numbers: this book is built on provenance. If a claim has no source above, it does not belong in the book.