Playbook

AI Infrastructure Readiness Audit

A repeatable process for auditing production AI infrastructure: what to measure, what to fix first, and how to write down the answer.

6 min readFor: Platform teams and infrastructure leads operating GPU workloads or planning to.Reviewed Apr 2026 · 2 months ago

Evidence First, Measure Before You Optimize GitOps as the Boring Substrate

TL;DR

Before you optimize GPU cost, reliability, or latency, establish the baseline. Most "we need to reduce GPU spend" conversations start with numbers that do not survive ten minutes of scrutiny.
The audit produces four artifacts: a cost model, a reliability risk register, an SLO proposal, and a 90-day roadmap with owners. Anything less is theater.
The hardest part is almost never technical. It is separating "what we believe is true about this system" from "what we have evidence for." Most audits are just that separation done carefully.
Run it as a fixed-scope engagement. Two to three weeks, written deliverables, clear handoff. Open-ended audits become consulting engagements, which is a different product.
The audit is not the fix. It is the honest inventory that makes the fix prioritizable.

When to use this playbook

You are running or responsible for AI infrastructure, and one or more of the following is true:

Cost is growing faster than usage, and nobody can say why with confidence.
An inference service has had two or three production incidents in a quarter, and the root causes feel related but nobody has written it down.
Leadership is asking about moving workloads on-prem, hybrid, or between clouds, and you do not have the numbers to decide.
A new model or workload is about to land and you do not know if the platform can absorb it.

If none of those are true, you do not need an audit. You need to keep running. Come back when one of them lights up.

Inputs

Before the audit starts, collect:

Access. Read-only is fine for most of it. You need kubectl or equivalent, Prometheus or equivalent, cloud billing console access, and read access to the deployment manifests or Terraform.
The list of inference services and GPU workloads in scope. Not a guess — the actual list, verified against what is running.
A primary stakeholder. One person who can answer "does this risk matter to us" without needing a committee.
A commitment that the audit output will be reviewed. A report nobody reads is worse than no report.

If any of these are missing, the audit will stall. Fix those first.

The audit

The audit has four tracks. They run in parallel, but each has an owner and a deliverable.

Cost baseline — what are we actually spending, on what, and how is it trending?
Reliability inventory — what breaks, how often, and what is the blast radius?
SLO proposal — what should we promise, to whom, and what error budget can we defend?
Roadmap — given everything above, what do we do next, in what order, and who owns it?

Each track has a gate. None of them is complete without a written artifact you could hand to a successor.

Track 1: Cost baseline

The goal is a cost model that answers three questions with numbers: what are we spending, on what unit of work, and how elastic is it?

Steps:

Pull the last 90 days of GPU spend from the cloud console or the on-prem equivalent (power, depreciation, networking).
Map spend to workloads. Not to services — to actual workloads, which may be multiple per service. Inference, training, embeddings, and batch are different workloads with different cost shapes.
Derive a unit-of-work cost. For inference, this is cost per 1K tokens at some reference request profile. For training, cost per epoch at a reference dataset size. For embeddings, cost per 1M vectors.
Separate fixed from variable. A GPU sitting idle still costs; a GPU at 80% utilization costs the same per hour but per unit of work, less. This is the lever.
Flag the lies. The cloud console will report utilization in a way that looks useful but is not (aggregate, time-windowed, missing the per-GPU breakdown). Record what is misleading.

The deliverable is a cost model with the three numbers, a written summary of assumptions, and a "things the console hides from you" section. That last one is often the most actionable.

Track 2: Reliability inventory

The goal is a risk register — what can go wrong, how it manifests, how often it has, and what blast radius it has.

Steps:

Pull the last 90 days of incidents, near-misses, and "weird" Slack threads. Classify them by mode: driver/kernel drift, VRAM exhaustion, thermal throttling, scheduling failure, dependency version mismatch, upstream API issue, data issue, model regression.
For each mode, write down: frequency, blast radius (one pod, one service, the cluster), detection time, and time-to-mitigation.
Map each mode to a guardrail: does one exist, is it tested, does it fire? A guardrail that has never fired in a real incident is a guardrail that has not been tested.
Identify the single points of failure. A GPU node type with no failover. A registry nobody is mirroring. A driver version nobody remembers why you pinned. Each SPOF gets a line.

The deliverable is the risk register — modes, frequencies, guardrails, SPOFs — plus a ranked list of "fix this first" items with rough effort estimates.

Track 3: SLO proposal

Most inference systems do not have SLOs. They have folklore.

Define, for each in-scope service:

What is being promised (latency, availability, correctness, throughput).
To whom (internal team, paying customer, downstream automation).
At what threshold (p95 latency under 500ms, 99.5% availability monthly, etc.).
With what error budget and what happens when it is spent.

The hardest part is the threshold. Do not pick one from the internet. Derive it from: (a) what the users currently experience, (b) what the downstream workflow can tolerate, and (c) what the platform can realistically deliver. A 99.99% SLO on consumer-GPU inference is folklore. A 99.5% SLO with a written error-budget policy is an engineering artifact.

Deliverable: a proposed SLO per service, a one-paragraph rationale, and an error-budget policy draft.

Track 4: Roadmap

Everything converges here. Given the cost baseline, the reliability inventory, and the SLO proposal, produce a 90-day roadmap.

Structure:

Weeks 1–4: stop-the-bleeding. Specific fixes for top reliability risks and the highest-leverage cost lever. Each item has an owner and a measurable outcome.
Weeks 5–8: foundation. Anything that unlocks the next quarter — GitOps migration, monitoring gap-fills, pinning and reproducibility work. Boring, durable.
Weeks 9–12: moving the number. Now and only now, optimization work aimed at the cost or reliability number. Anything earlier than week 9 is premature.

Each roadmap item has: the problem it addresses, the evidence from the audit, the owner, the rough effort, and the expected outcome on a specific metric.

Gates summary

Track	Gate	Evidence
Cost	Per-unit-of-work cost, written assumptions	The model, defended in review
Reliability	Risk register with frequency + blast radius	Classified incidents + SPOF list
SLO	One SLO per service, with rationale	Proposal doc
Roadmap	90-day plan with owners and metrics	The roadmap, reviewed by stakeholders

Anti-patterns

Open-ended scope. "Let us also look at data quality" turns the audit into a multi-month engagement. Scope it to four tracks, three weeks. If data quality is the real problem, the audit will surface it and you can run a separate data-quality engagement.
Skipping the SLO track because "we are not ready." You are not ready for a final SLO. You are ready for a proposal. The proposal forces the conversation that reveals what you do not know.
A cost model without assumptions. A cost number without context is indistinguishable from a guess. Always write the assumptions down.
A roadmap that does not rank. Unranked lists get cherry-picked. Ranked lists get executed or argued with. Both outcomes are better than cherry-picking.
Using the audit as a sales motion. If the deliverable is a pitch for a bigger engagement, trust collapses and no one acts on the findings. The audit is the product.

What this looks like in practice

The audit is also a template I have applied to my own platforms. The shape is the same at every scale:

GPU cost numbers derived from measurement, not the console.
Incidents classified and mapped to guardrails.
SLOs proposed and defended, not inherited.
A roadmap that someone could execute without the author in the room.

The goal is always the same. Replace folklore with artifacts, so the next decision is informed by evidence instead of vibes.

Source material

The source posts and case studies for this playbook are listed at the bottom of the page. They go deeper on each track, with real numbers from the platforms I operate.

Source material

The evidence this playbook draws on.

Back to principles Talk about adoption