AI Infra Readiness Audit
A fixed-scope diagnostic to baseline cost, find reliability risks, and leave you with a plan you can execute.
Typical timeline: 2–3 weeks. Price anchor: $7.5k–$15k.
Prefer reading first? Start with the checklist post: AI Infra Readiness Audit: what I check.
Who this is for
If any of these are true, the audit usually pays for itself.
- GPU spend is rising and you can’t explain it confidently.
- Production reliability is shaky (or you’re flying blind on telemetry).
- You’re considering hybrid/on-prem and need a realistic plan.
- Releases are risky: unclear rollout/rollback paths, unclear ownership.
What you get
Concrete outputs in writing.
- Scorecard across cost, reliability, security, operability
- GPU cost baseline + scaling model
- Risk register (top failure modes + mitigations)
- 90-day roadmap (sequenced, owned, executable)
Example artifacts
I optimize for written outputs you can hand to a team: what’s broken (and why), what to do next, and how to verify progress.
What I need from you
This keeps the audit fast and grounded in reality.
- A 60-minute architecture walkthrough (current state + constraints)
- A rough GPU cost breakdown (or billing export) for the last 30–90 days
- Top incidents / reliability pain points (even if it’s informal)
- Access to CI/CD + GitOps flow (repos, environments, rollout process)
How it runs
A predictable workflow with written artifacts.
- Kickoff: constraints, success metrics, scope boundaries
- Current-state review: workloads, cluster/runtime, deployment flow
- Cost model: baseline + sensitivity assumptions
- Reliability review: candidate SLIs/SLOs, failure modes, runbooks
- Findings review: working session + prioritized decisions
- Final handoff: report + roadmap + next-step options
FAQ
Short answers, no sales gloss.
A scorecard, a written architecture review, a GPU cost baseline/model, a risk register, and a 90-day roadmap with sequencing and owners.
Usually not. Most audits can be done with read-only visibility plus walkthroughs and exports (billing, telemetry, CI/CD, configs). If hands-on access is needed, we’ll scope it explicitly.
Yes. The deliverable is meant to be executable by your team. If you want help, we can scope phase 2 as a project or a retainer.
Most engagements start with the Readiness Audit. It gives you an execution plan and de-risks implementation.
Yes. I can build new MCP servers (Go or Python), harden existing ones, and deploy them locally or in Kubernetes with a clean handoff.
Yes. That’s a common blocker for enterprise use. We can scope a “context gateway” so agents access tools as the user (OIDC), with logs and rate limits.
No. I work across cloud, hybrid, and on-prem. The goal is predictable performance, cost, and operations.
Implementation is project-based with clear deliverables and milestones. If you need flexibility, a retainer can work better.
Next steps after the audit
If the roadmap is clear and you want help shipping it, phase 2 is typically a build/stabilization project. If you mainly need senior guidance, a retainer can be a better fit.