Standing Up a GPU-Ready Private AI Platform (Harvester + K3s + Flux + GitLab)
December 29, 2025·6 min read
A GPU-ready private AI platform (without the Rancher/Fleet stack)
If you want to run GPU inference 24/7, “just throw it on AWS” is a great default… until it isn’t. In my case, the constraints were more like:
- I want predictable cost for always-on services.
- I want data locality (and the option to keep sensitive data off the public internet).
- I want a workflow that’s boring: Git push → CI → Flux sync → running workloads.
This post is a case study in how I stood up a small GPU-ready private AI platform using Harvester for virtualization, K3s for the workload cluster, and GitLab + Flux for GitOps. My GPU fleet is still mostly AMD, so my defaults lean ROCm-first. I’m intentionally not using the “unified Rancher ecosystem” (and I’m not using Fleet): the core idea is that the control plane should be composable and easy to evolve.
TL;DR
- Harvester gives me a clean virtualization + storage foundation (KubeVirt + Longhorn) for cluster nodes.
- K3s keeps the workload cluster small and fast to operate.
- GitLab CI builds/publishes containers; Flux reconciles Kubernetes state from Git.
- GPU enablement is mostly about consistency: drivers + device plugin + node labeling/taints + runtime assumptions.
- The interesting part is the product layer: how
services/flexinfercan ship inference features safely on top of this.
Today that product layer is no longer hypothetical. The same platform now backs FlexInfer itself, the docs/playground surfaces on this site, and the sanitized public FlexDeck APIs behind the live demos.
Context: what “done” means
I’m not trying to recreate a hyperscaler. “Done” for this platform looks like:
- A clean path to deploy and roll back GPU workloads (Helm/Kustomize + Flux).
- A repeatable way to stand up new services under
services/flexinferwithout snowflake cluster changes. - Guardrails: secrets management, resource isolation, observability, and upgrades that don’t require heroics.
Layer 0: Harvester as the substrate
Harvester (KubeVirt on Kubernetes) is a practical middle ground between “pure bare metal k8s” and “traditional virtualization with a separate storage stack”:
- VMs for cluster nodes: workload cluster nodes are just VMs with predictable sizing.
- Storage: Longhorn as the default makes PVC behavior consistent (and debuggable).
- Networking: I keep it simple and avoid cleverness unless I need it (GPU inference doesn’t benefit from exotic networking).
The operational win is that node lifecycle becomes “treat nodes like cattle again,” even when the underlying hardware is not.
Layer 1: K3s for the workload cluster
I’m using K3s for the workload cluster because it’s lightweight, well-understood, and easy to repair. The tradeoff is that you have to be disciplined about what you add:
- Prefer “one controller per concern” (Flux, cert-manager, ingress, observability) instead of kitchen-sink bundles.
- Keep the cluster API surface small: fewer CRDs, fewer moving parts.
Layer 2: GitLab + Flux GitOps (no Fleet)
The contract I want is simple: the cluster is a projection of Git.
At a high level:
services/*repos build and test artifacts in GitLab CI.- Images publish to a registry with immutable tags.
- A separate GitOps repo declares desired state (HelmRelease/Kustomization).
- Flux reconciles that state into the cluster.
A minimal Flux Kustomization looks like this (pseudocode-ish, but close to what I run):
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: flexinfer-platform
namespace: flux-system
spec:
interval: 10m
path: ./clusters/prod
prune: true
sourceRef:
kind: GitRepository
name: platform-gitops
The key operational move is to make “how things ship” legible:
- CI owns tests + builds + SBOM/signing (if/when needed).
- GitOps owns rollout policy and drift correction.
- The cluster is not a place for manual edits.
GPU enablement: make it boring
GPU support fails in predictable ways: driver mismatch, runtime mismatch, or scheduling mismatch. I aim for boring invariants:
- Drivers/runtime: pick a known-good AMD driver + ROCm combo and don’t drift casually.
- Device plugin / operator: deploy via GitOps, pin versions, and treat upgrades like real changes.
- Scheduling: label/taint GPU nodes and make workloads declare intent.
On AMD nodes, my “sanity check” is intentionally unglamorous: confirm /dev/kfd + /dev/dri exist, validate the host with rocminfo/rocm-smi, then validate the container runtime can see the device nodes (before I even look at model code).
Example patterns I rely on:
- Taint GPU nodes and require a toleration.
- Use node selectors or affinity for “GPU-capable” pools.
- Request GPUs explicitly (for me:
amd.com/gpu: 1via an AMD GPU device plugin) and enforce limits with policy.
If the platform is healthy, a GPU workload should fail fast and obviously when it’s misconfigured.
One AMD footgun: KFD wants all GPUs bound
On my RX 7900 XTX node, the integrated GPU is a trap: if it isn’t bound to the amdgpu driver, /dev/kfd breaks and everything above it becomes “mysteriously flaky.”
The fix is unglamorous but effective:
- keep the iGPU bound,
- use the AMD device plugin in “mixed” resource naming mode so I can request specific GPU families,
- and select the discrete GPU in workloads via
HIP_VISIBLE_DEVICES.
That’s the kind of thing I mean by “make it boring”: encode the footguns into the platform so model deploys don’t have to rediscover them.
Another footgun: gfx1100 needs pinned ROCm knobs
On consumer RDNA3 (gfx1100), I’ve repeatedly hit “it runs, then it hangs or segfaults under load” issues when I let serving frameworks auto-select attention implementations.
So I treat a few environment variables as part of the platform contract and set them centrally (not per app):
HSA_OVERRIDE_GFX_VERSION=11.0.0PYTORCH_ROCM_ARCH=gfx1100TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1- and, for vLLM specifically: force V0 engine and disable AITER.
These aren’t “performance tweaks.” They’re the difference between “I can leave this running” and “I’m SSHing into a node at midnight.”
Versioning: What I Pin (and Why)
The biggest operational lesson from GPU homelab work is that “latest” is not a plan.
I treat these as part of the platform contract and pin them:
- Harvester (virtualization substrate)
- K3s (workload cluster distribution)
- Flux + Helm controller versions
- Container images that depend on GPU runtimes
- ROCm + kernel/driver combo on the AMD nodes
When I do upgrade, I do it like a production change: one variable at a time, with an easy rollback and a short “known-good” note in Git.
What this stack is backing now
Infrastructure is only interesting if it enables product velocity. The useful update here is that this platform is now backing real product surfaces, not just platform diagrams:
- FlexInfer: the control plane is effectively production-ready, with hardened routing, serverless activator behavior, multi-cluster support, and advanced model-management features documented in the current implementation status.
- flexinfer-site: this site is no longer brochureware. It now serves multi-project docs, playground hubs, case studies, and public demos from the same GitLab -> Flux delivery path. App-level metrics and the
/metricsendpoint are live; the remaining monitoring rollout work is scrape + alert wiring. - FlexDeck public API: the public cluster overview, CI pipeline, model gallery, and related demos are backed by sanitized read-only proxy routes. The remaining hardening work is the boring but necessary layer: rate limiting and narrower RBAC.
That is the part I care about most. A private GPU platform becomes interesting once it is good enough to support documentation, demos, operational tooling, and product rollout without each one inventing a separate deployment story.
What I’d do next
- Add policy (Kyverno/Gatekeeper) for “no GPU workloads without requests/limits”.
- Add a small SLO dashboard for inference: p95 latency, error rate, GPU utilization, and saturation.
- Formalize “break-glass” operational procedures (because you will need them).
If you’re building something similar, I’m happy to compare notes, especially around upgrade playbooks and the failure modes you only see after a few months of real usage.
Related Articles
Comments
Join the discussion. Be respectful.