Standing Up a GPU-Ready Private AI Platform (Harvester + K3s + Flux + GitLab)

December 29, 20256 min read

Professionalcase-studyplatform-engineeringkubernetesk3sharvestergpuamdradeon

A GPU-ready private AI platform (without the Rancher/Fleet stack)

If you want to run GPU inference 24/7, “just throw it on AWS” is a great default… until it isn’t. In my case, the constraints were more like:

I want predictable cost for always-on services.
I want data locality (and the option to keep sensitive data off the public internet).
I want a workflow that’s boring: Git push → CI → Flux sync → running workloads.

This post is a case study in how I stood up a small GPU-ready private AI platform using Harvester for virtualization, K3s for the workload cluster, and GitLab + Flux for GitOps. My GPU fleet is still mostly AMD, so my defaults lean ROCm-first. I’m intentionally not using the “unified Rancher ecosystem” (and I’m not using Fleet): the core idea is that the control plane should be composable and easy to evolve.

TL;DR

Harvester gives me a clean virtualization + storage foundation (KubeVirt + Longhorn) for cluster nodes.
K3s keeps the workload cluster small and fast to operate.
GitLab CI builds/publishes containers; Flux reconciles Kubernetes state from Git.
GPU enablement is mostly about consistency: drivers + device plugin + node labeling/taints + runtime assumptions.
The interesting part is the product layer: how services/flexinfer can ship inference features safely on top of this.

Today that product layer is no longer hypothetical. The same platform now backs FlexInfer itself, the docs/playground surfaces on this site, and the sanitized public FlexDeck APIs behind the live demos.

A GitOps flow diagram showing GitLab repos and CI building images, publishing to a registry, and Flux reconciling Helm/Kustomize deployments into a K3s workload cluster running on Harvester-provisioned VMs with GPU nodes. — **Figure 1.** The “boring” control loop: GitLab → CI → registry → Flux → K3s, with Harvester underneath providing the VM substrate (and GPUs on the right).

Context: what “done” means

I’m not trying to recreate a hyperscaler. “Done” for this platform looks like:

A clean path to deploy and roll back GPU workloads (Helm/Kustomize + Flux).
A repeatable way to stand up new services under services/flexinfer without snowflake cluster changes.
Guardrails: secrets management, resource isolation, observability, and upgrades that don’t require heroics.

Layer 0: Harvester as the substrate

Harvester (KubeVirt on Kubernetes) is a practical middle ground between “pure bare metal k8s” and “traditional virtualization with a separate storage stack”:

VMs for cluster nodes: workload cluster nodes are just VMs with predictable sizing.
Storage: Longhorn as the default makes PVC behavior consistent (and debuggable).
Networking: I keep it simple and avoid cleverness unless I need it (GPU inference doesn’t benefit from exotic networking).

The operational win is that node lifecycle becomes “treat nodes like cattle again,” even when the underlying hardware is not.

Layer 1: K3s for the workload cluster

I’m using K3s for the workload cluster because it’s lightweight, well-understood, and easy to repair. The tradeoff is that you have to be disciplined about what you add:

Prefer “one controller per concern” (Flux, cert-manager, ingress, observability) instead of kitchen-sink bundles.
Keep the cluster API surface small: fewer CRDs, fewer moving parts.

Layer 2: GitLab + Flux GitOps (no Fleet)

The contract I want is simple: the cluster is a projection of Git.

At a high level:

services/* repos build and test artifacts in GitLab CI.
Images publish to a registry with immutable tags.
A separate GitOps repo declares desired state (HelmRelease/Kustomization).
Flux reconciles that state into the cluster.

A minimal Flux Kustomization looks like this (pseudocode-ish, but close to what I run):

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: flexinfer-platform
  namespace: flux-system
spec:
  interval: 10m
  path: ./clusters/prod
  prune: true
  sourceRef:
    kind: GitRepository
    name: platform-gitops

The key operational move is to make “how things ship” legible:

CI owns tests + builds + SBOM/signing (if/when needed).
GitOps owns rollout policy and drift correction.
The cluster is not a place for manual edits.

GPU enablement: make it boring

GPU support fails in predictable ways: driver mismatch, runtime mismatch, or scheduling mismatch. I aim for boring invariants:

Drivers/runtime: pick a known-good AMD driver + ROCm combo and don’t drift casually.
Device plugin / operator: deploy via GitOps, pin versions, and treat upgrades like real changes.
Scheduling: label/taint GPU nodes and make workloads declare intent.

On AMD nodes, my “sanity check” is intentionally unglamorous: confirm /dev/kfd + /dev/dri exist, validate the host with rocminfo/rocm-smi, then validate the container runtime can see the device nodes (before I even look at model code).

Example patterns I rely on:

Taint GPU nodes and require a toleration.
Use node selectors or affinity for “GPU-capable” pools.
Request GPUs explicitly (for me: amd.com/gpu: 1 via an AMD GPU device plugin) and enforce limits with policy.

If the platform is healthy, a GPU workload should fail fast and obviously when it’s misconfigured.

One AMD footgun: KFD wants all GPUs bound

On my RX 7900 XTX node, the integrated GPU is a trap: if it isn’t bound to the amdgpu driver, /dev/kfd breaks and everything above it becomes “mysteriously flaky.”

The fix is unglamorous but effective:

keep the iGPU bound,
use the AMD device plugin in “mixed” resource naming mode so I can request specific GPU families,
and select the discrete GPU in workloads via HIP_VISIBLE_DEVICES.

That’s the kind of thing I mean by “make it boring”: encode the footguns into the platform so model deploys don’t have to rediscover them.

Another footgun: gfx1100 needs pinned ROCm knobs

On consumer RDNA3 (gfx1100), I’ve repeatedly hit “it runs, then it hangs or segfaults under load” issues when I let serving frameworks auto-select attention implementations.

So I treat a few environment variables as part of the platform contract and set them centrally (not per app):

HSA_OVERRIDE_GFX_VERSION=11.0.0
PYTORCH_ROCM_ARCH=gfx1100
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
and, for vLLM specifically: force V0 engine and disable AITER.

These aren’t “performance tweaks.” They’re the difference between “I can leave this running” and “I’m SSHing into a node at midnight.”

Versioning: What I Pin (and Why)

The biggest operational lesson from GPU homelab work is that “latest” is not a plan.

I treat these as part of the platform contract and pin them:

Harvester (virtualization substrate)
K3s (workload cluster distribution)
Flux + Helm controller versions
Container images that depend on GPU runtimes
ROCm + kernel/driver combo on the AMD nodes

When I do upgrade, I do it like a production change: one variable at a time, with an easy rollback and a short “known-good” note in Git.

What this stack is backing now

Infrastructure is only interesting if it enables product velocity. The useful update here is that this platform is now backing real product surfaces, not just platform diagrams:

FlexInfer: the control plane is effectively production-ready, with hardened routing, serverless activator behavior, multi-cluster support, and advanced model-management features documented in the current implementation status.
flexinfer-site: this site is no longer brochureware. It now serves multi-project docs, playground hubs, case studies, and public demos from the same GitLab -> Flux delivery path. App-level metrics and the /metrics endpoint are live; the remaining monitoring rollout work is scrape + alert wiring.
FlexDeck public API: the public cluster overview, CI pipeline, model gallery, and related demos are backed by sanitized read-only proxy routes. The remaining hardening work is the boring but necessary layer: rate limiting and narrower RBAC.

That is the part I care about most. A private GPU platform becomes interesting once it is good enough to support documentation, demos, operational tooling, and product rollout without each one inventing a separate deployment story.

What I’d do next

Add policy (Kyverno/Gatekeeper) for “no GPU workloads without requests/limits”.
Add a small SLO dashboard for inference: p95 latency, error rate, GPU utilization, and saturation.
Formalize “break-glass” operational procedures (because you will need them).

If you’re building something similar, I’m happy to compare notes, especially around upgrade playbooks and the failure modes you only see after a few months of real usage.

8 min read

amdradeon

Getting Gemma 4 Running on a Radeon 7900 XTX (with and without TurboQuant)

What it took to get Gemma 4 E4B serving cleanly on Radeon through FlexInfer: a stable TRITON lane on a 7900 XTX, an experimental TurboQuant long-context lane on a second node, and the GPTQ pipeline work still underway.

12 min read

rocmamd

Deploying MLC-LLM on Dual RX 7900 XTX GPUs: Debugging VRAM, KV Cache, and K8s GPU Scheduling

What actually broke when I deployed MLC-LLM across two RX 7900 XTX nodes, and the fixes that made it stable: quantization, KV cache sizing, and Kubernetes GPU hygiene.

11 min read

gpukubernetes

Two-Lane Text GPU Allocation: Quality + Vision/Fast (Plus a Media Lane)

How I redistributed 6 models across 3 GPU nodes to eliminate contention, using priority-based shared groups and label-based aliases for routing and failover.

Comments

Join the discussion. Be respectful.