Gaming Mode: Turning an Inference GPU Node into a Moonlight Host — Declaratively

July 1, 20264 min read

labgpukubernetesrocmsunshinemoonlightgamingschedulinghomelab

Gaming Mode: Turning an Inference GPU Node into a Moonlight Host — Declaratively — hero illustration

My FlexInfer cluster had a "gaming mode" for months. It had never run a single game.

It was a scaffold: a SteamBackend that launched Xvfb (software OpenGL — no GPU acceleration), logged into Steam anonymously (which can't host Remote Play), and exposed a port that k8s networking would never route. Green unit tests, zero games. So I did the thing I should have done before writing any of it: I ran a kill-test.

This post is how the never-run stub became a real, hardware-accelerated, declaratively-orchestrated Sunshine/Moonlight host — and the four traps I hit turning one of my inference GPUs into a game console for an evening.

TL;DR

The old Xvfb/Steam-Remote-Play path could never have worked. Replaced it with Sunshine + Moonlight (the de-facto headless-Linux streaming stack) + gamescope/sway on Mesa RADV.
Kill-test first. In a privileged container with /dev/dri on the 7900 XTX, vulkaninfo showed RADV bound to the discrete card (not llvmpipe), and ffmpeg VA-API actually hardware-encoded H.264 (5.7×), HEVC (7.9×), and AV1 (8.1× realtime). The substrate was real; everything else was engineering.
Gaming is now a GamingSession CR: kubectl apply drains inference on the node and flips its runtime into gaming mode via the controller; kubectl delete reverts it (a finalizer calls SetMode(inference)).
To free the node, I live-migrated the gemma4-26B primary to another 7900 XTX with zero chat gap — and hit a stale forcePromotion flag that silently pinned the old model as leader for 30 minutes.
The node's runtime pod runs hostNetwork so a LAN Moonlight client pairs straight to it on Sunshine's fixed ports.

Kill-test: does the substrate even work?

The load-bearing question wasn't "can I wire up Sunshine" — it was "can a ROCm inference container even do GPU graphics + hardware video encode?" ROCm ships the compute stack, not Mesa's graphics/Vulkan userspace or a VA-API encoder. My runtime image had none of it.

So before touching the controller, I threw a throwaway privileged container at the 7900 XTX node with /dev/dri mounted and installed just the graphics userspace:

vulkaninfo --summary
  GPU1: AMD Radeon RX 7900 XTX (RADV NAVI31)   driverName = radv   Mesa 25.2.8

RADV bound to the discrete card. Then the real question — hardware encode, not just enumeration — by actually encoding frames:

ffmpeg -vaapi_device /dev/dri/renderD128 -f lavfi -i testsrc=1920x1080:rate=60 \
       -vf format=nv12,hwupload -c:v <codec> -f null -

Codec	Frames	Speed
h264_vaapi	180	5.7× realtime
hevc_vaapi	180	7.9× realtime
av1_vaapi	180	8.1× realtime

Multi-× realtime with real bitstream output — that's VCN 4.0 hardware, not CPU x264. AV1 encode on RDNA3, in a container, on a Kubernetes node. The bet held; the pod-based approach was viable. (I also did the disconfirming search: Sunshine needs CAP_SYS_ADMIN for KMS capture, VA-API device selection is buggy — both handled by running privileged and pinning the render node explicitly.)

The declarative design

The old imperative toggle (PUT /api/v1/mode) became a first-class CRD. Create one and the node games; delete it and the node serves models again:

apiVersion: ai.flexinfer/v1alpha2
kind: GamingSession
metadata: { name: gaming-7900xtx, namespace: flexinfer-system }
spec: { nodeName: cblevins-7900xtx, mode: gaming }

The controller reconciles it: find the node's runtime pod, drain any loaded models, and PUT /api/v1/mode {gaming}. The runtime then launches sunshine-headless.sh — a headless sway (wlroots) session on Mesa RADV, with Sunshine capturing its output and hardware-encoding it. A finalizer guarantees that deleting the CR reverts the node to inference, so the contract is simply: CR exists ⇒ gaming.

The gaming node's runtime image is a dedicated gfx1100-gaming profile (Sunshine + sway + Mesa RADV + VA-API, no quantizer/llama.cpp bloat), and its DaemonSet runs hostNetwork so Moonlight reaches Sunshine on the ports it hardcodes (47984/47989/48010 TCP, 47998–48010 UDP — outside NodePort's range, so hostNetwork is the standard answer).

Freeing the node (and a 30-minute trap)

To hand the 7900 XTX to gaming, its always-on chat primary — gemma4-26B — had to move to the other 7900 XTX node. I did that as a live migration: promote the sister instance to co-primary first (so gpt-4/quality-chat never lose a backend), then de-advertise the original. Zero chat gap.

Except the sister wouldn't take the card. For half an hour it sat Queued behind a model I had already de-advertised, scaled to zero, and demoted. Every lever I pulled — litellm.enabled: false, minReplicas: 0, warmPolicy: ondemand, even bumping the sister's priority above it — did nothing.

The culprit: a leftover gpu.forcePromotion: true on the old model. The shared-GPU election treats a force-promoted member as the unconditional leader, bypassing priority and warmth entirely. One stale flag silently overrode everything else. Removing it, the election immediately elected the sister and it warmed on the freed card.

Lesson: when you de-advertise a shared-GPU leader, clear forcePromotion too — not just the routing.

Three more traps worth naming

A new CRD needs RBAC in two places. I shipped the GamingSession controller but only regenerated config/rbac/role.yaml, not the Helm chart's ClusterRole. Flux auto-deployed the new controller, which crashlooped on gamingsessions is forbidden and froze all model reconciliation until I added the rule to the chart. Existing pods kept serving, but nothing new reconciled for ~20 minutes.
SetMode(gaming) drains runtime-managed models, not dedicated Deployments. Models with raw pvc:// sources run as their own Deployments, so the mode switch doesn't touch them — they have to be de-advertised separately or they keep squatting on the card.
steamcmd fails a non-interactive image build (Steam License Agreement was DECLINED) without a debconf preseed. Sunshine streams any app without Steam, so I dropped it; Steam/Proton is a follow-up.

The result

Pair Moonlight at the node's IP, open https://<node>:47990 to set a login, enter the PIN, and stream. GPU-rendered frames, hardware-encoded (H.264/HEVC/AV1), off a node that was serving a 26B model an hour earlier. When I'm done, kubectl delete gamingsession puts it back to work.

The whole thing shipped as five slices — kill-test, Sunshine backend + image, the CRD/controller, hostNetwork, and the ops layer (a flexinfer_runtime_node_mode metric, an opt-in idle auto-revert, and a runbook). The stub that never ran a game is now a GPU node that does both jobs, and switches between them with one kubectl command.

Not bad for hardware that's supposed to be "just" an inference box.

11 min read

gpukubernetes

Two-Lane Text GPU Allocation: Quality + Vision/Fast (Plus a Media Lane)

How I redistributed 6 models across 3 GPU nodes to eliminate contention, using priority-based shared groups and label-based aliases for routing and failover.

5 min read

homelabkubernetes

Welcome to My Homelab

The infrastructure, product surfaces, and live demos behind the FlexInfer, Loom, and fi-fhir work I publish here.

10 min read

flexinfergpu

State of the Platform, July 2026

The quarterly snapshot of the whole stack: FlexInfer went multimodal, Mills merged its first autonomous work, a flight recorder appeared, and the dashboards grew a control plane. What shipped, what the numbers say, and what is deliberately not done.

Comments

Join the discussion. Be respectful.