Two-Lane Text GPU Allocation: Quality + Vision/Fast (Plus a Media Lane)

February 9, 2026·7 min read

labgpukubernetesmlc-llmrocminferenceschedulinghomelabflexinfer

Five models crammed onto one GPU, and only one could run at a time. That was the state of my FlexInfer cluster before this week. The other 7900 XTX node had 12GB of free VRAM sitting idle. I spent an afternoon fixing the layout, and the result is a setup that's faster, more available, and actually uses both text-generation GPUs.

This post covers the before/after, the design decisions, the benchmarks, and the operational details that made it work.

TL;DR

Split 5 models from one 7900 XTX node into two dedicated text lanes: quality (14B always-on + 32B/R1 on-demand) and vision+fast (vision always-on + 8B when vision idles).
FlexInfer's gpu.shared groups enforce mutual exclusion per GPU. gpu.priority determines which model wins. Higher priority = stays loaded.
serviceLabels drive proxy routing. OpenAI-compatible aliases (gpt-4, copilot, gpt-4o) resolve through labels (and can overlap for failover).
Measured 74.8 TPS sustained on qwen3-14b-mlc (7900 XTX, MLC-LLM) and 77.7 TPS on qwen3-vl-vision (llama.cpp) for text workloads. All 12 aliases route correctly with ~300ms latency for short, warm requests.
Removed the 0.5B vLLM test model and a dead NFS reference. Net result: fewer models, better distribution, more throughput.

The problem: 5 models, 1 GPU

Before this change, cblevins-7900xtx (24GB RX 7900 XTX) was running five text models in a single shared group. Only one could be active at a time:

Node	Models	VRAM Used
`cblevins-7900xtx` (24GB)	qwen3-14b-mlc, qwen3-8b-fast, qwen3-32b-quality, deepseek-r1, qwen25-05b-vllm	~16GB (1 active)
`cblevins-5930k` (24GB)	qwen3-vl-vision	~12GB
`cblevins-gtx980ti` (6GB)	sdxl-turbo-imagegen	~5GB

The 5930k node had 12GB of free VRAM doing nothing. Meanwhile, on the 7900xtx, requesting a non-primary model meant a cold swap: drain the active model, load the new one, serve the request. That's 30–120 seconds of latency depending on model size.

The design: two lanes

The fix was straightforward: split the text models across both 7900 XTX nodes by workload type.

Quality lane (cblevins-7900xtx): the 14B model stays always-on as the universal fallback. The 32B and R1 models activate on-demand for premium quality or reasoning, preempting 14B temporarily.

Vision + fast lane (cblevins-5930k): the vision model stays always-on. When vision idles out (10-minute timeout), the 8B fast model activates automatically, giving you a second text endpoint for copilot and chat traffic.

The final layout:

cblevins-7900xtx (24GB)        cblevins-5930k (24GB)         cblevins-gtx980ti (6GB)
"Quality Lane"                  "Vision + Fast Lane"           "Media Lane"
────────────────────────        ────────────────────────       ─────────────────────
Shared: 7900xtx-quality         Shared: 5930k-models           sdxl-turbo [ON]
├─ qwen3-14b-mlc  [P:100]      ├─ qwen3-vl-vision [P:100]
├─ qwen3-32b-qual [P:80]       └─ qwen3-8b-fast   [P:90]
└─ deepseek-r1    [P:70]

How shared groups work

FlexInfer's gpu.shared field creates mutual exclusion groups. Models with the same shared value compete for the same GPU, and gpu.priority determines which model stays loaded when multiple want to run.

spec:
  gpu:
    shared: 7900xtx-quality
    priority: 100 # highest in group = stays loaded

In the quality lane, 14b-mlc at P:100 always wins over 32b-quality (P:80) and deepseek-r1 (P:70). When someone explicitly requests the 32B model, FlexInfer preempts 14b-mlc, loads 32B, serves the request, and then after the idle timeout, 14b-mlc reclaims the GPU.

The minReplicas field controls the "on" vs "available" distinction:

minReplicas: 1 = always-on (vision, 14b-mlc)
minReplicas: 0 = scale-to-zero when preempted (8b-fast activates only when vision idles)

Alias routing: serviceLabels, not LiteLLM

One thing I learned during benchmarking: FlexInfer's proxy routes by serviceLabels, not by the litellm.aliases field. The litellm section is for LiteLLM gateway discovery. The proxy resolves model names by matching against the serviceLabels array on each Model CR.

This means if you want gpt-4 to resolve to your 14B model, you need:

spec:
  serviceLabels:
    - fast-chat
    - quality-chat
    - gpt-4 # OpenAI-compatible alias
    - gpt-3.5-turbo
    - copilot

There's a CRD-enforced limit of 10 service labels per model. I hit this when I tried to add all the aliases to 14b-mlc (which had 8 existing labels plus 6 new ones). The fix was dropping the redundant -text variants since -chat covers the same routing purpose.

Shared labels and priority routing

When two models share a service label (e.g., both 14b-mlc and 8b-fast have fast-chat), the proxy routes to whichever model is Ready. If both are Ready (during coding sessions with no vision traffic), the proxy load-balances between them. If one is Idle/preempted, traffic routes to the other automatically.

One gotcha: if the same label exists on a Ready model and an Idle model, the proxy may try to activate the Idle one and timeout. I hit this with reasoning: it was on both deepseek-r1 (Idle, P:70) and 14b-mlc (Ready, P:100). The proxy picked R1, tried to activate it, but R1 couldn't preempt the higher-priority 14b-mlc. The fix: remove overlapping labels from lower-priority models that can't self-activate. Use the direct model name (deepseek-r1-reasoning) when you specifically want R1.

Benchmarks

All measurements taken from inside the cluster (pod-to-service, no ingress overhead). Three runs per test, averaged. Workload: single-request, non-streaming, measured with curl -w '%{time_starttransfer}' as a convenient wall time. (With non-streaming responses, it's close to time-to-completion and good enough for relative comparisons.)

Quality lane (qwen3-14b-mlc on 7900 XTX, MLC-LLM)

Alias	Max tokens	Avg wall	Avg TPS
`qwen3-14b-mlc` (direct)	20	0.372s	55.1
`fast-chat`	100	1.337s	74.8
`quality-chat`	80	1.103s	72.6
`gpt-4`	10	0.258s	39.1
`gpt-3.5-turbo`	30	0.448s	67.1
`copilot`	50	0.778s	64.8
`textgen`	100	1.354s	73.8
`reasoning`	60	0.854s	70.5
`o1-preview`	80	1.080s	74.1

Sustained throughput:

Length	Avg wall	Avg TPS
300 tokens	4.0s	74.9
500 tokens	12.1s	41.4

The TPS drop at 500 tokens is expected: the KV cache grows with sequence length, and per-token compute/memory traffic increases as context grows. For typical copilot bursts (10–100 tokens), you get the full ~75 TPS.

Vision lane (qwen3-vl-vision on 5930k, llama.cpp)

Alias	Max tokens	Avg wall	Avg TPS
`vision`	60	0.773s	77.7
`gpt-4o`	2	0.071s	28.3
`ocr`	20	0.290s	69.2

The vision model (8B, Q4_K_M, llama.cpp) slightly outperforms the 14B text model on TPS, likely because it's smaller and easier to keep saturated.

Routing burst (all 12 aliases, 5 tokens each)

Alias	Latency	Routed to
`fast-chat`	0.196s	qwen3-14b-mlc
`gpt-4`	0.139s	qwen3-14b-mlc
`copilot`	0.153s	qwen3-14b-mlc
`quality-chat`	0.185s	qwen3-14b-mlc
`gpt-3.5-turbo`	0.189s	qwen3-14b-mlc
`textgen`	0.301s	qwen3-14b-mlc
`reasoning`	0.165s	qwen3-14b-mlc
`o1-preview`	0.181s	qwen3-14b-mlc
`vision`	0.124s	qwen3-vl-vision
`gpt-4o`	0.094s	qwen3-vl-vision
`ocr`	0.104s	qwen3-vl-vision
`dall-e-3`	0.061s	sdxl-turbo-imagegen

12/12 aliases routing correctly, all around 0.3s for short requests.

Implementation details

Storage: NVMe hostPath PVs

The 8B model needed to be compiled on the 5930k node (same gfx1100 arch, but a local copy). I added a 20Gi hostPath PV/PVC on the 5930k's NVMe:

spec:
  hostPath:
    path: /var/lib/flexinfer/qwen3-8b-abliterated-mlc-nvme
    type: DirectoryOrCreate
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values: ['cblevins-5930k']

Compile job

MLC-LLM requires a GPU-local compilation step to produce the .so library for ROCm gfx1100. I created a one-time Kubernetes Job targeting the 5930k node. The compile job downloads from HuggingFace, quantizes to q4f32_1, generates config, and compiles the model library, all in one shot.

One operational detail: the compile job needs amd.com/gpu: 1, which means it competes with the vision model for the GPU. I had to scale the vision deployment to zero, let the compile job run, then scale vision back up. FlexInfer's shared group preemption operates at the application layer, not the K8s scheduler layer, so the scheduler can't preempt a running pod for a Job.

Cache-check timing

After compilation, FlexInfer runs a cache-check job to verify the model files exist on the PVC before transitioning the model to Idle. If the cache-check runs before compilation finishes (which it did in my case), it fails and the model stays in Pending. Deleting the failed cache-check job triggers a re-check, and the model transitions to Idle once it finds the compiled artifacts.

GPUGroup CRDs

I updated the GPUGroup CRDs to reflect the new layout (two groups instead of one). These are v1alpha1 and technically optional; the v1alpha2 gpu.shared field works independently. But keeping the GPUGroup CRDs in sync documents the intended layout and enables anti-thrashing policies.

What I cleaned up

Removed qwen25-05b-vllm: The 0.5B vLLM test model was superseded by the 8B fast model. Removed from kustomization.
Deleted qwen3-14b-abliterated.yaml: Dead file referencing a down NFS server. Not in kustomization but cluttering the directory.

What I'd do differently

The serviceLabels limit (10 per model) caught me off guard. If I were designing the label strategy from scratch, I'd use fewer, more semantic labels (text, vision, code, image) rather than mirroring every OpenAI model name. The OpenAI aliases are convenient for drop-in compatibility, but they eat into a limited budget.

The compile-before-deploy ordering is also something I'd automate. Right now it's manual: scale down vision, run compile job, scale up vision. A pre-deploy hook or an init container that checks for compiled artifacts would make this smoother.

Takeaways

Distribute models by usage pattern, not just by size. Putting "always-on" and "on-demand" models on the same GPU is fine with shared groups, but splitting across nodes gives you parallelism you can't get from time-sharing.
serviceLabels are the routing primitive. If the proxy can't find your model, check labels first. LiteLLM aliases are for a different routing layer.
Shared labels on models with different priorities can cause timeouts. Keep aliases on the highest-priority (always-Ready) model. Use direct model names for on-demand models.
MLC-LLM on gfx1100 sustains ~75 TPS for 14B models at moderate context lengths. That's fast enough for copilot, chat, and code generation workloads on a homelab.
Both 7900 XTX nodes perform similarly despite one being on a 2014 Haswell-E platform (i7-5930K) and the other on a 2023 Zen 4 (Ryzen 9 7900X3D). For these workloads, inference is GPU-bound enough that CPU generation didn't move the needle.

Deploying MLC-LLM on Dual RX 7900 XTX GPUs: the VRAM, KV cache, and scheduling debugging that preceded this work.
Running LLMs on Radeon GPUs: bottom-up ROCm setup guide.
Hybrid GPU GitOps: the broader GitOps patterns for GPU workloads.

10 min read

mlc-llmrocm

Deploying MLC-LLM on Dual RX 7900 XTX GPUs: Debugging VRAM, KV Cache, and K8s GPU Scheduling

What actually broke when I deployed MLC-LLM across two RX 7900 XTX nodes, and the fixes that made it stable: quantization, KV cache sizing, and Kubernetes GPU hygiene.

3 min read

homelabkubernetes

Welcome to My Homelab

An introduction to my personal site and the homelab infrastructure powering my AI experiments.

6 min read

kubernetesgpu

Standing Up a GPU-Ready Private AI Platform (Harvester + K3s + Flux + GitLab)

Field notes from building and operating a small private GPU platform with Harvester, K3s, and a GitLab -> Flux delivery loop.

Comments

Join the discussion. Be respectful.