Two-Lane Text GPU Allocation: Quality + Vision/Fast (Plus a Media Lane)
February 9, 2026·7 min read
Five models crammed onto one GPU, and only one could run at a time. That was the state of my FlexInfer cluster before this week. The other 7900 XTX node had 12GB of free VRAM sitting idle. I spent an afternoon fixing the layout, and the result is a setup that's faster, more available, and actually uses both text-generation GPUs.
This post covers the before/after, the design decisions, the benchmarks, and the operational details that made it work.
TL;DR
- Split 5 models from one 7900 XTX node into two dedicated text lanes: quality (14B always-on + 32B/R1 on-demand) and vision+fast (vision always-on + 8B when vision idles).
- FlexInfer's
gpu.sharedgroups enforce mutual exclusion per GPU.gpu.prioritydetermines which model wins. Higher priority = stays loaded. serviceLabelsdrive proxy routing. OpenAI-compatible aliases (gpt-4,copilot,gpt-4o) resolve through labels (and can overlap for failover).- Measured 74.8 TPS sustained on qwen3-14b-mlc (7900 XTX, MLC-LLM) and 77.7 TPS on qwen3-vl-vision (llama.cpp) for text workloads. All 12 aliases route correctly with ~300ms latency for short, warm requests.
- Removed the 0.5B vLLM test model and a dead NFS reference. Net result: fewer models, better distribution, more throughput.
The problem: 5 models, 1 GPU
Before this change, cblevins-7900xtx (24GB RX 7900 XTX) was running five text models in a single shared group. Only one could be active at a time:
| Node | Models | VRAM Used |
|---|---|---|
cblevins-7900xtx (24GB) | qwen3-14b-mlc, qwen3-8b-fast, qwen3-32b-quality, deepseek-r1, qwen25-05b-vllm | ~16GB (1 active) |
cblevins-5930k (24GB) | qwen3-vl-vision | ~12GB |
cblevins-gtx980ti (6GB) | sdxl-turbo-imagegen | ~5GB |
The 5930k node had 12GB of free VRAM doing nothing. Meanwhile, on the 7900xtx, requesting a non-primary model meant a cold swap: drain the active model, load the new one, serve the request. That's 30–120 seconds of latency depending on model size.
The design: two lanes
The fix was straightforward: split the text models across both 7900 XTX nodes by workload type.
Quality lane (cblevins-7900xtx): the 14B model stays always-on as the universal fallback. The 32B and R1 models activate on-demand for premium quality or reasoning, preempting 14B temporarily.
Vision + fast lane (cblevins-5930k): the vision model stays always-on. When vision idles out (10-minute timeout), the 8B fast model activates automatically, giving you a second text endpoint for copilot and chat traffic.
The final layout:
cblevins-7900xtx (24GB) cblevins-5930k (24GB) cblevins-gtx980ti (6GB)
"Quality Lane" "Vision + Fast Lane" "Media Lane"
──────────────────────── ──────────────────────── ─────────────────────
Shared: 7900xtx-quality Shared: 5930k-models sdxl-turbo [ON]
├─ qwen3-14b-mlc [P:100] ├─ qwen3-vl-vision [P:100]
├─ qwen3-32b-qual [P:80] └─ qwen3-8b-fast [P:90]
└─ deepseek-r1 [P:70]
How shared groups work
FlexInfer's gpu.shared field creates mutual exclusion groups. Models with the same shared value compete for the same GPU, and gpu.priority determines which model stays loaded when multiple want to run.
spec:
gpu:
shared: 7900xtx-quality
priority: 100 # highest in group = stays loaded
In the quality lane, 14b-mlc at P:100 always wins over 32b-quality (P:80) and deepseek-r1 (P:70). When someone explicitly requests the 32B model, FlexInfer preempts 14b-mlc, loads 32B, serves the request, and then after the idle timeout, 14b-mlc reclaims the GPU.
The minReplicas field controls the "on" vs "available" distinction:
minReplicas: 1= always-on (vision, 14b-mlc)minReplicas: 0= scale-to-zero when preempted (8b-fast activates only when vision idles)
Alias routing: serviceLabels, not LiteLLM
One thing I learned during benchmarking: FlexInfer's proxy routes by serviceLabels, not by the litellm.aliases field. The litellm section is for LiteLLM gateway discovery. The proxy resolves model names by matching against the serviceLabels array on each Model CR.
This means if you want gpt-4 to resolve to your 14B model, you need:
spec:
serviceLabels:
- fast-chat
- quality-chat
- gpt-4 # OpenAI-compatible alias
- gpt-3.5-turbo
- copilot
There's a CRD-enforced limit of 10 service labels per model. I hit this when I tried to add all the aliases to 14b-mlc (which had 8 existing labels plus 6 new ones). The fix was dropping the redundant -text variants since -chat covers the same routing purpose.
Shared labels and priority routing
When two models share a service label (e.g., both 14b-mlc and 8b-fast have fast-chat), the proxy routes to whichever model is Ready. If both are Ready (during coding sessions with no vision traffic), the proxy load-balances between them. If one is Idle/preempted, traffic routes to the other automatically.
One gotcha: if the same label exists on a Ready model and an Idle model, the proxy may try to activate the Idle one and timeout. I hit this with reasoning: it was on both deepseek-r1 (Idle, P:70) and 14b-mlc (Ready, P:100). The proxy picked R1, tried to activate it, but R1 couldn't preempt the higher-priority 14b-mlc. The fix: remove overlapping labels from lower-priority models that can't self-activate. Use the direct model name (deepseek-r1-reasoning) when you specifically want R1.
Benchmarks
All measurements taken from inside the cluster (pod-to-service, no ingress overhead). Three runs per test, averaged. Workload: single-request, non-streaming, measured with curl -w '%{time_starttransfer}' as a convenient wall time. (With non-streaming responses, it's close to time-to-completion and good enough for relative comparisons.)
Quality lane (qwen3-14b-mlc on 7900 XTX, MLC-LLM)
| Alias | Max tokens | Avg wall | Avg TPS |
|---|---|---|---|
qwen3-14b-mlc (direct) | 20 | 0.372s | 55.1 |
fast-chat | 100 | 1.337s | 74.8 |
quality-chat | 80 | 1.103s | 72.6 |
gpt-4 | 10 | 0.258s | 39.1 |
gpt-3.5-turbo | 30 | 0.448s | 67.1 |
copilot | 50 | 0.778s | 64.8 |
textgen | 100 | 1.354s | 73.8 |
reasoning | 60 | 0.854s | 70.5 |
o1-preview | 80 | 1.080s | 74.1 |
Sustained throughput:
| Length | Avg wall | Avg TPS |
|---|---|---|
| 300 tokens | 4.0s | 74.9 |
| 500 tokens | 12.1s | 41.4 |
The TPS drop at 500 tokens is expected: the KV cache grows with sequence length, and per-token compute/memory traffic increases as context grows. For typical copilot bursts (10–100 tokens), you get the full ~75 TPS.
Vision lane (qwen3-vl-vision on 5930k, llama.cpp)
| Alias | Max tokens | Avg wall | Avg TPS |
|---|---|---|---|
vision | 60 | 0.773s | 77.7 |
gpt-4o | 2 | 0.071s | 28.3 |
ocr | 20 | 0.290s | 69.2 |
The vision model (8B, Q4_K_M, llama.cpp) slightly outperforms the 14B text model on TPS, likely because it's smaller and easier to keep saturated.
Routing burst (all 12 aliases, 5 tokens each)
| Alias | Latency | Routed to |
|---|---|---|
fast-chat | 0.196s | qwen3-14b-mlc |
gpt-4 | 0.139s | qwen3-14b-mlc |
copilot | 0.153s | qwen3-14b-mlc |
quality-chat | 0.185s | qwen3-14b-mlc |
gpt-3.5-turbo | 0.189s | qwen3-14b-mlc |
textgen | 0.301s | qwen3-14b-mlc |
reasoning | 0.165s | qwen3-14b-mlc |
o1-preview | 0.181s | qwen3-14b-mlc |
vision | 0.124s | qwen3-vl-vision |
gpt-4o | 0.094s | qwen3-vl-vision |
ocr | 0.104s | qwen3-vl-vision |
dall-e-3 | 0.061s | sdxl-turbo-imagegen |
12/12 aliases routing correctly, all around 0.3s for short requests.
Implementation details
Storage: NVMe hostPath PVs
The 8B model needed to be compiled on the 5930k node (same gfx1100 arch, but a local copy). I added a 20Gi hostPath PV/PVC on the 5930k's NVMe:
spec:
hostPath:
path: /var/lib/flexinfer/qwen3-8b-abliterated-mlc-nvme
type: DirectoryOrCreate
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: ['cblevins-5930k']
Compile job
MLC-LLM requires a GPU-local compilation step to produce the .so library for ROCm gfx1100. I created a one-time Kubernetes Job targeting the 5930k node. The compile job downloads from HuggingFace, quantizes to q4f32_1, generates config, and compiles the model library, all in one shot.
One operational detail: the compile job needs amd.com/gpu: 1, which means it competes with the vision model for the GPU. I had to scale the vision deployment to zero, let the compile job run, then scale vision back up. FlexInfer's shared group preemption operates at the application layer, not the K8s scheduler layer, so the scheduler can't preempt a running pod for a Job.
Cache-check timing
After compilation, FlexInfer runs a cache-check job to verify the model files exist on the PVC before transitioning the model to Idle. If the cache-check runs before compilation finishes (which it did in my case), it fails and the model stays in Pending. Deleting the failed cache-check job triggers a re-check, and the model transitions to Idle once it finds the compiled artifacts.
GPUGroup CRDs
I updated the GPUGroup CRDs to reflect the new layout (two groups instead of one). These are v1alpha1 and technically optional; the v1alpha2 gpu.shared field works independently. But keeping the GPUGroup CRDs in sync documents the intended layout and enables anti-thrashing policies.
What I cleaned up
- Removed
qwen25-05b-vllm: The 0.5B vLLM test model was superseded by the 8B fast model. Removed from kustomization. - Deleted
qwen3-14b-abliterated.yaml: Dead file referencing a down NFS server. Not in kustomization but cluttering the directory.
What I'd do differently
The serviceLabels limit (10 per model) caught me off guard. If I were designing the label strategy from scratch, I'd use fewer, more semantic labels (text, vision, code, image) rather than mirroring every OpenAI model name. The OpenAI aliases are convenient for drop-in compatibility, but they eat into a limited budget.
The compile-before-deploy ordering is also something I'd automate. Right now it's manual: scale down vision, run compile job, scale up vision. A pre-deploy hook or an init container that checks for compiled artifacts would make this smoother.
Takeaways
- Distribute models by usage pattern, not just by size. Putting "always-on" and "on-demand" models on the same GPU is fine with shared groups, but splitting across nodes gives you parallelism you can't get from time-sharing.
serviceLabelsare the routing primitive. If the proxy can't find your model, check labels first. LiteLLM aliases are for a different routing layer.- Shared labels on models with different priorities can cause timeouts. Keep aliases on the highest-priority (always-Ready) model. Use direct model names for on-demand models.
- MLC-LLM on gfx1100 sustains ~75 TPS for 14B models at moderate context lengths. That's fast enough for copilot, chat, and code generation workloads on a homelab.
- Both 7900 XTX nodes perform similarly despite one being on a 2014 Haswell-E platform (i7-5930K) and the other on a 2023 Zen 4 (Ryzen 9 7900X3D). For these workloads, inference is GPU-bound enough that CPU generation didn't move the needle.
Related posts:
- Deploying MLC-LLM on Dual RX 7900 XTX GPUs: the VRAM, KV cache, and scheduling debugging that preceded this work.
- Running LLMs on Radeon GPUs: bottom-up ROCm setup guide.
- Hybrid GPU GitOps: the broader GitOps patterns for GPU workloads.
Related Articles
Comments
Join the discussion. Be respectful.