Skip to main content
Back to Blog

Running LLMs on Radeon GPUs with ROCm

November 20, 2025·5 min read

labrocmamdllminference

I run LLM inference in my homelab because I want:

  • predictable cost for always-on endpoints,
  • low latency on my LAN,
  • and the ability to test “production-ish” operational work (deploys, rollbacks, telemetry) without renting GPUs.

My current default GPU is AMD Radeon (RX 7900 XTX). ROCm has improved a lot, but the important part isn’t “does it run?” It’s “does it stay running after upgrades, restarts, and real traffic?”

TL;DR

  • Treat ROCm like part of the platform: pin versions and don’t casually drift.
  • Validate the stack bottom-up: device nodes -> host tooling -> container runtime -> model server.
  • In a homelab, llama.cpp (GGUF) is often the least painful path to stable, memory-efficient serving.
  • Split workloads by lane, not just by model size. Predictable contention matters more than theoretical peak density.

What I'm Running

The current layout is more structured than when I first started:

  • Quality lane (cblevins-7900xtx): always-on text inference, with larger quality/reasoning models preempting when explicitly requested.
  • Vision + fast lane (cblevins-5930k): vision stays warm, and a faster text model can use the lane when vision traffic is quiet.
  • Media lane (cblevins-gtx980ti): smaller image-generation workloads and legacy CUDA-shaped escape hatches.

The important part is not just the hardware split. It is the contract around it:

  • FlexInfer handles deployment, routing, and shared-GPU behavior.
  • serviceLabels give me a stable routing layer even when the backing model changes.
  • The public Model Gallery demo, FlexInfer playground, and FlexInfer docs now mirror the same concepts I am using operationally.

I’ve also run vLLM and MLC-LLM on this hardware. They can be excellent. For day-to-day “leave it running” service, llama.cpp still wins a lot of boring operational contests because GGUF quantization is forgiving and predictable.

Bottom-Up Checklist (Host -> Container -> Model)

When ROCm is “broken,” it’s usually one of these:

  1. The host doesn’t have the right device nodes or kernel modules.
  2. The container can’t see the device nodes (permissions / runtime mismatch).
  3. The model server runs, but VRAM behavior under load causes OOMs, fragmentation, or tail latency.

This is the checklist I run before I blame the model.

1) Host sanity: device nodes and basic tooling

On AMD ROCm nodes, I want to see:

  • /dev/kfd (ROCm kernel driver interface)
  • /dev/dri/* (DRM devices)

And I want at least one host-level tool to confirm the GPU is visible:

ls -la /dev/kfd /dev/dri
rocminfo | head -n 50
rocm-smi || true

If rocminfo fails on the host, nothing above it is going to be stable.

2) Container sanity: can a pod see the GPU?

Before running a model server, I run a tiny “does the container see the device?” check. The goal is to catch runtime issues (missing device mounts, wrong security context) early.

If you’re on Kubernetes, make sure the pod has access to /dev/kfd + /dev/dri and that your device plugin/runtime setup is consistent. A lot of “ROCm is flaky” reports are really “my containers don’t reliably get the device nodes.”

3) Runtime knobs: make RDNA3 less surprising

For RDNA3 (gfx1100), I’ve had the best luck being explicit about a few environment variables instead of letting every framework guess:

  • PYTORCH_ROCM_ARCH=gfx1100 (PyTorch/ROCm builds, when applicable)
  • HSA_OVERRIDE_GFX_VERSION=11.0.0 (workaround for tooling/runtime mismatches on consumer RDNA3)
  • HIP_VISIBLE_DEVICES=0 or 1 (pick a GPU deterministically)

For PyTorch-heavy workloads (ComfyUI, diffusion backends), I also use allocator tuning like:

  • PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:256

These aren’t magic. They just reduce “why did this behave differently after a restart?” variance.

4) vLLM on gfx1100: the defaults that stopped the hangs

If you want vLLM on consumer RDNA3, plan for some platform-specific guardrails. In my services/flexinfer work, I ended up encoding the “don’t let this auto-detect itself into a hang” defaults:

  • base ROCm vars (RDNA3 sanity):
    • HSA_OVERRIDE_GFX_VERSION=11.0.0
    • PYTORCH_ROCM_ARCH=gfx1100
    • TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
  • vLLM-specific stability gates:
    • force vLLM V0 engine: VLLM_USE_V1=0
    • disable Triton flash attention: VLLM_USE_TRITON_FLASH_ATTN=0
    • disable AITER: VLLM_ROCM_USE_AITER=0

The meta-point: for AMD consumer GPUs, backend choice is only half the story. The other half is deciding which runtime knobs are “part of the platform” and keeping them pinned.

Real-World Configuration: Text Inference

For the always-on text lane, I currently favor llamacpp-server where GGUF efficiency is the difference between “fits cleanly” and “turns into a late-night VRAM archaeology session.” Here is the shape of the configuration I use on the primary node:

# Qwen2.5-7B with speculative decoding (Excerpt from llamacpp-qwen2p5-7b-spec.yaml)
args:
  - |
    exec /opt/src/llama.cpp/build/bin/llama-server \
      --model /models/qwen2.5-7b-abliterated/Qwen2.5-7B-Instruct-abliterated-v2.Q4_K_M.gguf \
      --model-draft /models/qwen2.5-0.5b/qwen2.5-0.5b-instruct-q8_0.gguf \
      --ctx-size 16384 \
      --n-gpu-layers 9999 \
      --n-gpu-layers-draft 9999 \
      --draft-max 16 \
      --draft-min 4 \
      --flash-attn on \
      --cache-type-k q8_0 \
      --parallel 4

Speculative Decoding Workflow

Speculative decoding is the biggest “free speed” lever I’ve found for interactive chat. The idea: run a small draft model to propose tokens and let the target model verify them in batches.

  • Draft Model: Qwen2.5-0.5B-Instruct
  • Target Model: Qwen2.5-7B-Instruct (Abliterated)

Two operational notes that matter more than the theory:

  • This is sensitive to VRAM headroom. If you’re tight on memory, the extra model can push you into OOM territory under concurrency.
  • It changes the “shape” of latency. Measure p95/p99, not just average tokens/sec.

A Practical Starting Point (GGUF on 24GB)

If you’re running GGUF models on a 24GB card and you want “works reliably” before “chase the last 10%,” here are the knobs I start with:

  • context: --ctx-size 8192 (bigger is nice until it isn’t)
  • batching: --batch-size 512 (throughput vs latency tradeoff)
  • offload: --n-gpu-layers as high as you can without OOM

Then I increase concurrency and context slowly and watch tail latency and VRAM behavior.

Failure Modes I Actually Hit

These are the ones that cost me time:

  • Version drift: a kernel update or ROCm update that “sort of works” and then fails under load. Fix: pin versions, upgrade intentionally, write it down.
  • VRAM lies: a model fits until you add concurrency and a real context window. Fix: budget KV cache and set explicit limits (model len, parallelism).
  • Container visibility: the pod is “Running” but doesn’t have usable /dev/kfd access. Fix: validate the device nodes in a trivial pod before deploying the model server.

Next Steps

At this point the work is less “can I make ROCm run?” and more “can I keep known-good configs easy to reuse?”

The current follow-through items are:

  • better benchmark baselines for warm vs cold behavior,
  • clearer backend-specific docs for gfx1100,
  • and keeping the docs/playground/demo surfaces in sync with the configs that are actually working in the cluster.

If you want the deeper war story version (dual nodes, MLC-LLM compilation/JIT, and the real “what broke first” timeline), see: Deploying MLC-LLM on Dual RX 7900 XTX GPUs.

Related surfaces:

Related Articles

Comments

Join the discussion. Be respectful.