Running LLMs on Radeon GPUs with ROCm

November 20, 2025·2 min read

labrocmamdllminference

One of my main homelab projects is getting large language models running efficiently on AMD Radeon GPUs. While NVIDIA dominates the ML space, AMD's ROCm platform has come a long way and offers a compelling alternative for local inference.

Why AMD?

Price/Performance - The RX 7900 XTX offers great value for inference workloads
24GB VRAM - Enough to run 7B-13B models comfortably
Open Source - ROCm is fully open source

The Stack

My current setup splits workloads across two nodes to maximize throughput:

Text Inference: llamacpp with Speculative Decoding on cblevins-7900xtx
Image/Video Gen: comfyui with WanVideo on cblevins-5930k

Real-World Configuration: Text Inference

I've moved from vLLM to llamacpp-server for better support of GGUF quantization and lower VRAM usage. Here is the actual production configuration running on the primary node:

# Qwen2.5-7B with speculative decoding (Excerpt from llamacpp-qwen2p5-7b-spec.yaml)
args:
  - |
    exec /opt/src/llama.cpp/build/bin/llama-server \
      --model /models/qwen2.5-7b-abliterated/Qwen2.5-7B-Instruct-abliterated-v2.Q4_K_M.gguf \
      --model-draft /models/qwen2.5-0.5b/qwen2.5-0.5b-instruct-q8_0.gguf \
      --ctx-size 16384 \
      --n-gpu-layers 9999 \
      --n-gpu-layers-draft 9999 \
      --draft-max 16 \
      --draft-min 4 \
      --flash-attn on \
      --cache-type-k q8_0 \
      --parallel 4

Speculative Decoding Workflow

To squeeze every bit of performance out of the 7900 XTX, I use speculative decoding. This involves running a smaller "draft" model alongside the main model to predict upcoming tokens.

Draft Model: Qwen2.5-0.5B-Instruct
Target Model: Qwen2.5-7B-Instruct (Abliterated)

The draft model rapidly predicts the next few tokens, and the target model verifies them in parallel. With --draft-max 16, we can theoretically validate up to 16 tokens in a single pass. When the verifier accepts multiple tokens at a time, this can reduce the memory bandwidth bottleneck that usually constrains generation speed.

Next Steps

I'm working on a comprehensive benchmarking suite to compare different models and configurations. Stay tuned for more detailed performance data.

Check out the ROCm Inference Stack project for the full setup.

10 min read

rocmamd

Deploying MLC-LLM on Dual RX 7900 XTX GPUs: Debugging VRAM, KV Cache, and K8s GPU Scheduling

What actually broke when I deployed MLC-LLM across two RX 7900 XTX nodes, and the fixes that made it stable: quantization, KV cache sizing, and Kubernetes GPU hygiene.

4 min read

amdrocm

Standing Up a GPU-Ready Private AI Platform (Harvester + K3s + Flux + GitLab)

Field notes from building and operating a small private AI platform with GPU scheduling, GitOps, and production-grade guardrails.

2 min read

llm

Building Practical AI Agents

Thoughts on designing AI agents that actually work for real tasks.

Comments

Join the discussion. Be respectful.