Getting Gemma 4 Running on a Radeon 7900 XTX (with and without TurboQuant)

April 4, 20268 min read

labgemma4amdradeon7900xtxrocmvllmturboquantgptq

Getting Gemma 4 running on a Radeon 7900 XTX (with and without TurboQuant)

I wanted a Gemma 4 serving path that was real, not a benchmark screenshot:

managed by FlexInfer,
reconciled through GitOps,
reachable through LiteLLM / OpenWebUI,
and stable enough that I could restart the runtime without losing the plot.

That turned out to be two different problems hiding under one label:

Get Gemma 4 serving cleanly on Radeon with the boring path.
See how far I could push TurboQuant on ROCm without lying to myself about what was production-ready.

The first one is working now. The second one is working enough to be interesting, but still clearly experimental.

TL;DR

google/gemma-4-E4B-it now serves cleanly on my RX 7900 XTX through the unified FlexInfer runtime.
The stable path is vLLM + TRITON_ATTN + float16 KV cache, exposed as:
- gemma4-e4b
- gemma4-e4b-fast
The experimental long-context path runs on a second Radeon node with TurboQuant KV cache compression, exposed as:
- gemma4-e4b-long
The fast lane is currently around 1,887 prompt tok/s on a warmed ~10k prompt.
The long TurboQuant lane is currently around 268 prompt tok/s on a warmed ~30k prompt.
The big lesson: Gemma 4 on Radeon is mostly a runtime integration problem; TurboQuant on Radeon is still a semantics and prefill-performance problem.
In parallel, GPTQ work is moving on the quantization pipeline side, but that is not the serving path I’m using for Gemma 4 today.

Context: what “working” meant here

This wasn’t “I ran a local script once.” I wanted the same deployment path I use for the rest of the platform:

Model CRs in FlexInfer,
a unified gfx1100 runtime image,
rollout via GitLab -> Flux,
model discovery through LiteLLM,
and sane public model IDs that don’t turn OpenWebUI into a junk drawer.

The hardware split I ended up with is:

Model ID	Node	Path	Intent
`gemma4-e4b`	`cblevins-7900xtx`	vLLM `TRITON_ATTN` + float16 KV	default alias
`gemma4-e4b-fast`	`cblevins-7900xtx`	vLLM `TRITON_ATTN` + float16 KV	lower-latency text generation
`gemma4-e4b-long`	`cblevins-5930k`	vLLM `CUSTOM` + `kvCacheCodec=turboquant`	long-context experimental lane

That split matters. It let me stop pretending one configuration should do everything well on one consumer GPU.

The first blocker: upstream drift, not just “missing support”

The original problem wasn’t just “FlexInfer doesn’t know Gemma 4.” It was a moving target:

Hugging Face Gemma 4 models landed ahead of stable vLLM releases.
The model family mixes text, vision, and audio towers depending on the checkpoint.
The runtime stack wanted transformers features newer than the versions pinned by some downstream packages.

The checkpoint I focused on is google/gemma-4-E4B-it. That was the right starting point because:

it is much smaller than the 31B Gemma 4 checkpoints,
it fits the 24 GB Radeon story better,
and it is still complex enough to flush out the real runtime issues.

The first useful discovery was that vLLM main already had Gemma 4 model support, while the stable release line I started from did not. That shifted the project from “find a fork” to “make the source path actually run on my ROCm base image.”

Phase 1: make the boring serving path real

The stable path is not exotic now:

unified runtime
vLLM on ROCm
Gemma parser/tool-calling enabled
no eager mode
managed via FlexInfer CRDs

What took time was getting there without leaving a pile of one-off debug pods behind.

1) Fix the runtime stack before the model

I had to stop thinking in terms of “model bugs” and look at the runtime as a platform:

source-built vLLM against the AMD base image,
patched compatibility issues between the base PyTorch build and current vLLM,
made the runtime image install transformers from a new enough source tree for Gemma 4,
and hardened the managed runtime path so Flux and the controller weren’t fighting hot fixes.

One of the most important boring wins was compile cache reuse. On the managed warm restart path, torch.compile + warmup dropped from roughly 232s down to about 11s, and total engine init dropped from about 239s to about 15s once the cache path was stable.

That mattered more than any one flashy optimization because it made iteration tolerable.

2) Stop treating aliases like an afterthought

I also had to clean up the user-facing model IDs so OpenWebUI and LiteLLM reflected how the system actually worked.

The current public contract is intentionally small:

gemma4-e4b
gemma4-e4b-fast
gemma4-e4b-long

That sounds cosmetic, but it fixed real operator pain:

old Qwen/Gemma aliases stopped cluttering the UI,
the default alias stopped inheriting the old low-context assumptions,
and it became obvious which lane was “fast” versus “long.”

Phase 2: TurboQuant on ROCm

This was the fun part, and also the part where “works” needs scare quotes unless I am careful.

The public TurboQuant story looked promising on paper: KV cache compression with meaningful memory savings. But on ROCm and Gemma 4, this turned into a sequence of very specific failures:

CPU LAPACK assumptions in the quantizer path,
Gemma 4 model-type mismatches (gemma4_text vs gemma4),
mixed 256/512 head-dimension behavior,
shared-KV semantics not lining up with upstream assumptions,
prompt-format issues that looked like backend corruption until I fixed the harness,
and then the real long-context problem: prefill speed.

This is why I split the long lane out instead of trying to silently swap the default path underneath users.

What is actually working now

The managed TurboQuant canary is real:

it runs through the same unified FlexInfer runtime family,
it is exposed through the normal Model CRD path,
and it serves through LiteLLM as gemma4-e4b-long.

Current tuned profile:

Model ID	`maxModelLen`	`maxNumBatchedTokens`	`gpuMemoryUtilization`
`gemma4-e4b-fast`	`16384`	`512`	`0.92`
`gemma4-e4b-long`	`32768`	`160`	`0.80`

The key tuning win so far was small but real: raising gemma4-e4b-long from maxNumBatchedTokens=128 to 160 improved the warmed long-context lane from roughly 223 to 268 prompt tok/s on a ~30k prompt, and improved the ~10k leg from roughly 340 to 389 prompt tok/s.

That is not “TurboQuant solved.” It is “the managed experimental path is now good enough to iterate on with numbers.”

What the current numbers look like

These are from the current managed setup, using the benchmark workflow I added in FlexInfer for fast-vs-long comparisons:

Model ID	Prompt tokens	Elapsed	Prompt tok/s	Completion tok/s
`gemma4-e4b`	`73`	`0.880s`	`82.96`	`57.96`
`gemma4-e4b-fast`	`10035`	`5.317s`	`1887.31`	`12.04`
`gemma4-e4b-long`	`10035`	`25.802s`	`388.92`	`2.48`
`gemma4-e4b-long`	`30035`	`112.088s`	`267.96`	`0.57`

The headline is straightforward:

the fast lane is good
the long lane is functional
and the long lane’s prefill path is still where most of the pain lives

That is exactly the kind of split I would rather document honestly than hide behind a single marketing number.

The hard lesson: correctness first, then speed

I lost a lot of time until I forced myself to separate three classes of failure:

runtime crashes
semantic corruption
slow but correct execution

TurboQuant on Gemma 4 hit all three at different moments.

Some fixes that mattered:

gating TurboQuant KV-spec rewriting so it did not leak into normal TRITON_ATTN paths,
fixing shared-KV handling so the cache path matched what vLLM expected,
routing instruction-tuned -it models through chat-style prompting in the debug harness,
and keeping the experimental path on its own model ID so the stable lane stayed boring.

The result is that I can now say something precise:

Gemma 4 on Radeon is production-ish on the stable path
Gemma 4 + TurboQuant on Radeon is a valid experimental long-context lane
TurboQuant is not yet the default path I would hand to every user

The in-progress GPTQ work

There is another thread happening in parallel that matters, but it is easy to describe badly if I collapse it into the Gemma story.

The GPTQ work is currently on the quantization pipeline side, not the Gemma 4 serving path above.

The latest Claude-side progress has been in hardening the loader/materialization behavior for GPTQModel-based quantization flows, especially around Qwen3.5. The recent deltas I wanted reflected here are:

fix: align GPTQ loader path with pinned API
fix: support current GPTQ before-load hook
fix(quantization): force qwen3.5 text loader
fix(gptq): materialize meta-backed shard loads
fix(gptq): always attempt assign shard loads

Those changes are aimed at a real class of issues I hit earlier:

loader/API drift in GPTQModel,
meta-backed tensor materialization failures,
before-load hooks moving under us,
and model definitions that did not line up cleanly with the instantiated text-only model layout.

In other words: the GPTQ thread is getting more boring in the right way. The loader path is being pinned down, shard loads are being forced to materialize more predictably, and the text-only Qwen3.5 path is being handled more explicitly instead of hoping the generic loader will guess correctly.

The practical takeaway is:

GPTQ is moving forward
the work is making the quantization pipeline less brittle
but it is not the thing currently serving Gemma 4 in my cluster

That distinction matters because I do not want to imply “Gemma 4 is running via GPTQ now” when it isn’t.

What I would tell someone trying this on Radeon

If you want the shortest honest version of the path:

Start with the stable non-TurboQuant path.
Get the model serving through your real control plane, not a throwaway script.
Make warm restarts cheap before chasing clever kernels.
Split fast and long into different managed profiles.
Only then start paying the TurboQuant tax.

And if you are working on consumer AMD GPUs, treat a few things as platform contracts, not per-app guesses:

your ROCm/PyTorch/vLLM version set,
your attention backend choice,
your allocator settings,
and your GitOps rollout path.

That sounds less exciting than “TurboQuant on Radeon,” but it is the difference between a demo and a service.

What’s next

My next queue is pretty clear:

push the long-context TurboQuant lane further without regressing stability,
keep measuring prefill separately from generation so I stop fooling myself with the wrong number,
continue the GPTQ pipeline hardening until the loader/materialization path is boring,
and keep the living status docs up to date so this work stays legible.

More concretely:

for the stable lane, I want to find the actual batching ceiling on the 7900 XTX without giving back the operational stability I already have;
for the TurboQuant lane, I want to spend time on prefill behavior and paged decompress fallback, not just keep bumping manifest knobs blindly;
for GPTQ, I want the quantization path to become predictable enough that it can be written about as a boring workflow instead of a heroic debugging story.

If you are building on AMD consumer GPUs, this is the version I would recommend:

use the stable path first
keep the experimental path isolated
measure everything
and don’t mistake “it returned text once” for “the platform is done”

Because on Radeon, especially with newer model families, the real work is rarely “does it run?” It is “can I still trust it after the next restart, rollout, or context-length bump?”

6 min read

amdradeon

Standing Up a GPU-Ready Private AI Platform (Harvester + K3s + Flux + GitLab)

Field notes from building and operating a small private GPU platform with Harvester, K3s, and a GitLab -> Flux delivery loop.

12 min read

rocmamd

Deploying MLC-LLM on Dual RX 7900 XTX GPUs: Debugging VRAM, KV Cache, and K8s GPU Scheduling

What actually broke when I deployed MLC-LLM across two RX 7900 XTX nodes, and the fixes that made it stable: quantization, KV cache sizing, and Kubernetes GPU hygiene.

5 min read