State of the Platform, July 2026

July 1, 202610 min read

labflexinferloomloom-coremillsflexdeckmentatlabplatformgpu

State of the Platform, July 2026 — hero illustration

Every few months the platform accumulates enough change that the product pages, docs, and my own mental model drift apart. This is the periodic re-sync — and this quarter it was overdue. I sat down expecting to write about two features and found roughly 6,000 commits across the core platform repos (loom-core alone landed 757 feat commits; a healthy chunk of the GitOps volume is Flux image automation doing its job, which is itself one of the quarter's stories).

So this is the honest walk down the stack: what shipped between April and July, with dates and measurements where I have them, and plain "this layer held still" where it did.

TL;DR

Mills closed Phase 1. The autonomous work loop completed its first real merge via the harvester-vm substrate in early June, and by quarter-end had a real-work merge KPI, fail-closed health gates, and an autonomy circuit breaker. The system that plans work can now also be stopped by the system that watches it.
FlexInfer went multimodal. Text inference was joined by a voice lane (TTS at 0.207 RTF, Whisper ASR, speaker diarization), image generation on a 2015-era GTX 980 Ti, an embeddings/rerank plane on a Radeon VII, a /v1/rag codebase-answer service, and a finetune lane that borrows serving GPUs through a new GPULease CRD.
A flight recorder appeared. loom-flightdeck is a new Phoenix app: context ledger, stall board, cost attribution, and a quality lane that joins agent sessions to their GitLab MR outcomes.
The dashboards became a control plane. FlexDeck shipped a six-slice Loom section (plans, Mills, fleet, stall board), Redis-backed RBAC, and the public benchmarks API that now feeds this site's homepage — the demo-data fallback era is over.
The context ceiling was the measurement of the quarter: a model that loads at 96K tokens was coherent to 60K, degraded past 73K, and confidently wrong at 96K. Lane cards now advertise measured ceilings, not load ceilings.
No new hardware. The quarter's capacity gains came from software: GPU leasing, per-model ports, priority classes, and giving a nine-year-old NVIDIA card a job.

The Mental Model (One Addition)

The stack is still layered, but the quarter forced me to name a layer I'd been hand-waving:

FlexInfer — run and customize models inside the cluster boundary.
Loom / Loom Core — govern how agents reach tools and context.
Mills + Plan Store + Flightdeck — turn intent into coordinated, gated, recorded work.
FlexDeck + HUD + Companion — see and operate all of it, from a browser or a phone.
fi-fhir — prove the pattern on sensitive healthcare data instead of demo data.

The addition is the word recorded. The work loop only became trustworthy when it grew a flight recorder.

Runtime: FlexInfer Stopped Being a Text Server

A year ago FlexInfer served chat completions. This quarter it picked up most of the other modalities:

Voice lane (June): Kokoro TTS behind /v1/audio/speech with a measured real-time factor of 0.207, a production Whisper ASR Model CR, and pyannote speaker diarization running on the Radeon VII — the full conversational loop validated end to end.
Image generation (2026-06-20): Stable Diffusion on a GTX 980 Ti. That's a 2015 Maxwell card that was gathering dust; it now owns the image lane with an 18-step DPM++ 2M Karras scheduler and black-image guards. Not every lane needs a flagship GPU — it needs a dedicated GPU.
RAG as a front-door route (2026-06-25): a codebase-answer service behind /v1/rag, indexing multiple repos, with a retrieval-quality gate (RQGate, 2026-06-29) so retrieval regressions fail a kill-test instead of degrading silently.
Finetuning on serving hardware (2026-06-20): a new GPULease CRD lets a training job park-and-hold a serving GPU with priority thresholds and preemption policy — QLoRA finetunes now run on the same cards that serve, without the two workloads fighting.

Model customization matured alongside: the Gemma4 pipeline (abliteration + GPTQ quantization of the 26B-A4B MoE and 31B dense variants) got per-layer checkpoint resume, speculative decoding shipped with self-quantized 4-bit draft experts, and LoRA adapters and model catalogs became CRDs (ai.flexinfer/v1alpha2) with full lifecycle state machines instead of config conventions.

Two honesty mechanisms are worth calling out, because they're the runtime's version of quality gates:

The context ceiling. A model that loads at 96K tokens is not a model that works at 96K. On a daily-driver lane the measured boundary was coherent at 60K, degraded past 73K, confidently wrong at 96K. The context-needle-bench tool (now with depth grids and multi-needle modes) turned that from an anecdote into a lane-card policy. Details: Finding the real context ceiling.
The Goodhart guard (2026-06-26): the autotune loop can now veto its own throughput gains when they regress a workload class. An optimizer that can't be told "no" will optimize the benchmark instead of the fleet.

And the crowd-pleaser: gaming mode is real. A declarative GamingSession CR drains inference on a node and flips it into a hardware-accelerated Sunshine/Moonlight streaming host; deleting the CR hands the GPU back. The final slice (runbook, node-mode metrics, opt-in idle auto-revert) merged 2026-07-01, gated on live-migrating the always-on gemma4-26B chat primary off the node with zero chat gap. War story: Gaming Mode: a declarative GPU node. Steady-state allocation: the two-lane architecture.

Context Plane: Loom Core v0.10.0

Loom Core released v0.10.0 (2026-05-18) and kept compounding:

The canonical registry now defines 55 MCP servers (47 in February) synced across eight coding assistants, backed by ~70 Go-native MCP server binaries. The design bet from the registry post keeps paying off — every new surface this quarter attached to the same registry instead of forking config.
Weaver (rebranded from Orchestra in April) matured into the query/synthesis layer: retries with exponential backoff, per-query correlation, tool validation at startup, and its own daemon metrics.
A cross-process event bus landed (May): session lifecycle, presence transitions, and per-tool-call telemetry with per-subscriber backpressure tracking — the nervous system the HUD and Flightdeck now listen to.
The spawn driver grew a multi-turn control loop for headless agents, full telemetry (per-tool-call, cost estimation with cached-token accounting), budget enforcement, and a REST control plane reachable from the phone.
Semantic recall got faster and more resilient: a dedicated Qdrant instance on GPU-node-local NVMe for sub-millisecond embedding search, plus write-path embedding fallback to a secondary provider so context capture never blocks on a cold model.
The devbox sandbox hardened both backends (Docker and K8s): async image builds, pod lifecycle pruning, tar-pipe sync exclusions, and a quality gate that now correctly awaits async builds — the fix that unblocked Mills canaries.

The editor and mobile surfaces kept pace: the VS Code extension shipped v1.1.0 (2026-05-04) with the Mills view, and has Flightdeck stall/health-divergence surfacing and first-class Weaver queries merged for v1.2. Loom Companion (iOS) rebuilt Mills as a first-class tab (2026-06-29) on the frozen 18-endpoint v1 mobile API.

The Work Loop: Mills Merged Something Real

In the May post most of Mills was labeled "v2 direction." The quarter's arc, in order:

Hive v2 shipped in phases over one sprint (May 2–5): squads, audit foundation, policy gating, cross-repo registry, council debate mode, cost preview — then the rename to Mills.
First autonomous merge on real infrastructure (early June): the A2 kill-test ran the full loop on the harvester-vm spawn substrate, with RBAC scoped down from hypervisor-admin to a namespaced ServiceAccount that can touch VMs and nothing else.
Plan Store S1–S7 (June 20–27): plans became first-class, worktree-resilient entities with lifecycle, search, slice file-claims, and a markdown mirror — converged with the Mills backlog so council output, GitLab imports, and human plans land in one store. The HUD grew a Plans board where a plan or slice can be handed off to a session, an agent, or Mills from a drawer.
Pattern Loom (2026-06-28): vetted work archetypes become taste-gated templates stamped into backlog items, with engrams populated from green stamps — the system now learns from work that passed its own gates.
The guardrails arrived last, on purpose (June 30 – July 1): fail-closed infrastructure health gates, a real-work merge KPI (mills_autonomous_merges_real, excluding canary heartbeats), an alert that fires if only canaries are merging, and an autonomy circuit breaker enforced in the fan-out pipeline paths. The "manager who can stop the line" from the May post is now code with a runbook.

Alongside Mills, loom-flightdeck emerged as a separate Phoenix app — the flight recorder: a context ledger with cost attribution and session drill-down, a stall board (which the HUD and FlexDeck now both surface as an "N blocked" signal), continuous ingestion of session corpora, and a quality lane that labels per-session process friction and joins sessions to their GitLab MR outcomes. It even runs counterfactual scoring benchmarks — predicted-vs-actual verification of its own labels.

What I still owe: the Mills metrics report. Time from backlog item to merged change, cost per merged change, gate pass rate, post-merge regression rate. The instrumentation now exists end to end; the honest write-up is next quarter's job, after enough real (non-canary) runs accumulate.

Visibility: The Dashboards Grew a Control Plane

FlexDeck had its biggest quarter since the original visualization work:

A unified Loom section shipped in six slices (2026-06-30 → 07-01): plans, Mills, fleet/projects consolidation, Flightdeck stall board + context ledger, and RBAC-gated Mills ops controls. One pane of glass over the whole work loop.
RBAC went live (mid-June): Redis-backed durable user store, enforced login, rotated tokens. The dashboard stopped trusting the LAN.
A Stack Explorer (early June) inventories the workspace from the GitLab API: library adoption, workload health classification, and verification that what's declared in Flux actually binds in the cluster.
The public benchmarks API shipped (2026-06-27) — ConfigMap-served with value+unit passthrough for the non-LLM lanes. This site's homepage benchmark panel now shows live fleet numbers; the demo-data fallback it launched with is finally just a fallback.

The HUD itself got the instrument-panel overhaul (dark-ocean theme, April), degraded-state banners driven by fleet snapshots, session grouping by conversation, and Claude Code chapter ingestion — long sessions now render as navigable chapters instead of one scroll.

MentatLab and fi-fhir: Held Still, Mostly On Purpose

MentatLab wasn't frozen — checkpoint-resume landed in the orchestrator and the M16.6 visual-token pass (2026-06-23) standardized the Mission Control shell — but it's been feature-stable since early May. It remains the DAG design surface; its next push likely comes when Mills pipelines want richer visual run control than the HUD provides.

fi-fhir held steady: Source Profiles, the HL7v2/FHIR pipeline, and the docs/playground surfaces are unchanged. When the layers under a healthcare proof path are churning this hard, stability is the contribution.

Infrastructure: No New Hardware, Lots of New Leverage

The cluster is 22 declared K3s nodes plus the Harvester substrate, and the GPU fleet is the same four personal machines it was in April. Everything below was software:

Flux image automation rolled out to seven services — CI builds a timestamp-tagged image, Flux notices, commits the bump, and deploys. Deployment lag went from hours to minutes, and it's why the GitOps commit count looks inflated: the robots are committing.
The MCP gateway WebSocket fix (2026-06-22) ended a chronic failure mode: ingress-nginx reloads were severing long-lived WebSocket connections, causing synchronized close-1006 bursts across 17 MCP servers every time a pod rolled. The /ws route now goes straight from the Cloudflare tunnel to the gateway, bypassing ingress reload churn entirely.
Off-LAN operations got real: mills.flexinfer.ai and the MCP gateway are reachable through Cloudflare Access with service tokens — the autonomous loop can be supervised (and stopped) from anywhere.
Longhorn grew restore-state guardrails (volume lifecycle classification, stale-replica gates) after an etcd incident forced some live node re-pooling mid-June. The cluster got tended, not just used.

Operational Reality: The Gates Fired

One incident this week is the platform philosophy eating its own cooking. Every temporary risk acceptance in this site's CI — trivy acceptances and kube-linter waivers — carries an expiry date. On 2026-06-30 all of them lapsed at once; on 2026-07-01 the security gate went red for every merge request, including an unrelated copy change. Zero new findings.

Two things happened next, and both were the system working: the lapse forced the actual dependency-upgrade slice to ship the same day, and the remaining acceptances were re-dated one month out — not indefinitely — so this fires again in August if the tracked issues don't close. An allowlist without a date is just a decision nobody has to defend again.

What I'm Watching Next Quarter

The Mills metrics report — the numbers from "What I'll Watch," published honestly, including the ugly ones.
Real-work merge rate — the KPI exists; now the question is whether the line trends up without the circuit breaker tripping.
Voice + RAG lanes graduating from "validated end to end" to "daily drivers with SLOs," the way the text lanes did last year.
Loom extension v1.2 — Flightdeck and Weaver surfaces are merged and waiting.
MentatLab ↔ Mills — whether DAG run control becomes the operator surface for Mills pipelines.
The August waiver expiry — either the tracked issues close or I write another paragraph like the one above.

The theme, more than ever: less "more agents doing more things," more "a visible floor where every promotion has evidence — and a recorder that remembers what actually happened."

7 min read

loomloom-core

Loom Mills: From Agent Swarms to Software Production Lines

The next Loom Core orchestration layer turns roadmap intent into reviewed, gated, observable work. The internal codename was Hive; the product metaphor is moving toward Mills.

7 min read

loomloom-core

Loom-Mode MCP for Advanced, Fast AI-Assisted Dev (Go-Native, Proxy+Daemon)

How to keep AI-assisted development fast and token-efficient: one proxy entry, a Go daemon that routes calls, and a small set of Go-native MCP servers.

4 min read

agentsloom-core

Building Practical AI Agents

What makes an AI agent reliable in production: explicit loops, bounded tools, and visible operator state.

Comments

Join the discussion. Be respectful.