FlexInfer
FlexInfer is a Kubernetes-native inference control plane for private and hybrid AI workloads. It manages model lifecycle, GPU-aware scheduling, serverless activation, and OpenAI-compatible routing without pushing runtime control out to a shared SaaS layer.
Runtime control you can actually operate
The current product surface is broader than the old site copy suggests. FlexInfer now covers lifecycle, routing, scheduling, artifact delivery, and advanced runtime features on one control plane.
Serverless OpenAI endpoints
Expose model backends behind OpenAI-compatible APIs with cold-start handling and routing controls built in.
Operationally legible rollout
Use Helm, GitOps, Prometheus, and explicit runtime controls instead of hidden side-channel deployment logic.
Mixed-GPU pragmatism
Support AMD, NVIDIA, and CPU-oriented paths with backend-specific images, tuning, and placement constraints.
Advanced runtime surface
Ship quantization, LoRA, OCI catalogs, image generation, and federation on the same control-plane model.
What the platform covers today
These clusters map directly to the current upstream FlexInfer docs and roadmap, not an aspirational product sketch.
Runtime lifecycle
FlexInfer treats model runtime as a first-class Kubernetes workload instead of an ad hoc pod template.
- Single v1alpha2 Model CRD for model lifecycle management
- Backend plugins for vLLM, MLC-LLM, llama.cpp, Ollama, diffusers, ComfyUI, and related runtimes
- Cache-gated rollout so pods do not become ready before artifacts are prepared
- Flash-loader preload path for faster model staging from PVC to tmpfs
Routing and activation
The proxy surface is designed for real application traffic, not just lab demos.
- OpenAI-compatible proxy endpoint for chat, completions, embeddings, and image flows
- Scale-to-zero activation with queueing, cold-start budgets, and bounded retries
- Routing strategies for session affinity, prefix locality, and least-loaded dispatch
- Multipart request handling for image-editing and model-aware request extraction
GPU-aware placement
Placement decisions combine node facts, runtime constraints, and model demand instead of static node pinning alone.
- Node agent labels GPU vendor, architecture, VRAM, and capacity hints
- Scheduler extender scores nodes with benchmark results and runtime telemetry
- Shared GPU groups support priority-based preemption and anti-thrashing controls
- KV-cache and free-VRAM hints help reduce avoidable placement mistakes under load
Model supply chain and advanced features
The platform now ships several higher-order capabilities that the site barely mentions today.
- Quantization pipelines and validation for GGUF, AWQ, GPTQ, EXL2, and FP8
- OCI ModelCatalog support for Harbor, GHCR, and ECR-backed model delivery
- LoRAAdapter hot-swap for vLLM-based adapter workflows
- Cluster, FederatedModel, and GlobalProxy resources for multi-cluster execution
Control plane shape, delivery loop, and operator visibility
The runtime story is now explicit: one view for the control plane and in-cluster serving boundary, one for Git-driven rollout, and one for the observability stack that keeps the service legible under load.
Control plane, activation, and runtime boundary
Current product shape: CRDs and policy at the control plane, OpenAI-compatible proxying at the edge, and GPU-aware model execution inside the cluster.
Deployment and operations loop
Git-driven delivery path for GPU workloads and product services running on the platform.
Inference observability stack
Metrics layers needed to keep inference performance, GPU health, and application behavior visible in production.
Where FlexInfer fits best
FlexInfer is a strong fit when the runtime boundary matters as much as the model itself: private data, mixed GPU hardware, predictable rollouts, and product teams that want normal ops tooling instead of bespoke scripts.
Private or hybrid AI deployments, internal platforms, and teams running sensitive workloads on Kubernetes.
Helm-friendly, GitOps-compatible, and instrumented enough to support real rollout, rollback, and troubleshooting loops.
Text, embeddings, image-generation, quantized artifacts, and multi-cluster execution all sit on the same product surface.
Context and orchestration around the runtime
FlexInfer is the runtime anchor. Loom Core and MentatLab cover the adjacent context-governance and operator-UX layers.
Loom Core
Use Loom Core when agent and tool access need policy-aware context routing around the runtime layer FlexInfer provides.
Loom Core product →MentatLab
Use MentatLab when operators need DAG orchestration UX and run visibility on top of the private platform stack.
MentatLab product →