Skip to main content
Product

FlexInfer

FlexInfer is a Kubernetes-native inference control plane for private and hybrid AI workloads. It manages model lifecycle, GPU-aware scheduling, serverless activation, and OpenAI-compatible routing without pushing runtime control out to a shared SaaS layer.

Why it matters

Runtime control you can actually operate

The current product surface is broader than the old site copy suggests. FlexInfer now covers lifecycle, routing, scheduling, artifact delivery, and advanced runtime features on one control plane.

Serverless OpenAI endpoints

Expose model backends behind OpenAI-compatible APIs with cold-start handling and routing controls built in.

Operationally legible rollout

Use Helm, GitOps, Prometheus, and explicit runtime controls instead of hidden side-channel deployment logic.

Mixed-GPU pragmatism

Support AMD, NVIDIA, and CPU-oriented paths with backend-specific images, tuning, and placement constraints.

Advanced runtime surface

Ship quantization, LoRA, OCI catalogs, image generation, and federation on the same control-plane model.

Feature map

What the platform covers today

These clusters map directly to the current upstream FlexInfer docs and roadmap, not an aspirational product sketch.

Shipped surface

Runtime lifecycle

FlexInfer treats model runtime as a first-class Kubernetes workload instead of an ad hoc pod template.

  • Single v1alpha2 Model CRD for model lifecycle management
  • Backend plugins for vLLM, MLC-LLM, llama.cpp, Ollama, diffusers, ComfyUI, and related runtimes
  • Cache-gated rollout so pods do not become ready before artifacts are prepared
  • Flash-loader preload path for faster model staging from PVC to tmpfs
Shipped surface

Routing and activation

The proxy surface is designed for real application traffic, not just lab demos.

  • OpenAI-compatible proxy endpoint for chat, completions, embeddings, and image flows
  • Scale-to-zero activation with queueing, cold-start budgets, and bounded retries
  • Routing strategies for session affinity, prefix locality, and least-loaded dispatch
  • Multipart request handling for image-editing and model-aware request extraction
Shipped surface

GPU-aware placement

Placement decisions combine node facts, runtime constraints, and model demand instead of static node pinning alone.

  • Node agent labels GPU vendor, architecture, VRAM, and capacity hints
  • Scheduler extender scores nodes with benchmark results and runtime telemetry
  • Shared GPU groups support priority-based preemption and anti-thrashing controls
  • KV-cache and free-VRAM hints help reduce avoidable placement mistakes under load
Shipped surface

Model supply chain and advanced features

The platform now ships several higher-order capabilities that the site barely mentions today.

  • Quantization pipelines and validation for GGUF, AWQ, GPTQ, EXL2, and FP8
  • OCI ModelCatalog support for Harbor, GHCR, and ECR-backed model delivery
  • LoRAAdapter hot-swap for vLLM-based adapter workflows
  • Cluster, FederatedModel, and GlobalProxy resources for multi-cluster execution
Architecture

Control plane shape, delivery loop, and operator visibility

The runtime story is now explicit: one view for the control plane and in-cluster serving boundary, one for Git-driven rollout, and one for the observability stack that keeps the service legible under load.

Architecture

Control plane, activation, and runtime boundary

Current product shape: CRDs and policy at the control plane, OpenAI-compatible proxying at the edge, and GPU-aware model execution inside the cluster.

FlexInfer architecture diagram showing applications hitting an OpenAI-compatible proxy, control-plane services managing model state and routing policy, and GPU worker nodes serving inference workloads inside the cluster.
Architecture

Deployment and operations loop

Git-driven delivery path for GPU workloads and product services running on the platform.

GitOps flow from GitLab through CI and Flux into a K3s cluster on Harvester with GPU nodes and product workloads.
Architecture

Inference observability stack

Metrics layers needed to keep inference performance, GPU health, and application behavior visible in production.

Layered observability diagram connecting application signals, inference metrics, GPU telemetry, and infrastructure monitoring.
Fit

Where FlexInfer fits best

FlexInfer is a strong fit when the runtime boundary matters as much as the model itself: private data, mixed GPU hardware, predictable rollouts, and product teams that want normal ops tooling instead of bespoke scripts.

Best for

Private or hybrid AI deployments, internal platforms, and teams running sensitive workloads on Kubernetes.

Operational model

Helm-friendly, GitOps-compatible, and instrumented enough to support real rollout, rollback, and troubleshooting loops.

Runtime breadth

Text, embeddings, image-generation, quantized artifacts, and multi-cluster execution all sit on the same product surface.

Related surfaces

Context and orchestration around the runtime

FlexInfer is the runtime anchor. Loom Core and MentatLab cover the adjacent context-governance and operator-UX layers.

Companion surface

Loom Core

Use Loom Core when agent and tool access need policy-aware context routing around the runtime layer FlexInfer provides.

Loom Core product
Companion surface

MentatLab

Use MentatLab when operators need DAG orchestration UX and run visibility on top of the private platform stack.

MentatLab product