Skip to main content
Back to Blog

Building Practical AI Agents

November 15, 2025·4 min read

labagentsllmworkflowsloom-corementatlab

“AI agents” is overloaded. A lot of demos are a prompt plus a loop plus vibes. I care about agents that close a real loop:

  • produce an artifact (a summary, a report, a PR),
  • verify it against explicit criteria,
  • and revise until it’s acceptable (or fail loudly with a reason).

If you can’t explain the loop, instrument it, and test it with real inputs, you don’t have an agent. You have a generator.

TL;DR

  • Make the loop explicit: draft -> verify -> revise with a max revision budget.
  • Verify against something concrete: schemas, checklists, citations, invariants, or policy gates.
  • Log state transitions and the reasons for revisions. “It got better” is not a debug strategy.
  • Treat tools as untrusted I/O: timeouts, retries, and bounded outputs.
  • Keep the agent state visible to operators. If work, blockers, and memory are opaque, the system will feel smarter than it is.

The loop matters more than the model

The model choice matters. The loop matters more.

When an agent is reliable, it is usually because the control structure is boring:

  • the input shape is constrained,
  • the output artifact is inspectable,
  • verification is explicit,
  • retries are bounded,
  • and failure is visible.

That pattern works whether the artifact is a summary, a docs patch, a deployment plan, or a DAG run in an operator UI.

Here is the smallest useful version:

class SpecState(TypedDict):
    task: str
    draft: str
    verified: str
    needs_revision: bool
    revision_count: int
    max_revisions: int

def _build_graph(self):
    workflow = StateGraph(SpecState)

    workflow.add_node("draft_step", self.draft_node)
    workflow.add_node("verify_step", self.verify_node)
    workflow.add_node("revise_step", self.revise_node)

    workflow.set_entry_point("draft_step")
    workflow.add_edge("draft_step", "verify_step")
    workflow.add_conditional_edges(
        "verify_step",
        self.should_revise,
        {"revise": "revise_step", "end": END}
    )
    workflow.add_edge("revise_step", "verify_step")

    return workflow.compile()

The point is not LangGraph specifically. The point is that the behavior is a state machine instead of an improvised conversation.

Current examples from my stack

This pattern now shows up in a few different places across the projects on this site:

1) News Analyzer: the pipeline is the product

The News Analyzer demo is the most literal example. It is an agentic content pipeline:

  • collect inputs,
  • extract content,
  • draft a summary,
  • enrich or verify,
  • and deliver an artifact.

That is useful because the artifact is concrete and the pipeline stages are visible. You can inspect where it failed instead of treating the whole system like magic.

2) Loom Core: agent work becomes observable

In Loom Core, the interesting part is not “the agent can do things.” The interesting part is that sessions, tasks, memory, workflows, and server health are all visible in the HUD.

That matters operationally. Once agents touch repos, infra, or long-running workflows, you need to know:

  • what the agent believes it is doing,
  • what is blocked,
  • what tool is slow,
  • and what decision was recorded.

Without that, the agent may still be productive, but it won’t be trustworthy.

3) MentatLab: orchestration is an operator surface

MentatLab takes the same core idea and puts it in an operator-facing DAG workflow surface. That is a very different UX from a chat window, but the reliability pattern is the same:

  • explicit steps,
  • clear handoffs,
  • visible state,
  • approvals where needed,
  • and an artifact at the end.

If a system affects real workflows, I want to be able to point at the graph and say where the risk lives.

Verification is the gate, not the epilogue

The reliability gain comes from the verification step. The verification model acts as a critic: it checks a concrete rubric and either approves or returns actionable feedback.

def verify_node(self, state: SpecState) -> SpecState:
    """Quality verification using orchestrator model."""
    response = self.verify_model.invoke([
        SystemMessage(content="Review this draft. If acceptable, output APPROVED. If issues, output REVISE: <feedback>"),
        HumanMessage(content=f"Task: {state['task']}\n\nDraft:\n{state['draft']}")
    ])

    if "APPROVED" in str(response.content):
        return {"verified": state['draft'], "needs_revision": False}
    return {"needs_revision": True, "verified": str(response.content)}

This is the difference between “pretty output” and “a system you can lean on at 6am.”

The verifier does not need to be an LLM every time. In good systems, verification is often cheaper and stricter than generation:

  • JSON schema validation
  • required citations
  • contract checks
  • diff review rules
  • policy or RBAC checks
  • bounded retries with a hard stop

The key is that approval has a shape. “Looks okay” is not a shape.

What I instrument now

If I expect an agent to do real work, I want enough telemetry to debug it like a service:

  • state transitions
  • revision count and revision reasons
  • tool latency and failure rate
  • artifact size and validation failures
  • stop conditions and timeout reasons

This is also why I like bounded tools and visible task queues. When the artifact is bad, I want to know whether the problem was the model, the tool, the policy, or the workflow design.

Many “agent” products are really RAG pipelines with a nicer UI. That can still be useful. But it is a different thing. The practical distinction for me is simple:

  • if the system can verify and revise against explicit criteria, it starts to behave like an agent,
  • if it only retrieves context and generates once, it behaves like a generator with better inputs.

Concrete practices that helped:

  1. Start simple: small state + small tools beats a mega-prompt.
  2. Make the artifact inspectable: summaries, plans, PRs, tickets, and DAG steps are easier to debug than vibes.
  3. Test with real tasks: synthetic tasks hide the failure modes you care about.
  4. Be honest about models: smaller models are great for drafting and extraction; verification often needs a stricter model or a narrower rubric.
  5. Keep operator state visible: HUDs, task lists, traces, and workflow graphs are not polish. They are safety features.

Related examples:

Related Articles

Comments

Join the discussion. Be respectful.