What is the haimaker.ai auto-router?

It is a routing layer you reach by sending model: 'haimaker/auto'. For each request it checks what the prompt actually needs (vision, tool use, long context), matches the prompt against rules you define by example, and sends it to the right model. Simple work goes to cheap, fast models; hard work goes to frontier models. You change one line of code and the router handles the rest.

How much can the auto-router save on LLM costs?

Most teams send 80 to 90% of their traffic to a frontier model when a cheaper model would handle it fine. Routing that traffic to the right tier commonly cuts spend by a large margin while keeping quality on the requests that matter. The auto-router also defaults to the cheapest capable model on fallback and proposes strictly-cheaper rules from your own traffic, so it can only lower your bill, never raise it.

Is the auto-router good for AI agents and high-volume workloads?

Yes. Agentic frameworks like Hermes Agent and OpenClaw fire hundreds of repeated calls a day (heartbeats, cron jobs, classification steps, tool loops), and most of those are simple work paying frontier prices. The auto-router detects those repeated clusters and routes them to cheaper models automatically, which is where the largest savings show up.

Does the auto-router add latency or call an LLM to decide?

No. Routing is fully deterministic and runs in-process with a version-pinned embedding model, with no network call, no LLM, and nothing random in the request path. Identical prompts hit an exact-match cache and skip embedding entirely. The only LLM involved is an offline tuner that runs once a day, never while your request is in flight.

Auto-Router, Improved: Cut Your LLM Bill Without Touching Your Code

You are almost certainly overpaying for your LLM calls, not because you picked the wrong model, but because you picked one model for everything. “What is the capital of France” runs on the same frontier model as a multi-file refactor, and you pay frontier prices for both.

Haimaker’s auto-router fixes that with one line of code. Set model: "haimaker/auto", and every request goes to the right model: cheap, fast models for the easy work, frontier models for the hard work. Your code does not change, and quality on the requests that matter does not move.

The improved auto-router makes that routing sharper, makes the default cheaper, and adds the part we are most excited about: it now learns cost-saving rules from your own traffic, finding savings you would never have hunted down by hand.

Save more without changing stacks

Most teams route everything to a frontier model out of caution, and 80 to 90% of those requests would run just as well on something cheaper. The auto-router is built so it can only ever save you money: when it falls back, it picks the cheapest capable model, not the biggest one, and the rules it proposes from your traffic are always strictly cheaper than what you pay today. There is no setup where it quietly costs you more.

Right model, no quality loss

Cheaper only counts if the answers stay good. Before picking anything, the router checks what the request needs: an image forces a vision model, a long document forces one with room for it, a tool call forces one that handles tools. Then it matches the prompt against your rules so coding prompts land on a coding model and analysis prompts land on a reasoning model. Hard work never gets dumped on a weak model to shave a few cents.

You teach the router by example, not by keyword. Give a rule a few sample prompts and it matches new ones by meaning, so “fix this bug in my code” and “debug this function” land together even though they share no words. No brittle string matching, no surprise misroutes.

Automatically find missed savings

Once a day, offline, the router scans your traffic for clusters of repeated requests that are quietly overpaying: the heartbeat check firing every minute, the 2am cron report, the classification prompt sent ten thousand times a day. When it finds one, it proposes a cheaper model and shows you the projected monthly savings, sample prompts, and its reasoning. Approve it in one click, or let trusted cases apply on their own within strict guardrails. Either way, hunting down wasted spend stops being your job, and if anything regresses, it reverts on its own.

Built for agents

This is where the auto-router earns its keep. Agents generate huge volumes of LLM calls, and most are not hard. An autonomous agent like Hermes Agent running scheduled briefings, or a coding agent like OpenClaw grinding through a task loop, fires constant background traffic: tool-use steps, status checks, summarization passes, retries. Pointed at one frontier model, every call costs top dollar.

Point that agent at haimaker/auto and the cheap, repetitive calls drop to cheap, fast models while the genuinely hard reasoning steps stay on a frontier model. The more an agent runs, the more repetition it produces, and repetition is exactly what the traffic learning catches. If you run agents, this is the easiest cost lever you have: swap the model name, keep your loop, watch the bill drop. Swapping it can be literal: npx -y @haimaker/connect writes the native settings for Claude Code, Codex, OpenClaw, Hermes, Cline, Kilo Code, or opencode and pins them to haimaker/auto in one shot, so there’s nothing to hand-edit.

No latency, no lock-in, full visibility

Routing happens in-process and adds no meaningful latency. No AI classifier second-guesses your request, nothing in the decision is random, and the same prompt always routes to the same model. Every decision is written to your logs, and repeated prompts skip the work through an exact-match cache. You can also test before you trust it: a sandbox shows which model the router would pick and why, without making a real call.

How it works under the hood

For the engineers who want the mechanics, here is what happens on every request, in order.

Capability detection runs first. The router inspects the request and filters the model pool to models that can serve it: vision when there is an image, tool use when there are tools, structured output for JSON schemas, long context when the token estimate runs past a model’s window (with a 10% safety buffer so you never get a context-length error at the provider). If a rule matches but its target cannot handle the request, that rule is skipped.

Example-based rules match by meaning. A rule is a set of example prompts plus a target model. The router turns the examples into a single reference point and measures how close an incoming prompt is to each rule. The best match above its threshold wins. The default threshold is 0.80 and you tune it per rule: lower matches more loosely, higher fires only on near-identical prompts. Timestamps, UUIDs, and long IDs are stripped before matching, so templated traffic stays stable. This replaced the old keyword engine entirely, which removes the v1 false-positive class where words like “error” or “export” pulled unrelated prompts into a coding rule.

The fallback picks the cheapest capable model. When the default model gets filtered out, usually by a long-context request, the router selects the cheapest model in the pool that can still serve the request rather than the largest-context one. In our own routing, that single tiebreak was the difference between sending fallback traffic to a flagship and sending it to a budget model that did the job.

The tuner learns offline, never in the request path. Captured traffic is clustered, classified by an offline judge as trivial, standard, or complex, and mapped to a strictly-cheaper capable model. A proposal auto-applies only when every guardrail holds: the cluster is trivial and tight, the traffic currently goes to your default or fallback (never a rule you wrote), there are at least 50 requests across 3 or more days, the cheaper target passes every capability check the cluster needed, and auto-apply is toggled on. For 7 days after a rule applies, the router watches its error rate and auto-reverts if infrastructure errors double against baseline. Every change lands in a changelog with one-click revert.

The request path stays deterministic. Embeddings are computed in-process with a version-pinned model: no network, no per-request cost, the same text producing the same vector every time. The judge only ever runs in the daily offline tuner.

Your data stays yours. To learn from traffic, the router stores the last user message of each routed request, normalized, capped at 2KB, with IDs stripped, deduplicated into hit counts, and deleted after 30 days. It is used only for your router’s own tuning and never shared across routers. Turn off traffic capture in settings and routing keeps working, it just stops learning.

Already on v1? Your rules keep working. The migration is small and documented in the auto-router docs.

The bottom line

Routing got more accurate, the default fallback got cheaper, and the router now finds your wasted spend instead of leaving it to you. The request path is as fast and deterministic as ever. If you run agents or high-volume workloads, swapping your model name for haimaker/auto is the easiest dollar you will save this quarter.

READ THE AUTO-ROUTER DOCS