Hermes Agent from Nous Research is a self-improving CLI agent: persistent memory, automated skill creation, 47+ built-in tools, and gateways into 15+ messaging platforms. None of that matters if the model behind it fumbles tool arguments or loses the thread halfway through a workflow.

Hermes is OpenAI-compatible, so it runs on basically any provider with a /v1/chat/completions endpoint. That’s a lot of choice. Here’s how to narrow it down.

The quick answer

ModelInput / Output (per 1M)ContextBest for
Claude Sonnet 4.6$3 / $151MThe reliable default — autonomous loops, tool chains
Claude Opus 4.6~$5 / $25200KZero-failure work: SSH, Docker, production edits
GPT-5.4 Codexpremium tier400KHeavy multi-file coding inside Hermes
Gemini 3.1 Pro~$1.25 / $101M+Long-context research, codebase Q&A
DeepSeek V3.2~$0.27 / M128KLow-cost coding and reasoning fallback
MiniMax M2.5~$0.12 / $1200K+Budget instances, high-volume routing
GLM-4.7 / GLM-5sub-dollar128K+Cheap general-purpose agent work
Kimi K2.5cheap256KLong chats, agentic workflows on a budget
Gemma 4 8B (Ollama)$0 (local)128KPrivate, offline, no API bill

If you don’t have a reason to pick something else, start with Claude Sonnet 4.6. It has the best ratio of tool-calling reliability to cost, and the 1M context window means Hermes’ loops rarely have to drop state.

What actually matters for a Hermes model

Benchmarks don’t tell you much here. For Hermes specifically, watch four things:

  • Tool-schema adherence — Hermes hands the model 47+ tools with strict argument shapes. A model that hallucinates a parameter name breaks the loop. Claude and GPT-5-class models are the most disciplined; smaller open models drift.
  • Long-loop stability — agentic runs can be 20+ steps. Cheaper models tend to “loop” — repeating a failed action instead of recovering. Reasoning-capable models avoid this.
  • Context headroom — tool outputs, file contents, and prior steps all stay in the prompt. Aim for 64K+ usable context; 1M is comfortable.
  • Cost per run — Hermes runs are token-heavy. A model that’s 50x cheaper per token is 50x cheaper per overnight automation. That math is why budget models exist in this list.

Best overall — Claude Sonnet 4.6

At $3/$15 per million tokens with a 1M-token window, Sonnet 4.6 is the model most Hermes deployments should run by default. Tool calls land correctly, it recovers gracefully when a command fails, and it holds context across the kind of 30-message workflow Hermes is built for. If you only configure one model, configure this one.

If you want to spend even less while keeping Claude’s reliability, Claude 3.7 Sonnet (also $3/$15) is the older sibling and still excellent for autonomous loops — pick it over Sonnet 4 if you don’t need the 1M window.

Best for coding — GPT-5.4 Codex or Claude Opus 4.6

When Hermes is doing real engineering work — multi-file refactors, debugging, writing code that has to run — step up to a coding-tuned flagship. GPT-5.4 Codex is tuned for exactly this and handles large diffs well. Claude Opus 4.6 (~$5/$25) is the choice when a single mistake is expensive: it’s the model to put behind Hermes when the agent has SSH access or is touching production.

Both are pricey. Don’t run them as your default — route to them only for tasks that need the horsepower, and keep a cheaper model for everything else.

Best for long context and research — Gemini 3.1 Pro

Gemini 3.1 Pro’s 1M+ context window means you can drop an entire repository into a Hermes session and ask it to find the bug. For document-heavy work, codebase Q&A, or summarizing long logs, nothing else competes on raw context length, and at ~$1.25/$10 it’s cheaper than the Claude or GPT flagships.

Best budget — MiniMax M2.5, DeepSeek V3.2, GLM

This is where the real savings live. MiniMax M2.5 at roughly $0.12/$1 per million tokens is the cheapest model that still behaves in Hermes’ multi-tool loops — fine for message classification, routing, simple edits, and most day-to-day automation. DeepSeek V3.2 (~$0.27/M) is the low-cost coding and reasoning fallback. GLM-4.7 / GLM-5 sit in the same sub-dollar tier for general-purpose agent work, and Kimi K2.5 is worth a look for long-running chats thanks to its large window.

The standard pattern: run a budget model as your Hermes default, and override to Sonnet or a Codex model only when a task earns it. Most people see 60–90% of their bill disappear from that one change.

Best local and self-hosted models for Hermes

If you want zero API cost or you’re handling data that can’t leave your machine, run a local model through Ollama. Hermes treats it like any other OpenAI-compatible endpoint.

  • Gemma 4 8B — runs on any Mac with 16GB unified memory. Good for classification, message routing, boilerplate, and simple edits.
  • Qwen3.5 27B — needs ~32GB but is meaningfully stronger on code and reasoning; the best local pick if you have the RAM.
  • Llama 3.3 70B — strongest open model here, but you’ll want a serious GPU (or a lot of patience) to run it locally.

Point Hermes at Ollama:

ollama pull gemma4

Then run hermes model, pick Custom endpoint, and enter:

  • Base URL: http://localhost:11434/v1
  • Model: gemma4:latest

Local models won’t match a frontier flagship on hard multi-step work — keep a cloud model configured as a fallback for the tasks that need it.

Hermes-compatible models and context requirements

Hermes works with any provider exposing /v1/chat/completions — Anthropic, OpenAI, Google, xAI, DeepSeek, MiniMax, GLM (Z.ai), Moonshot (Kimi), OpenRouter, Together, a private vLLM box, Ollama, or haimaker.ai for all of them through one key. The practical requirement isn’t a brand, it’s capability: a model that follows tool schemas, recovers from errors, and carries at least ~64K of usable context. Anything below ~16K context will spend most of its window on Hermes’ own scaffolding and struggle to do useful work.

How to switch models in Hermes Agent

Hermes makes model selection a one-liner:

hermes model

Pick Custom endpoint, then enter the base URL and model identifier when prompted. Hermes stores the choice and uses it for every subsequent run. If you’re pointing at a slower provider, set HERMES_STREAM_READ_TIMEOUT (and related timeout env vars) so long agentic steps don’t get cut off.

Set up haimaker.ai with Hermes Agent

The simplest way to use every model above without juggling a separate account and API key per provider is to point Hermes at haimaker.ai once. One key, one base URL, and you can switch between Sonnet, GPT-5.4 Codex, Gemini 3.1 Pro, MiniMax, DeepSeek, GLM, and Kimi by changing a single string.

  1. Create an account and grab an API key at app.haimaker.ai.

  2. In your terminal, run:

    hermes model
    
  3. Choose Custom endpoint.

  4. Enter the connection details:

    • Base URL: https://api.haimaker.ai/v1
    • API key: your haimaker.ai key
    • Model: the model you want, e.g. anthropic/claude-sonnet-4-6, openai/gpt-5-4-codex, google/gemini-3-1-pro, minimax/minimax-m2-5, deepseek/deepseek-v3-2, zai/glm-4-7, or moonshot/kimi-k2-5
  5. Run hermes — the agent now routes through haimaker.ai. To switch models later, run hermes model again and change the model string; the key and base URL stay the same.

Want to see pricing and benchmarks side by side before you pick? Compare every model in one place at haimaker.ai.

GET $10 FREE CREDITS ON HAIMAKER


Related: Hermes Agent Pricing: what it costs to run · How to add a custom provider to Hermes Agent · Hermes Agent vs Codex CLI