Hermes Agent from Nous Research is a self-improving CLI agent: persistent memory, automated skill creation, 47+ built-in tools, and gateways into 15+ messaging platforms. None of that matters if the model behind it fumbles tool arguments or loses the thread halfway through a workflow.
Hermes is OpenAI-compatible, so it runs on basically any provider with a /v1/chat/completions endpoint. That’s a lot of choice. Here’s how to narrow it down.
The quick answer
| Model | Input / Output (per 1M) | Context | Best for |
|---|---|---|---|
| Claude Sonnet 4.6 | $3 / $15 | 1M | The reliable default — autonomous loops, tool chains |
| Claude Opus 4.6 | ~$5 / $25 | 200K | Zero-failure work: SSH, Docker, production edits |
| GPT-5.4 Codex | premium tier | 400K | Heavy multi-file coding inside Hermes |
| Gemini 3.1 Pro | ~$1.25 / $10 | 1M+ | Long-context research, codebase Q&A |
| DeepSeek V3.2 | ~$0.27 / M | 128K | Low-cost coding and reasoning fallback |
| MiniMax M2.5 | ~$0.12 / $1 | 200K+ | Budget instances, high-volume routing |
| GLM-4.7 / GLM-5 | sub-dollar | 128K+ | Cheap general-purpose agent work |
| Kimi K2.5 | cheap | 256K | Long chats, agentic workflows on a budget |
| Gemma 4 8B (Ollama) | $0 (local) | 128K | Private, offline, no API bill |
If you don’t have a reason to pick something else, start with Claude Sonnet 4.6. It has the best ratio of tool-calling reliability to cost, and the 1M context window means Hermes’ loops rarely have to drop state.
What actually matters for a Hermes model
Benchmarks don’t tell you much here. For Hermes specifically, watch four things:
- Tool-schema adherence — Hermes hands the model 47+ tools with strict argument shapes. A model that hallucinates a parameter name breaks the loop. Claude and GPT-5-class models are the most disciplined; smaller open models drift.
- Long-loop stability — agentic runs can be 20+ steps. Cheaper models tend to “loop” — repeating a failed action instead of recovering. Reasoning-capable models avoid this.
- Context headroom — tool outputs, file contents, and prior steps all stay in the prompt. Aim for 64K+ usable context; 1M is comfortable.
- Cost per run — Hermes runs are token-heavy. A model that’s 50x cheaper per token is 50x cheaper per overnight automation. That math is why budget models exist in this list.
Best overall — Claude Sonnet 4.6
At $3/$15 per million tokens with a 1M-token window, Sonnet 4.6 is the model most Hermes deployments should run by default. Tool calls land correctly, it recovers gracefully when a command fails, and it holds context across the kind of 30-message workflow Hermes is built for. If you only configure one model, configure this one.
If you want to spend even less while keeping Claude’s reliability, Claude 3.7 Sonnet (also $3/$15) is the older sibling and still excellent for autonomous loops — pick it over Sonnet 4 if you don’t need the 1M window.
Best for coding — GPT-5.4 Codex or Claude Opus 4.6
When Hermes is doing real engineering work — multi-file refactors, debugging, writing code that has to run — step up to a coding-tuned flagship. GPT-5.4 Codex is tuned for exactly this and handles large diffs well. Claude Opus 4.6 (~$5/$25) is the choice when a single mistake is expensive: it’s the model to put behind Hermes when the agent has SSH access or is touching production.
Both are pricey. Don’t run them as your default — route to them only for tasks that need the horsepower, and keep a cheaper model for everything else.
Best for long context and research — Gemini 3.1 Pro
Gemini 3.1 Pro’s 1M+ context window means you can drop an entire repository into a Hermes session and ask it to find the bug. For document-heavy work, codebase Q&A, or summarizing long logs, nothing else competes on raw context length, and at ~$1.25/$10 it’s cheaper than the Claude or GPT flagships.
Best budget — MiniMax M2.5, DeepSeek V3.2, GLM
This is where the real savings live. MiniMax M2.5 at roughly $0.12/$1 per million tokens is the cheapest model that still behaves in Hermes’ multi-tool loops — fine for message classification, routing, simple edits, and most day-to-day automation. DeepSeek V3.2 (~$0.27/M) is the low-cost coding and reasoning fallback. GLM-4.7 / GLM-5 sit in the same sub-dollar tier for general-purpose agent work, and Kimi K2.5 is worth a look for long-running chats thanks to its large window.
The standard pattern: run a budget model as your Hermes default, and override to Sonnet or a Codex model only when a task earns it. Most people see 60–90% of their bill disappear from that one change.
Best local and self-hosted models for Hermes
If you want zero API cost or you’re handling data that can’t leave your machine, run a local model through Ollama. Hermes treats it like any other OpenAI-compatible endpoint.
- Gemma 4 8B — runs on any Mac with 16GB unified memory. Good for classification, message routing, boilerplate, and simple edits.
- Qwen3.5 27B — needs ~32GB but is meaningfully stronger on code and reasoning; the best local pick if you have the RAM.
- Llama 3.3 70B — strongest open model here, but you’ll want a serious GPU (or a lot of patience) to run it locally.
Point Hermes at Ollama:
ollama pull gemma4
Then run hermes model, pick Custom endpoint, and enter:
- Base URL:
http://localhost:11434/v1 - Model:
gemma4:latest
Local models won’t match a frontier flagship on hard multi-step work — keep a cloud model configured as a fallback for the tasks that need it.
Hermes-compatible models and context requirements
Hermes works with any provider exposing /v1/chat/completions — Anthropic, OpenAI, Google, xAI, DeepSeek, MiniMax, GLM (Z.ai), Moonshot (Kimi), OpenRouter, Together, a private vLLM box, Ollama, or haimaker.ai for all of them through one key. The practical requirement isn’t a brand, it’s capability: a model that follows tool schemas, recovers from errors, and carries at least ~64K of usable context. Anything below ~16K context will spend most of its window on Hermes’ own scaffolding and struggle to do useful work.
How to switch models in Hermes Agent
Hermes makes model selection a one-liner:
hermes model
Pick Custom endpoint, then enter the base URL and model identifier when prompted. Hermes stores the choice and uses it for every subsequent run. If you’re pointing at a slower provider, set HERMES_STREAM_READ_TIMEOUT (and related timeout env vars) so long agentic steps don’t get cut off.
Set up haimaker.ai with Hermes Agent
The simplest way to use every model above without juggling a separate account and API key per provider is to point Hermes at haimaker.ai once. One key, one base URL, and you can switch between Sonnet, GPT-5.4 Codex, Gemini 3.1 Pro, MiniMax, DeepSeek, GLM, and Kimi by changing a single string.
-
Create an account and grab an API key at app.haimaker.ai.
-
In your terminal, run:
hermes model -
Choose Custom endpoint.
-
Enter the connection details:
- Base URL:
https://api.haimaker.ai/v1 - API key: your haimaker.ai key
- Model: the model you want, e.g.
anthropic/claude-sonnet-4-6,openai/gpt-5-4-codex,google/gemini-3-1-pro,minimax/minimax-m2-5,deepseek/deepseek-v3-2,zai/glm-4-7, ormoonshot/kimi-k2-5
- Base URL:
-
Run
hermes— the agent now routes through haimaker.ai. To switch models later, runhermes modelagain and change the model string; the key and base URL stay the same.
Want to see pricing and benchmarks side by side before you pick? Compare every model in one place at haimaker.ai.
GET $10 FREE CREDITS ON HAIMAKER
Related: Hermes Agent Pricing: what it costs to run · How to add a custom provider to Hermes Agent · Hermes Agent vs Codex CLI