What is the exact cost of running this model?

Input tokens cost $0.04 per million and output tokens cost $0.19 per million, making it one of the most affordable models for high-volume agents.

How large is the context window for memory?

The model supports up to 131K tokens, which is ideal for Hermes' persistent cross-session memory and long conversation logs.

Does it support Hermes' tool-use features?

Yes, it fully supports function calling and the MCP protocol, which are essential for Hermes to interact with external platforms and shell environments.

gpt-oss-120b for Hermes Agent: Pricing, Setup, and What It's Good At

Current as of April 2026. gpt-oss-120b is a highly efficient model for Hermes Agent users who need massive context and reliable tool-calling without the flagship price tag. At $0.04 per million input tokens, it is built for long-running autonomous loops that monitor platforms like Slack and Discord 24/7.

Specs


Provider	OpenAI
Input cost	$0.04 / M tokens
Output cost	$0.19 / M tokens
Context window	131K tokens
Max output	N/A tokens
Parameters	N/A
Features	function_calling, reasoning

What it’s good at

Tool-Use Reliability

It executes Hermes’ 47 built-in tools and MCP protocols with high precision, rarely failing on the JSON syntax required for complex shell commands.

Massive Context for Memory

The 131K context window allows Hermes to maintain a persistent identity and remember user preferences across weeks of multi-platform interactions.

Where it falls short

Reasoning Latency

The internal reasoning steps can cause a 2-3 second delay, which is noticeable when users expect instant replies on Telegram or WhatsApp.

Proprietary Constraints

Unlike Llama-based models, you cannot fine-tune this for specific persona quirks, leaving your agent’s personality feeling somewhat generic.

Best use cases with Hermes Agent

Cross-Platform Automation — It excels at monitoring a Slack channel and autonomously triggering deployments on Modal or Docker based on the conversation history.
Long-Term Autonomous Monitoring — The low $0.19 output cost makes it sustainable to keep an agent running indefinitely to manage persistent cross-session memory.

Not ideal for

Local-Only Privacy — Since this is an OpenAI-hosted model, it is not suitable for users running Hermes on isolated Mac local or private Docker setups.
High-Speed Chatbots — The reasoning overhead makes it poorly suited for rapid-fire messaging where sub-second response times are the priority.

Hermes Agent setup

Point your provider to OpenAI and use the model ID openai/gpt-oss-120b. Ensure your API quota is sufficient for high-frequency tool polling if you are running autonomous loops.

Hermes makes custom endpoints easy. Run:

hermes model

Choose Custom endpoint from the menu. Enter the base URL and model identifier when prompted:

Base URL: https://api.haimaker.ai/v1
Model: openai/gpt-oss-120b

Hermes stores the selection and uses it for all subsequent agent runs across whatever platforms you have wired up (Telegram, Discord, Slack, etc.). Tune HERMES_STREAM_READ_TIMEOUT and related env vars if you’re hitting slow providers.

How it compares

vs gpt-4o-mini — gpt-oss-120b provides significantly better reasoning for complex MCP tool chains despite being in a similar low-cost tier.
vs Llama 3.1 70B — While Llama is better for local privacy, gpt-oss-120b offers a larger 131K context window and more stable function calling for autonomous tasks.

Bottom line

This is the best value-to-performance model for Hermes Agent users who prioritize reliable tool execution and long-term memory over local hosting.

TRY GPT-OSS-120B IN HERMES

For more, see our Hermes local-LLM setup guide.