How much does GLM-4.6 cost?

Input tokens are priced at $0.39 per million and output tokens are $1.9 per million.

What are the token limits?

It supports a 205,000 token context window with a maximum output of 131,000 tokens.

GLM-4.6 for Hermes Agent: Pricing, Setup, and What It's Good At

Current as of April 2026. GLM-4.6 is a budget-friendly powerhouse for Hermes Agent users who need massive context windows without the high price tag of GPT-4o. It balances reasoning capabilities with a 205K context window, making it a strong contender for long-term autonomous memory.

Specs


Provider	Zhipu AI
Input cost	$0.39 / M tokens
Output cost	$1.90 / M tokens
Context window	205K tokens
Max output	131K tokens
Parameters	N/A
Features	function_calling, reasoning

What it’s good at

Massive Output Capacity

With a 131K max output token limit, it handles extremely long reasoning chains and multi-platform summaries that would choke smaller models.

Cost-to-Context Efficiency

At $0.39 per million input tokens, you get a 205K context window, which is significantly cheaper than running large-scale memory tasks on Claude 3.5 Sonnet.

Where it falls short

Latency Outside Asia

Users outside the APAC region often experience higher response times, which can slow down real-time interactions on platforms like Telegram or Slack.

Tool-Calling Reliability

While it supports function calling, it occasionally struggles with complex MCP tool sequences compared to more polished models like GPT-4o.

Best use cases with Hermes Agent

Long-term memory logging — The 205K context window allows Hermes to retain weeks of conversation history from Discord or Slack without losing the thread.
High-volume messaging triage — Its low cost ($1.9/1M output) makes it ideal for sorting and summarizing hundreds of messages across 15+ platforms.

Not ideal for

Critical shell commands — Its reasoning can sometimes hallucinate pathing or environment variables during complex local terminal operations.
Ultra-low latency chat — The network overhead to Zhipu’s servers makes it feel sluggish for fast-paced back-and-forth messaging.

Hermes Agent setup

Set the base URL to Zhipu’s API endpoint and increase your timeout settings to account for the model’s high-context processing time.

Hermes makes custom endpoints easy. Run:

hermes model

Choose Custom endpoint from the menu. Enter the base URL and model identifier when prompted:

Base URL: https://api.haimaker.ai/v1
Model: z-ai/glm-4.6

Hermes stores the selection and uses it for all subsequent agent runs across whatever platforms you have wired up (Telegram, Discord, Slack, etc.). Tune HERMES_STREAM_READ_TIMEOUT and related env vars if you’re hitting slow providers.

How it compares

vs GPT-4o-mini — GPT-4o-mini is cheaper at $0.15/1M input but lacks the massive 205K context and 131K output capacity of GLM-4.6.
vs Claude 3 Haiku — Haiku is faster for tool-calling, but GLM-4.6 offers better reasoning depth for complex cross-platform automation tasks.

Bottom line

GLM-4.6 is the best choice for Hermes users who prioritize huge memory and low cost over raw speed and Western server proximity.

TRY GLM-4.6 IN HERMES

For more, see our Hermes local-LLM setup guide.