What is the exact pricing for GLM-4.7 Flash?

It costs $0.06 per 1 million input tokens and $0.4 per 1 million output tokens.

How large is the context window?

The model supports up to 203,000 tokens for input and 32,000 tokens for output.

Does it support vision for multi-platform image handling?

Yes, it has native vision capabilities, allowing Hermes to process screenshots sent via Telegram or Discord.

GLM-4.7 Flash for Hermes Agent: Pricing, Setup, and What It's Good At

Current as of April 2026. GLM-4.7 Flash from Zhipu AI is a budget-focused workhorse for Hermes Agent, offering a massive 203K context window at a fraction of the cost of Western competitors. It is designed for developers who need to process high volumes of messaging traffic across Discord and Slack without breaking the bank.

Specs


Provider	Zhipu AI
Input cost	$0.06 / M tokens
Output cost	$0.40 / M tokens
Context window	203K tokens
Max output	32K tokens
Parameters	N/A
Features	function_calling, vision, reasoning

What it’s good at

Extreme Cost Efficiency

At $0.06 per million input tokens and $0.4 per million output tokens, it is significantly cheaper than GPT-4o-mini for high-frequency tool use.

Massive Context Window

The 203K token limit allows Hermes to maintain deep cross-session memory and ingest large message histories from multiple platforms simultaneously.

Reliable Function Calling

It handles Hermes’ 47 built-in tools with surprising stability, maintaining valid JSON structures during autonomous multi-step tasks.

Where it falls short

Inconsistent Latency

Users outside of mainland China may experience variable response times when connecting to Zhipu’s API endpoints, which can lag autonomous loops.

Reasoning Depth

While good for routing, it can struggle with complex logic when chain-loading multiple MCP tools in a single turn.

Best use cases with Hermes Agent

High-Volume Multi-Platform Monitoring — The low cost and 203K context make it ideal for watching dozens of Telegram and Discord channels to trigger specific shell commands.
Persistent Memory Management — It can ingest weeks of interaction history within its context window to maintain a consistent identity across different messaging platforms.

Not ideal for

High-Stakes Shell Operations — Its reasoning can occasionally hallucinate parameter values for complex CLI tools compared to larger, more expensive models.
Low-Latency Real-Time Chat — The geographic distance to Zhipu servers often results in a 2-3 second delay that disrupts the flow of real-time Slack conversations.

Hermes Agent setup

Ensure you use the correct Zhipu AI base URL in your Hermes config and set the max_tokens to accommodate the 32K output limit if performing long-form data summarization.

Hermes makes custom endpoints easy. Run:

hermes model

Choose Custom endpoint from the menu. Enter the base URL and model identifier when prompted:

Base URL: https://api.haimaker.ai/v1
Model: z-ai/glm-4.7-flash

Hermes stores the selection and uses it for all subsequent agent runs across whatever platforms you have wired up (Telegram, Discord, Slack, etc.). Tune HERMES_STREAM_READ_TIMEOUT and related env vars if you’re hitting slow providers.

How it compares

vs GPT-4o-mini — GLM-4.7 Flash is cheaper on input ($0.06 vs $0.15) and offers a larger context window (203K vs 128K) for better long-term memory.
vs Gemini 1.5 Flash — Gemini has a larger 1M context, but GLM-4.7 Flash often feels more decisive when executing Hermes’ built-in shell and filesystem tools.

Bottom line

If you are running a high-traffic Hermes Agent on a budget and need deep memory, GLM-4.7 Flash is the most economical way to get 200K+ context and reliable tool use.

TRY GLM-4.7 FLASH IN HERMES

For more, see our Hermes local-LLM setup guide.