What are the exact token costs?

Input tokens cost $0.39 per million and output tokens cost $2.34 per million.

How large is the context window for memory?

It supports up to 262,144 tokens, allowing for extremely deep persistent memory in Hermes Agent.

Can it handle images for automation?

Yes, it has native vision features for processing visual data from 15+ messaging platforms.

Qwen3.5 397B A17B for Hermes Agent: Pricing, Setup, and What It's Good At

Current as of April 2026. Qwen3.5 397B A17B is a high-reasoning powerhouse with a massive 262K context window, making it a serious contender for long-running Hermes Agent sessions. At $0.39 per million input tokens, it provides a cost-effective way to feed large amounts of persistent memory into your autonomous loops.

Specs


Provider	Qwen (Alibaba)
Input cost	$0.39 / M tokens
Output cost	$2.34 / M tokens
Context window	262K tokens
Max output	66K tokens
Parameters	N/A
Features	function_calling, vision, reasoning

What it’s good at

Robust Tool Execution

The model handles Hermes’s 47+ built-in tools with high precision, maintaining parameter accuracy even when chaining multiple MCP calls in a single turn.

Massive Context for Memory

The 262K context window allows Hermes to maintain a massive cross-session memory buffer, ensuring the agent doesn’t lose its persona or task history during week-long runs.

Vision-Enabled Reasoning

Native vision support allows the agent to interpret screenshots from desktop environments or messaging platforms when text-based scraping is insufficient.

Where it falls short

Response Latency

Due to its scale, the time-to-first-token is higher than smaller models, which can make real-time platforms like WhatsApp feel sluggish.

Proprietary Constraints

Unlike its open-weight siblings, this variant is proprietary, which might be a dealbreaker for users requiring full local control over their agent’s weights.

Best use cases with Hermes Agent

Multi-Platform Orchestration — It excels at tracking state across Discord, Slack, and SSH simultaneously without losing the thread of the autonomous objective.
Complex MCP Tool Chains — The 66K output limit ensures the model can generate long, complex sequences of tool calls and reasoning logs without being truncated.

Not ideal for

Low-Latency Notification Bots — The overhead of a 397B model is overkill for simple ‘if-this-then-that’ messaging tasks where speed is the priority.
Strictly Local Deployment — This specific version is hosted and proprietary, making it unsuitable for air-gapped or purely local Hermes setups.

Hermes Agent setup

Configure your provider endpoint to use the qwen/qwen3.5-397b-a17b ID and ensure your timeout settings are increased to accommodate the model’s high reasoning overhead during deep tool-use cycles.

Hermes makes custom endpoints easy. Run:

hermes model

Choose Custom endpoint from the menu. Enter the base URL and model identifier when prompted:

Base URL: https://api.haimaker.ai/v1
Model: qwen/qwen3.5-397b-a17b

Hermes stores the selection and uses it for all subsequent agent runs across whatever platforms you have wired up (Telegram, Discord, Slack, etc.). Tune HERMES_STREAM_READ_TIMEOUT and related env vars if you’re hitting slow providers.

How it compares

vs Llama 3.1 405B — Qwen is significantly cheaper at $0.39/$2.34 compared to Llama’s typical $5.00+ pricing on many providers, while offering comparable tool-use reliability.
vs Claude 3.5 Sonnet — Sonnet is faster for messaging, but Qwen’s 66K output limit is vastly superior for generating long autonomous execution logs that would hit Sonnet’s 8K cap.

Bottom line

A top-tier choice for complex, long-running autonomous agents that need to juggle multiple platforms and massive memory buffers without the premium price tag of western frontier models.

TRY QWEN3.5 397B A17B IN HERMES

For more, see our Hermes local-LLM setup guide.