What is the exact pricing for Grok 2 Vision?

It costs $2.00 per million input tokens and $10.00 per million output tokens.

How large is the context window for this model?

The model supports a 33,000 token context window and can output up to 33,000 tokens.

Can it handle MCP tools in Hermes?

Yes, it supports function calling which allows it to interface with any MCP servers or built-in Hermes tools.

Grok 2 Vision for Hermes Agent: Pricing, Setup, and What It's Good At

Current as of April 2026. Grok 2 Vision brings visual intelligence to the xAI lineup for Hermes Agent, offering a middle-ground price point of $2 per million input tokens. It is designed for agents that need to interpret screenshots or images across messaging platforms like Telegram and Discord while maintaining tool-use reliability.

Specs


Provider	xAI
Input cost	$2.00 / M tokens
Output cost	$10 / M tokens
Context window	33K tokens
Max output	33K tokens
Parameters	N/A
Features	function_calling, vision, web_search

What it’s good at

Reliable Vision-to-Tool Pipeline

The model excels at taking a visual input, such as a dashboard screenshot, and accurately mapping that data to specific Hermes tool arguments.

Fast Inference for Autonomous Loops

Latency is consistently low, which prevents the Hermes agent from timing out during complex multi-step autonomous runs involving visual analysis.

Where it falls short

Restrictive Context Window

The 33K token limit is narrow for Hermes users who need to maintain deep cross-session memory or large MCP tool definitions.

Proprietary Constraints

Unlike Llama-based models, you cannot fine-tune or deeply steer the model’s persona to fit specific identity requirements in Hermes.

Best use cases with Hermes Agent

Visual Platform Monitoring — Hermes can monitor a video feed or UI via screenshots and use shell tools to react when specific visual triggers occur.
Image-Based Data Entry — Users can drop photos of documents into Slack and have Hermes automatically parse them into structured tool calls for external databases.

Not ideal for

Long-Term Memory Sessions — A 33K context window will truncate your persistent memory and learning loops much faster than models with 128K+ windows.
Budget Text-Only Automation — If your agent doesn’t need to see, you are paying a premium over models like GPT-4o-mini that handle text-only reasoning for a fraction of the cost.

Hermes Agent setup

Set the provider to xAI and ensure your API key has vision permissions enabled. You must use the xai/grok-2-vision identifier to ensure Hermes correctly formats image buffers in the request payload.

Hermes makes custom endpoints easy. Run:

hermes model

Choose Custom endpoint from the menu. Enter the base URL and model identifier when prompted:

Base URL: https://api.x.ai/v1
Model: xai/grok-2-vision

Hermes stores the selection and uses it for all subsequent agent runs across whatever platforms you have wired up (Telegram, Discord, Slack, etc.). Tune HERMES_STREAM_READ_TIMEOUT and related env vars if you’re hitting slow providers.

How it compares

vs GPT-4o-mini — At $0.15/$0.60 per 1M tokens, GPT-4o-mini is vastly cheaper for simple vision tasks, though Grok 2 Vision feels more robust for autonomous tool-use.
vs Claude 3.5 Sonnet — Sonnet is more expensive at $3/$15 but offers a 200K context window and superior reasoning for complex multi-platform coordination.

Bottom line

Grok 2 Vision is a solid choice for Hermes agents that require visual perception on a budget, provided you can work within the 33K token context limit.

TRY GROK 2 VISION IN HERMES

For more, see our Hermes local-LLM setup guide.