Current as of April 2026. GPT-5 Image Mini is OpenAI’s specialized vision-first model designed for agents that need to see and reason simultaneously. Within Hermes, it excels at interpreting UI screenshots from Slack or Discord and mapping those visuals directly to tool calls without intermediate text steps.

Specs

ProviderOpenAI
Input cost$2.50 / M tokens
Output cost$2.00 / M tokens
Context window400K tokens
Max output128K tokens
ParametersN/A
Featuresfunction_calling, vision, reasoning, web_search

What it’s good at

Visual Tool-Use Precision

It maps visual elements to Hermes’ 47 tools with higher precision than standard models, making UI-based automation across platforms like Telegram and Slack highly reliable.

Deep Context Retention

The 400K context window allows Hermes to maintain long-running cross-platform threads, ensuring the agent’s persistent identity and memory remain intact during month-long sessions.

Dense Output Capacity

A 128K max output limit allows the agent to generate complex, multi-step execution plans or long-form reports across different messaging channels in a single turn.

Where it falls short

Higher Input Overhead

At $2.50 per million input tokens, it is significantly more expensive than GPT-4o-mini for simple text-based messaging tasks that do not utilize the vision weights.

Proprietary Constraints

The model’s closed nature means you cannot fine-tune the identity or memory loop logic specifically for the Hermes architecture, unlike open-weight alternatives.

Best use cases with Hermes Agent

  • Visual UI Monitoring — Watching a Discord channel for specific graph screenshots and triggering shell scripts or MCP tools based on visual data analysis.
  • Cross-Platform Coordination — Navigating web tools that lack APIs by using vision to identify buttons and converting those pixels into actionable Hermes tool calls.

Not ideal for

  • Simple Notification Relays — Using a $2.50/1M input model just to move text between Telegram and Slack is a waste of resources when cheaper text-only models exist.
  • High-Volume Log Parsing — For scanning thousands of lines of shell output, the vision-optimized weights provide no benefit over cheaper, text-centric models.

Hermes Agent setup

Configure your OpenAI API key with vision permissions and set the max_tokens parameter to at least 100K to prevent the agent from truncating complex multi-platform execution plans.

Hermes makes custom endpoints easy. Run:

hermes model

Choose Custom endpoint from the menu. Enter the base URL and model identifier when prompted:

  • Base URL: https://api.haimaker.ai/v1
  • Model: openai/gpt-5-image-mini

Hermes stores the selection and uses it for all subsequent agent runs across whatever platforms you have wired up (Telegram, Discord, Slack, etc.). Tune HERMES_STREAM_READ_TIMEOUT and related env vars if you’re hitting slow providers.

How it compares

  • vs GPT-4o-mini — GPT-4o-mini is vastly cheaper at $0.15/1M input, making it the better choice for text-heavy Hermes workflows that don’t require high-fidelity image reasoning.
  • vs Gemini 1.5 Flash — Gemini 1.5 Flash offers a 1M context window versus GPT-5 Image Mini’s 400K, but OpenAI’s tool-calling reliability is generally more stable for Hermes’ 47 built-in tools.

Bottom line

The best choice for Hermes agents that live in visual chat apps or web browsers, provided you can justify the $2.50/1M input cost over cheaper text-only alternatives.

TRY GPT-5 IMAGE MINI IN HERMES


For more, see our Hermes local-LLM setup guide.