Current as of April 2026. GPT-4o-audio-preview is a specialized variant for Hermes users who need native voice processing without the latency of separate STT/TTS pipelines. It brings OpenAI’s top-tier tool-use reliability to audio-centric workflows across platforms like WhatsApp and Telegram.
Specs
| Provider | OpenAI |
| Input cost | $2.50 / M tokens |
| Output cost | $10 / M tokens |
| Context window | 128K tokens |
| Max output | 16K tokens |
| Parameters | N/A |
| Features | function_calling |
What it’s good at
Native Audio Reasoning
It processes tone and inflection directly, which is vital for Hermes agents that need to interpret the emotional context of voice memos.
Tool-Use Stability
It inherits the robust function-calling capabilities of the GPT-4 family, ensuring Hermes can reliably trigger its 47 built-in tools during autonomous runs.
Where it falls short
Premium Pricing
At $10 per million output tokens, it is significantly more expensive than standard models for agents that primarily process text.
Preview Limitations
As a preview model, it may face more frequent rate limits or API instability during long-running autonomous sessions compared to the stable GPT-4o branch.
Best use cases with Hermes Agent
- Voice-First Messaging — Perfect for Hermes instances running on WhatsApp where users interact via voice notes rather than typing.
- Accessible Automation — Enables hands-free control of shell commands and platform monitoring through direct audio input and output.
Not ideal for
- Text-Only Workflows — You are paying a massive premium for audio capabilities that go unused if your agent only monitors Slack or Discord text.
- High-Volume Background Tasks — The $10/M output cost makes it prohibitively expensive for persistent, high-frequency autonomous logging or monitoring.
Hermes Agent setup
Set the model ID to openai/gpt-4o-audio-preview and ensure your API key has audio modality permissions enabled. Configure Hermes to pass audio buffers directly to the model to minimize latency in voice-to-tool execution.
Hermes makes custom endpoints easy. Run:
hermes model
Choose Custom endpoint from the menu. Enter the base URL and model identifier when prompted:
- Base URL:
https://api.haimaker.ai/v1 - Model:
openai/gpt-4o-audio-preview
Hermes stores the selection and uses it for all subsequent agent runs across whatever platforms you have wired up (Telegram, Discord, Slack, etc.). Tune HERMES_STREAM_READ_TIMEOUT and related env vars if you’re hitting slow providers.
How it compares
- vs GPT-4o-mini — Mini is vastly cheaper at $0.60/M output for standard tool-use but lacks the native audio reasoning required for processing voice notes directly.
- vs Claude 3.5 Sonnet — Sonnet provides superior reasoning for complex MCP tool chains but requires a separate Whisper pipeline for audio, which increases total latency.
Bottom line
This is the go-to model for Hermes users building voice-activated autonomous agents, provided the budget supports the $10/M output cost.
For more, see our Hermes local-LLM setup guide.