Current as of March 2026. Grok 2 Vision adds image understanding to the Grok 2 base at the same $2/$10 price point. The catch is the context window drops to 33K — which is genuinely tight for a vision model, since image tokens eat into that budget fast.
Specs
| Provider | xAI |
| Input cost | $2.00 / M tokens |
| Output cost | $10 / M tokens |
| Context window | 33K tokens |
| Max output | 33K tokens |
| Parameters | N/A |
| Features | function_calling, vision, web_search |
What it’s good at
Pricing
$2/M input for a vision model is hard to beat. GPT-4o charges $5/M for the same capability. If you’re processing a lot of images and tight on budget, this is one of the cheapest options available.
Inference speed
Responses come back quickly for vision tasks, which matters for UI automation agents where you need fast screen-read cycles.
OpenAI compatibility
The API follows the OpenAI spec closely enough that it works as a drop-in replacement in most agent frameworks. The config change is minimal.
Where it falls short
Context window
33K tokens is the real problem here. A single high-resolution image can consume a significant chunk of that, leaving little room for conversation history or long system prompts. This is the reason to pick a different model more often than anything else.
Function calling reliability
It misses arguments or ignores JSON schema constraints more often than Claude 3.5 Sonnet. If your tool-calling schema is complex, budget for retry handling.
Best use cases with OpenClaw
- High-volume OCR — The $2/$10 price makes bulk document and image processing economical. If you’re processing thousands of receipts, invoices, or screenshots, the cost math works out.
- Simple visual agents — Short “see and click” tasks where the conversation history stays short and the visual cues are unambiguous. Don’t try to build a long agentic loop on 33K tokens.
Not ideal for
- Long-form document reasoning — The 33K limit prevents loading multiple large images or maintaining a lengthy conversation history simultaneously.
- Complex multi-step reasoning — Hallucination rate increases on multi-step tasks compared to Sonnet. The base reasoning capability just isn’t as strong.
Run it through Haimaker
Skip juggling API keys. One Haimaker key gives you access to every model on the platform. Tell OpenClaw:
Add Haimaker as a custom provider to my OpenClaw config. Use these details:
- Provider name: haimaker
- Base URL: https://api.haimaker.ai/v1
- API key: [PASTE YOUR HAIMAKER API KEY HERE]
- API type: openai-completions
Add the auto-router model:
- haimaker/auto (reasoning: false, context: 128000, max tokens: 32000)
Create an alias "auto" for easy switching. Apply the config when done.
Or skip model selection entirely — Haimaker’s auto-router picks the best model for each task so you don’t have to.
OpenClaw setup
Point the OpenAI provider base URL to api.x.ai/v1 and ensure your API key is funded via the xAI console. You must manually set the model ID to xai/grok-2-vision in your environment variables.
{
"models": {
"mode": "merge",
"providers": {
"xai": {
"baseUrl": "https://api.x.ai/v1",
"apiKey": "YOUR-XAI-API-KEY",
"api": "openai-completions",
"models": [
{
"id": "grok-2-vision",
"name": "Grok 2 Vision",
"cost": {
"input": 2,
"output": 10
},
"contextWindow": 32768,
"maxTokens": 32768
}
]
}
}
}
}
How it compares
- vs GPT-4o — GPT-4o is more reliable for complex tool use but costs 2.5x more for input tokens.
- vs Claude 3.5 Sonnet — Sonnet has a much larger 200K context window and superior coding logic for a higher price.
- vs Gemini 1.5 Flash — Flash is cheaper and offers a 1M context window, though Grok 2 Vision often handles OCR with better accuracy.
Bottom line
Grok 2 Vision earns its place for high-volume, budget-conscious image processing. The 33K context limit is a real constraint though — if your use case needs any conversational depth or multiple images per request, it’ll bite you.
For setup instructions, see our API key guide. For all available models, see the complete models guide.