What is the exact context limit?

The model supports a maximum of 32,768 tokens for both input and output combined.

How much does it cost to use?

Input tokens are priced at $2.00 per million and output tokens at $10.00 per million.

Does it support web search?

Yes, it has a built-in web_search feature that can be toggled via the API.

Grok 2 Vision for OpenClaw: Pricing, Setup, and What It's Good At

Current as of March 2026. Grok 2 Vision adds image understanding to the Grok 2 base at the same $2/$10 price point. The catch is the context window drops to 33K — which is genuinely tight for a vision model, since image tokens eat into that budget fast.

Specs


Provider	xAI
Input cost	$2.00 / M tokens
Output cost	$10 / M tokens
Context window	33K tokens
Max output	33K tokens
Parameters	N/A
Features	function_calling, vision, web_search

What it’s good at

Pricing

$2/M input for a vision model is hard to beat. GPT-4o charges $5/M for the same capability. If you’re processing a lot of images and tight on budget, this is one of the cheapest options available.

Inference speed

Responses come back quickly for vision tasks, which matters for UI automation agents where you need fast screen-read cycles.

OpenAI compatibility

The API follows the OpenAI spec closely enough that it works as a drop-in replacement in most agent frameworks. The config change is minimal.

Where it falls short

Context window

33K tokens is the real problem here. A single high-resolution image can consume a significant chunk of that, leaving little room for conversation history or long system prompts. This is the reason to pick a different model more often than anything else.

Function calling reliability

It misses arguments or ignores JSON schema constraints more often than Claude 3.5 Sonnet. If your tool-calling schema is complex, budget for retry handling.

Best use cases with OpenClaw

High-volume OCR — The $2/$10 price makes bulk document and image processing economical. If you’re processing thousands of receipts, invoices, or screenshots, the cost math works out.
Simple visual agents — Short “see and click” tasks where the conversation history stays short and the visual cues are unambiguous. Don’t try to build a long agentic loop on 33K tokens.

Not ideal for

Long-form document reasoning — The 33K limit prevents loading multiple large images or maintaining a lengthy conversation history simultaneously.
Complex multi-step reasoning — Hallucination rate increases on multi-step tasks compared to Sonnet. The base reasoning capability just isn’t as strong.

Run it through Haimaker

Skip juggling API keys. One Haimaker key gives you access to every model on the platform. Tell OpenClaw:

Add Haimaker as a custom provider to my OpenClaw config. Use these details:

- Provider name: haimaker
- Base URL: https://api.haimaker.ai/v1
- API key: [PASTE YOUR HAIMAKER API KEY HERE]
- API type: openai-completions

Add the auto-router model:
- haimaker/auto (reasoning: false, context: 128000, max tokens: 32000)

Create an alias "auto" for easy switching. Apply the config when done.

Or skip model selection entirely — Haimaker’s auto-router picks the best model for each task so you don’t have to.

OpenClaw setup

Point the OpenAI provider base URL to api.x.ai/v1 and ensure your API key is funded via the xAI console. You must manually set the model ID to xai/grok-2-vision in your environment variables.

{
  "models": {
    "mode": "merge",
    "providers": {
      "xai": {
        "baseUrl": "https://api.x.ai/v1",
        "apiKey": "YOUR-XAI-API-KEY",
        "api": "openai-completions",
        "models": [
          {
            "id": "grok-2-vision",
            "name": "Grok 2 Vision",
            "cost": {
              "input": 2,
              "output": 10
            },
            "contextWindow": 32768,
            "maxTokens": 32768
          }
        ]
      }
    }
  }
}

How it compares

vs GPT-4o — GPT-4o is more reliable for complex tool use but costs 2.5x more for input tokens.
vs Claude 3.5 Sonnet — Sonnet has a much larger 200K context window and superior coding logic for a higher price.
vs Gemini 1.5 Flash — Flash is cheaper and offers a 1M context window, though Grok 2 Vision often handles OCR with better accuracy.

Bottom line

Grok 2 Vision earns its place for high-volume, budget-conscious image processing. The 33K context limit is a real constraint though — if your use case needs any conversational depth or multiple images per request, it’ll bite you.

TRY GROK 2 VISION ON HAIMAKER

For setup instructions, see our API key guide. For all available models, see the complete models guide.