Note: Clawdbot is now called OpenClaw (website: openclaw.ai). This guide works with all versions. For new installs, use
npm install -g openclaw.
Running your AI agent locally means your data never leaves your network. No API calls to external servers, no compliance paperwork, no per-token billing. The tradeoff is hardware investment and setup complexity.
Here's how to make it work.
Why self-host?
Three reasons keep coming up:
Privacy and compliance. Healthcare, legal, finance — regulated industries can't always send data to cloud APIs. HIPAA, GDPR, SOC2 audits get a lot simpler when prompts stay on-premise.
Cost at scale. If you're running thousands of requests per day, local inference can be 10-50x cheaper than cloud APIs. The break-even point is lower than most people think.
Latency control. No network round-trips. No provider rate limits. Your GPU, your queue.
Hardware requirements
Local LLMs need serious compute. Here's what actually works:
Consumer hardware (hobbyist/dev)
| Setup | VRAM | Models | Performance |
|---|---|---|---|
| RTX 4090 | 24GB | 7B-13B at full precision, 70B quantized | ~40 tok/s on 7B |
| RTX 4080 | 16GB | 7B at full precision, 13B quantized | ~30 tok/s on 7B |
| M2/M3 Max | 32-96GB unified | Up to 70B with offloading | ~20 tok/s on 7B |
A single RTX 4090 (~$1,600) handles most local use cases. For 70B+ models at reasonable speed, you need multiple GPUs or cloud rentals.
Production hardware (enterprise/heavy use)
| Setup | VRAM | Models | Performance |
|---|---|---|---|
| 2x RTX 4090 | 48GB | 70B quantized | ~25 tok/s |
| A100 40GB | 40GB | 70B at 4-bit | ~50 tok/s |
| 2x A100 80GB | 160GB | 70B at full precision | ~80 tok/s |
| H100 80GB | 80GB | 70B at 8-bit | ~120 tok/s |
GPU rental vs ownership
Buy if you're running inference 8+ hours daily. A 4090 pays for itself in 3-6 months versus cloud API costs at moderate volume.
Rent for burst capacity or experimenting. Options:
- RunPod: $0.44/hr for RTX 4090, $1.99/hr for A100
- Vast.ai: Variable pricing, often cheaper
- Lambda Labs: $1.25/hr for A100, good availability
At 1000 requests/day with a 70B model, cloud APIs run $30-100/day. A rented A100 at $2/hr costs $48/day and handles the same load with room to spare.
Best local models by use case
Not all open-source models are equal. Here's what works for different tasks:
General assistant
Llama 3.3 70B — Meta's best open model. Strong reasoning, good instruction following. Needs 40GB+ VRAM for decent quantization.
Qwen 2.5 72B — Alibaba's entry. Slightly better at multilingual tasks. Similar hardware requirements.
Mistral Large — Good balance of capability and speed. The 123B version competes with GPT-4.
Coding
DeepSeek Coder V3 — Purpose-built for code. Handles completions, debugging, and multi-file edits well. The 33B version runs on a single 4090.
CodeLlama 70B — Meta's code-focused variant. Solid for general dev work but showing its age.
Qwen 2.5 Coder 32B — Punches above its weight. Good option if VRAM is tight.
Document analysis
Llama 3.3 70B with extended context — Handles up to 128K tokens. Good for legal docs, contracts, research papers.
Mixtral 8x22B — Mixture-of-experts architecture. Only activates 39B parameters per forward pass, so it's faster than you'd expect.
Small and fast
Llama 3.2 3B — Runs on almost anything. Good for simple tasks, routing, and classification.
Phi-3 Mini — Microsoft's small model. 3.8B parameters, surprisingly capable for its size.
Setting up Ollama
Ollama is the easiest path to local inference. It handles model downloads, quantization, and serves an OpenAI-compatible API.
Install
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
Pull models
# General purpose
ollama pull llama3.3:70b-instruct-q4_K_M
# Coding
ollama pull deepseek-coder-v2:33b
# Small and fast
ollama pull llama3.2:3b
The tag after the colon specifies quantization. q4_K_M is a good balance of quality and VRAM usage. Use q8_0 for better quality if you have the memory.
Start the server
ollama serve
By default, it listens on http://localhost:11434. Test it:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3:70b-instruct-q4_K_M",
"prompt": "Hello!"
}'
Docker setup for production
For production deployments, containerize everything:
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
openclaw:
image: openclaw/openclaw:latest
depends_on:
- ollama
environment:
- OLLAMA_BASE_URL=http://ollama:11434
ports:
- "3000:3000"
volumes:
- ./openclaw-config:/root/.openclaw
restart: unless-stopped
volumes:
ollama_data:
Start it:
docker-compose up -d
Pre-pull models in the container:
docker exec -it ollama ollama pull llama3.3:70b-instruct-q4_K_M
Integrating with OpenClaw
Add Ollama as a custom provider in ~/.openclaw/openclaw.json:
{
"models": {
"mode": "merge",
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434/v1",
"apiKey": "ollama",
"api": "openai-completions",
"models": [
{
"id": "llama3.3:70b-instruct-q4_K_M",
"name": "Llama 3.3 70B",
"reasoning": false,
"input": ["text"],
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
},
"contextWindow": 128000,
"maxTokens": 4096
},
{
"id": "deepseek-coder-v2:33b",
"name": "DeepSeek Coder V2",
"reasoning": false,
"input": ["text"],
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
},
"contextWindow": 64000,
"maxTokens": 4096
}
]
}
}
},
"agents": {
"defaults": {
"models": {
"ollama/llama3.3:70b-instruct-q4_K_M": {
"alias": "llama-local"
},
"ollama/deepseek-coder-v2:33b": {
"alias": "coder-local"
}
}
}
}
}
Apply the config:
openclaw gateway config.apply --file ~/.openclaw/openclaw.json
Now switch models in OpenClaw:
/model llama-local
Performance benchmarks
Real-world numbers from testing common agent tasks:
Latency (time to first token)
| Model | Local (4090) | Cloud API |
|---|---|---|
| 7B | 50ms | 200-400ms |
| 13B | 80ms | 300-500ms |
| 70B | 200ms | 400-800ms |
Local wins on latency every time. No network overhead.
Throughput (tokens per second)
| Model | Local (4090) | Local (A100) | Cloud API |
|---|---|---|---|
| 7B | 40 tok/s | 80 tok/s | 50-100 tok/s |
| 13B | 25 tok/s | 60 tok/s | 40-80 tok/s |
| 70B (q4) | 12 tok/s | 50 tok/s | 30-60 tok/s |
Cloud APIs can be faster for large models if you don't have top-tier hardware. But you're paying per token.
Cost comparison (1M tokens/day)
| Setup | Monthly cost |
|---|---|
| Claude Sonnet 4 | ~$540 |
| GPT-4o | ~$300 |
| Local 4090 (owned) | ~$40 (electricity) |
| Local 4090 (rented) | ~$320 |
| A100 (rented) | ~$600 |
The math favors local if you own the hardware and have consistent volume. Renting only makes sense for burst workloads or testing.
When to use local vs cloud
Use local when:
- Compliance requires it. Data can't leave your network. Full stop.
- Volume is high. 500K+ tokens/day makes local cost-effective.
- Latency matters. Real-time applications benefit from no network hops.
- You need predictability. No rate limits, no provider outages.
Use cloud when:
- You need frontier capabilities. Claude Opus 4.5 and GPT-4 are still ahead of open-source for complex reasoning.
- Volume is low. Under 100K tokens/day, cloud is simpler.
- Burst capacity. Spinning up GPUs for a one-time project isn't worth it.
- You want zero maintenance. No driver updates, no OOM errors, no thermal throttling.
Hybrid approach
Most people land on a mix: local models for simple stuff, cloud APIs for the hard problems.
OpenClaw makes this easy. Configure both your local Ollama instance and a cloud provider like Haimaker, then switch between them as needed:
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434/v1",
"apiKey": "ollama",
"api": "openai-completions",
"models": [
{
"id": "llama3.3:70b-instruct-q4_K_M",
"name": "Llama 3.3 70B Local"
}
]
},
"haimaker": {
"baseUrl": "https://api.haimaker.ai/v1",
"apiKey": "${HAIMAKER_API_KEY}",
"api": "openai-completions"
}
}
},
"agents": {
"defaults": {
"models": {
"ollama/llama3.3:70b-instruct-q4_K_M": { "alias": "local" },
"haimaker/claude-sonnet-4": { "alias": "cloud" }
}
}
}
}
In chat, switch models with /model local or /model cloud. Use local for quick questions and drafts, cloud when you need stronger reasoning or tool use.
Haimaker gives you access to Claude, GPT-4, Gemini, and a bunch of cheaper alternatives through one API. Useful when your local model hits its limits or you're away from your hardware.
Troubleshooting
Out of memory errors
Reduce batch size or use a smaller quantization:
# Switch from q4 to q3
ollama pull llama3.3:70b-instruct-q3_K_M
Or enable partial GPU offloading in Ollama's modelfile.
Slow generation
Check GPU utilization with nvidia-smi. If it's not maxed out, the bottleneck is probably CPU or memory bandwidth. On Apple Silicon, ensure you're using the Metal backend.
Model not found in OpenClaw
Make sure the model name in your config matches exactly what Ollama reports:
ollama list
Use the full name including the tag.
Getting started
- Install Ollama and pull a model
- Add it to your OpenClaw config
- Test with simple prompts
- Scale up hardware as needed
Start small. A 7B model on consumer hardware is enough to validate your workflow before investing in bigger iron.
For more on configuring custom providers in OpenClaw, see our guide on integrating custom LLM providers. Visit openclaw.ai to get started with your own AI agent.
