Building Self-Hosted AI Agents with Local LLMs

Note: Clawdbot is now called OpenClaw (website: openclaw.ai). This guide works with all versions. For new installs, use npm install -g openclaw.

Running your AI agent locally means your data never leaves your network. No API calls to external servers, no compliance paperwork, no per-token billing. The tradeoff is hardware investment and setup complexity.

Here's how to make it work.

Why self-host?

Three reasons keep coming up:

Privacy and compliance. Healthcare, legal, finance — regulated industries can't always send data to cloud APIs. HIPAA, GDPR, SOC2 audits get a lot simpler when prompts stay on-premise.

Cost at scale. If you're running thousands of requests per day, local inference can be 10-50x cheaper than cloud APIs. The break-even point is lower than most people think.

Latency control. No network round-trips. No provider rate limits. Your GPU, your queue.

Hardware requirements

Local LLMs need serious compute. Here's what actually works:

Consumer hardware (hobbyist/dev)

Setup	VRAM	Models	Performance
RTX 4090	24GB	7B-13B at full precision, 70B quantized	~40 tok/s on 7B
RTX 4080	16GB	7B at full precision, 13B quantized	~30 tok/s on 7B
M2/M3 Max	32-96GB unified	Up to 70B with offloading	~20 tok/s on 7B

A single RTX 4090 (~$1,600) handles most local use cases. For 70B+ models at reasonable speed, you need multiple GPUs or cloud rentals.

Production hardware (enterprise/heavy use)

Setup	VRAM	Models	Performance
2x RTX 4090	48GB	70B quantized	~25 tok/s
A100 40GB	40GB	70B at 4-bit	~50 tok/s
2x A100 80GB	160GB	70B at full precision	~80 tok/s
H100 80GB	80GB	70B at 8-bit	~120 tok/s

GPU rental vs ownership

Buy if you're running inference 8+ hours daily. A 4090 pays for itself in 3-6 months versus cloud API costs at moderate volume.

Rent for burst capacity or experimenting. Options:

RunPod: $0.44/hr for RTX 4090, $1.99/hr for A100
Vast.ai: Variable pricing, often cheaper
Lambda Labs: $1.25/hr for A100, good availability

At 1000 requests/day with a 70B model, cloud APIs run $30-100/day. A rented A100 at $2/hr costs $48/day and handles the same load with room to spare.

Best local models by use case

Not all open-source models are equal. Here's what works for different tasks:

General assistant

Llama 3.3 70B — Meta's best open model. Strong reasoning, good instruction following. Needs 40GB+ VRAM for decent quantization.

Qwen 2.5 72B — Alibaba's entry. Slightly better at multilingual tasks. Similar hardware requirements.

Mistral Large — Good balance of capability and speed. The 123B version competes with GPT-4.

Coding

DeepSeek Coder V3 — Purpose-built for code. Handles completions, debugging, and multi-file edits well. The 33B version runs on a single 4090.

CodeLlama 70B — Meta's code-focused variant. Solid for general dev work but showing its age.

Qwen 2.5 Coder 32B — Punches above its weight. Good option if VRAM is tight.

Document analysis

Llama 3.3 70B with extended context — Handles up to 128K tokens. Good for legal docs, contracts, research papers.

Mixtral 8x22B — Mixture-of-experts architecture. Only activates 39B parameters per forward pass, so it's faster than you'd expect.

Small and fast

Llama 3.2 3B — Runs on almost anything. Good for simple tasks, routing, and classification.

Phi-3 Mini — Microsoft's small model. 3.8B parameters, surprisingly capable for its size.

Setting up Ollama

Ollama is the easiest path to local inference. It handles model downloads, quantization, and serves an OpenAI-compatible API.

Install

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Pull models

# General purpose
ollama pull llama3.3:70b-instruct-q4_K_M

# Coding
ollama pull deepseek-coder-v2:33b

# Small and fast
ollama pull llama3.2:3b

The tag after the colon specifies quantization. q4_K_M is a good balance of quality and VRAM usage. Use q8_0 for better quality if you have the memory.

Start the server

ollama serve

By default, it listens on http://localhost:11434. Test it:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3:70b-instruct-q4_K_M",
  "prompt": "Hello!"
}'

Docker setup for production

For production deployments, containerize everything:

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  openclaw:
    image: openclaw/openclaw:latest
    depends_on:
      - ollama
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    ports:
      - "3000:3000"
    volumes:
      - ./openclaw-config:/root/.openclaw
    restart: unless-stopped

volumes:
  ollama_data:

Start it:

docker-compose up -d

Pre-pull models in the container:

docker exec -it ollama ollama pull llama3.3:70b-instruct-q4_K_M

Integrating with OpenClaw

Add Ollama as a custom provider in ~/.openclaw/openclaw.json:

{
  "models": {
    "mode": "merge",
    "providers": {
      "ollama": {
        "baseUrl": "http://localhost:11434/v1",
        "apiKey": "ollama",
        "api": "openai-completions",
        "models": [
          {
            "id": "llama3.3:70b-instruct-q4_K_M",
            "name": "Llama 3.3 70B",
            "reasoning": false,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 128000,
            "maxTokens": 4096
          },
          {
            "id": "deepseek-coder-v2:33b",
            "name": "DeepSeek Coder V2",
            "reasoning": false,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 64000,
            "maxTokens": 4096
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "models": {
        "ollama/llama3.3:70b-instruct-q4_K_M": {
          "alias": "llama-local"
        },
        "ollama/deepseek-coder-v2:33b": {
          "alias": "coder-local"
        }
      }
    }
  }
}

Apply the config:

openclaw gateway config.apply --file ~/.openclaw/openclaw.json

Now switch models in OpenClaw:

/model llama-local

Performance benchmarks

Real-world numbers from testing common agent tasks:

Latency (time to first token)

Model	Local (4090)	Cloud API
7B	50ms	200-400ms
13B	80ms	300-500ms
70B	200ms	400-800ms

Local wins on latency every time. No network overhead.

Throughput (tokens per second)

Model	Local (4090)	Local (A100)	Cloud API
7B	40 tok/s	80 tok/s	50-100 tok/s
13B	25 tok/s	60 tok/s	40-80 tok/s
70B (q4)	12 tok/s	50 tok/s	30-60 tok/s

Cloud APIs can be faster for large models if you don't have top-tier hardware. But you're paying per token.

Cost comparison (1M tokens/day)

Setup	Monthly cost
Claude Sonnet 4	~$540
GPT-4o	~$300
Local 4090 (owned)	~$40 (electricity)
Local 4090 (rented)	~$320
A100 (rented)	~$600

The math favors local if you own the hardware and have consistent volume. Renting only makes sense for burst workloads or testing.

When to use local vs cloud

Use local when:

Compliance requires it. Data can't leave your network. Full stop.
Volume is high. 500K+ tokens/day makes local cost-effective.
Latency matters. Real-time applications benefit from no network hops.
You need predictability. No rate limits, no provider outages.

Use cloud when:

You need frontier capabilities. Claude Opus 4.5 and GPT-4 are still ahead of open-source for complex reasoning.
Volume is low. Under 100K tokens/day, cloud is simpler.
Burst capacity. Spinning up GPUs for a one-time project isn't worth it.
You want zero maintenance. No driver updates, no OOM errors, no thermal throttling.

Hybrid approach

Most people land on a mix: local models for simple stuff, cloud APIs for the hard problems.

OpenClaw makes this easy. Configure both your local Ollama instance and a cloud provider like Haimaker, then switch between them as needed:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://localhost:11434/v1",
        "apiKey": "ollama",
        "api": "openai-completions",
        "models": [
          {
            "id": "llama3.3:70b-instruct-q4_K_M",
            "name": "Llama 3.3 70B Local"
          }
        ]
      },
      "haimaker": {
        "baseUrl": "https://api.haimaker.ai/v1",
        "apiKey": "${HAIMAKER_API_KEY}",
        "api": "openai-completions"
      }
    }
  },
  "agents": {
    "defaults": {
      "models": {
        "ollama/llama3.3:70b-instruct-q4_K_M": { "alias": "local" },
        "haimaker/claude-sonnet-4": { "alias": "cloud" }
      }
    }
  }
}

In chat, switch models with /model local or /model cloud. Use local for quick questions and drafts, cloud when you need stronger reasoning or tool use.

Haimaker gives you access to Claude, GPT-4, Gemini, and a bunch of cheaper alternatives through one API. Useful when your local model hits its limits or you're away from your hardware.

Troubleshooting

Out of memory errors

Reduce batch size or use a smaller quantization:

# Switch from q4 to q3
ollama pull llama3.3:70b-instruct-q3_K_M

Or enable partial GPU offloading in Ollama's modelfile.

Slow generation

Check GPU utilization with nvidia-smi. If it's not maxed out, the bottleneck is probably CPU or memory bandwidth. On Apple Silicon, ensure you're using the Metal backend.

Model not found in OpenClaw

Make sure the model name in your config matches exactly what Ollama reports:

ollama list

Use the full name including the tag.

Getting started

Install Ollama and pull a model
Add it to your OpenClaw config
Test with simple prompts
Scale up hardware as needed

Start small. A 7B model on consumer hardware is enough to validate your workflow before investing in bigger iron.

EXPLORE HAIMAKER

For more on configuring custom providers in OpenClaw, see our guide on integrating custom LLM providers. Visit openclaw.ai to get started with your own AI agent.