Running models locally means no API keys, no usage bills, and no sending proprietary code to someone else’s servers. It also means slower responses and lower quality on hard problems. Whether that tradeoff makes sense depends on what you’re doing.

Since Ollama became an official OpenClaw provider in March 2026, the setup is simpler than it used to be. And the Qwen3.5 family changed the math on what local hardware can actually do.

Model rankings

Current local models ranked for coding work in OpenClaw, based on SWE-bench scores, tool-calling reliability, and real-world agent performance:

ModelParametersActivationVRAM NeededSWE-benchSpeed (RTX 4090)Best For
Qwen3 Coder Plus72B72B (dense)48GB+70.6%~25 t/sHardest coding tasks, full agent loops
Qwen3.5 27B27B27B (dense)20GB+72.4%~40 t/sBest quality-to-size ratio for coding
Qwen3.5 35B-A3B35B3B (MoE)16GB+~112 t/sSpeed-critical work, high throughput
Qwen3.5 9B9B9B (dense)8GB+~80 t/sEntry-level hardware, simple tasks
Llama 3.3 70B70B70B (dense)48GB+~20 t/sGeneral coding, good instruction following
Qwen3 32B32B32B (dense)24GB+~30 t/sSolid all-rounder, widely tested

Qwen3.5 27B hitting 72.4% on SWE-bench puts it in the same range as GPT-5 Mini — an open-weight model on a single consumer GPU matching a cloud model you’d pay per token to use.

The 35B-A3B is the wildcard. It’s a mixture-of-experts model that only activates 3B parameters per forward pass, so it runs at 112 tokens/second on an RTX 3090. Quality is lower than the 27B dense model on hard problems, but for file reads, boilerplate generation, and simple edits it’s fast enough to feel like a cloud API.

Hardware requirements

Local model quality scales with model size, and model size scales with hardware needs. The tiers:

8–16GB VRAM (entry level)

RTX 3070/4060 or 16GB unified memory (M1/M2 MacBook Pro). Enough for Qwen3.5 9B and the 35B-A3B MoE model. The 9B handles simple tasks and code summarization. The 35B-A3B uses far less memory than its parameter count suggests because of sparse activation.

Models: Qwen3.5 9B, Qwen3.5 35B-A3B

RTX 4090 or 32GB unified memory (M2/M3 Pro/Max). This is where local models become practical for real work. Qwen3.5 27B runs comfortably here and its SWE-bench score rivals cloud models.

Models: Qwen3.5 27B, Qwen3 32B

48GB+ VRAM (premium)

2x A6000, A100, or 64GB+ unified memory (M2/M3 Ultra). Qwen3 Coder Plus lives here — it resolves real GitHub issues at a 70.6% rate on SWE-bench. Most people don’t have this hardware at home, but if you do, the gap between local and cloud gets small.

Models: Qwen3 Coder Plus, Llama 3.3 70B

If you’re on an M-series Mac, unified memory works well for inference and Apple has been optimizing Metal for LLM workloads. A 32GB M3 Pro runs the 27B model comfortably.

Setting up Ollama

Ollama is the simplest way to run local models. Install it, pull a model, and you have an OpenAI-compatible API running on localhost.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (pick one based on your hardware)
ollama pull qwen3.5:27b          # Best quality, needs 20GB+ VRAM
ollama pull qwen3.5:35b-a3b      # Fast MoE model, runs on 16GB
ollama pull qwen3.5:9b           # Lightweight, runs on 8GB
ollama pull qwen3-coder-plus     # Premium, needs 48GB+

Ollama serves an API at http://localhost:11434 by default.

Tip from r/LocalLLaMA: Several users report better performance switching from Ollama to llama.cpp directly for the 27B and larger models. Ollama adds convenience, but llama.cpp gives you more control over quantization and memory allocation. Start with Ollama — switch to llama.cpp if you hit performance walls.

OpenClaw configuration

Since Ollama is now an official provider, the setup is straightforward. Run the onboarding wizard:

openclaw onboard --auth-choice ollama

Or add Ollama manually in ~/.openclaw/openclaw.json:

{
  models: {
    providers: {
      ollama: {
        baseUrl: "http://localhost:11434/v1",
        api: "openai-completions",
        models: [
          {
            id: "qwen3.5:27b",
            name: "Qwen3.5 27B",
            reasoning: false,
            contextWindow: 131072,
            maxTokens: 8192
          }
        ]
      }
    }
  },
  agents: {
    defaults: {
      model: { primary: "ollama/qwen3.5:27b" },
      models: {
        "ollama/qwen3.5:27b": { alias: "qwen-local" }
      }
    }
  }
}

Switch to your local model:

/model qwen-local

What local models handle well

After running Qwen3.5 27B locally for several weeks, a few things hold up:

  • Reading and summarizing code. Ask it what a function does and it gives you a solid answer. Not as nuanced as Sonnet 4.6, but good enough for navigating unfamiliar codebases.
  • Code generation for common patterns. Boilerplate, CRUD operations, config files, test scaffolding. It writes functional code on the first try most of the time. The 27B model handles multi-step generation better than any local model a year ago.
  • File operations and simple refactoring. Listing files, searching for patterns, renaming variables across a file. Mechanical tasks that don’t require deep reasoning.
  • Agentic tool calling. Qwen3.5 models specifically improved their function-calling reliability. The 27B scores 72.2 on BFCL-V4 (tool use benchmarks), which is better than some cloud models from a year ago.

Where local models fall short

  • Multi-file refactors. Anything that requires holding context across 5+ files gets unreliable. The model either loses track or makes inconsistent changes. Cloud models with 200K context windows still have a massive advantage here.
  • Complex debugging. If the bug requires reasoning through multiple abstraction layers, local models suggest surface-level fixes when the problem runs deeper. This is where Claude Opus and GPT-5 earn their price.
  • Speed on dense models. The 27B model runs at about 40 tokens/second on an RTX 4090. Cloud APIs give you 80-150. The difference is noticeable during long code generation. (The 35B-A3B MoE model at 112 t/s is the exception.)
  • Very long context. Qwen3.5 supports up to 256K tokens in theory, but inference quality degrades on consumer hardware past 32K. Keep context windows realistic in your config.

The hybrid approach

Most people who try local models end up with a hybrid setup: local for the cheap stuff, cloud for the hard stuff.

{
  agents: {
    defaults: {
      model: {
        primary: "ollama/qwen3.5:27b",
        thinking: "anthropic/claude-sonnet-4-6-20260514"
      }
    }
  }
}

The local model handles file reads, simple edits, and boilerplate — maybe 60-70% of a typical coding session. Sonnet handles the debugging, architecture decisions, and multi-file work. Your API bill drops to a few dollars a day instead of $20-50.

Switch manually when you know a task needs more capability:

/model sonnet

Or use Haimaker’s auto-router to handle the routing for you. The auto-router detects task complexity and sends hard problems to cloud models automatically, so you don’t have to think about when to switch.

Troubleshooting

Model loads slowly or crashes. You’re probably out of memory. Try a smaller quantization: ollama pull qwen3.5:27b-q4_K_M uses less memory at a small quality cost. The Q4_K_M quantization is the sweet spot for most people — minimal quality loss, significant memory savings.

Tool calls fail. Set "reasoning": false in your model config and stick to Qwen3.5 models — they handle OpenClaw’s tool-calling format more reliably than Mistral or older Llama models. If tool calls still break, update Ollama to the latest version. The official provider integration fixed several edge cases.

Context window errors. Set contextWindow accurately in your config. For Qwen3.5 models, 131072 (128K) is a safe default on 24GB+ VRAM hardware. On 16GB, stick to 32768 to avoid quality degradation.

Slow generation speed. If you’re getting under 20 t/s on the 27B model, check whether other processes are using your GPU. Close any browser tabs running WebGL or video. On Mac, Activity Monitor → GPU History will show what’s competing for unified memory.

TRY HAIMAKER FOR CLOUD ROUTING


For model pricing comparisons, see cheapest models for OpenClaw. For reducing token costs on cloud models, see cutting costs by 96% with QMD.