The right local coding model is the largest one that fits entirely in your GPU’s VRAM, with room left for context. That last part matters more than people expect. A 32B model that runs at 60 tokens per second when it fits in VRAM drops to 1-2 tokens per second the moment it spills into system RAM. For a coding agent that loops through dozens of tool calls, that is the difference between usable and abandoned.

So the question is not “what is the best Ollama coding model,” it is “what is the best model my card can hold.” This guide maps current NVIDIA RTX consumer GPUs to the coding model that fits, the speed to expect, and the point where you should stop forcing local inference and reach for a cloud model.

For the model rankings on their own merits, see best Ollama models for coding agents. This post is the hardware companion to it.

Quick map: GPU to model

GPUVRAMBest coding model to pullRough speed
RTX 509032GBqwen3-coder:30b (Q4/Q5)~110 t/s (30B MoE, 32K ctx)
RTX 409024GBqwen3-coder:30b (Q4_K_M)~105 t/s
RTX 309024GBqwen3-coder:30b (Q4_K_M)~87 t/s
RTX 5070 Ti / 4070 Ti Super16GBqwen3:14b (Q8)~50 t/s
RTX 4060 Ti 16GB16GBqwen3:14b (Q8)~22-34 t/s
RTX 3060 12GB12GBqwen3:8binteractive
8GB cards (3050, 4060)8GBgemma4:e4b, qwen3:8b (Q4)light tasks

Numbers are practical ballparks for token generation at a working context size, not peak marketing figures. Prompt processing is much faster on every card.

32GB: RTX 5090

The 5090 is the first consumer NVIDIA card that runs a 30B coding model the way you actually want to use it. With 32GB of GDDR7 and around 1.8 TB/s of memory bandwidth, it holds Qwen3 Coder 30B at Q4_K_M (about 18GB) or the higher-quality Q5_K_M (about 22GB) with enough left over for a long context window.

ollama pull qwen3-coder:30b

Expect the 30B MoE to generate around 110 tokens per second on a 32K context, and a dense 32B model at Q4 to land near 60 tokens per second. Token generation is memory-bandwidth bound, and the 5090’s bandwidth is roughly 77 percent higher than the 4090’s, which is where most of its lead comes from. This is the card to buy if you want a local coding agent that keeps up with your typing and your tool calls.

Even on a 5090, hard multi-file tasks are better handled by a frontier cloud model. haimaker gives you one endpoint to run Qwen3 Coder locally for the cheap work and escalate to cloud models when the task gets expensive in attention instead of tokens.

ROUTE LOCAL AND CLOUD MODELS WITH HAIMAKER

24GB: RTX 4090 and RTX 3090

Both 24GB cards run the same model: Qwen3 Coder 30B-A3B at Q4_K_M. At roughly 18GB it leaves about 6GB for KV cache and context, which is workable for real agent sessions.

ollama pull qwen3-coder:30b

The split between these two cards is speed and price, not capability. The 4090 generates tokens about 20 percent faster than the 3090 on a 30B model. The 3090, bought used, is the best VRAM-per-dollar value in local AI right now: it fits the exact same 30B coding model for a fraction of a new card’s price, which makes it the smart pick for learning, light use, or a tight budget.

Q5_K_M (about 22GB) technically fits 24GB, but it leaves almost no room for context. Stick with Q4_K_M unless you are running short prompts and know why you need the extra precision.

16GB: RTX 5070 Ti, 4070 Ti Super, 4060 Ti 16GB

Sixteen gigabytes is the line where 30B models stop being practical. A 30B model at Q4 is about 18GB, so it spills out of a 16GB card into system RAM and the speed collapses. Stay in the 14B class and you get a fast, fully-resident model with room for context.

ollama pull qwen3:14b

Qwen3 14B at Q8 fits comfortably and produces strong code. On an RTX 4060 Ti 16GB expect around 22-34 tokens per second; on a GDDR7 RTX 5060 Ti the same model runs closer to 50 tokens per second thanks to higher memory bandwidth. Other solid 16GB picks are phi4:14b and deepseek-r1:14b. This tier is genuinely useful for code explanation, small edits, test drafts, and config generation, and it is where most developers without a flagship GPU should land.

12GB: RTX 3060 12GB and similar

At 12GB the sweet spot is the 8B class. Qwen3 8B runs at interactive speeds and handles the everyday agent work: explaining unfamiliar code, writing small functions, drafting tests, summarizing logs.

ollama pull qwen3:8b

You can squeeze a 14B model at Q4 (about 8GB) onto a 12GB card, but context space gets tight quickly. For a smooth agent loop, an 8B model that stays fully resident usually feels better than a 14B model fighting for memory.

8GB: entry-level cards

Eight gigabytes runs lightweight models only: Gemma 4 E4B or an 8B model at Q4.

ollama pull gemma4:e4b

Set expectations accordingly. This tier is good for code explanation, boilerplate, and small single-file edits. It is not the place to run a coding agent through a multi-file refactor. Smaller models lose the thread once the agent starts opening files, revising patches, and juggling tool output.

The rule that beats every benchmark

Keep the model fully in VRAM. Everything else is secondary.

  • A 32B model at ~60 t/s in VRAM becomes ~1-2 t/s the moment it spills to system RAM
  • Quantization is how you make a model fit: Q4_K_M is the practical default, Q5/Q8 only when you have spare VRAM
  • Prompt processing is fast on every card; token generation is the bottleneck, and it scales with memory bandwidth
  • A smaller model that stays resident beats a bigger model that swaps, every time

When you are picking a quant, leave headroom for context. A model that fits at 0K context but overflows at 16K will swap mid-session, which is the worst case.

When to use a cloud model instead

Local is best when privacy, cost, or offline work matters. It is not automatically better for every task, and no consumer GPU changes that. Reach for a cloud model when:

  • The task spans many files
  • The bug is subtle and needs strong reasoning
  • You need reliable tool calling
  • The change will touch production
  • You do not have time to review every generated line

The practical setup is local-first, not local-only. Run Qwen3 Coder or a 14B model on your RTX card for cheap private work, then escalate to a stronger cloud model when the task gets hard.

How to route local and cloud together

  • haimaker.ai — one endpoint that lets you run your local Ollama model for simple coding-agent work and fall back to frontier cloud models for the hard tasks, without juggling separate API keys
  • Ollama — runs the local model on your RTX card with an OpenAI-compatible API on localhost
  • Your coding agent — point OpenCode, OpenClaw, or any OpenAI-compatible agent at the local endpoint for cheap work and the cloud endpoint for everything else

ROUTE LOCAL AND CLOUD MODELS WITH HAIMAKER

The bottom line

Buy the card that holds the model you want, then run the largest quant that leaves room for context. A 32GB RTX 5090 runs Qwen3 Coder 30B with headroom. A 24GB 3090 or 4090 runs the same 30B at a lower price. A 16GB card runs a fast 14B. Below that, you get a capable assistant for small tasks and a reason to keep a cloud model on standby for the hard ones.


For the model rankings on their own merits, see best Ollama models for coding agents. For OpenClaw-specific local setup, see best local models for OpenClaw.