Gemma 4 12B is Google's 12-billion-parameter open model, released June 3, 2026 under the Apache 2.0 license. It is natively multimodal (text, vision, and audio) with an encoder-free design, and Google says its benchmark performance nears the 26B model while using less than half the memory.

How much memory do I need to run Gemma 4 12B locally?

Google lists 16GB of VRAM or unified memory as the floor for running Gemma 4 12B on a consumer laptop. At the default 4-bit quantization the weights are roughly 8GB, so 16GB leaves room for context and other apps. 24GB or more is comfortable for long coding sessions.

Can I use Gemma 4 12B with OpenCode?

Yes. Run Gemma 4 12B through Ollama, then add Ollama as a custom provider in ~/.config/opencode/opencode.jsonc pointing at http://localhost:11434/v1. Restart OpenCode and pick the model with /models. No API key or internet connection is needed once the model is pulled.

How to Run Gemma 4 12B Locally with Ollama and OpenCode

Google released Gemma 4 12B on June 3, 2026, and it slots into the spot most local coders actually want: big enough to reason well, small enough to run on a laptop you already own.

The headline is the new middle size. The previous Gemma 4 lineup gave you a tiny edge model and a 26B flagship, with a gap in between. The 12B fills it. Google says its benchmarks approach the 26B model while using less than half the memory, and the whole thing fits inside the 16GB floor that most recent Macs and mid-range GPUs already clear.

This guide covers what changed in the 12B variant, then walks through running it with Ollama and connecting it to OpenCode as a local, private coding assistant.

What’s new in Gemma 4 12B

12 billion parameters, positioned between the E4B edge model and the 26B Mixture-of-Experts flagship.
Native multimodal input. Text, vision, and audio go into the same model. The architecture is encoder-free: images run through a lightweight embedding module, and raw audio is projected straight into the text token space. Fewer moving parts than a bolted-on vision encoder.
Reasoning that nears the 26B model. Google reports the 12B’s benchmark performance approaching its 26B variant, at under half the memory footprint.
Multi-Token Prediction (MTP) drafters for lower latency, which helps interactive use where you’re waiting on every token.
Built for agentic workflows, so tool-calling and multi-step coding loops are first-class rather than an afterthought.
Apache 2.0 license. Commercial use, fine-tuning, redistribution. The Gemma 4 family has now passed 150 million downloads.

Weights are on Hugging Face and Kaggle, with day-one support across Transformers, llama.cpp, MLX, SGLang, and vLLM. Ollama is built on llama.cpp and GGUF, so the 12B runs there the same way the earlier Gemma 4 models did.

Gemma 4 12B vs the rest of the family

Variant	Best for	Memory
E4B (edge)	Phones, embedded and on-device apps	Minimal
12B (new)	Local coding plus vision/audio on a laptop	16GB+
26B MoE (flagship)	Heavier reasoning, large multi-file work	24GB+

If you’re on a lighter machine or just want the smaller default model, the older Gemma 4 + Ollama + OpenCode setup still applies. The 12B is the upgrade you reach for when you have the memory headroom and want noticeably stronger reasoning without jumping to the 26B.

What you need

A Mac with Apple Silicon (M1–M5) and at least 16GB of unified memory, or a PC with a 16GB+ GPU
Homebrew on macOS
OpenCode installed (see opencode.ai)

Google’s stated floor is 16GB. At that level you can run the 12B comfortably for everyday work. If you have 24GB or more, long sessions and bigger context windows stop being a worry.

Step 1: Install Ollama

On macOS:

brew install --cask ollama-app
open -a Ollama

On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Wait for the menu bar icon (macOS) or the service to start, then confirm the server is up:

ollama list

The local API runs at http://localhost:11434.

Step 2: Pull Gemma 4 12B

ollama pull gemma4:12b

At the default 4-bit quantization the download is roughly 8GB. Verify it landed:

ollama list
# NAME             ID              SIZE      MODIFIED
# gemma4:12b       ...             ~8 GB     ...

Run a quick sanity check:

ollama run gemma4:12b "Write a small TypeScript function that debounces a callback"

Confirm the GPU is doing the work:

ollama ps
# Should show a CPU/GPU split, e.g. 12%/88% CPU/GPU

On Apple Silicon, recent Ollama builds use Apple’s MLX backend automatically, so you don’t need to configure anything for acceleration.

Tag not found? Gemma 4 12B is brand new, so if gemma4:12b isn’t in the Ollama registry yet, pull the official GGUF from Hugging Face and import it, or update Ollama (brew upgrade ollama-app) and retry.

Step 3: Connect Gemma 4 12B to OpenCode

OpenCode reads its config from ~/.config/opencode/opencode.jsonc. Add Ollama as a custom provider:

{
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:11434/v1"
      },
      "models": {
        "gemma4:12b": {}
      }
    }
  }
}

Ollama doesn’t validate keys, but OpenCode still expects an auth entry. Add a placeholder to ~/.local/share/opencode/auth.json:

{
  "ollama": {
    "type": "api",
    "key": "ollama"
  }
}

Restart OpenCode, run /models, and switch to ollama/gemma4:12b. You now have a coding assistant that never sends a line of your code off the machine.

Step 4: Keep the model warm

By default, Ollama unloads a model after about five minutes idle, which means a cold start every time you come back to the terminal. Keep it loaded:

launchctl setenv OLLAMA_KEEP_ALIVE "-1"

Restart Ollama for it to take effect. To persist across reboots, add this to ~/.zshrc:

export OLLAMA_KEEP_ALIVE="-1"

On the Ollama menu bar icon you can also enable Launch at Login so the server is ready before you are.

What Gemma 4 12B handles well in OpenCode

The extra parameters and the MTP drafters show up most in the work that used to feel marginal on the smaller model:

Multi-step edits. It holds a plan across a few files better than the 8B did, so small refactors land more often on the first try.
Code explanation and review. Ask what a module does or where a bug might hide, and the answers are sharper.
Boilerplate and scaffolding. Config files, test stubs, route handlers, and CRUD layers come out clean.
Vision input. Because the 12B is multimodal, you can hand it a screenshot of an error dialog or a UI mockup and ask for a fix or a component, without standing up a separate vision model.

Where it still falls short

Large, cross-cutting refactors. Coordinated changes across a dozen files still drift. The 12B does this better than the 8B did, but it hasn’t fixed the problem.
The hardest debugging. Bugs that span several layers of abstraction or need deep domain knowledge are where a frontier cloud model still earns its place.
Very long context on 16GB. The model supports large windows, but quality degrades under memory pressure on a 16GB machine. Keep inputs reasonable, or move up to 24GB+.

Go hybrid: local Gemma 4 12B plus cloud models

The setup most people settle on is local for the routine 70% and cloud for the hard 30%. Here’s where to get each:

haimaker.ai — one API key for Claude Opus, GPT-5, Gemini Pro, and hundreds of other models, with unified pricing and benchmarks so you can compare before you route.
Ollama — your local Gemma 4 12B, free and private, for everyday edits and reads.
Provider APIs directly — if you only ever need one cloud vendor and want to manage keys per provider yourself.

Add Haimaker alongside Ollama in OpenCode:

{
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:11434/v1"
      },
      "models": {
        "gemma4:12b": {}
      }
    },
    "haimaker": {
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "https://api.haimaker.ai/v1"
      },
      "models": {
        "anthropic/claude-sonnet-4-6": {},
        "openai/gpt-5": {},
        "google/gemini-2.5-pro": {}
      }
    }
  }
}

Add your Haimaker key to auth.json:

{
  "ollama": {
    "type": "api",
    "key": "ollama"
  },
  "haimaker": {
    "type": "api",
    "key": "YOUR_HAIMAKER_API_KEY"
  }
}

Use Gemma 4 12B for the quick stuff, then /models over to Sonnet or GPT-5 when a task gets hard. Your cloud bill drops to a fraction of running everything on a frontier model. To skip the manual switching, Haimaker’s auto-router can detect task complexity and pick the model for you.

GET YOUR HAIMAKER API KEY

Troubleshooting

Provider not showing in /models. Restart OpenCode after editing config. It doesn’t reload opencode.jsonc while running.

“Model not found.” Run ollama list and match the model ID exactly, usually gemma4:12b. If the tag isn’t in the registry yet, see the note in Step 2 about importing the Hugging Face GGUF.

Authentication errors with Ollama. The placeholder "key": "ollama" in auth.json is enough. OpenCode just needs an entry to exist.

Slow generation. Make sure you’re on a recent Ollama build for MLX acceleration on Apple Silicon (ollama --version). Close memory-heavy apps. On 16GB, a few browser tabs running video can push you into swap.

Quality drops on long prompts. That’s memory pressure on a 16GB machine. Keep context inputs modest, or move to 24GB+ for headroom.

Useful Ollama commands

Command	Description
`ollama list`	List downloaded models
`ollama ps`	Show running models and memory usage
`ollama run gemma4:12b`	Interactive chat
`ollama stop gemma4:12b`	Unload from memory
`ollama pull gemma4:12b`	Update to the latest version
`ollama rm gemma4:12b`	Delete the model

The bottom line

Gemma 4 12B is the local model a lot of people were waiting for: multimodal, Apache-licensed, and strong enough to handle the bulk of day-to-day coding on a 16GB laptop. Run it through Ollama, point OpenCode at it, and you have a private assistant for the routine work. Keep a cloud model a /models switch away for the hard problems, and you keep the speed and privacy of running local without hitting its reasoning ceiling.

New to local setups? Start with the Gemma 4 + OpenCode guide for the smaller default model, or the Gemma 4 + OpenClaw setup if OpenClaw is your agent. For cloud pricing, see the model catalog.