Google released Gemma 4 12B on June 3, 2026, and it slots into the spot most local coders actually want: big enough to reason well, small enough to run on a laptop you already own.
The headline is the new middle size. The previous Gemma 4 lineup gave you a tiny edge model and a 26B flagship, with a gap in between. The 12B fills it. Google says its benchmarks approach the 26B model while using less than half the memory, and the whole thing fits inside the 16GB floor that most recent Macs and mid-range GPUs already clear.
This guide covers what changed in the 12B variant, then walks through running it with Ollama and connecting it to OpenCode as a local, private coding assistant.
What’s new in Gemma 4 12B
- 12 billion parameters, positioned between the E4B edge model and the 26B Mixture-of-Experts flagship.
- Native multimodal input. Text, vision, and audio go into the same model. The architecture is encoder-free: images run through a lightweight embedding module, and raw audio is projected straight into the text token space. Fewer moving parts than a bolted-on vision encoder.
- Reasoning that nears the 26B model. Google reports the 12B’s benchmark performance approaching its 26B variant, at under half the memory footprint.
- Multi-Token Prediction (MTP) drafters for lower latency, which helps interactive use where you’re waiting on every token.
- Built for agentic workflows, so tool-calling and multi-step coding loops are first-class rather than an afterthought.
- Apache 2.0 license. Commercial use, fine-tuning, redistribution. The Gemma 4 family has now passed 150 million downloads.
Weights are on Hugging Face and Kaggle, with day-one support across Transformers, llama.cpp, MLX, SGLang, and vLLM. Ollama is built on llama.cpp and GGUF, so the 12B runs there the same way the earlier Gemma 4 models did.
Gemma 4 12B vs the rest of the family
| Variant | Best for | Memory |
|---|---|---|
| E4B (edge) | Phones, embedded and on-device apps | Minimal |
| 12B (new) | Local coding plus vision/audio on a laptop | 16GB+ |
| 26B MoE (flagship) | Heavier reasoning, large multi-file work | 24GB+ |
If you’re on a lighter machine or just want the smaller default model, the older Gemma 4 + Ollama + OpenCode setup still applies. The 12B is the upgrade you reach for when you have the memory headroom and want noticeably stronger reasoning without jumping to the 26B.
What you need
- A Mac with Apple Silicon (M1–M5) and at least 16GB of unified memory, or a PC with a 16GB+ GPU
- Homebrew on macOS
- OpenCode installed (see opencode.ai)
Google’s stated floor is 16GB. At that level you can run the 12B comfortably for everyday work. If you have 24GB or more, long sessions and bigger context windows stop being a worry.
Step 1: Install Ollama
On macOS:
brew install --cask ollama-app
open -a Ollama
On Linux:
curl -fsSL https://ollama.com/install.sh | sh
Wait for the menu bar icon (macOS) or the service to start, then confirm the server is up:
ollama list
The local API runs at http://localhost:11434.
Step 2: Pull Gemma 4 12B
ollama pull gemma4:12b
At the default 4-bit quantization the download is roughly 8GB. Verify it landed:
ollama list
# NAME ID SIZE MODIFIED
# gemma4:12b ... ~8 GB ...
Run a quick sanity check:
ollama run gemma4:12b "Write a small TypeScript function that debounces a callback"
Confirm the GPU is doing the work:
ollama ps
# Should show a CPU/GPU split, e.g. 12%/88% CPU/GPU
On Apple Silicon, recent Ollama builds use Apple’s MLX backend automatically, so you don’t need to configure anything for acceleration.
Tag not found? Gemma 4 12B is brand new, so if
gemma4:12bisn’t in the Ollama registry yet, pull the official GGUF from Hugging Face and import it, or update Ollama (brew upgrade ollama-app) and retry.
Step 3: Connect Gemma 4 12B to OpenCode
OpenCode reads its config from ~/.config/opencode/opencode.jsonc. Add Ollama as a custom provider:
{
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://localhost:11434/v1"
},
"models": {
"gemma4:12b": {}
}
}
}
}
Ollama doesn’t validate keys, but OpenCode still expects an auth entry. Add a placeholder to ~/.local/share/opencode/auth.json:
{
"ollama": {
"type": "api",
"key": "ollama"
}
}
Restart OpenCode, run /models, and switch to ollama/gemma4:12b. You now have a coding assistant that never sends a line of your code off the machine.
Step 4: Keep the model warm
By default, Ollama unloads a model after about five minutes idle, which means a cold start every time you come back to the terminal. Keep it loaded:
launchctl setenv OLLAMA_KEEP_ALIVE "-1"
Restart Ollama for it to take effect. To persist across reboots, add this to ~/.zshrc:
export OLLAMA_KEEP_ALIVE="-1"
On the Ollama menu bar icon you can also enable Launch at Login so the server is ready before you are.
What Gemma 4 12B handles well in OpenCode
The extra parameters and the MTP drafters show up most in the work that used to feel marginal on the smaller model:
- Multi-step edits. It holds a plan across a few files better than the 8B did, so small refactors land more often on the first try.
- Code explanation and review. Ask what a module does or where a bug might hide, and the answers are sharper.
- Boilerplate and scaffolding. Config files, test stubs, route handlers, and CRUD layers come out clean.
- Vision input. Because the 12B is multimodal, you can hand it a screenshot of an error dialog or a UI mockup and ask for a fix or a component, without standing up a separate vision model.
Where it still falls short
- Large, cross-cutting refactors. Coordinated changes across a dozen files still drift. The 12B does this better than the 8B did, but it hasn’t fixed the problem.
- The hardest debugging. Bugs that span several layers of abstraction or need deep domain knowledge are where a frontier cloud model still earns its place.
- Very long context on 16GB. The model supports large windows, but quality degrades under memory pressure on a 16GB machine. Keep inputs reasonable, or move up to 24GB+.
Go hybrid: local Gemma 4 12B plus cloud models
The setup most people settle on is local for the routine 70% and cloud for the hard 30%. Here’s where to get each:
- haimaker.ai — one API key for Claude Opus, GPT-5, Gemini Pro, and hundreds of other models, with unified pricing and benchmarks so you can compare before you route.
- Ollama — your local Gemma 4 12B, free and private, for everyday edits and reads.
- Provider APIs directly — if you only ever need one cloud vendor and want to manage keys per provider yourself.
Add Haimaker alongside Ollama in OpenCode:
{
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://localhost:11434/v1"
},
"models": {
"gemma4:12b": {}
}
},
"haimaker": {
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "https://api.haimaker.ai/v1"
},
"models": {
"anthropic/claude-sonnet-4-6": {},
"openai/gpt-5": {},
"google/gemini-2.5-pro": {}
}
}
}
}
Add your Haimaker key to auth.json:
{
"ollama": {
"type": "api",
"key": "ollama"
},
"haimaker": {
"type": "api",
"key": "YOUR_HAIMAKER_API_KEY"
}
}
Use Gemma 4 12B for the quick stuff, then /models over to Sonnet or GPT-5 when a task gets hard. Your cloud bill drops to a fraction of running everything on a frontier model. To skip the manual switching, Haimaker’s auto-router can detect task complexity and pick the model for you.
Sign up at haimaker.ai and browse the full model catalog.
Troubleshooting
Provider not showing in /models. Restart OpenCode after editing config. It doesn’t reload opencode.jsonc while running.
“Model not found.” Run ollama list and match the model ID exactly, usually gemma4:12b. If the tag isn’t in the registry yet, see the note in Step 2 about importing the Hugging Face GGUF.
Authentication errors with Ollama. The placeholder "key": "ollama" in auth.json is enough. OpenCode just needs an entry to exist.
Slow generation. Make sure you’re on a recent Ollama build for MLX acceleration on Apple Silicon (ollama --version). Close memory-heavy apps. On 16GB, a few browser tabs running video can push you into swap.
Quality drops on long prompts. That’s memory pressure on a 16GB machine. Keep context inputs modest, or move to 24GB+ for headroom.
Useful Ollama commands
| Command | Description |
|---|---|
ollama list | List downloaded models |
ollama ps | Show running models and memory usage |
ollama run gemma4:12b | Interactive chat |
ollama stop gemma4:12b | Unload from memory |
ollama pull gemma4:12b | Update to the latest version |
ollama rm gemma4:12b | Delete the model |
The bottom line
Gemma 4 12B is the local model a lot of people were waiting for: multimodal, Apache-licensed, and strong enough to handle the bulk of day-to-day coding on a 16GB laptop. Run it through Ollama, point OpenCode at it, and you have a private assistant for the routine work. Keep a cloud model a /models switch away for the hard problems, and you keep the speed and privacy of running local without hitting its reasoning ceiling.
New to local setups? Start with the Gemma 4 + OpenCode guide for the smaller default model, or the Gemma 4 + OpenClaw setup if OpenClaw is your agent. For cloud pricing, see the model catalog.