Llama doesn’t fit neatly into the standard model-roundup format. Meta’s open-weight strategy means Llama shows up in two very different places: as cheap long-context cloud endpoints, and as the de-facto standard for self-hosted OpenClaw setups.
Short version: Llama 4 Scout for cloud long-context work, Llama 4 Maverick when you need 1M tokens, Llama 3.3 70B for self-hosted. Llama 3 70B Instruct is still in API catalogs but there’s no reason to pick it over Scout.
The quick answer
| Model | Input/Output Cost | Context | Best For |
|---|---|---|---|
| Llama 4 Scout | $0.08 / $0.30 | 328K | Default cloud Llama |
| Llama 4 Maverick | $0.15 / $0.60 | 1M | Long-context cloud work |
| Llama 3.3 70B (Ollama) | $0 (self-hosted) | 128K | Privacy, air-gapped, high volume |
| Llama 3 70B Instruct | $0.51 / $0.74 | 8K | Legacy — use Scout |
Start with Llama 4 Scout for cloud workloads. If you’re running local, pull Llama 3.3 70B through Ollama.
Llama 4 Scout — the cloud default
$0.08/M input, $0.30/M output, 328K context, 16K output cap. Scout is Meta’s answer to the cost-optimized tier — priced below DeepSeek and GLM Flash, with a much larger context window.
The 328K context is the real differentiator. Most budget models cap around 128K–200K. Scout can hold roughly 5x more code in working memory, which matters for codebase-wide refactors and large-document analysis.
Tool calling works but isn’t at Sonnet-level reliability. Function arguments are usually well-formed, but Scout occasionally misses optional parameters or invents schema fields. For OpenClaw’s interactive coding loops, you’ll want to keep a higher-quality fallback configured (Sonnet 4.6 is the usual pairing).
The 16K output cap is the weakness. Scout can’t regenerate large files in one pass — you’ll need to chunk your asks. For most refactor workflows this is fine; for full-file rewrites, reach for a model with a higher output limit.
Llama 4 Maverick — the 1M-context model
$0.15/M input, $0.60/M output, 1M context, 16K output cap. Maverick is Scout with more context, at roughly 2x the price.
The jump from 328K to 1M is only worth it if you actually need it. Most OpenClaw tasks fit comfortably in Scout’s 328K window. The scenarios where Maverick earns its premium:
- Monorepo-wide reasoning. You want the model to see the entire service, not a subset.
- Long conversation history. Agent sessions that span hours and need full context of previous tool calls.
- Large document analysis. Feeding a full codebase, ADRs, and design docs together.
If you’re doing anything shorter, Scout’s 328K is plenty and you’re paying for context you’ll never use.
Llama 3.3 70B (self-hosted via Ollama)
Free in per-token terms but gated by hardware. You need 40GB+ VRAM for Q4_K_M quantization, which in practice means a server with an H100, A100, or a machine with two consumer GPUs in parallel. Apple Silicon with 64GB+ unified memory works but is slower.
Llama 3.3 70B is what you deploy when cloud isn’t an option:
- Privacy-sensitive workloads. Medical records, legal discovery, proprietary codebases that can’t leave the building.
- Air-gapped environments. Compliance requirements that prohibit any outbound connection.
- High-volume batch processing. At enough throughput, the amortized cost of self-hosting beats per-token cloud pricing.
The 128K context is narrower than Scout’s 328K, but for self-hosted use it’s rarely the bottleneck. The bottleneck is inference speed — expect 20-40 tokens/second on a single H100, which is slower than cloud.
Reliability for self-hosted is what you make it. Ollama crashes occasionally on long runs. For production, wrap it in a supervisor or run it through haimaker’s gateway with a cloud fallback configured. See our local LLM setup guide for hardware specifics.
Llama 3 70B Instruct — skip it
Still in the OpenRouter catalog at $0.51/M input, $0.74/M output, 8K context. There’s no scenario where this is the right choice over Scout. Scout is 6x cheaper on input with 40x the context. If you have Llama 3 70B in an existing config, swap it out.
Setup in OpenClaw
Llama isn’t a built-in provider. Three routes depending on where you’re running it.
Cloud through haimaker.ai
All Llama 4 cloud models are available through haimaker.ai:
{
"models": {
"providers": {
"haimaker": {
"baseUrl": "https://api.haimaker.ai/v1",
"apiKey": "your-haimaker-api-key",
"api": "openai-completions"
}
}
}
}
Then add the models to your allowlist:
{
"agents": {
"defaults": {
"models": {
"meta-llama/llama-4-scout": {},
"meta-llama/llama-4-maverick": {}
}
}
}
}
Self-hosted through Ollama
Install Ollama and pull Llama 3.3:
brew install ollama
ollama pull llama3.3:70b-instruct-q4_K_M
ollama serve
Then add Ollama as a provider in ~/.openclaw/openclaw.json:
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434/v1",
"apiKey": "ollama",
"api": "openai-completions"
}
}
}
}
Add the model to the allowlist under meta-llama/llama3.3:70b (or whatever tag you pulled) and apply config. Full walkthrough in our self-hosted local LLMs guide.
What I’d do
Run Llama 4 Scout as your default cloud Llama. The 328K context and $0.08/M pricing make it genuinely useful for cost-constrained workloads.
Reach for Maverick only when you know you need the full 1M context. Don’t pay for it speculatively.
Use Llama 3.3 70B self-hosted when you have a real reason — privacy, compliance, or volume. Self-hosting is a commitment: hardware, reliability, updates. Don’t do it casually.
Llama’s strength isn’t that any single model beats the Western flagships. It’s that the open-weight option exists at all. If your threat model or compliance story requires running inference inside your own infrastructure, Llama is the answer. If you’re just cost-optimizing and don’t need self-hosting, GLM-4.7 or DeepSeek V3.2 are better picks per dollar.
For self-hosted setup, see our local LLMs guide. For a full comparison of open-weight options, see the local models roundup.