MiniMax just dropped M2.5, and the numbers are weird.

This model beats Claude Opus 4.6 on SWE-bench Pro and SWE-bench Verified. It ships with native prompt caching at $0.06/M tokens for cache reads. And it costs roughly one-tenth to one-twentieth of comparable models on a per-token basis.

For developers building AI-powered products, this changes the calculus on every routing decision.

The benchmarks that matter

SWE-bench isn’t a toy benchmark. It tests real-world software engineering: take a GitHub issue, write the code to fix it, and get the PR merged. M2.5 doesn’t just compete here—it leads.

Combined with SWE-bench Multilingual, SWE-bench-pro, and MultiSWE-bench, MiniMax is explicitly positioning M2.5 as the model for developers who ship code for a living.

The practical implication: if you’re building coding assistants, automated PR review tools, or any dev-focused AI workflow, M2.5 belongs in your routing matrix.

Built for agents, not just chat

This matters most for the agent use case. M2.5 was designed from the ground up for autonomous AI systems:

  • Tool-calling chains: Reliable multi-step tool execution for complex workflows
  • Long-horizon planning: Extended reasoning without losing the thread
  • Multilingual programming: Code in any language without performance degradation

Most models do okay at single-turn tasks. Agents need models that maintain coherence across dozens of tool calls and maintain internal state. M2.5’s architecture optimizes for exactly this.

The economics stack up

Pricing is where this gets interesting for cost-conscious developers:

MetricMiniMax M2.5Claude Opus 4GPT-4o
Input tokens$0.30/M~$15/M$5/M
Output tokens$1.20/M~$75/M$15/M
Prompt cache reads$0.06/MN/AN/A
Cache writes$0.375/MN/AN/A
Output speed~100 TPS~30 TPS~60 TPS

The raw token math is compelling. But for agent workloads, the speed matters more than the pricing. At roughly three times faster than Opus-class models, M2.5 compounds its advantage in any workflow requiring multiple model calls.

For a 100-step agent task that might hit 500 tokens through the model, the latency savings alone could be measured in seconds per task.

Two access paths, same model

MiniMax ships M2.5 in two variants:

  • M2.5: Standard version, full capability
  • M2.5-lightning: Same results, optimized for speed

Both versions support automatic prompt caching with no configuration required. The model weights are fully open-sourced on HuggingFace, which means you can also self-host if you prefer not to use the API.

Where this fits in your routing stack

For Haimaker users, M2.5 opens up new routing opportunities:

  1. Coding workloads: Route complex code generation to M2.5 instead of Claude, save 80-90% on identical output quality
  2. Agent workflows: M2.5’s speed advantage compounds on multi-step tasks—more agent steps per second, lower latency end-to-end
  3. Caching optimization: High-volume applications with repeated context benefit from native caching that other providers charge more for
  4. Fallback routing: Add M2.5 as a cost-effective fallback when primary models hit rate limits

The combination of SOTA coding performance, agent-native design, and aggressive pricing makes M2.5 a first-tier option for production workloads—not just a “cheap alternative.”

The catch

Nothing is free. A few considerations:

  • M2.5 is newer, which means less real-world testing in production environments
  • The model weights are open source, but optimal deployment requires vLLM or SGLang infrastructure
  • Benchmark performance doesn’t always translate 1:1 to specific use cases

Start with low-volume routing, measure against your own quality bar, then scale up.

How to try it

Haimaker already supports MiniMax M2.5 routing. Set up rules for your coding workloads, add it to your agent routing matrix, or use it as a fallback for high-volume production traffic.

The $10 in free credits cover enough tokens to run a solid benchmark against your current model mix and see where the savings show up.

Start routing at haimaker.ai.