inclusionai/ling-2.6-1t — 262K context, $0.30/1M

Name: Ling 2.6 1T
Brand: Inclusionai
SKU: inclusionai/ling-2.6-1t
Price: 0.3000 USD
Availability: InStock

Parameters

1.0T

Context Window

262K

tokens

Max Output

33K

tokens

Input Price

$0.30

/1M tokens

Output Price

$2.50

/1M tokens

Overview

Ling 2.6 1t is a chat model by Inclusionai. It has 1025.7B parameters. It supports a 262K token context window. Supports function calling.

Model Card

🤗 Hugging Face | 🤖 ModelScope | 🐙 OpenRouter

Ling-2.6-1T: A Trillion-Parameter Comprehensive Flagship Model for Complex Tasks

Today, we are thrilled to open-source Ling–2.6–1T from the Ling family.

Tailored for real–world, complex scenarios, this trillion–parameter model introduces targeted optimizations across inference efficiency, token overhead, and agentic capabilities, making it highly effective for coding and daily workflows.

Key upgrades in Ling–2.6–1T include:

High Inference Efficiency: By adopting a hybrid architecture combining MLA and Linear Attention, we dramatically reduce latency and VRAM footprint for long contexts. It delivers superior throughput and lower per–token computational costs without sacrificing expressivity, ensuring real–time responsiveness for complex reasoning and tool calling.

Lower Token Overhead via "Fast Thinking": We introduce a Contextual Process Redundancy Suppression* reward strategy during post–training. This reduces reliance on verbose chains–of–thought (CoT), utilizing a "fast thinking" mechanism to reach answers directly and compress output costs while maintaining top–tier intelligence.

Reliable Multi–Step Execution: With enhanced reasoning, agentic coding, and instruction following, Ling–2.6–1T achieves open–source SOTA on execution–heavy benchmarks, including AIME26, SWE–bench Verified, BFCL–V4, TAU2–Bench, and IFBench.

Production–Ready for Agent Workflows: Designed for end–to–end engineering—from code generation to bug fixing—Ling–2.6–1T integrates seamlessly with mainstream agent frameworks like Claude Code, OpenClaw, OpenCode, and CodeBuddy*, effortlessly handling multi–tool, multi–step constraints in enterprise environments.

Unlocking Robust Intelligence with Superior Efficiency

On Artificial Analysis, Ling-2.6-1T achieved an Intelligence Index of 34 with approximately 16M output tokens, representing a significant generational leap over the previous Ling-1T. This positioning underscores its ability to deliver high-tier intelligence with optimized token consumption.

Enhancing Execution Stability for Complex Multi-Step Tasks

Ling-2.6-1T demonstrates balanced excellence across reasoning, coding, and tool-calling, achieving open-source SOTA status on multiple execution-heavy benchmarks:

Advanced Reasoning: Significantly leads non-thinking models on AIME26*, showcasing superior complex problem-solving capabilities. First-Tier Agent Execution: Ranks among the top models on SWE-bench Verified, TAU2-Bench, Claw-Eval, BFCL-V4, and PinchBench*, proving high reliability in real-world workflows. Context & Constraints: Strong performance on MRCR (16K–256K) and IFBench* ensures logical consistency and precision under complex instructions and long contexts.

Note: If you are interested in the previous version, please visit the past model collections on Huggingface or ModelScope.

Quickstart

🔌 API Usage

https://openrouter.ai/inclusionai/ling-2.6-1t:free

https://zenmux.ai/inclusionai/ling-2.6-1t

Deployment

SGLang

Environment Preparation

pip install uv
uv venv ~/my_ling_env
source ~/my_ling_env/bin/activate
uv pip "sglang-kernel>=0.4.1"
uv pip install "sglang[all]>=0.5.10.post1" --prerelease=allow

Run Inference

Here is the example to run Ling-1T with 8 GPUs, where the server port is ${PORT}:

Server 1. Standard Inference (Without MTP)

sglang serve \
  --model-path inclusionAI/Ling-2.6-1T \
  --tp-size 8 \
  --max-running-requests 32 \
  --mem-fraction-static 0.92 \
  --chunked-prefill-size 8192 \
  --context-length 262144 \
  --trust-remote-code \
  --model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}' \
  --tool-call-parser qwen25

2. Inference with MTP (Multi-Token Prediction) _The current official SGLang implementation of MTP contains a bug. For better inference performance, we recommend installing our patched version. Our fix is currently under review and is expected to be merged into the official SGLang library shortly._ Install our SGLang

git clone -b ling_2_6 git@github.com:antgroup/sglang.git cd sglang

pip install --upgrade pip pip install -e "python"

Start server

sglang serve \
  --model-path inclusionAI/Ling-2.6-1T \
  --tp-size 8 \
  --max-running-requests 32 \
  --mem-fraction-static 0.92 \
  --chunked-prefill-size 8192 \
  --context-length 262144 \
  --trust-remote-code \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mamba-scheduler-strategy extra_buffer \
  --mamba-full-memory-ratio 1.4 \
  --model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}' \
  --tool-call-parser qwen25

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

More usage can be found here

vLLM

##### Environment Preparation

pip install uv uv venv ~/my_ling_env source ~/my_ling_env/bin/activate git clone https://github.com/vllm-project/vllm.git cd vllm

VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto

Run inference

Server

vllm serve $MODEL_PATH \
    --port $PORT \
    --served-model-name my_model \
    --trust-remote-code --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.85

Client

curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

Limitations & Future Plans

While Ling-2.6-1T excels in reasoning and agentic efficiency, our future development will focus on:

Intelligence-Efficiency Balance: Further optimizing token efficiency for knowledge-intensive tasks.
Long-Range Consistency: Enhancing global consistency in long-term planning and complex information retrieval.
Dynamic Alignment: Refining cross-lingual alignment to eliminate occasional language-switching offsets under complex instructions.

We remain committed to pushing the boundaries of model performance to enhance delivery efficiency across all complex scenarios.

License

This code repository is licensed under the MIT License.

Features & Capabilities

Mode	chat
Context Window	262,144 tokens
Max Output	32,768 tokens
Function Calling	Supported
Vision	-
Reasoning	-
Web Search	-
Url Context	-

Technical Details

Architecture	BailingMoeV2_5ForCausalLM
Model Type	bailing_hybrid
Library	transformers

API Usage

from openai import OpenAI

client = OpenAI(
    base_url="https://api.haimaker.ai/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="inclusionai/ling-2.6-1t",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
)

print(response.choices[0].message.content)

Frequently Asked Questions

What is the context window of Ling 2.6 1T?

Ling 2.6 1T (inclusionai/ling-2.6-1t) has a 262,144-token context window and supports up to 32,768 output tokens per request.

How much does Ling 2.6 1T cost?

Ling 2.6 1T is priced at $0.30 per 1M input tokens and $2.50 per 1M output tokens when accessed via the haimaker.ai OpenAI-compatible API.

What features does Ling 2.6 1T support?

Ling 2.6 1T supports function calling.

How do I use Ling 2.6 1T via API?

Send requests to https://api.haimaker.ai/v1/chat/completions with model "inclusionai/ling-2.6-1t" using any OpenAI-compatible SDK. Authentication uses a Bearer API key from https://app.haimaker.ai.

Use Ling 2.6 1T with the haimaker API

OpenAI-compatible endpoint. Start building in minutes.

Get API Access

More from Inclusionai