inclusionai/ling-2.6-1tLing 2.6 1T (inclusionai/ling-2.6-1t) is a bailing_hybrid 1025.7B-parameter model from Inclusionai with a 262,144-token context window and 32,768 max output tokens, priced at $0.30/1M input and $2.50/1M output tokens. Available via the haimaker.ai OpenAI-compatible API.
Ling 2.6 1t is a chat model by Inclusionai. It has 1025.7B parameters. It supports a 262K token context window. Supports function calling.
🤗 Hugging Face | 🤖 ModelScope | 🐙 OpenRouter
Today, we are thrilled to open-source Ling–2.6–1T from the Ling family.
Tailored for real–world, complex scenarios, this trillion–parameter model introduces targeted optimizations across inference efficiency, token overhead, and agentic capabilities, making it highly effective for coding and daily workflows.
Key upgrades in Ling–2.6–1T include:
Ling-2.6-1T demonstrates balanced excellence across reasoning, coding, and tool-calling, achieving open-source SOTA status on multiple execution-heavy benchmarks:
Advanced Reasoning: Significantly leads non-thinking models on AIME26*, showcasing superior complex problem-solving capabilities. First-Tier Agent Execution: Ranks among the top models on SWE-bench Verified, TAU2-Bench, Claw-Eval, BFCL-V4, and PinchBench*, proving high reliability in real-world workflows. Context & Constraints: Strong performance on MRCR (16K–256K) and IFBench* ensures logical consistency and precision under complex instructions and long contexts.
Note: If you are interested in the previous version, please visit the past model collections on Huggingface or ModelScope.
https://openrouter.ai/inclusionai/ling-2.6-1t:free
https://zenmux.ai/inclusionai/ling-2.6-1t
pip install uv
uv venv ~/my_ling_env
source ~/my_ling_env/bin/activate
uv pip "sglang-kernel>=0.4.1"
uv pip install "sglang[all]>=0.5.10.post1" --prerelease=allow
Here is the example to run Ling-1T with 8 GPUs, where the server port is ${PORT}:
Server 1. Standard Inference (Without MTP)sglang serve \
--model-path inclusionAI/Ling-2.6-1T \
--tp-size 8 \
--max-running-requests 32 \
--mem-fraction-static 0.92 \
--chunked-prefill-size 8192 \
--context-length 262144 \
--trust-remote-code \
--model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}' \
--tool-call-parser qwen25
2. Inference with MTP (Multi-Token Prediction)
_The current official SGLang implementation of MTP contains a bug. For better inference performance, we recommend installing our patched version. Our fix is currently under review and is expected to be merged into the official SGLang library shortly._
Install our SGLang
git clone -b ling_2_6 git@github.com:antgroup/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python"
sglang serve \
--model-path inclusionAI/Ling-2.6-1T \
--tp-size 8 \
--max-running-requests 32 \
--mem-fraction-static 0.92 \
--chunked-prefill-size 8192 \
--context-length 262144 \
--trust-remote-code \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mamba-scheduler-strategy extra_buffer \
--mamba-full-memory-ratio 1.4 \
--model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}' \
--tool-call-parser qwen25
Client
curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
More usage can be found here
pip install uv
uv venv ~/my_ling_env
source ~/my_ling_env/bin/activate
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto
vllm serve $MODEL_PATH \
--port $PORT \
--served-model-name my_model \
--trust-remote-code --tensor-parallel-size 8 \
--gpu-memory-utilization 0.85
Client
curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
While Ling-2.6-1T excels in reasoning and agentic efficiency, our future development will focus on:
This code repository is licensed under the MIT License.
| Mode | chat |
| Context Window | 262,144 tokens |
| Max Output | 32,768 tokens |
| Function Calling | Supported |
| Vision | - |
| Reasoning | - |
| Web Search | - |
| Url Context | - |
| Architecture | BailingMoeV2_5ForCausalLM |
| Model Type | bailing_hybrid |
| Library | transformers |
from openai import OpenAI
client = OpenAI(
base_url="https://api.haimaker.ai/v1",
api_key="YOUR_API_KEY",
)
response = client.chat.completions.create(
model="inclusionai/ling-2.6-1t",
messages=[
{"role": "user", "content": "Hello, how are you?"}
],
)
print(response.choices[0].message.content)Ling 2.6 1T (inclusionai/ling-2.6-1t) has a 262,144-token context window and supports up to 32,768 output tokens per request.
Ling 2.6 1T is priced at $0.30 per 1M input tokens and $2.50 per 1M output tokens when accessed via the haimaker.ai OpenAI-compatible API.
Ling 2.6 1T supports function calling.
Send requests to https://api.haimaker.ai/v1/chat/completions with model "inclusionai/ling-2.6-1t" using any OpenAI-compatible SDK. Authentication uses a Bearer API key from https://app.haimaker.ai.
OpenAI-compatible endpoint. Start building in minutes.