qwen/qwen3-30b-a3b-instruct-2507Qwen3 30B A3B Instruct 2507 (qwen/qwen3-30b-a3b-instruct-2507) is a qwen3_moe model from Qwen with a 262,144-token context window and 262,144 max output tokens, priced at $0.09/1M input and $0.30/1M output tokens. Available via the haimaker.ai OpenAI-compatible API.
Qwen3 30B A3b Instruct 2507 is a chat model by Qwen. It supports a 262K token context window. Supports function calling.
We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507, featuring the following key enhancements:
blocks in its output. Meanwhile, specifying enable_thinking=False is no longer required.
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.
| | Deepseek-V3-0324 | GPT-4o-0327 | Gemini-2.5-Flash Non-Thinking | Qwen3-235B-A22B Non-Thinking | Qwen3-30B-A3B Non-Thinking | Qwen3-30B-A3B-Instruct-2507 |
|--- | --- | --- | --- | --- | --- | --- |
| Knowledge | | | | | | |
| MMLU-Pro | 81.2 | 79.8 | 81.1 | 75.2 | 69.1 | 78.4 |
| MMLU-Redux | 90.4 | 91.3 | 90.6 | 89.2 | 84.1 | 89.3 |
| GPQA | 68.4 | 66.9 | 78.3 | 62.9 | 54.8 | 70.4 |
| SuperGPQA | 57.3 | 51.0 | 54.6 | 48.2 | 42.2 | 53.4 |
| Reasoning | | | | | | |
| AIME25 | 46.6 | 26.7 | 61.6 | 24.7 | 21.6 | 61.3 |
| HMMT25 | 27.5 | 7.9 | 45.8 | 10.0 | 12.0 | 43.0 |
| ZebraLogic | 83.4 | 52.6 | 57.9 | 37.7 | 33.2 | 90.0 |
| LiveBench 20241125 | 66.9 | 63.7 | 69.1 | 62.5 | 59.4 | 69.0 |
| Coding | | | | | | |
| LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 40.1 | 32.9 | 29.0 | 43.2 |
| MultiPL-E | 82.2 | 82.7 | 77.7 | 79.3 | 74.6 | 83.8 |
| Aider-Polyglot | 55.1 | 45.3 | 44.0 | 59.6 | 24.4 | 35.6 |
| Alignment | | | | | | |
| IFEval | 82.3 | 83.9 | 84.3 | 83.2 | 83.7 | 84.7 |
| Arena-Hard v2* | 45.6 | 61.9 | 58.3 | 52.0 | 24.8 | 69.0 |
| Creative Writing v3 | 81.6 | 84.9 | 84.6 | 80.4 | 68.1 | 86.0 |
| WritingBench | 74.5 | 75.5 | 80.5 | 77.0 | 72.2 | 85.5 |
| Agent | | | | | | |
| BFCL-v3 | 64.7 | 66.5 | 66.1 | 68.0 | 58.6 | 65.1 |
| TAU1-Retail | 49.6 | 60.3# | 65.2 | 65.2 | 38.3 | 59.1 |
| TAU1-Airline | 32.0 | 42.8# | 48.0 | 32.0 | 18.0 | 40.0 |
| TAU2-Retail | 71.1 | 66.7# | 64.3 | 64.9 | 31.6 | 57.0 |
| TAU2-Airline | 36.0 | 42.0# | 42.5 | 36.0 | 18.0 | 38.0 |
| TAU2-Telecom | 34.0 | 29.8# | 16.9 | 24.6 | 18.4 | 12.3 |
| Multilingualism | | | | | | |
| MultiIF | 66.5 | 70.4 | 69.4 | 70.2 | 70.8 | 67.9 |
| MMLU-ProX | 75.8 | 76.2 | 78.3 | 73.2 | 65.1 | 72.0 |
| INCLUDE | 80.1 | 82.1 | 83.8 | 75.6 | 67.8 | 71.9 |
| PolyMATH | 32.2 | 25.5 | 41.9 | 27.0 | 23.3 | 43.1 |
*: For reproducibility, we report the win rates evaluated by GPT-4.1.
\#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable.
The code of Qwen3-MoE has been in the latest Hugging Face transformers and we advise you to use the latest version of transformers.
With transformers<4.51.0, you will encounter the following error:KeyError: 'qwen3_moe'
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)
For deployment, you can use sglang>=0.4.6.post1 or vllm>=0.8.5 or to create an OpenAI-compatible API endpoint: python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 --context-length 262144 vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --max-model-len 262144
.
For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.
from qwen_agent.agents import Assistant
Define LLM
llm_cfg = {
'model': 'Qwen3-30B-A3B-Instruct-2507',
# Use a custom endpoint compatible with OpenAI API:
'model_server': 'http://localhost:8000/v1', # api_base
'api_key': 'EMPTY',
}
Define Tools
tools = [
{'mcpServers': { # You can specify the MCP configuration file
'time': {
'command': 'uvx',
'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
},
"fetch": {
"command": "uvx",
"args": ["mcp-server-fetch"]
}
}
},
'code_interpreter', # Built-in tools
]
Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)
Streaming generation
messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
for responses in bot.run(messages=messages):
pass
print(responses)
To support ultra-long context processing (up to 1 million tokens), we integrate two key techniques:
For full technical details, see the Qwen2.5-1M Technical Report.
NOTE: To effectively process a 1 million token context, users will require approximately 240 GB of total GPU memory. This accounts for model weights, KV-cache storage, and peak activation memory demands.
Download the model and replace the content of your config.json with config_1m.json, which includes the config for length extrapolation and sparse attention.
export MODELNAME=Qwen3-30B-A3B-Instruct-2507
huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
After updating the config, proceed with either vLLM or SGLang for serving the model.
To run Qwen with 1M context support:
pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
Then launch the server with Dual Chunk Flash Attention enabled:
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
vllm serve ./Qwen3-30B-A3B-Instruct-2507 \
--tensor-parallel-size 4 \
--max-model-len 1010000 \
--enable-chunked-prefill \
--max-num-batched-tokens 131072 \
--enforce-eager \
--max-num-seqs 1 \
--gpu-memory-utilization 0.85
##### Key Parameters
| Parameter | Purpose |
|--------|--------|
| VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN | Enables the custom attention kernel for long-context efficiency |--max-model-len 1010000
| | Sets maximum context length to ~1M tokens |--enable-chunked-prefill
| | Allows chunked prefill for very long inputs (avoids OOM) |--max-num-batched-tokens 131072
| | Controls batch size during prefill; balances throughput and memory |--enforce-eager
| | Disables CUDA graph capture (required for dual chunk attention) |--max-num-seqs 1
| | Limits concurrent sequences due to extreme memory usage |--gpu-memory-utilization 0.85
| | Set the fraction of GPU memory to be used for the model executor |
First, clone and install the specialized branch:
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
Launch the server with DCA support:
python3 -m sglang.launch_server \
--model-path ./Qwen3-30B-A3B-Instruct-2507 \
--context-length 1010000 \
--mem-frac 0.75 \
--attention-backend dual_chunk_flash_attn \
--tp 4 \
--chunked-prefill-size 131072
##### Key Parameters
| Parameter | Purpose |
|---------|--------|
| --attention-backend dual_chunk_flash_attn | Activates Dual Chunk Flash Attention |--context-length 1010000
| | Defines max input length |--mem-frac 0.75
| | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |--tp 4
| | Tensor parallelism size (matches model sharding) |--chunked-prefill-size 131072
| | Prefill chunk size for handling long inputs without OOM |
The VRAM reserved for the KV cache is insufficient.
max_model_len or increasing the tensor_parallel_size and gpu_memory_utilization. Alternatively, you can reduce max_num_batched_tokens, although this may significantly slow down inference.context-length or increasing the tp and mem-frac. Alternatively, you can reduce chunked-prefill-size, although this may significantly slow down inference.The VRAM reserved for activation weights is insufficient. You can try lowering gpu_memory_utilization or mem-frac, but be aware that this might reduce the VRAM available for the KV cache.
The input is too lengthy. Consider using a shorter sequence or increasing the max_model_len or context-length.
We test the model on an 1M version of the RULER benchmark.
| Model Name | Acc avg | 4k | 8k | 16k | 32k | 64k | 96k | 128k | 192k | 256k | 384k | 512k | 640k | 768k | 896k | 1000k |
|---------------------------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|-------|
| Qwen3-30B-A3B (Non-Thinking) | 72.0 | 97.1 | 96.1 | 95.0 | 92.2 | 82.6 | 79.7 | 76.9 | 70.2 | 66.3 | 61.9 | 55.4 | 52.6 | 51.5 | 52.0 | 50.9 |
| Qwen3-30B-A3B-Instruct-2507 (Full Attention) | 86.8 | 98.0 | 96.7 | 96.9 | 97.2 | 93.4 | 91.0 | 89.1 | 89.8 | 82.5 | 83.6 | 78.4 | 79.7 | 77.6 | 75.7 | 72.8 |
| Qwen3-30B-A3B-Instruct-2507 (Sparse Attention) | 86.8 | 98.0 | 97.1 | 96.3 | 95.1 | 93.6 | 92.5 | 88.1 | 87.7 | 82.9 | 85.7 | 80.7 | 80.0 | 76.9 | 75.5 | 72.2 |
To achieve optimal performance, we recommend the following settings:
, TopP=0.8, TopK=20, and MinP=0. parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. field with only the choice letter, e.g., "answer": "C"`."If you find our work helpful, feel free to give us a cite.
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}| Mode | chat |
| Context Window | 262,144 tokens |
| Max Output | 262,144 tokens |
| Function Calling | Supported |
| Vision | - |
| Reasoning | - |
| Web Search | - |
| Url Context | - |
| Architecture | Qwen3MoeForCausalLM |
| Model Type | qwen3_moe |
| Library | transformers |
from openai import OpenAI
client = OpenAI(
base_url="https://api.haimaker.ai/v1",
api_key="YOUR_API_KEY",
)
response = client.chat.completions.create(
model="qwen/qwen3-30b-a3b-instruct-2507",
messages=[
{"role": "user", "content": "Hello, how are you?"}
],
)
print(response.choices[0].message.content)Qwen3 30B A3B Instruct 2507 (qwen/qwen3-30b-a3b-instruct-2507) has a 262,144-token context window and supports up to 262,144 output tokens per request.
Qwen3 30B A3B Instruct 2507 is priced at $0.09 per 1M input tokens and $0.30 per 1M output tokens when accessed via the haimaker.ai OpenAI-compatible API.
Qwen3 30B A3B Instruct 2507 supports function calling.
Send requests to https://api.haimaker.ai/v1/chat/completions with model "qwen/qwen3-30b-a3b-instruct-2507" using any OpenAI-compatible SDK. Authentication uses a Bearer API key from https://app.haimaker.ai.
OpenAI-compatible endpoint. Start building in minutes.