Haimaker.ai Logo

cogito 671b v2.1

deepcogito/cogito-v2.1-671b
Chatmit
Deepcogito|
Reasoning
|Released Oct 2025 · Updated Nov 2025

cogito 671b v2.1 (deepcogito/cogito-v2.1-671b) is a deepseek_v3 671.0B-parameter model from Deepcogito with a 128,000-token context window and 128,000 max output tokens, priced at $1.25/1M input and $1.25/1M output tokens. Available via the haimaker.ai OpenAI-compatible API.

Parameters
671.0B
Context Window
128K
tokens
Max Output
128K
tokens
Input Price
$1.25
/1M tokens
Output Price
$1.25
/1M tokens

Overview

Cogito V2.1 671B is a chat model by Deepcogito. It has 671.0B parameters. It supports a 128K token context window. Supports reasoning.

Model Card

Logo

Cogito v2.1 - 671B MoE

Blog Post, GitHub

The Cogito v2.1 LLMs are instruction tuned generative models. All models are released under an open license for commercial use.

  • Cogito v2.1 models are hybrid reasoning models. Each model can answer directly (standard LLM), or self-reflect before answering (like reasoning models).
  • The LLMs are trained using Iterated Distillation and Amplification (IDA) - an scalable and efficient alignment strategy for superintelligence using iterative self-improvement.
  • The models have been optimized for coding, STEM, instruction following, general helpfulness and tool calling capabilities.
  • This model is trained in over 30 languages and supports a context length of 128k.

Evaluations

Here is the model performance on some standard industry benchmarks:

v2-1-benchmark-1

v2-1-benchmark-2

v2-1-benchmark-3

For detailed evaluations, please refer to the Blog Post.

Usage

This checkpoint is a 671B parameter Mixture of Experts model in BF16 format, consuming approximately 1.3 TB for parameters. You will need at least 8 B200s (1 node) or 16 H200s (2 nodes) to run this model. For serving on 8 H200s, use the quantized version: deepcogito/cogito-671b-v2.1-FP8.

To download and cache the model:

pip install transformers hf_transfer accelerate vllm
hf download deepcogito/cogito-671b-v2.1

With HuggingFace pipeline

import torch
from transformers import pipeline

model_id = "deepcogito/cogito-671b-v2.1"
pipe = pipeline("text-generation", model=model_id, model_kwargs={"dtype": "auto"}, device_map="auto")

messages = [
{"role": "system", "content": "Always respond in 1-2 words."},
{"role": "user", "content": "Who created you?"},
]

without reasoning

outputs = pipe(messages, max_new_tokens=512, tokenizer_encode_kwargs={"enable_thinking": False}) print(outputs[0]["generated_text"][-1])

{'role': 'assistant', 'content': 'Deep Cogito'}

with reasoning

outputs = pipe(messages, max_new_tokens=512, tokenizer_encode_kwargs={"enable_thinking": True}) print(outputs[0]["generated_text"][-1])

{'role': 'assistant', 'content': 'The question is asking about my creator. I know that I\'m Cogito, an AI assistant created by Deep Cogito, which is an AI research lab. The question is very direct and can be answered very briefly. Since the user has specified to always respond in 1-2 words, I should keep my answer extremely concise.\n\nThe most accurate 2-word answer would be "Deep Cogito" - this names the organization that created me without any unnecessary details. "Deep Cogito" is two words, so it fits the requirement perfectly.\n</think>\nDeep Cogito'}

With HuggingFace AutoModel

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepcogito/cogito-671b-v2.1"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
{"role": "system", "content": "Always respond in 1-2 words."},
{"role": "user", "content": "Who created you?"}
]

text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)

To enable reasoning, set enable_thinking=True above.

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Tool Calling with HuggingFace

Cogito models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and extended thinking mode.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "deepcogito/cogito-671b-v2.1"

model = AutoModelForCausalLM.from_pretrained(model_id, dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

def get_current_temperature(location: str) -> float:
"""
Get the current temperature at a location.

Args:
location: The location to get the temperature for, in the format "City, Country"
Returns:
The current temperature at the specified location in the specified units, as a float.
"""
return 22.

def generate(messages):
global tokenizer, model
prompt = tokenizer.apply_chat_template(
messages,
tools=[get_current_temperature],
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
# To enable reasoning, set enable_thinking=True above.

model_inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return response

messages = [{"role": "user", "content": "whats the temperature in Paris?"}]
response = generate(messages)

This will result in the output -

<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>get_current_temperature
json {"location":"Paris, France"} ``<|tool▁call▁end|><|tool▁calls▁end|><|end▁of▁sentence|>

You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:
python tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}} messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})

and then call the tool and append the result, with the tool role, and After that, you can generate() again to let the model use the tool result in the chat:
python messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"}) response = generate(messages)

This should result in the string -
The current temperature in Paris is 22.0 degrees.<|end▁of▁sentence|>

With vLLM

python from transformers import AutoTokenizer from vllm import SamplingParams, LLM

model_id = "deepcogito/cogito-671b-v2.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(model=model_id, tensor_parallel_size=8, gpu_memory_utilization=0.95, max_model_len=16384)
sampling_params = SamplingParams(temperature=0.6, max_tokens=8192)

prompts = ["who created you?", "how are you doing?"]

prompts = [
tokenizer.apply_chat_template(
[{"role": "system", "content": "Always respond in 1-2 words."}, {"role": "user", "content": prompt}],
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
for prompt in prompts
]

To enable reasoning, set enable_thinking=True above.

out = llm.generate(prompts, sampling_params=sampling_params)
print([res.outputs[0].text for res in out])


Tool Calling with vLLM

python
from vllm import LLM, SamplingParams

def get_current_temperature(location: str) -> float:
"""
Get the current temperature at a location.

Args:
location: The location to get the temperature for, in the format "City, Country"
Returns:
The current temperature at the specified location in the specified units, as a float.
"""
return 22. # A real function should probably actually get the temperature!

model_id = "deepcogito/cogito-671b-v2.1"

llm = LLM(model=model_id, gpu_memory_utilization=0.9, tensor_parallel_size=8, max_model_len=16384)
sampling_params = SamplingParams(temperature=0.6, max_tokens=512)

tokenizer = llm.get_tokenizer()

def generate_output(messages):
global tokenizer, llm, sampling_params
prompt = tokenizer.apply_chat_template(
messages,
tools=[get_current_temperature],
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
response = llm.generate(prompt, sampling_params)
return response[0].outputs[0].text

messages = [{"role": "user", "content": "whats the temperature today?"}]
response = generate_output(messages)
print(response)

'I\'d be happy to check the temperature for you. Could you please let me know which location you\'re interested in? Please provide the city and country (e.g., "New York, USA").'

messages.append({"role": "assistant", "content": 'I\'d be happy to check the temperature for you. Could you please let me know which location you\'re interested in? Please provide the city and country (e.g., "New York, USA").'})
messages.append({"role": "user", "content": "I live in San Francisco."})

response = generate_output(messages)
print(response)

'<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>get_current_temperature<|tool▁sep|>{"location": "San Francisco, USA"}<|tool▁call▁end|><|tool▁calls▁end|>'

tool_calls = [{"type": "function", "function": {"name": "get_current_temperature", "arguments": {"location": "San Francisco, USA"}}}]
messages.append({"role": "assistant", "tool_calls": tool_calls})

messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
response = generate_output(messages)
print(response)

The current temperature in San Francisco, USA is 22°C.


``

NOTE: We initiate the response with "\\n" at the beginning of every output when thinking is enabled. This is because hybrid models can be brittle at times, and adding a "\\n" ensures that the model does indeed respect thinking.

License

This repository and the model weights are licensed under MIT License.

Contact

If you would like to reach out to our team, send an email to contact@deepcogito.com.

Features & Capabilities

Modechat
Context Window128,000 tokens
Max Output128,000 tokens
Function Calling-
Vision-
ReasoningSupported
Web Search-
Url Context-

Technical Details

ArchitectureDeepseekV3ForCausalLM
Model Typedeepseek_v3
Base Modeldeepseek-ai/DeepSeek-V3-Base
Librarytransformers

API Usage

from openai import OpenAI

client = OpenAI(
    base_url="https://api.haimaker.ai/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="deepcogito/cogito-v2.1-671b",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
)

print(response.choices[0].message.content)

Frequently Asked Questions

What is the context window of cogito 671b v2.1?

cogito 671b v2.1 (deepcogito/cogito-v2.1-671b) has a 128,000-token context window and supports up to 128,000 output tokens per request.

How much does cogito 671b v2.1 cost?

cogito 671b v2.1 is priced at $1.25 per 1M input tokens and $1.25 per 1M output tokens when accessed via the haimaker.ai OpenAI-compatible API.

What features does cogito 671b v2.1 support?

cogito 671b v2.1 supports reasoning.

How do I use cogito 671b v2.1 via API?

Send requests to https://api.haimaker.ai/v1/chat/completions with model "deepcogito/cogito-v2.1-671b" using any OpenAI-compatible SDK. Authentication uses a Bearer API key from https://app.haimaker.ai.

Use cogito 671b v2.1 with the haimaker API

OpenAI-compatible endpoint. Start building in minutes.

Get API Access