rekaai/reka-flash-3reka flash 3 (rekaai/reka-flash-3) is a llama 20.9B-parameter model from Rekaai with a 65,536-token context window and 65,536 max output tokens, priced at $0.10/1M input and $0.20/1M output tokens. Available via the haimaker.ai OpenAI-compatible API.
Reka Flash 3 is a 21B general-purpose reasoning model that was trained from scratch. It was trained in synthetic and public datasets for supervised finetuning, followed by RLOO with model-based and rule-based rewards. It performs competitively with proprietary models such as OpenAI o1-mini, making it a good foundation to build applications that require low latency or on-device deployment. It is currently the best open model in its size category.
Reka Flash 3 is a 21B general-purpose reasoning model that was trained from scratch. It was trained in synthetic and public datasets for supervised finetuning, followed by RLOO with model-based and rule-based rewards. It performs competitively with proprietary models such as OpenAI o1-mini, making it a good foundation to build applications that require low latency or on-device deployment. It is currently the best open model in its size category.
Try it out at Reka Space.
Reka Flash 3 powers Nexus, our new platform for organizations to create and manage AI workers. Nexus workers have native deep research capabilities and can browse the web, execute code, and analyse internal files including documents, images, videos and audio. Learn more about Nexus at getnexus.reka.ai.
For ease of deployment, the model is released in a Llama-compatible format. You may use any library compatible with Llama to run the model.
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("RekaAI/reka-flash-3")
model = transformers.AutoModelForCausalLM.from_pretrained("RekaAI/reka-flash-3", torch_dtype='auto', device_map='auto')
prompt = {"role": "human", "content": "Write a poem about large language model."}
text = tokenizer.apply_chat_template([prompt], tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**model_inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
docker run --rm -it --network=host --gpus '"device=0"' -v --shm-size=10.24gb vllm/vllm-openai:latest serve RekaAI/reka-flash-3 --dtype auto -tp 1
Reka Flash 3 uses cl100k_base tokenizer and adds no additional special tokens. Its prompt format is as follows:
human: this is round 1 prompt <sep> assistant: this is round 1 response <sep> ...
Generation should stop on seeing the string or seeing the special token <|endoftext|>.
System prompt can be added by prepending to the first user round.
human: You are a friendly assistant blah ... this is round 1 user prompt <sep> assistant: this is round 1 response <sep> ...
For multi-round conversations, it is recommended to drop the reasoning traces in the previous assistant round to save tokens for the model to think.
If you are using HF or vLLM, the built-in chat_template shall handle prompt formatting automatically.
Reka Flash thinks before it produces an output. We use
This model is primarily built for the English language, and you should consider this an English only model. However, the model is able to converse and understand other languages to some degree.
| Mode | chat |
| Context Window | 65,536 tokens |
| Max Output | 65,536 tokens |
| Function Calling | - |
| Vision | - |
| Reasoning | Supported |
| Web Search | - |
| Url Context | - |
| Architecture | LlamaForCausalLM |
| Model Type | llama |
from openai import OpenAI
client = OpenAI(
base_url="https://api.haimaker.ai/v1",
api_key="YOUR_API_KEY",
)
response = client.chat.completions.create(
model="rekaai/reka-flash-3",
messages=[
{"role": "user", "content": "Hello, how are you?"}
],
)
print(response.choices[0].message.content)reka flash 3 (rekaai/reka-flash-3) has a 65,536-token context window and supports up to 65,536 output tokens per request.
reka flash 3 is priced at $0.10 per 1M input tokens and $0.20 per 1M output tokens when accessed via the haimaker.ai OpenAI-compatible API.
reka flash 3 supports reasoning.
Send requests to https://api.haimaker.ai/v1/chat/completions with model "rekaai/reka-flash-3" using any OpenAI-compatible SDK. Authentication uses a Bearer API key from https://app.haimaker.ai.
OpenAI-compatible endpoint. Start building in minutes.