microsoft/phi-4phi 4 (microsoft/phi-4) is a phi3 model from Microsoft with a 16,384-token context window and 16,384 max output tokens, priced at $0.07/1M input and $0.14/1M output tokens. Available via the haimaker.ai OpenAI-compatible API.
Our training data is an extension of the data used for Phi-3 and includes a wide variety of sources from:
| | |
|-------------------------|-------------------------------------------------------------------------------|
| Developers | Microsoft Research |
| Description | phi-4 is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.phi-4 underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures |
| Architecture | 14B parameters, dense decoder-only Transformer model |
| Inputs | Text, best suited for prompts in the chat format |
| Context length | 16K tokens |
| GPUs | 1920 H100-80G |
| Training time | 21 days |
| Training data | 9.8T tokens |
| Outputs | Generated text in response to input |
| Dates | October 2024 – November 2024 |
| Status | Static model trained on an offline dataset with cutoff dates of June 2024 and earlier for publicly available data |
| Release date | December 12, 2024 |
| License | MIT |
| | |
|-------------------------------|-------------------------------------------------------------------------|
| Primary Use Cases | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:
1. Memory/compute constrained environments.
2. Latency bound scenarios.
3. Reasoning and logic. |
| Out-of-Scope Use Cases | Our models is not specifically designed or evaluated for all downstream purposes, thus:
1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.
2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English.
3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. |
Our training data is an extension of the data used for Phi-3 and includes a wide variety of sources from:
Multilingual data constitutes about 8% of our overall data. We are focusing on the quality of data that could potentially improve the reasoning ability for the model, and we filter the publicly available documents to contain the correct level of knowledge.
We evaluated phi-4 using OpenAI’s SimpleEval and our own internal benchmarks to understand the model’s capabilities, more specifically:
phi-4 has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated synthetic datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning) and iterative DPO (Direct Preference Optimization), including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories.
Prior to release, phi-4 followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the independent AI Red Team (AIRT) at Microsoft to assess safety risks posed by phi-4 in both average and adversarial user scenarios. In the average user scenario, AIRT emulated typical single-turn and multi-turn interactions to identify potentially risky behaviors. The adversarial user scenario tested a wide range of techniques aimed at intentionally subverting the model’s safety training including jailbreaks, encoding-based attacks, multi-turn attacks, and adversarial suffix attacks.
Please refer to the technical report for more details on safety alignment.
To understand the capabilities, we compare phi-4 with a set of models over OpenAI’s SimpleEval benchmark.
At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance:
| Category | Benchmark | phi-4 (14B) | phi-3 (14B) | Qwen 2.5 (14B instruct) | GPT-4o-mini | Llama-3.3 (70B instruct) | Qwen 2.5 (72B instruct) | GPT-4o |
|------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------|
| Popular Aggregated Benchmark | MMLU | 84.8 | 77.9 | 79.9 | 81.8 | 86.3 | 85.3 | 88.1 |
| Science | GPQA | 56.1 | 31.2 | 42.9 | 40.9 | 49.1 | 49.0 | 50.6 |
| Math | MGSM
MATH | 80.6
80.4 | 53.5
44.6 | 79.6
75.6 | 86.5
73.0 | 89.1
66.3* | 87.3
80.0 | 90.4
74.6 |
| Code Generation | HumanEval | 82.6 | 67.8 | 72.1 | 86.2 | 78.9* | 80.4 | 90.6 |
| Factual Knowledge | SimpleQA | 3.0 | 7.6 | 5.4 | 9.9 | 20.9 | 10.2 | 39.4 |
| Reasoning | DROP | 75.5 | 68.3 | 85.5 | 79.3 | 90.2 | 76.7 | 80.9 |
\* These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B.
Given the nature of the training data, phi-4 is best suited for prompts using the chat format as follows:
<|im_start|>system<|im_sep|>
You are a medieval knight and must provide explanations to modern people.<|im_end|>
<|im_start|>user<|im_sep|>
How should I explain the Internet?<|im_end|>
<|im_start|>assistant<|im_sep|>
transformersimport transformers
pipeline = transformers.pipeline(
"text-generation",
model="microsoft/phi-4",
model_kwargs={"torch_dtype": "auto"},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a medieval knight and must provide explanations to modern people."},
{"role": "user", "content": "How should I explain the Internet?"},
]
outputs = pipeline(messages, max_new_tokens=128)
print(outputs[0]["generated_text"][-1])
Like other language models, phi-4 can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
phi-4 is not intended to support multilingual use. phi-4 training data is based in Python and uses common packages such as typing, math, random, collections, datetime, itertools. If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. | Mode | chat |
| Context Window | 16,384 tokens |
| Max Output | 16,384 tokens |
| Function Calling | - |
| Vision | - |
| Reasoning | - |
| Web Search | - |
| Url Context | - |
| Architecture | Phi3ForCausalLM |
| Model Type | phi3 |
| Languages | en |
| Library | transformers |
from openai import OpenAI
client = OpenAI(
base_url="https://api.haimaker.ai/v1",
api_key="YOUR_API_KEY",
)
response = client.chat.completions.create(
model="microsoft/phi-4",
messages=[
{"role": "user", "content": "Hello, how are you?"}
],
)
print(response.choices[0].message.content)phi 4 (microsoft/phi-4) has a 16,384-token context window and supports up to 16,384 output tokens per request.
phi 4 is priced at $0.07 per 1M input tokens and $0.14 per 1M output tokens when accessed via the haimaker.ai OpenAI-compatible API.
Send requests to https://api.haimaker.ai/v1/chat/completions with model "microsoft/phi-4" using any OpenAI-compatible SDK. Authentication uses a Bearer API key from https://app.haimaker.ai.
OpenAI-compatible endpoint. Start building in minutes.