Introducing haimaker GPU Benchmarking
San Francisco, CA – Dec 18th, 2025
In this environment, "peak throughput" is a vanity metric. To run a profitable AI infrastructure, you need to solve a multi-variable equation involving hardware costs, power consumption, and shifting model demand.
At haimaker.ai, we use real-time market data and deep hardware benchmarking to optimize this equation. Here is how we think about inference performance and the methodology behind our benchmarking suite.
Three Dimensions of Inference Optimization
Maximizing the return on AI infrastructure requires balancing three distinct variables:
1. Global Model Marketplace
The demand for specific LLMs changes constantly as new open-source models are released and user behavior evolves. We monitor the global model mix and token pricing in real-time. This allows our orchestration layer to adjust model placement and node management based on what the market actually wants to pay for.
2. Hardware
The GPU is your primary CapEx cost, but raw specs are often misleading. For example, the "decode" phase of inference is usually limited by memory bandwidth, not computational TFLOPS. We currently partner with NVIDIA and Tenstorrent to benchmark how specific architectures handle real-world LLM workloads versus their marketed peaks.
3. Efficiency Levers: Quantization & Power
Once the hardware is on the rack, we use two primary levers to improve the performance-per-watt:
Quantization: By reducing precision (e.g., from FP16 to FP8 or INT4), we reduce memory footprint and speed up computation. This allows larger models to fit on cheaper or fewer GPUs without significant quality loss.
Power Management: Enterprise GPUs don't have to run at max wattage. We find the "sweet spot" where reducing the power limit significantly cuts energy costs while only marginally affecting throughput.
Power Profile Management
A critical efficiency lever is the ability to manage the power consumption of the GPU itself. Enterprise-grade GPUs do not operate at a fixed wattage. Their power limits can be configured dynamically. This trades off peak performance for greater energy efficiency. For many inference workloads that are not constantly compute-saturated, reducing the power limit from its maximum may result in only a minor decrease in throughput but a substantial reduction in energy consumption.
haimaker.ai Benchmarking Methodology
Standard benchmarks often fail to capture how a system behaves under a real-world concurrent load. Our suite is designed to find the actual breaking point of a configuration.
4×4 Test Matrix
Inference costs vary wildly between input (prefill) and output (decode) phases. We test 16 distinct combinations of token lengths to mirror production usage:
Input Token Lengths: 128, 512, 1024, 2048 tokens
Output Token Lengths: 128, 512, 1024, 2048 tokens
This creates 16 distinct combinations that represent:
Short queries with brief responses (128 x 128): Quick Q&A, command interfaces
Medium context with detailed responses (512×1024): Documentation queries, technical support
Large context processing (2048 x 2048): Document analysis, complex reasoning tasks
All intermediate combinations capturing the spectrum of real-world usage
Dynamic Concurrency Scaling
Instead of testing a fixed number of users, our system automatically ramps up concurrency until it hits a "Stopping Condition":
Throughput Degradation: The point where adding more requests actually lowers the total tokens per second.
Failure Threshold: When the request error rate exceeds 10%.
This allows us to identify the true performance ceiling for any hardware/model combination.
Core Metrics for TCO
We focus on metrics that directly impact the bottom line and the user experience:
Tokens Per Second (TPS): Measured separately for Input and Output to calculate accurate TCO.
kWh per Million Tokens (kWh/MT): The most important metric for operational profitability.
Time to First Token (TTFT): Our proxy for "snappiness" and responsiveness (P50 through P99).
Time Per Output Token (TPOT): Determines the perceived streaming speed for the user.
End-to-End Latency (E2E): The total round-trip time for a request.
From Metrics to Business Decisions
We maintain versioned reference datasets—built from real-world text corpora and validated for exact token counts—to ensure every test is reproducible. We also capture the full system metadata, from CUDA versions to thermal settings.
This data-driven approach moves AI infrastructure from guesswork to financial modeling. It allows our partners to make informed decisions on:
Hardware Acquisition: Which GPU provides the best ROI for their specific workload?
Capacity Planning: How many nodes are actually required to meet an SLA?
Operating Margins: What is the true cost of energy per million tokens?
Total Cost of Ownership (TCO) is more than just the price of the chip; CapEx is often only half of the story. By integrating market demand with hardware-level efficiency, haimaker.ai ensures that AI infrastructure isn't just fast—it's profitable.
EXPLORE BENCHMARKS