Introducing haimaker GPU benchmarking

Peak throughput is a vanity metric.

Running profitable AI infrastructure means solving a harder problem: balancing hardware costs, power consumption, and shifting model demand. The GPU that looks best on paper might lose money in production.

At haimaker.ai, we benchmark for profitability. Here's how.

Three variables that matter

1. Model demand changes constantly

Which LLMs people actually want changes week to week. New open-source releases shift traffic. Pricing changes redirect volume. We track the global model mix and token pricing in real-time, then adjust where we run models based on what the market will pay for.

2. GPU specs are misleading

The GPU is your biggest capital expense, but TFLOPS don't tell you much about inference performance. The decode phase (generating output tokens) is usually memory-bandwidth limited, not compute limited. A GPU with lower theoretical performance can beat a more expensive one on actual inference throughput.

We benchmark NVIDIA and Tenstorrent hardware against real LLM workloads, not marketing specs.

3. Efficiency beats raw speed

Once hardware is deployed, two levers improve cost efficiency:

Quantization: Reducing precision from FP16 to FP8 or INT4 shrinks memory footprint and speeds up computation. Larger models fit on cheaper hardware. Quality loss is usually negligible.

Power management: Enterprise GPUs don't need to run at max wattage. We test to find where cutting power significantly reduces energy costs while barely affecting throughput. For inference workloads that aren't constantly saturated, the savings are substantial.

Benchmarking methodology

Standard benchmarks test ideal conditions. We test breaking points.

The 4×4 token matrix

Inference costs differ dramatically between input (prefill) and output (decode). We test 16 combinations:

Input lengths: 128, 512, 1024, 2048 tokens
Output lengths: 128, 512, 1024, 2048 tokens

This covers quick Q&A (128×128) through document analysis (2048×2048) and everything between.

Finding the ceiling

Instead of testing fixed concurrency, we ramp up requests until something breaks:

Throughput degradation: Adding more requests lowers total tokens/second
Failure threshold: Error rate exceeds 10%

This tells you the actual capacity of a configuration, not theoretical maximum.

Metrics that affect your costs

Tokens per second (TPS): Measured separately for input and output
kWh per million tokens: The number that determines operational margins
Time to first token (TTFT): How responsive the system feels (we track P50 through P99)
Time per output token (TPOT): Streaming speed perception
End-to-end latency: Total round-trip time

From benchmarks to decisions

We version our test datasets and capture full system metadata (CUDA versions, thermal settings, the works). Every test is reproducible.

This data answers practical questions:

Which GPU gives best ROI for a specific workload?
How many nodes do we actually need to meet an SLA?
What's the true energy cost per million tokens?

Total cost of ownership is more than chip price. CapEx is often only half the story. By matching hardware efficiency with market demand, we make sure AI infrastructure is profitable, not just fast.

EXPLORE BENCHMARKS