NVIDIA H100 80GB HBM3 (8x) - llama-2-70b-hf

November 13, 2025 at 04:02 AM

Dataset: reference (v1.0)

Best Performance

Click a metric to highlight the best run in the table below

Best Output TPS
668.64
Peak generation speed
Best Input TPS
855.76
Peak prefill speed
Best Energy Efficiency
0.79 kWh/MT
Energy cost per 1M tokens
Best TTFT (P95)
40.51 ms
Lowest latency
Best E2E (P95)
2,912.99 ms
Lowest latency

Test Matrix Results

Performance across different input/output token combinations and concurrency levels

Input TokensOutput TokensConcurrencyOutput TPSInput TPSEnergy Cost
(kWh/MT)
TTFT MeanTTFT P95E2E P95Success Rate
Best Run for Output TPS
1281,02432x668.64115.251.31163.02193.8331,169.2387.5%
1281281x29.4334.526.15147.99147.993,504.32100.0%
1281282x65.5263.994.2770.2279.983,899.72100.0%
1281284x92.3897.513.6479.25122.473,885.3175.0%
1285121x33.8313.859.1740.5140.518,780.17100.0%
1285122x64.0630.274.7089.09108.968,191.36100.0%
1285124x104.9232.604.4785.7798.6015,581.87100.0%
1285128x174.7258.743.6878.10125.5415,222.3987.5%
1281,0241x34.194.0712.0955.2855.2829,950.71100.0%
1281,0242x67.928.298.3445.7846.2130,153.24100.0%
1281,0244x82.6016.536.5154.7076.2530,633.94100.0%
1281,0248x242.3433.183.6766.8598.3030,932.80100.0%
1281,02416x439.0061.242.03105.45151.2631,542.7893.8%
1282,0481x33.6541.477.5041.3141.312,912.99100.0%
1282,0482x43.914.1410.6541.5241.5658,283.80100.0%
1282,0484x40.998.479.7454.6675.5251,697.89100.0%
5121281x33.28128.972.54112.77112.773,846.58100.0%
5121282x66.48258.371.7478.50110.843,846.07100.0%
5121284x107.10510.451.16102.34132.853,870.43100.0%
5121288x221.67855.760.80119.40181.024,035.0287.5%
5125121x34.0733.006.68103.47103.4715,029.21100.0%
5125122x67.1065.203.3558.7872.4515,257.07100.0%
5125124x104.53127.712.68157.52205.7315,480.23100.0%
5125128x227.49220.112.15116.99187.2815,708.5587.5%
5121,0241x33.4816.229.02104.30104.3030,585.09100.0%
5121,0242x33.62134.352.6160.3174.887,011.20100.0%
5121,0244x100.3148.524.2657.5575.8730,619.0975.0%
5122,0481x33.808.1910.8146.4646.4660,584.47100.0%
5122,0482x35.2816.438.9076.00102.7757,650.63100.0%
5122,0484x59.5224.126.4155.4773.4858,067.0175.0%
1,0241,0241x33.8432.226.7746.0446.0430,257.88100.0%
1,0241,0242x33.6832.206.85150.75150.7530,401.7650.0%
1,0242,0481x33.4615.939.0850.0250.0261,203.36100.0%
1,0242,0482x39.6332.156.75150.59152.9958,273.75100.0%
1,0242,0484x48.8332.025.64187.43308.3659,536.0050.0%
2,0481281x31.92490.770.79278.85278.854,010.34100.0%
2,0481282x31.67482.680.86286.56286.564,040.6250.0%
2,0485121x32.99126.792.77284.03284.0315,522.11100.0%
2,0485122x65.57250.931.42299.23313.3815,614.53100.0%
2,0485124x118.78488.531.03294.57505.3316,038.52100.0%
2,0485128x187.48716.341.10500.67807.8416,381.8975.0%
2,0481,0241x33.6264.624.6244.9144.9130,454.31100.0%
2,0481,0242x33.47318.581.2453.4053.406,092.7850.0%

Hardware Configuration

GPU ManufacturerNVIDIA
GPU ModelNVIDIA H100 80GB HBM3
GPU Count8
GPU Memory (Total)632 GB
GPU Driver570.195.03
CUDA VersionUnknown
Compute Capability9.0
Power Limit (per GPU)700 W
CPU ModelIntel(R) Xeon(R) Platinum 8480+
RAM1,772 GB

Software Configuration

Inference FrameworkvLLM
Framework Version0.11.0
OSUbuntu
OS Version22.04.5 LTS (Jammy Jellyfish)
Kernel Version5.15.0-88-generic
Python Version3.10.12

Model Configuration

Providermeta-llama
Model Namellama-2-70b-hf
QuantizationFP16

Inference Configuration

Runtime parameters used across all benchmark runs

Max Model Length8192
Tensor Parallel Size1
Pipeline Parallel Size1
GPU Memory Utilization0.95
Temperature0.70
Top-P1.00
Top-K-1