NVIDIA H100 80GB HBM3 (8x) - llama-2-70b-hf

November 13, 2025 at 04:02 AM

Dataset: reference (v1.0)

NVIDIA NVIDIA H100 80GB HBM3 meta-llama llama-2-70b-hf

Best Performance

Click a metric to highlight the best run in the table below

Best Output TPS

668.64

Peak generation speed

Best Input TPS

855.76

Peak prefill speed

Best Energy Efficiency

0.79 kWh/MT

Energy cost per 1M tokens

Best TTFT (P95)

40.51 ms

Lowest latency

Best E2E (P95)

2,912.99 ms

Lowest latency

Test Matrix Results

Performance across different input/output token combinations and concurrency levels

Input Tokens	Output Tokens	Concurrency	Output TPS	Input TPS	Energy Cost (kWh/MT)	TTFT Mean	TTFT P95	E2E P95	Success Rate
Best Run for Output TPS
128	1,024	32x	668.64★	115.25	1.31	163.02	193.83	31,169.23	87.5%
128	128	1x	29.43	34.52	6.15	147.99	147.99	3,504.32	100.0%
128	128	2x	65.52	63.99	4.27	70.22	79.98	3,899.72	100.0%
128	128	4x	92.38	97.51	3.64	79.25	122.47	3,885.31	75.0%
128	512	1x	33.83	13.85	9.17	40.51	40.51	8,780.17	100.0%
128	512	2x	64.06	30.27	4.70	89.09	108.96	8,191.36	100.0%
128	512	4x	104.92	32.60	4.47	85.77	98.60	15,581.87	100.0%
128	512	8x	174.72	58.74	3.68	78.10	125.54	15,222.39	87.5%
128	1,024	1x	34.19	4.07	12.09	55.28	55.28	29,950.71	100.0%
128	1,024	2x	67.92	8.29	8.34	45.78	46.21	30,153.24	100.0%
128	1,024	4x	82.60	16.53	6.51	54.70	76.25	30,633.94	100.0%
128	1,024	8x	242.34	33.18	3.67	66.85	98.30	30,932.80	100.0%
128	1,024	16x	439.00	61.24	2.03	105.45	151.26	31,542.78	93.8%
128	2,048	1x	33.65	41.47	7.50	41.31	41.31	2,912.99	100.0%
128	2,048	2x	43.91	4.14	10.65	41.52	41.56	58,283.80	100.0%
128	2,048	4x	40.99	8.47	9.74	54.66	75.52	51,697.89	100.0%
512	128	1x	33.28	128.97	2.54	112.77	112.77	3,846.58	100.0%
512	128	2x	66.48	258.37	1.74	78.50	110.84	3,846.07	100.0%
512	128	4x	107.10	510.45	1.16	102.34	132.85	3,870.43	100.0%
512	128	8x	221.67	855.76	0.80	119.40	181.02	4,035.02	87.5%
512	512	1x	34.07	33.00	6.68	103.47	103.47	15,029.21	100.0%
512	512	2x	67.10	65.20	3.35	58.78	72.45	15,257.07	100.0%
512	512	4x	104.53	127.71	2.68	157.52	205.73	15,480.23	100.0%
512	512	8x	227.49	220.11	2.15	116.99	187.28	15,708.55	87.5%
512	1,024	1x	33.48	16.22	9.02	104.30	104.30	30,585.09	100.0%
512	1,024	2x	33.62	134.35	2.61	60.31	74.88	7,011.20	100.0%
512	1,024	4x	100.31	48.52	4.26	57.55	75.87	30,619.09	75.0%
512	2,048	1x	33.80	8.19	10.81	46.46	46.46	60,584.47	100.0%
512	2,048	2x	35.28	16.43	8.90	76.00	102.77	57,650.63	100.0%
512	2,048	4x	59.52	24.12	6.41	55.47	73.48	58,067.01	75.0%
1,024	1,024	1x	33.84	32.22	6.77	46.04	46.04	30,257.88	100.0%
1,024	1,024	2x	33.68	32.20	6.85	150.75	150.75	30,401.76	50.0%
1,024	2,048	1x	33.46	15.93	9.08	50.02	50.02	61,203.36	100.0%
1,024	2,048	2x	39.63	32.15	6.75	150.59	152.99	58,273.75	100.0%
1,024	2,048	4x	48.83	32.02	5.64	187.43	308.36	59,536.00	50.0%
2,048	128	1x	31.92	490.77	0.79	278.85	278.85	4,010.34	100.0%
2,048	128	2x	31.67	482.68	0.86	286.56	286.56	4,040.62	50.0%
2,048	512	1x	32.99	126.79	2.77	284.03	284.03	15,522.11	100.0%
2,048	512	2x	65.57	250.93	1.42	299.23	313.38	15,614.53	100.0%
2,048	512	4x	118.78	488.53	1.03	294.57	505.33	16,038.52	100.0%
2,048	512	8x	187.48	716.34	1.10	500.67	807.84	16,381.89	75.0%
2,048	1,024	1x	33.62	64.62	4.62	44.91	44.91	30,454.31	100.0%
2,048	1,024	2x	33.47	318.58	1.24	53.40	53.40	6,092.78	50.0%

Hardware Configuration

GPU ManufacturerNVIDIA

GPU ModelNVIDIA H100 80GB HBM3

GPU Count8

GPU Memory (Total)632 GB

GPU Driver570.195.03

CUDA VersionUnknown

Compute Capability9.0

Power Limit (per GPU)700 W

CPU ModelIntel(R) Xeon(R) Platinum 8480+

RAM1,772 GB

Software Configuration

Inference FrameworkvLLM

Framework Version0.11.0

OSUbuntu

OS Version22.04.5 LTS (Jammy Jellyfish)

Kernel Version5.15.0-88-generic

Python Version3.10.12

Model Configuration

Providermeta-llama

Model Namellama-2-70b-hf

QuantizationFP16

Inference Configuration

Runtime parameters used across all benchmark runs

Max Model Length8192

Tensor Parallel Size1

Pipeline Parallel Size1

GPU Memory Utilization0.95

Temperature0.70

Top-P1.00

Top-K-1