Gemma-4 31B on vLLM with RTX 6000 PRO Blackwell: Benchmark

Model Overview

Gemma-4-31B-it-FP8 is a 30.7B parameter dense Transformer built by Google DeepMind, designed for frontier-level reasoning, coding, multimodal understanding, and agentic workflows. It supports a 256K-token context window, accepts text, image, and video input, and features a hybrid attention mechanism interleaving local sliding-window and full global attention with Proportional RoPE for long-context performance. The model includes native function calling, toggleable thinking mode for step-by-step reasoning, and multilingual support across 140+ languages.

This FP8-block checkpoint was quantized by RedHatAI using LLM Compressor, compressing the weights and activations of linear operators within transformer blocks to the FP8 data type while keeping the vision tower, embedding, and output head layers in their original precision.

HF path: https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block

Model Name	Gemma/Gemma-4 31B it
HF Checkpoint	RedHatAI/gemma-4-31B-it-FP8-block
Quantization	FP-8
Max model length (context)	4096

Hardware

GPU Name	RTX PRO 6000 Blackwell
No of GPUS	1
VRAM	96GB
CPU	24 vCPU with 218 GB RAM

Serving Engine

Engine	vLLM
Version	0.20
Configs	`'max_model_len': 4096, 'gpu_memory_utilization': 0.9, 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 21, 'enable_chunked_prefill': True`
Endpoint	`/v1/chat/completions`

Dataset / Workload

Name	ShareGPT sample
Unique Prompts	128 prompts per concurrency level
Conversation Shape	Multi-turn response per request
Languages	Multilingual with en/zh/ru/th/ko/fr/pl/ja
Max output tokens	1024
Temperature per prompt	0.2
Streaming	ON
Concurrency Levels	12, 16, 20, 24

Performance Charts

Request Concurrency Timings in Charts

Concurrency Level	Time Range in Chart
12	17:19 - 17:23
16	17:24 - 17:27
20	17:27 - 17:30
24	17:30 - 17:33

Token Throughput

Time To First Token (TTFT)

Queue Time

E2E Latency

TPOT Latency

Verdict

Showing Avg and Max only.

Figures are aggregated across the whole sweep, so Avg ≈ behavior across concurrency 12→24 and Max ≈ behavior at the top end (~concurrency 24) — not broken out per level.

Category	Metric	Avg	Max	Unit	What it tells you
Throughput	Prompt	444	708	tok/s	Prefill rate; scales well under load
	Output	380	548	tok/s	Generation rate; the user-facing number
	Total	823	1,168	tok/s	Combined; peaks ~42% above average
Load	Active requests	13.7	21	reqs	Requests being processed concurrently
	Queued requests	0.41	3	reqs	Near-empty queue → server keeping up
E2E latency	p50	27.1	35	s	Typical full-response time
	p95	40.1	51.3	s	Slow-tail full-response time
	p99	41.9	58.3	s	Worst-case full-response time
TTFT	p50	0.73	3.65	s	Median time to first token — snappy
	p95	2.82	15.6	s	Tail first-token wait
	p99	3.65	19.1	s	Worst-case first-token wait spikes hard

Summary:

Server stayed healthy and unsaturated across the whole sweep. The clearest evidence is the queue: it averaged 0.41 and peaked at only 3 requests while ~14–21 requests were actively processing. Requests were being worked on, not waiting — the scheduler had headroom even at concurrency 24.
Throughput scaled cleanly to a 1,168 tok/s total peak (548 tok/s output). The gap between avg and max is just the ramp from concurrency 12 up to 24, which is expected.
High E2E latency (p50 27s, p99 58s) is driven by long generations, not server inefficiency. At 380 output tok/s across 14 active requests (27 tok/s per request), responses are roughly 700+ output tokens each. Long outputs naturally mean long end-to-end times; the near-zero queue confirms it isn't a backlog problem.
The one genuine stress signal is tail TTFT. Median first-token is excellent (~0.73s), but p99 spikes to 19s at peak load — under heaviest concurrency, a small fraction of requests wait noticeably before streaming starts. This is the first thing that degrades as you push higher.

Result

The vLLM server and Gemma-4 model didn't break a sweat. Near-zero queueing and ~1.17k tok/s throughput at concurrency 24 mean there's room to climb — high E2E latency is just long generations, and the only metric flexing under load is tail TTFT (p99 ~19s).

Complete Walkthrough Video

https://youtu.be/88ZbdsJYDxk

Gemma-4 31B + vLLM on RTX 6000 PRO : 1.17k tokens/sec and still asking for more

Model Overview

Hardware

Serving Engine

Dataset / Workload