Gemma-4 31B + vLLM on RTX 6000 PRO : 1.17k tokens/sec and still asking for more
Throughput, latency, and queue depth for Gemma-4 31B served on vLLM under progressive load, from 12 to 24 concurrency The numbers that matter: 1.17k tok/s peak, ~0.7s median TTFT, and tail latency as the one thing to watch.

Model Overview
Gemma-4-31B-it-FP8 is a 30.7B parameter dense Transformer built by Google DeepMind, designed for frontier-level reasoning, coding, multimodal understanding, and agentic workflows. It supports a 256K-token context window, accepts text, image, and video input, and features a hybrid attention mechanism interleaving local sliding-window and full global attention with Proportional RoPE for long-context performance. The model includes native function calling, toggleable thinking mode for step-by-step reasoning, and multilingual support across 140+ languages.
This FP8-block checkpoint was quantized by RedHatAI using LLM Compressor, compressing the weights and activations of linear operators within transformer blocks to the FP8 data type while keeping the vision tower, embedding, and output head layers in their original precision.
HF path: https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block
| Model Name | Gemma/Gemma-4 31B it |
|---|---|
| HF Checkpoint | RedHatAI/gemma-4-31B-it-FP8-block |
| Quantization | FP-8 |
| Max model length (context) | 4096 |
Hardware
| GPU Name | RTX PRO 6000 Blackwell |
|---|---|
| No of GPUS | 1 |
| VRAM | 96GB |
| CPU | 24 vCPU with 218 GB RAM |
Serving Engine
| Engine | vLLM |
|---|---|
| Version | 0.20 |
| Configs | 'max_model_len': 4096, 'gpu_memory_utilization': 0.9, 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 21, 'enable_chunked_prefill': True |
| Endpoint | /v1/chat/completions |
Dataset / Workload
| Name | ShareGPT sample |
|---|---|
| Unique Prompts | 128 prompts per concurrency level |
| Conversation Shape | Multi-turn response per request |
| Languages | Multilingual with en/zh/ru/th/ko/fr/pl/ja |
| Max output tokens | 1024 |
| Temperature per prompt | 0.2 |
| Streaming | ON |
| Concurrency Levels | 12, 16, 20, 24 |
Performance Charts
Request Concurrency Timings in Charts
| Concurrency Level | Time Range in Chart |
|---|---|
| 12 | 17:19 - 17:23 |
| 16 | 17:24 - 17:27 |
| 20 | 17:27 - 17:30 |
| 24 | 17:30 - 17:33 |
Token Throughput
Time To First Token (TTFT)
Queue Time
E2E Latency
TPOT Latency
Verdict
Showing Avg and Max only.
Figures are aggregated across the whole sweep, so Avg ≈ behavior across concurrency 12→24 and Max ≈ behavior at the top end (~concurrency 24) — not broken out per level.
| Category | Metric | Avg | Max | Unit | What it tells you |
|---|---|---|---|---|---|
| Throughput | Prompt | 444 | 708 | tok/s | Prefill rate; scales well under load |
| Output | 380 | 548 | tok/s | Generation rate; the user-facing number | |
| Total | 823 | 1,168 | tok/s | Combined; peaks ~42% above average | |
| Load | Active requests | 13.7 | 21 | reqs | Requests being processed concurrently |
| Queued requests | 0.41 | 3 | reqs | Near-empty queue → server keeping up | |
| E2E latency | p50 | 27.1 | 35 | s | Typical full-response time |
| p95 | 40.1 | 51.3 | s | Slow-tail full-response time | |
| p99 | 41.9 | 58.3 | s | Worst-case full-response time | |
| TTFT | p50 | 0.73 | 3.65 | s | Median time to first token — snappy |
| p95 | 2.82 | 15.6 | s | Tail first-token wait | |
| p99 | 3.65 | 19.1 | s | Worst-case first-token wait spikes hard |
Summary:
Server stayed healthy and unsaturated across the whole sweep. The clearest evidence is the queue: it averaged 0.41 and peaked at only 3 requests while ~14–21 requests were actively processing. Requests were being worked on, not waiting — the scheduler had headroom even at concurrency 24.
Throughput scaled cleanly to a 1,168 tok/s total peak (548 tok/s output). The gap between avg and max is just the ramp from concurrency 12 up to 24, which is expected.
High E2E latency (p50 27s, p99 58s) is driven by long generations, not server inefficiency. At 380 output tok/s across 14 active requests (27 tok/s per request), responses are roughly 700+ output tokens each. Long outputs naturally mean long end-to-end times; the near-zero queue confirms it isn't a backlog problem.
The one genuine stress signal is tail TTFT. Median first-token is excellent (~0.73s), but p99 spikes to 19s at peak load — under heaviest concurrency, a small fraction of requests wait noticeably before streaming starts. This is the first thing that degrades as you push higher.
Result
The vLLM server and Gemma-4 model didn't break a sweat. Near-zero queueing and ~1.17k tok/s throughput at concurrency 24 mean there's room to climb — high E2E latency is just long generations, and the only metric flexing under load is tail TTFT (p99 ~19s).




