Skip to main content

Command Palette

Search for a command to run...

Gemma-4 31B + vLLM on RTX 6000 PRO : 1.17k tokens/sec and still asking for more

Throughput, latency, and queue depth for Gemma-4 31B served on vLLM under progressive load, from 12 to 24 concurrency The numbers that matter: 1.17k tok/s peak, ~0.7s median TTFT, and tail latency as the one thing to watch.

Updated
5 min readView as Markdown
Gemma-4 31B + vLLM on RTX 6000 PRO : 1.17k tokens/sec and still asking for more
H
HexGrid.cloud is a managed inference platform for deploying and fine-tuning open-source AI models. It gives developers and AI teams a faster way to deploy models like Llama, Qwen, Gemma, DeepSeek, embedding models, rerankers, and other production inference workloads on dedicated GPU infrastructure. Instead of stitching together cloud GPUs, serving frameworks, storage, gateways, certificates, authentication, logs, and billing systems yourself, HexGrid.cloud provides a unified deployment path.

Model Overview

Gemma-4-31B-it-FP8 is a 30.7B parameter dense Transformer built by Google DeepMind, designed for frontier-level reasoning, coding, multimodal understanding, and agentic workflows. It supports a 256K-token context window, accepts text, image, and video input, and features a hybrid attention mechanism interleaving local sliding-window and full global attention with Proportional RoPE for long-context performance. The model includes native function calling, toggleable thinking mode for step-by-step reasoning, and multilingual support across 140+ languages.

This FP8-block checkpoint was quantized by RedHatAI using LLM Compressor, compressing the weights and activations of linear operators within transformer blocks to the FP8 data type while keeping the vision tower, embedding, and output head layers in their original precision.

HF path: https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block

Model Name Gemma/Gemma-4 31B it
HF Checkpoint RedHatAI/gemma-4-31B-it-FP8-block
Quantization FP-8
Max model length (context) 4096

Hardware

GPU Name RTX PRO 6000 Blackwell
No of GPUS 1
VRAM 96GB
CPU 24 vCPU   with  218 GB RAM

Serving Engine

Engine vLLM
Version 0.20
Configs 'max_model_len': 4096, 'gpu_memory_utilization': 0.9, 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 21, 'enable_chunked_prefill': True
Endpoint /v1/chat/completions

Dataset / Workload

Name ShareGPT sample
Unique Prompts 128 prompts per concurrency level
Conversation Shape Multi-turn response per request
Languages Multilingual with en/zh/ru/th/ko/fr/pl/ja
Max output tokens 1024
Temperature per prompt 0.2
Streaming ON
Concurrency Levels 12, 16, 20, 24

Performance Charts

Request Concurrency Timings in Charts

Concurrency Level Time Range in Chart
12 17:19 - 17:23
16 17:24 - 17:27
20 17:27 - 17:30
24 17:30 - 17:33

Token Throughput

Time To First Token (TTFT)

Queue Time

E2E Latency

TPOT Latency


Verdict

Showing Avg and Max only.

Figures are aggregated across the whole sweep, so Avg ≈ behavior across concurrency 12→24 and Max ≈ behavior at the top end (~concurrency 24) — not broken out per level.

Category Metric Avg Max Unit What it tells you
Throughput Prompt 444 708 tok/s Prefill rate; scales well under load
Output 380 548 tok/s Generation rate; the user-facing number
Total 823 1,168 tok/s Combined; peaks ~42% above average
Load Active requests 13.7 21 reqs Requests being processed concurrently
Queued requests 0.41 3 reqs Near-empty queue → server keeping up
E2E latency p50 27.1 35 s Typical full-response time
p95 40.1 51.3 s Slow-tail full-response time
p99 41.9 58.3 s Worst-case full-response time
TTFT p50 0.73 3.65 s Median time to first token — snappy
p95 2.82 15.6 s Tail first-token wait
p99 3.65 19.1 s Worst-case first-token wait spikes hard

Summary:

  • Server stayed healthy and unsaturated across the whole sweep. The clearest evidence is the queue: it averaged 0.41 and peaked at only 3 requests while ~14–21 requests were actively processing. Requests were being worked on, not waiting — the scheduler had headroom even at concurrency 24.

  • Throughput scaled cleanly to a 1,168 tok/s total peak (548 tok/s output). The gap between avg and max is just the ramp from concurrency 12 up to 24, which is expected.

  • High E2E latency (p50 27s, p99 58s) is driven by long generations, not server inefficiency. At 380 output tok/s across 14 active requests (27 tok/s per request), responses are roughly 700+ output tokens each. Long outputs naturally mean long end-to-end times; the near-zero queue confirms it isn't a backlog problem.

  • The one genuine stress signal is tail TTFT. Median first-token is excellent (~0.73s), but p99 spikes to 19s at peak load — under heaviest concurrency, a small fraction of requests wait noticeably before streaming starts. This is the first thing that degrades as you push higher.

Result

The vLLM server and Gemma-4 model didn't break a sweat. Near-zero queueing and ~1.17k tok/s throughput at concurrency 24 mean there's room to climb — high E2E latency is just long generations, and the only metric flexing under load is tail TTFT (p99 ~19s).


Complete Walkthrough Video

https://youtu.be/88ZbdsJYDxk