# Gemma-4 31B + vLLM on RTX 6000 PRO : 1.17k tokens/sec and still asking for more

## Model Overview

**Gemma-4-31B-it-FP8** is a **30.7B parameter dense Transformer** built by Google DeepMind, designed for frontier-level reasoning, coding, multimodal understanding, and agentic workflows. It supports a **256K-token context window**, accepts text, image, and video input, and features a hybrid attention mechanism interleaving local sliding-window and full global attention with Proportional RoPE for long-context performance. The model includes native function calling, toggleable thinking mode for step-by-step reasoning, and multilingual support across 140+ languages.

This FP8-block checkpoint was quantized by **RedHatAI** using LLM Compressor, compressing the weights and activations of linear operators within transformer blocks to the FP8 data type while keeping the vision tower, embedding, and output head layers in their original precision.

HF path: [https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block](https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block)

| Model Name | Gemma/Gemma-4 31B it |
| --- | --- |
| **HF Checkpoint** | RedHatAI/gemma-4-31B-it-FP8-block |
| **Quantization** | FP-8 |
| **Max model length (context)** | 4096 |

* * *

## Hardware

| GPU Name | RTX PRO 6000 Blackwell |
| --- | --- |
| **No of GPUS** | 1 |
| **VRAM** | 96GB |
| **CPU** | 24 vCPU   with  218 GB RAM |

* * *

## Serving Engine

| Engine | vLLM |
| --- | --- |
| **Version** | 0.20 |
| **Configs** | `'max_model_len': 4096, 'gpu_memory_utilization': 0.9, 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 21, 'enable_chunked_prefill': True` |
| **Endpoint** | `/v1/chat/completions` |

* * *

## Dataset / Workload

| Name | ShareGPT sample |
| --- | --- |
| **Unique Prompts** | 128 prompts per concurrency level |
| **Conversation Shape** | Multi-turn response per request |
| **Languages** | Multilingual with en/zh/ru/th/ko/fr/pl/ja |
| **Max output tokens** | 1024 |
| **Temperature per prompt** | 0.2 |
| **Streaming** | ON |
| **Concurrency Levels** | 12, 16, 20, 24 |

* * *

## Performance Charts

### Request Concurrency Timings in Charts

| Concurrency Level | Time Range in Chart |
| --- | --- |
| **12** | 17:19 - 17:23 |
| **16** | 17:24 - 17:27 |
| **20** | 17:27 - 17:30 |
| **24** | 17:30 - 17:33 |

### Token Throughput

![](https://cdn.hashnode.com/uploads/covers/6a22b1a041d5b05f16273b50/01c53087-8401-49fa-9b77-8f25ffccbe02.png align="center")

![](https://cdn.hashnode.com/uploads/covers/6a22b1a041d5b05f16273b50/d4743eb1-801a-45c0-b831-d7da754ea81a.png align="center")

### Time To First Token (TTFT)

![](https://cdn.hashnode.com/uploads/covers/6a22b1a041d5b05f16273b50/7ca1a1b0-e7a3-429e-aea6-a5551ab05f19.png align="center")

### Queue Time

![](https://cdn.hashnode.com/uploads/covers/6a22b1a041d5b05f16273b50/e056a6ea-8ba3-48ac-b0b7-8ac5d36b4ab3.png align="center")

### E2E Latency

![](https://cdn.hashnode.com/uploads/covers/6a22b1a041d5b05f16273b50/2b69dcf9-e674-4e2c-ba49-f2ba320be13a.png align="center")

### TPOT Latency

![](https://cdn.hashnode.com/uploads/covers/6a22b1a041d5b05f16273b50/55884e32-002b-4a19-a0a2-95493045cbef.png align="center")

* * *

## Verdict

Showing **Avg** and **Max** only.

Figures are aggregated across the whole sweep, so **Avg ≈ behavior across concurrency 12→24** and **Max ≈ behavior at the top end (~concurrency 24)** — not broken out per level.

| Category | Metric | Avg | Max | Unit | What it tells you |
| --- | --- | --- | --- | --- | --- |
| **Throughput** | Prompt | 444 | 708 | tok/s | Prefill rate; scales well under load |
|  | Output | 380 | 548 | tok/s | Generation rate; the user-facing number |
|  | **Total** | **823** | **1,168** | tok/s | Combined; peaks ~42% above average |
| **Load** | Active requests | 13.7 | 21 | reqs | Requests being processed concurrently |
|  | Queued requests | 0.41 | 3 | reqs | Near-empty queue → server keeping up |
| **E2E latency** | p50 | 27.1 | 35 | s | Typical full-response time |
|  | p95 | 40.1 | 51.3 | s | Slow-tail full-response time |
|  | p99 | 41.9 | 58.3 | s | Worst-case full-response time |
| **TTFT** | p50 | 0.73 | 3.65 | s | Median time to first token — snappy |
|  | p95 | 2.82 | 15.6 | s | Tail first-token wait |
|  | p99 | 3.65 | 19.1 | s | Worst-case first-token wait spikes hard |

### Summary:

*   **Server stayed healthy and unsaturated** across the whole sweep. The clearest evidence is the queue: it averaged **0.41** and peaked at only **3** requests while ~14–21 requests were actively processing. Requests were being worked on, not waiting — the scheduler had headroom even at concurrency 24.
    
*   **Throughput scaled cleanly** to a **1,168 tok/s** total peak (**548 tok/s** output). The gap between avg and max is just the ramp from concurrency 12 up to 24, which is expected.
    
*   **High E2E latency (p50 27s, p99 58s) is driven by long generations, not server inefficiency.** At 380 output tok/s across 14 active requests (27 tok/s per request), responses are roughly 700+ output tokens each. Long outputs naturally mean long end-to-end times; the near-zero queue confirms it isn't a backlog problem.
    
*   **The one genuine stress signal is tail TTFT.** Median first-token is excellent (~0.73s), but p99 spikes to **19s** at peak load — under heaviest concurrency, a small fraction of requests wait noticeably before streaming starts. This is the first thing that degrades as you push higher.
    

### Result

> The vLLM server and Gemma-4 model didn't break a sweat. Near-zero queueing and ~1.17k tok/s throughput at concurrency 24 mean there's room to climb — high E2E latency is just long generations, and the only metric flexing under load is tail TTFT (p99 ~19s).

* * *

## Complete Walkthrough Video

%[https://youtu.be/88ZbdsJYDxk]