What an LLM Server Feels Like Under Load

I had a couple of days, an RTX 3060 with 12 GB of VRAM running on a Proxmox node, and a question I was curious about: what does running an LLM at home actually feel like to use? Alone with the model. With five other people. With fifty. At what point does it stop feeling fast and start feeling broken?

I ran the same experiments against two 7B models — Qwen 2.5 and Llama-2, both 4-bit quantized.

The setup

One Proxmox node with one 3060 passed through. vLLM 0.11.0 in a Docker container. 4-bit quantized weights (AWQ for Qwen, GPTQ for Llama-2). Prefix caching enabled.

docker run -d --name vllm \
    --restart unless-stopped \
    --gpus all --ipc=host \
    -v vllm-models:/root/.cache/huggingface \
    -p 8000:8000 \
    vllm/vllm-openai:v0.11.0 \
    --model <MODEL> \
    --max-model-len <CTX> \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 96 \
    --enable-prefix-caching

Driver scripts running from a different machine. Three benchmarks per model: single-user baseline, concurrency sweep, prefix caching test. A small dashboard on /metrics watched KV cache, queue depth, preemptions, and TTFT live. Full data tables in the appendix.

Scripts at github.com/clipod/vllm-probe if you want to clone and run them against your own vLLM instance.

One user, no one else around

The first word appears in about 40 ms; tokens then stream at 60–80 per second, faster than reading speed. Both models feel instant. Qwen baselines at 59 tok/s decode with 43 ms TTFT. Llama-2 GPTQ at 79 tok/s with 37 ms TTFT. From the user’s chair, indistinguishable.

The wait before the first word

Time-to-first-token grows as concurrent users go up. Under 200 ms feels responsive; up to a second is fine for an LLM (ChatGPT routinely sits there). Past three seconds users start to wonder; past ten, some give up.

Concurrent users	Qwen TTFT p95	Llama-2 TTFT p95
1	40 ms	30 ms
8	140 ms	120 ms
32	280 ms	340 ms
64	570 ms	590 ms
96	770 ms	740 ms

Both models stay under a second even at 96 concurrent users. Users notice the wait at peak load but don’t reload.

The text starts typing slower

Per-user decode rate is the perceived typing speed. Above 10 tok/s feels conversational; below 5 feels like buffering.

Concurrent users	Qwen per-user tok/s	Llama-2 per-user tok/s
1	59	79
8	54	65
32	33	32
64	19	17
96	13	10

At 96, Llama-2 has crossed the buffering line. Qwen is right at the edge. Total throughput climbs with concurrency, but per-user throughput drops to make room for the new arrivals.

The silent pause

Tokens are arriving, you’re reading along, and then… nothing. Five seconds. Ten seconds. The cursor is still there, no error. Then tokens resume mid-sentence as if nothing happened.

This is preemption. vLLM holds each user’s KV cache in GPU memory. When the pool fills and an active user needs more memory, the scheduler evicts the youngest active user to free room. The evicted user waits, then re-prefills from scratch when there’s space. Tokens already on screen stay; new tokens just stop flowing for the duration.

It’s a hard symptom to operate around:

No error is raised. The HTTP connection stays open. Monitoring sees a healthy server.
Client timeouts misdiagnose it. Many libraries hang up after 30 seconds of silence. Your user sees an error from a healthy server.
It hides in averages. Most users still get fine responses; some randomly get the pause.

I forced this state on Llama-2 — long prompts, climbing concurrency, watching the KV gauge:

Concurrent users	KV cache	Preemptions	What users feel
4	28%	0	Smooth
8	45%	0	Smooth
16	65%	0	Smooth
24	97%	0	Still smooth — but on the edge
32	100%	11	Random pauses for some

The dangerous spot is 24, where the pool is at 97% with zero preemptions. Everything still feels fine. One more user and the pool tips over. Capacity planning needs to design for the cliff at 32, not the comfort at 24.

The two models diverge sharply here. Qwen never produces the pause on this card — even at 48 concurrent users with 12,000-token prompts, its KV pool peaked at 47%. Llama-2 produces it at modest concurrency. The reason is in the section after next.

The cold-user problem

Two users hit your service in the same minute. Identical questions, identical system prompt. The first user waits 2.5 seconds for the first word. The second waits 0.15 seconds.

That’s a 16x gap in perceived speed for an identical request.

The reason is prefix caching. vLLM stores K/V vectors for sequences it’s recently processed. When the second user’s request arrives with the same system prompt, vLLM reuses the cached vectors instead of recomputing them.

	Cold user TTFT	Warm user TTFT	Speedup
Qwen 2.5	2.48 s	0.15 s	17x
Llama-2 GPTQ	1.40 s	0.12 s	12x

Product implication: the order of content in your system prompt determines whether caching helps you. Static instructions at the front, per-user data at the end → most users feel like the second user. Per-user data at the front (user name, today’s date, session token) → every user is the first user.

Prompt design earns or burns the speedup. It’s a UX decision living in the prompt template, not in the infrastructure.

Why two 7B models behave differently to users

Same card. Same 12 GB. Same 4-bit quantization. Same Marlin kernel. Both ~7B parameters. But Qwen serves 96 concurrent users smoothly; Llama-2 falls over at 32. The gap traces to one architectural choice — how the model stores its key-value memory.

Qwen 2.5 uses Grouped Query Attention. Its attention layers share K and V across groups of query heads. 4 KV heads across 28 layers, ~56 KB per token in fp16.

Llama-2 uses full Multi-Head Attention. Each query head gets its own K and V. 32 KV heads across 32 layers, ~512 KB per token. About 9x more memory per token.

The 6.6 GB KV pool on this card holds either ~120,000 tokens for Qwen or ~13,200 for Llama-2. Same physical memory, ~9x fewer concurrent slots for Llama-2. That’s the entire user-experience gap, packed into one architectural decision.

The lesson: architecture decides UX more than hardware on consumer cards. Two 7B models that look interchangeable on a spec sheet have different walls. Choosing your serving model is a UX decision, not just a quality decision.

The numbers, for engineers

Full data for verification.

Single-user baseline (200 output tokens, 10 kept after warmup):

	Qwen 2.5 7B AWQ	Llama-2 7B GPTQ
TTFT mean	43 ms	37 ms
Total time mean	3.41 s	2.55 s
Decode tok/s mean	59.1	79.3

Throughput vs concurrency (200 max output tokens, varied prompts):

Concurrency	Qwen tok/s	Qwen per-user	Qwen TTFT p95	Llama tok/s	Llama per-user	Llama TTFT p95
1	58.5	58.9	0.04 s	78.8	79.4	0.03 s
2	102.0	57.7	0.07 s	150.9	76.5	0.05 s
4	217.6	55.7	0.10 s	284.8	72.7	0.08 s
8	410.1	53.6	0.14 s	505.5	65.2	0.12 s
16	760.8	49.9	0.17 s	810.4	53.1	0.21 s
32	999.0	32.5	0.28 s	971.0	31.7	0.34 s
64	1157.5	18.8	0.57 s	1009.9	16.6	0.59 s
96	1184.7	12.9	0.77 s	869.8	10.4	0.74 s

Qwen plateaus around 1185 tok/s. Llama-2 peaks at 1010 at concurrency 64 and drops to 870 at 96 — preemption overhead becoming visible.

Memory-wall test (Llama-2 GPTQ, ~2,000-token prompts, 500 max output tokens):

Concurrency	Peak KV %	Preemptions	Wait queue	Total throughput
4	28%	0	0	137 tok/s
8	45%	0	0	179
16	65%	0	0	212
24	97%	0	0	217
32	100%	11	6	215

Qwen on a comparable long-prompt workload (12,000-token prompts at concurrency 48) peaked at 47% KV with zero preemptions. The wall is architectural.

Prefix caching (30 requests, concurrency 8, ~2,000-character system prompt):

	Qwen unique	Qwen shared	Llama-2 unique	Llama-2 shared
TTFT p50	2.48 s	0.15 s	1.40 s	0.12 s
Prefix hit rate	0%	94.3%	0%	94.3%

Caveat on the GPTQ Llama-2 prefix run: the shared-prompt requests produced only 52 output tokens across 30 requests, while the unique-prompt requests produced 1,251. At temperature 0, decoding is deterministic, and the GPTQ-quantized weights happen to land on a “short polite answer plus EOS” output for that specific shared prefix — same architecture as the AWQ build but slightly different rounding, different greedy path. Hit rate and TTFT improvement are valid; the throughput speedup is not, since the two scenarios generated very different amounts of output.

What this doesn’t cover

Two specific models on one 12 GB card. Different model, quantization, prompts, or hardware would shift the boundaries.

Production deployments touch a longer list this didn’t: failure modes, multi-tenant isolation, model upgrades, autoscaling, cost, monitoring discipline. Each is its own question.

What I take from it

Capacity planning is not “tokens per second peak.” It’s “how many users feel served.” The peak number is an upper bound on a different question than the one a product owner is asking.

When self-hosting LLMs for production, model architecture caps your concurrency as much as the GPU does. Two 7B models on the same card with the same quantization can support very different numbers of concurrent users. Pick the model with capacity planning in mind, not just answer quality.

Watch the silent symptoms. Preemption-as-pause and cold-user latency don’t show up as errors. They show up as weird user feedback that’s hard to reproduce. Instrument for them — preemption counts and prefix cache hit rates over time save weeks of debugging.

Prefix caching is leverage, but only if you design for it. Static prefixes earn their keep. Per-user data at the front of the prompt makes every user cold.

Running this changed what I’d ask in a design review. “Can our hardware serve N users?” sounded like a single question; now it sounds like four — first-token latency, decode rate, preemption frequency, cold-user gap. Each has a different threshold, none of them visible on a peak-throughput chart.