Skip to content
Venkat Nithin Chinni
Go back

What an LLM Server Feels Like Under Load

10 min read

I had a couple of days, an RTX 3060 with 12 GB of VRAM running on a Proxmox node, and a question I was curious about: what does running an LLM at home actually feel like to use? Alone with the model. With five other people. With fifty. At what point does it stop feeling fast and start feeling broken?

I ran the same experiments against two 7B models — Qwen 2.5 and Llama-2, both 4-bit quantized.

The setup

One Proxmox node with one 3060 passed through. vLLM 0.11.0 in a Docker container. 4-bit quantized weights (AWQ for Qwen, GPTQ for Llama-2). Prefix caching enabled.

docker run -d --name vllm \
    --restart unless-stopped \
    --gpus all --ipc=host \
    -v vllm-models:/root/.cache/huggingface \
    -p 8000:8000 \
    vllm/vllm-openai:v0.11.0 \
    --model <MODEL> \
    --max-model-len <CTX> \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 96 \
    --enable-prefix-caching

Driver scripts running from a different machine. Three benchmarks per model: single-user baseline, concurrency sweep, prefix caching test. A small dashboard on /metrics watched KV cache, queue depth, preemptions, and TTFT live. Full data tables in the appendix.

Scripts at github.com/clipod/vllm-probe if you want to clone and run them against your own vLLM instance.

One user, no one else around

The first word appears in about 40 ms; tokens then stream at 60–80 per second, faster than reading speed. Both models feel instant. Qwen baselines at 59 tok/s decode with 43 ms TTFT. Llama-2 GPTQ at 79 tok/s with 37 ms TTFT. From the user’s chair, indistinguishable.

The wait before the first word

Time-to-first-token grows as concurrent users go up. Under 200 ms feels responsive; up to a second is fine for an LLM (ChatGPT routinely sits there). Past three seconds users start to wonder; past ten, some give up.

Concurrent usersQwen TTFT p95Llama-2 TTFT p95
140 ms30 ms
8140 ms120 ms
32280 ms340 ms
64570 ms590 ms
96770 ms740 ms

Both models stay under a second even at 96 concurrent users. Users notice the wait at peak load but don’t reload.

The text starts typing slower

Per-user decode rate is the perceived typing speed. Above 10 tok/s feels conversational; below 5 feels like buffering.

Concurrent usersQwen per-user tok/sLlama-2 per-user tok/s
15979
85465
323332
641917
961310

At 96, Llama-2 has crossed the buffering line. Qwen is right at the edge. Total throughput climbs with concurrency, but per-user throughput drops to make room for the new arrivals.

The silent pause

Tokens are arriving, you’re reading along, and then… nothing. Five seconds. Ten seconds. The cursor is still there, no error. Then tokens resume mid-sentence as if nothing happened.

This is preemption. vLLM holds each user’s KV cache in GPU memory. When the pool fills and an active user needs more memory, the scheduler evicts the youngest active user to free room. The evicted user waits, then re-prefills from scratch when there’s space. Tokens already on screen stay; new tokens just stop flowing for the duration.

It’s a hard symptom to operate around:

I forced this state on Llama-2 — long prompts, climbing concurrency, watching the KV gauge:

Concurrent usersKV cachePreemptionsWhat users feel
428%0Smooth
845%0Smooth
1665%0Smooth
2497%0Still smooth — but on the edge
32100%11Random pauses for some

The dangerous spot is 24, where the pool is at 97% with zero preemptions. Everything still feels fine. One more user and the pool tips over. Capacity planning needs to design for the cliff at 32, not the comfort at 24.

The two models diverge sharply here. Qwen never produces the pause on this card — even at 48 concurrent users with 12,000-token prompts, its KV pool peaked at 47%. Llama-2 produces it at modest concurrency. The reason is in the section after next.

The cold-user problem

Two users hit your service in the same minute. Identical questions, identical system prompt. The first user waits 2.5 seconds for the first word. The second waits 0.15 seconds.

That’s a 16x gap in perceived speed for an identical request.

The reason is prefix caching. vLLM stores K/V vectors for sequences it’s recently processed. When the second user’s request arrives with the same system prompt, vLLM reuses the cached vectors instead of recomputing them.

Cold user TTFTWarm user TTFTSpeedup
Qwen 2.52.48 s0.15 s17x
Llama-2 GPTQ1.40 s0.12 s12x

Product implication: the order of content in your system prompt determines whether caching helps you. Static instructions at the front, per-user data at the end → most users feel like the second user. Per-user data at the front (user name, today’s date, session token) → every user is the first user.

Prompt design earns or burns the speedup. It’s a UX decision living in the prompt template, not in the infrastructure.

Why two 7B models behave differently to users

Same card. Same 12 GB. Same 4-bit quantization. Same Marlin kernel. Both ~7B parameters. But Qwen serves 96 concurrent users smoothly; Llama-2 falls over at 32. The gap traces to one architectural choice — how the model stores its key-value memory.

Qwen 2.5 uses Grouped Query Attention. Its attention layers share K and V across groups of query heads. 4 KV heads across 28 layers, ~56 KB per token in fp16.

Llama-2 uses full Multi-Head Attention. Each query head gets its own K and V. 32 KV heads across 32 layers, ~512 KB per token. About 9x more memory per token.

The 6.6 GB KV pool on this card holds either ~120,000 tokens for Qwen or ~13,200 for Llama-2. Same physical memory, ~9x fewer concurrent slots for Llama-2. That’s the entire user-experience gap, packed into one architectural decision.

The lesson: architecture decides UX more than hardware on consumer cards. Two 7B models that look interchangeable on a spec sheet have different walls. Choosing your serving model is a UX decision, not just a quality decision.

The numbers, for engineers

Full data for verification.

Single-user baseline (200 output tokens, 10 kept after warmup):

Qwen 2.5 7B AWQLlama-2 7B GPTQ
TTFT mean43 ms37 ms
Total time mean3.41 s2.55 s
Decode tok/s mean59.179.3

Throughput vs concurrency (200 max output tokens, varied prompts):

ConcurrencyQwen tok/sQwen per-userQwen TTFT p95Llama tok/sLlama per-userLlama TTFT p95
158.558.90.04 s78.879.40.03 s
2102.057.70.07 s150.976.50.05 s
4217.655.70.10 s284.872.70.08 s
8410.153.60.14 s505.565.20.12 s
16760.849.90.17 s810.453.10.21 s
32999.032.50.28 s971.031.70.34 s
641157.518.80.57 s1009.916.60.59 s
961184.712.90.77 s869.810.40.74 s

Qwen plateaus around 1185 tok/s. Llama-2 peaks at 1010 at concurrency 64 and drops to 870 at 96 — preemption overhead becoming visible.

Memory-wall test (Llama-2 GPTQ, ~2,000-token prompts, 500 max output tokens):

ConcurrencyPeak KV %PreemptionsWait queueTotal throughput
428%00137 tok/s
845%00179
1665%00212
2497%00217
32100%116215

Qwen on a comparable long-prompt workload (12,000-token prompts at concurrency 48) peaked at 47% KV with zero preemptions. The wall is architectural.

Prefix caching (30 requests, concurrency 8, ~2,000-character system prompt):

Qwen uniqueQwen sharedLlama-2 uniqueLlama-2 shared
TTFT p502.48 s0.15 s1.40 s0.12 s
Prefix hit rate0%94.3%0%94.3%

Caveat on the GPTQ Llama-2 prefix run: the shared-prompt requests produced only 52 output tokens across 30 requests, while the unique-prompt requests produced 1,251. At temperature 0, decoding is deterministic, and the GPTQ-quantized weights happen to land on a “short polite answer plus EOS” output for that specific shared prefix — same architecture as the AWQ build but slightly different rounding, different greedy path. Hit rate and TTFT improvement are valid; the throughput speedup is not, since the two scenarios generated very different amounts of output.

What this doesn’t cover

Two specific models on one 12 GB card. Different model, quantization, prompts, or hardware would shift the boundaries.

Production deployments touch a longer list this didn’t: failure modes, multi-tenant isolation, model upgrades, autoscaling, cost, monitoring discipline. Each is its own question.

What I take from it

Capacity planning is not “tokens per second peak.” It’s “how many users feel served.” The peak number is an upper bound on a different question than the one a product owner is asking.

When self-hosting LLMs for production, model architecture caps your concurrency as much as the GPU does. Two 7B models on the same card with the same quantization can support very different numbers of concurrent users. Pick the model with capacity planning in mind, not just answer quality.

Watch the silent symptoms. Preemption-as-pause and cold-user latency don’t show up as errors. They show up as weird user feedback that’s hard to reproduce. Instrument for them — preemption counts and prefix cache hit rates over time save weeks of debugging.

Prefix caching is leverage, but only if you design for it. Static prefixes earn their keep. Per-user data at the front of the prompt makes every user cold.

Running this changed what I’d ask in a design review. “Can our hardware serve N users?” sounded like a single question; now it sounds like four — first-token latency, decode rate, preemption frequency, cold-user gap. Each has a different threshold, none of them visible on a peak-throughput chart.

Have thoughts, corrections, or counterarguments? Reach me at venk@nith.in

Share this post on:

Next Post
The Genetically Modified Employee