Running large language models (LLMs) like Qwen3-235B using ollama
on a multi-GPU setup involves a complex dance of memory allocation. Why does a model sometimes only use 50% of GPU VRAM even with high context sizes and batch sizes? Why are layers offloaded to CPU? Let’s break it all down using a backpacking trip analogy.
đź§ The Cast of Characters
- GPUs: Super-strong hikers with small, fast backpacks (VRAM).
- CPU: Slower hiker with a huge but sluggish backpack (RAM).
- Model Layers: Heavy, fixed-weight books of knowledge the hikers must carry.
- Context Window (
--ctx-size
): A journal of everything seen so far. - KV Cache: The actual memory used to record attention to every token.
🏕️ The Journey Begins
Imagine you’re preparing a team of hikers to cross a massive range (run inference). Each hiker (GPU) needs to carry:
- A set of model weights (books)
- The attention journal (KV cache)
- Some workbench tools (temporary buffers)
But there’s a catch: the backpacks (VRAM) are limited. So you must balance what gets carried and where.
📦 VRAM Usage Breakdown
1. Model Weights (--n-gpu-layers
)
Model weights are static and predictable. If you set --n-gpu-layers 80
, Ollama tries to fit 80 layers of the model onto GPU VRAM. If it doesn’t fit, it offloads remaining layers to CPU.
2. KV Cache (--ctx-size
Ă— --batch-size
Ă— #layers)
This is dynamic and memory-hungry. Every new token added to the context inflates the size of the KV cache:
- More context = bigger attention matrix
- Bigger batch size = more rows of KV per token
It’s the single biggest contributor to VRAM usage during long prompts and completions.
3. Temporary Buffers
These are needed for intermediate operations like activations, attention weights, etc. They’re small compared to the others but still contribute.
📉 Why VRAM Usage Sometimes Drops
Let’s say you run the following config:
--ctx-size 32768
--batch-size 512
--n-gpu-layers 24
--tensor-split 3,3,3,3,3,3,3,3
You’d expect high VRAM usage, but you only see 40% usage per GPU. Why?
Answer: Because Ollama prioritizes safety and cache room
- The huge
ctx-size
andbatch-size
mean Ollama reserves a lot of VRAM for the KV cache. - It reduces the number of GPU layers to avoid running out of space during generation.
- Unused VRAM appears idle, but it’s “reserved” for safety — especially for long outputs.
đź§ Deep Dive: Attention and KV Cache
Every time the model predicts a new token, it performs “self-attention”: comparing this token to every other token that came before it.
So with a context size of 32,768, the model must calculate attention across a 32768 Ă— 32768
matrix. Multiply that by 80–100 layers and you get gigabytes of memory — just for remembering what it saw.
đź§Ş How to Max Out GPU Efficiency
- Reduce
--ctx-size
to 8192 or 4096 - Increase
--n-gpu-layers
gradually - Monitor with
nvidia-smi
and watch for OOM - Fine-tune
--tensor-split
if one GPU has more VRAM
Example Optimal Config
--ctx-size 8192
--batch-size 64
--n-gpu-layers 80
--tensor-split 12,12,12,12,12,12,12,16
This balances layer load across GPUs while keeping enough memory for a medium-length prompt.
🔚 Closing Thoughts
Understanding how VRAM is used in Ollama lets you avoid slowdowns, crashes, and wasted capacity. If your model feels sluggish or your GPUs look underused, try tweaking --ctx-size
and --n-gpu-layers
in tandem.
The key is to treat your GPUs like hikers: don’t overload them, but don’t let them coast either.
Happy inferencing!
Recent Comments