Understanding VRAM Usage in Ollama with Large Models

Running large language models (LLMs) like Qwen3-235B using ollama on a multi-GPU setup involves a complex dance of memory allocation. Why does a model sometimes only use 50% of GPU VRAM even with high context sizes and batch sizes? Why are layers offloaded to CPU? Let’s break it all down using a backpacking trip analogy.

🧭 The Cast of Characters

GPUs: Super-strong hikers with small, fast backpacks (VRAM).
CPU: Slower hiker with a huge but sluggish backpack (RAM).
Model Layers: Heavy, fixed-weight books of knowledge the hikers must carry.
Context Window (--ctx-size): A journal of everything seen so far.
KV Cache: The actual memory used to record attention to every token.

🏕️ The Journey Begins

Imagine you’re preparing a team of hikers to cross a massive range (run inference). Each hiker (GPU) needs to carry:

A set of model weights (books)
The attention journal (KV cache)
Some workbench tools (temporary buffers)

But there’s a catch: the backpacks (VRAM) are limited. So you must balance what gets carried and where.

📦 VRAM Usage Breakdown

1. Model Weights (`--n-gpu-layers`)

Model weights are static and predictable. If you set --n-gpu-layers 80, Ollama tries to fit 80 layers of the model onto GPU VRAM. If it doesn’t fit, it offloads remaining layers to CPU.

2. KV Cache (`--ctx-size` × `--batch-size` × #layers)

This is dynamic and memory-hungry. Every new token added to the context inflates the size of the KV cache:

More context = bigger attention matrix
Bigger batch size = more rows of KV per token

It’s the single biggest contributor to VRAM usage during long prompts and completions.

3. Temporary Buffers

These are needed for intermediate operations like activations, attention weights, etc. They’re small compared to the others but still contribute.

📉 Why VRAM Usage Sometimes Drops

Let’s say you run the following config:

--ctx-size 32768
--batch-size 512
--n-gpu-layers 24
--tensor-split 3,3,3,3,3,3,3,3

You’d expect high VRAM usage, but you only see 40% usage per GPU. Why?

Answer: Because Ollama prioritizes safety and cache room

The huge ctx-size and batch-size mean Ollama reserves a lot of VRAM for the KV cache.
It reduces the number of GPU layers to avoid running out of space during generation.
Unused VRAM appears idle, but it’s “reserved” for safety — especially for long outputs.

🧠 Deep Dive: Attention and KV Cache

Every time the model predicts a new token, it performs “self-attention”: comparing this token to every other token that came before it.

So with a context size of 32,768, the model must calculate attention across a 32768 × 32768 matrix. Multiply that by 80–100 layers and you get gigabytes of memory — just for remembering what it saw.

🧪 How to Max Out GPU Efficiency

Reduce --ctx-size to 8192 or 4096
Increase --n-gpu-layers gradually
Monitor with nvidia-smi and watch for OOM
Fine-tune --tensor-split if one GPU has more VRAM

Example Optimal Config

--ctx-size 8192
--batch-size 64
--n-gpu-layers 80
--tensor-split 12,12,12,12,12,12,12,16

This balances layer load across GPUs while keeping enough memory for a medium-length prompt.

🔚 Closing Thoughts

Understanding how VRAM is used in Ollama lets you avoid slowdowns, crashes, and wasted capacity. If your model feels sluggish or your GPUs look underused, try tweaking --ctx-size and --n-gpu-layers in tandem.

The key is to treat your GPUs like hikers: don’t overload them, but don’t let them coast either.

Happy inferencing!

Understanding VRAM Usage in Ollama with Large Models

🧭 The Cast of Characters

🏕️ The Journey Begins