Local LLMs 101: What Really Happens When You Run an AI Model on Your Own Machine

You’re a developer. You write code, you build projects, and lately, you’ve seen people running these “local LLMs” on their desktops — Mistral, LLaMA, Gemma — tossing around terms like quantization and KV cache as if they’re obvious.

So you try it. You download 10 gigabytes of something, run a script, and… it works. But what actually happened?

You didn’t just run a black box. You just made your GPU perform a small miracle. Let’s open that box and see what’s inside.

1. What Does “Running a Model” Actually Mean?

When you “run” a model, you’re doing inference — which is a fancy way of saying, “use what the model already knows to make predictions.”

There’s no learning happening now. That already happened during training, months ago, on hundreds of GPUs and mountains of text. Inference is just the playback phase — the part where your model shows what it’s learned.

You type something like:

“Explain quantum physics like I’m five.”

The model takes that text, breaks it into little chunks, and starts guessing what comes next, one token at a time — like finishing your sentences, but at machine speed.

Running a model = guess next token → add it → repeat until you stop.

It’s all just prediction — the world’s most sophisticated autocomplete.

2. A Model Is More Than One File

A model isn’t a single magic .bin file. It’s a small orchestra of moving parts:

  • Weights: Billions of numbers — the model’s “knowledge.”
  • Architecture: The wiring diagram — how information flows layer by layer.
  • Tokenizer: The translator that turns your words into numerical pieces called tokens.
  • Config: The instruction manual describing shapes, roles, and settings.

If the model were a human brain:

  • The architecture is the anatomy.
  • The weights are the memories and reflexes.
  • The tokenizer is language comprehension.
  • The config is the medical chart that says how it’s all assembled.

3. Tokens: The Model’s Alphabet Soup

Tokens aren’t words. They’re fragments — slices of text the model actually sees.

“hello” might be one token or two.
“internationalization” could be six.
“🐍” might take more than one.

The tokenizer decides how text is split. It’s like using a weird paper shredder that chops language into chunks the model understands.

And the model doesn’t see text — it sees numbers. Each token becomes an integer ID that points to a learned embedding in its vocabulary.

Think of it as Morse code for machines: it’s not about letters, it’s about patterns of meaning.

4. Context Window: The Model’s Short-Term Memory

A model can’t remember forever. Its context window defines how many tokens it can “see” at once — usually 2,000, 8,000, 32,000, or more.

When the conversation gets longer than that, older tokens fall off the edge. It’s like a chalkboard — every new sentence pushes the old ones out of view.

Each token you add also grows the KV cache, the model’s internal memory of attention states. And that eats GPU memory fast. Rough rule of thumb: around 0.5 MB per token on a 7B model.

So yes — chatting with your model literally fills up its brain until it can’t think anymore.

5. Step by Step: What Happens During Generation

  1. Take the prompt (in tokens).
  2. Feed it through all layers of the model.
  3. The model calculates a probability for every possible next token.
  4. It picks one (based on decoding strategy).
  5. Adds it to the sequence.
  6. Repeats until a stop signal.

Every single token is generated one at a time. It’s like a pianist playing one note after another — except each note depends on every note played so far.

That’s why long responses take longer: your model is literally re-reading its entire output every time it adds a new word.

6. Inside the Transformer: The Secret Engine of Language

At the core of every modern LLM is the Transformer — a structure built to handle sequences.

You can imagine it as a massive factory with multiple conveyor belts (layers), each refining your sentence into a more detailed understanding.

Each layer contains:

  • Self-Attention: Figures out which previous tokens matter right now. Like your brain deciding which part of a paragraph to focus on.
  • Feed-Forward (MLP): Adds nonlinear reasoning and pattern recognition.
  • RoPE (Rotary Position Embeddings): A mathematical trick that tells the model where each token is in the sequence — so “dog bites man” ≠ “man bites dog.”

A 7B model might have around 32 layers. A 70B model? Many more. Each layer is like a microscope zooming in, seeing relationships the last one missed.

7. Model Size vs. VRAM: The Eternal Struggle

Model names like 7B, 13B, and 70B refer to how many parameters (weights) they have — billions of them.

More parameters = smarter model, but also more memory and compute.

Your GPU must fit both:

  • Model weights (the bulk of it)
  • KV cache (grows with your conversation length)

Example:
7B model in FP16 precision → ~14 GB
Same model in 4-bit quantization → ~3.5 GB
Add 2 GB for KV cache during a long chat, and you’re suddenly near your GPU limit.

If VRAM is the kitchen counter, the model is a giant bowl of noodles. You can’t cook more than what fits.

8. Quantization: Making Big Brains Fit Small GPUs

Quantization is how we shrink huge models to fit smaller hardware. It reduces the precision of each weight — from 16 bits down to 8 or even 4.

You can think of it like compressing an image: the pixels get blurrier, but the shape is still there. The model “thinks” roughly the same, just with slightly fuzzier math.

  • FP16 / BF16: Full precision, highest quality.
  • INT8: Half the size, minimal quality loss.
  • INT4 / NF4: Quarter the size, still surprisingly good.

For most developers, 4-bit quantization is the sweet spot: You lose a tiny bit of accuracy, but your GPU stops crying.

9. Decoding: How Models Choose Their Words

Once the model has probabilities for the next token, how does it choose one?

  • Greedy: Always pick the top token. (Robot mode.)
  • Temperature: Add randomness; higher = more creative, lower = more factual.
  • Top-k / Top-p: Pick from a subset of likely options, not just the top one.
  • Repetition penalties: Stop it from looping forever.

It’s like DJ’ing for a language model — you’re tuning how wild or predictable its rhythm feels.

10. What Happens When You “Load a Model”

  1. It reads the config (architecture, shapes, vocab).
  2. Loads weights into VRAM.
  3. Initializes the tokenizer.
  4. Performs a “warmup” pass (first run is slower).
  5. Starts inference — generating one token at a time.

The first output you see? That’s the moment your GPU becomes a brain. Every token after that is just math echoing through a billion tiny circuits.

11. Serving Models: Turning Math Into APIs

Once you can run inference, you can serve your model for others (or yourself):

  • vLLM: High-throughput, production-grade, perfect for chatbots.
  • llama.cpp: Lightweight and portable, runs on laptops.
  • ExLlama v2/v3: Great for 4-bit LLaMA models.
  • FastAPI / Flask: Wrap your local model into an HTTP endpoint.

Think of this like turning your local GPU into a mini OpenAI server — same power, zero API fees.

12. Tradeoffs Everywhere

DecisionWinCost
Bigger modelMore knowledge, better reasoningMore VRAM & slower
Longer contextMore memory of conversationMore latency & heat
Heavier quantizationSmaller size, fasterSome loss of nuance
Running locallyFull control, privacySetup effort & maintenance

There’s no “best,” only “best for your use case.”

13. Why Run Locally?

Because you want control, privacy, and speed.

  • No one else sees your data.
  • You can tune decoding, context, precision — everything.
  • No per-token billing.
  • Instant response, no network lag.

Cloud AI is like renting a supercar — impressive but expensive. Local AI is like building your own garage. Harder upfront, but it’s yours forever.

14. Common Pitfalls

  • Out of Memory (OOM): Model too big or context too long. Quantize or shrink.
  • Weird gibberish: You used a base model with a chat prompt, or missed the chat template.
  • Slow output: Offloading to CPU, missing drivers, or no FlashAttention.
  • Unsafe downloads: Random .bin files — stick to verified safetensors.

Running local LLMs is a bit like overclocking a GPU — you learn by crashing.

15. Chat Templates: The Hidden Formatting Rule

Most chat-tuned models don’t understand plain text. They expect messages in a structured conversation format, known as a chat template.

Think of it as a screenplay — each character’s name must be labeled, or the model won’t know who’s speaking.


<|system|>
You are a helpful assistant.
<|user|>
Write a haiku about transformers.
<|assistant|>

Each tag is like a colored sticky note:

  • System: sets the stage — tone, instructions, personality.
  • User: delivers input.
  • Assistant: where the model writes the reply.

Skip these, and the model forgets who’s talking — like actors on stage without scripts. Chat templates keep the dialogue coherent.

Base (non-chat) models don’t need these, but chat-tuned models live by them. If your outputs make no sense, check your template first — it’s like feeding the model the wrong format of food.

16. Running Local: Quick Checklist

  1. Pick a chat-tuned model that fits your VRAM.
  2. Choose precision: FP16 (quality) or 4-bit (practical).
  3. Install a runtime (vLLM, llama.cpp, PyTorch).
  4. Load the model, check tokens/sec and memory use.
  5. Apply the right chat template.
  6. Tune decoding (temperature, top-p, repetition penalty).
  7. Serve it locally if you like.

That’s it. You’re now running your own AI model — no cloud, no subscription.

17. Glossary (Keep It Simple)

TermMeaning
TokenA chunk of text, represented as an integer.
Context windowHow many tokens the model can “see.”
KV cacheMemory of what’s been said so far.
QuantizationShrinking model weights by lowering precision.
RoPEMath that encodes token order.
TransformerThe architecture behind most modern LLMs.

18. The Big Picture

Running a local LLM isn’t about showing off your GPU. It’s about understanding how all the moving parts—tokens, context, weights, and precision—work together to turn math into language.

Once you grasp those fundamentals, the mystery disappears. What feels like “intelligence” is really a precise system of probabilities running at scale.

From there, it’s all about control. Adjust context size, precision, or temperature, and you can shape how the model behaves. The more you experiment, the clearer it becomes: running an LLM locally is less about power, and more about mastery.