AI / Machine Learning • April 19, 2025 • Aditya Rawas

How Large Language Models Work: A Beginner's Guide

In recent years, large language models (LLMs) like ChatGPT, Claude, and Gemini have transformed how we interact with software. From writing code to generating essays, these systems are remarkably capable. But how do they actually work?

This guide breaks it down from first principles — covering transformers, attention, tokenization, context windows, sampling, and what causes hallucinations — in developer-friendly terms.

The Core Idea: Next-Token Prediction

At its heart, every LLM does one thing: predict the next token given all previous tokens. That’s it. Everything else — reasoning, coding ability, instruction following — emerges from doing this at massive scale.

Take the sentence:

“My favorite programming language is”

The model has seen billions of similar sentences during training. It assigns probabilities to every possible next token:

Token	Probability
`Python`	18%
`JavaScript`	14%
`Go`	6%
`TypeScript`	9%
…	…

It samples from this distribution and appends the chosen token, then repeats — predicting one token at a time until it generates a complete response.

What is a Token?

LLMs don’t process characters or words — they process tokens. A token is roughly 3–4 characters of English text on average, but the exact boundaries depend on the tokenizer.

Common tokenization patterns:

"Hello" → ["Hello"] (one token)
"JavaScript" → ["Java", "Script"] (two tokens)
"unforgettable" → ["un", "forget", "table"] (three tokens)
" " (space) is often merged with the following word

GPT-4 uses ~100,000 tokens in its vocabulary. The tokenizer (typically Byte Pair Encoding / BPE) learns which character sequences are common enough to deserve their own token.

Why Tokens Matter for Developers

Pricing: API calls are billed per token (input + output). “1,000 tokens ≈ 750 words.”
Context limits: A model with a 128K context window can process ~96,000 words at once.
Counting: tiktoken (for OpenAI models) lets you count tokens before sending a request.
Non-English languages: Languages like Chinese or Arabic tokenize less efficiently — the same content costs more tokens.

The Transformer Architecture

Modern LLMs are built on the Transformer architecture, introduced in the 2017 paper “Attention Is All You Need”. Before transformers, sequence models (RNNs, LSTMs) processed text sequentially — one word at a time. Transformers process the entire input in parallel, which is why they can be trained at scale.

A Transformer has two main components:

Encoder: Reads and understands the input (used in models like BERT).
Decoder: Generates output tokens one at a time (used in GPT-style models).

Most modern LLMs (GPT-4, Claude, Llama) are decoder-only — they read the prompt and generate a continuation.

Self-Attention: How the Model Understands Context

The key innovation of Transformers is self-attention — a mechanism that lets every token look at every other token in the input simultaneously.

Consider:

“The animal didn’t cross the street because it was too tired.”

What does “it” refer to — the animal or the street? Humans resolve this using context. Self-attention gives the model the same ability: the “it” token attends to both “animal” and “street,” and learns to weight “animal” more heavily based on training.

How Self-Attention Works (Simplified)

For each token, the model computes three vectors:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What information do I pass along?”

The attention score between two tokens is computed as:

score(Q, K) = softmax(Q · K^T / √d)

High scores mean the tokens are strongly related. The output is a weighted sum of all Value vectors.

Multi-head attention runs this process in parallel with different learned projections — each “head” learns to attend to different kinds of relationships (syntactic, semantic, positional).

Context Windows

The context window is the maximum number of tokens a model can “see” at once — both the prompt and the response combined.

Model	Context Window
GPT-3.5	16K tokens
GPT-4o	128K tokens
Claude 3.5 Sonnet	200K tokens
Gemini 1.5 Pro	1M tokens

Everything outside the context window is invisible to the model. This is why long conversations can cause the model to “forget” earlier context — it literally can’t see it anymore.

Practical Implications

For RAG (Retrieval-Augmented Generation), you retrieve relevant chunks and fit them into the context window rather than loading entire documents.
Long context doesn’t mean perfect recall — models tend to perform better on content at the beginning and end of the context (“lost in the middle” problem).

Training: Learning from Massive Data

LLMs are trained in two main phases:

Phase 1: Pre-training

The model is trained on a massive corpus of text — books, websites, code, Wikipedia, forums — using self-supervised learning. The training objective is simple: predict the next token given all previous tokens.

This requires enormous compute. GPT-4 is estimated to have been trained on trillions of tokens using thousands of GPUs over months.

The result is a base model — extremely knowledgeable, but not yet helpful as an assistant. Ask a base model a question and it might just continue generating text in the style of the training data rather than answering.

Phase 2: Fine-tuning for Helpfulness

Base models are transformed into useful assistants through fine-tuning:

Instruction Tuning (SFT) The model is trained on curated examples of (instruction, good response) pairs. This teaches it to follow the format of question → answer.

Reinforcement Learning from Human Feedback (RLHF) Human raters compare pairs of responses and indicate which is better. A reward model is trained on these preferences. The LLM is then fine-tuned using reinforcement learning to maximize the reward model’s score — pushing it toward more helpful, accurate, and safe responses.

Direct Preference Optimization (DPO) A newer, simpler alternative to RLHF that skips the separate reward model and fine-tunes directly from preference data.

Temperature and Sampling

When an LLM generates the next token, it doesn’t always pick the highest-probability token. The sampling strategy controls how creative or predictable the output is.

Temperature

Temperature scales the probability distribution before sampling:

Temperature = 0: Always pick the highest-probability token. Deterministic, repetitive.
Temperature = 1: Sample from the raw distribution. Balanced.
Temperature > 1: Flatten the distribution — more random, creative, and sometimes incoherent.

# OpenAI API example
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a tagline for a coffee shop"}],
    temperature=0.9  # More creative
)

Top-p (Nucleus Sampling)

Instead of considering all tokens, top-p sampling only considers the smallest set of tokens whose cumulative probability exceeds p.

top_p=0.9: Only consider tokens that together make up 90% of the probability mass.
This filters out very unlikely tokens while still allowing diversity.

Most production applications use temperature=0 for factual/structured tasks and temperature=0.7–1.0 for creative tasks.

Why LLMs Hallucinate

Hallucination — confidently generating false information — is one of the most discussed LLM failure modes. Understanding why it happens helps you build around it.

Reason 1: The objective is fluency, not truth LLMs are trained to predict the next likely token, not to verify factual accuracy. A plausible-sounding but wrong answer can have a high probability in the training distribution.

Reason 2: Knowledge cutoff LLMs have a training cutoff date. Events, APIs, or facts after that date are unknown to the model — but it may still generate confident-sounding responses about them.

Reason 3: Sparse training data The model may have seen very few examples about niche topics. In low-data regions, it generalizes from similar patterns and may generate inaccurate specifics.

Mitigation strategies:

RAG (Retrieval-Augmented Generation): Ground the model with retrieved, verifiable documents.
Tool use: Let the model call a search API or database instead of relying on parametric memory.
Low temperature + system prompts: Instruct the model to say “I don’t know” when uncertain.
Output verification: For critical use cases, verify generated content programmatically or with a second model.

Parameters: What “70B” Actually Means

You’ll often see model sizes expressed in parameters: GPT-3 had 175B, Llama 3 comes in 8B and 70B variants.

Parameters are the numerical weights inside the neural network — the values that were adjusted during training to minimize prediction error. More parameters generally means:

More knowledge stored
Better reasoning
Higher compute and memory requirements

A 70B model requires roughly 140GB of GPU memory at float16 precision (2 bytes × 70 billion parameters). This is why running large models locally requires specialized hardware, while smaller quantized models (4-bit, 8-bit) can run on consumer GPUs.

Key Takeaways

LLMs predict the next token using probability distributions learned from massive text corpora.
Tokens are sub-word units — roughly 3–4 characters. They drive API pricing and context limits.
The Transformer architecture processes all tokens in parallel using self-attention, enabling understanding of long-range dependencies.
Context windows define how much the model can see at once — content outside the window is invisible.
Pre-training builds world knowledge; fine-tuning (SFT + RLHF) makes models helpful and safe.
Temperature and top-p control output diversity — lower for precision, higher for creativity.
Hallucination happens because LLMs optimize for fluency, not truth — mitigate with RAG and tool use.
Model size in parameters determines capability and compute cost.

For a deeper look at the data side — how training data is collected, labeled, cleaned, and split — see Data in AI and Machine Learning.

AI / Machine Learning

Data in AI and Machine Learning: What It Is and How to Use It

Explore what data means in AI — structured vs unstructured data, how to collect and label it, train/test/validation splits, data augmentation, feature engineering, data quality vs quantity, and common mistakes to avoid.

Written by

Aditya Rawas

Full-stack engineer writing deep-dives on JavaScript, TypeScript, React, AWS, Docker, and Kubernetes. Passionate about making complex engineering concepts accessible to developers at every level.

GitHub LinkedIn Twitter/X