Deep Dive into LLMs like ChatGPT

Andrej Karpathy walks through the complete pipeline of building an LLM like ChatGPT, aimed at a general audience. The video demystifies what happens behind the text box where you type your prompts.

Building an LLM starts with downloading and processing the internet. The Fine Web dataset exemplifies this process—44 terabytes of filtered text from Common Crawl, which has indexed 2.7 billion web pages since 2007.

The filtering pipeline removes:

Malware, spam, and adult content (URL filtering)
HTML markup, keeping only text (text extraction)
Non-English content below 65% threshold (language filtering)
Personally identifiable information like addresses and SSNs

Tokenization

Neural networks need a finite set of symbols in a one-dimensional sequence. Raw text gets converted through byte pair encoding:

Start with raw UTF-8 bytes (256 possible values)
Find common consecutive byte pairs
Merge them into new symbols
Repeat until vocabulary reaches ~100,000 tokens

GPT-4 uses 100,277 tokens. "Hello world" becomes just two tokens. The tokenizer at tiktokenizer lets you explore how text maps to tokens.

Neural Network Training

The network learns statistical patterns of how tokens follow each other. Training works by:

Taking random windows of tokens (up to 8,000)
Predicting which token comes next
Comparing prediction to actual next token
Adjusting parameters to increase correct token's probability

Modern networks have billions of parameters—knobs that get tuned during training. The Transformer architecture processes input tokens through layers of attention and multi-layer perceptrons until outputting probability distributions.

Inference

Generating text means sampling from the model's learned distributions:

Start with input tokens
Get probability distribution over all possible next tokens
Sample one token (flip a biased coin)
Append sampled token to input
Repeat

This stochastic process produces text that's statistically similar to training data but not identical—remixes rather than copies.

Key Insight

When you use ChatGPT, you're doing inference on a model trained months ago. The parameters stay fixed. You provide tokens, it completes sequences based on patterns learned during training.

andrej-karpathy-were-summoning-ghosts-not-building-animals - Karpathy's perspective on what LLMs actually are

Deep Dive into LLMs like ChatGPT

The Pre-training Pipeline

Data Collection

Tokenization

Neural Network Training

Inference

Key Insight

The Pre-training Pipeline

Data Collection

Tokenization

Neural Network Training

Inference

Key Insight

Related