Building an LLM from Scratch¶

Welcome to this hands-on tutorial where you'll learn how to build a tiny Large Language Model (LLM) from scratch. This guide is designed for CS undergraduates who want to understand what's really happening inside models like GPT, from the ground up.

What You'll Learn¶

By the end of this tutorial, you will:

Understand the core concepts behind Large Language Models
Know how to collect, clean, and tokenize text data
Implement the Transformer architecture using only Python and NumPy — no PyTorch or TensorFlow
Train a tiny language model on a small dataset
Evaluate your model using standard metrics like perplexity and Q&A accuracy

Prerequisites¶

Python 3.10+ with NumPy installed
Basic knowledge of:
- Linear algebra (matrix multiplication, dot products)
- Calculus (derivatives, chain rule)
- Probability and statistics
- Python programming

Project Structure¶

llm-tutorial/
├── data/                  # Training data
│   └── corpus.txt         # Q&A training corpus
├── docs/                  # Tutorial documentation (this site)
│   ├── index.md
│   ├── 01_introduction.md
│   ├── 02_data_preprocessing.md
│   ├── 03_model_architecture.md
│   ├── 04_training.md
│   └── 05_evaluation.md
├── src/                   # Source code
│   ├── tokenizer.py       # Byte-Pair Encoding tokenizer
│   ├── data_preprocessing.py
│   ├── model.py           # Transformer model in NumPy
│   ├── train.py           # Training loop with dropout
│   ├── evaluate.py        # Evaluation metrics
│   └── utils.py           # Utility functions
├── run_pipeline.py        # End-to-end pipeline script
├── generate_corpus.py     # Q&A corpus generator
├── mkdocs.yml
└── agent.md

Quick Start¶

Run the entire pipeline (data → tokenize → train → evaluate → Q&A) in one command:

python -m venv .venv && source .venv/bin/activate
pip install numpy
python run_pipeline.py              # uses data/corpus.txt by default
python run_pipeline.py --epochs 10  # train longer for better results

How to Use This Tutorial¶

Work through the chapters in order. Each chapter includes:

Conceptual explanations of the theory
Code walkthroughs with detailed comments
Runnable source code in the src/ folder

Let's get started with Chapter 1: Introduction to LLMs!