Chapter 5: Evaluation¶

After training our tiny LLM, we need to measure how well it performs. In this chapter we implement two standard evaluation metrics — perplexity and BLEU score — and show how to generate text from the model.

1. Perplexity¶

Perplexity (PPL) is the most common metric for language models. It measures how "surprised" the model is by the test data. Lower is better.

\[\text{PPL} = \exp\!\left(-\frac{1}{N}\sum_{i=1}^{N}\log P(x_i \mid x_{<i})\right)\]

Intuitively:

PPL = 1 → the model is perfectly confident and always correct
PPL = V (vocabulary size) → the model is random-guessing
A good small LM on English text might achieve PPL ≈ 20–50

Implementation¶

def perplexity(model, token_ids, seq_length=128):
    """
    Compute perplexity of the model on a sequence of token IDs.
    """
    total_log_prob = 0.0
    total_tokens = 0

    for i in range(0, len(token_ids) - seq_length, seq_length):
        input_seq = np.array([token_ids[i : i + seq_length]])
        target_seq = np.array([token_ids[i + 1 : i + seq_length + 1]])

        logits = model.forward(input_seq)            # (1, seq_len, vocab)
        probs = softmax(logits)                       # (1, seq_len, vocab)

        # Gather probabilities of correct tokens
        for t in range(seq_length):
            p = probs[0, t, target_seq[0, t]]
            total_log_prob += np.log(max(p, 1e-9))
            total_tokens += 1

    avg_neg_log_prob = -total_log_prob / total_tokens
    return np.exp(avg_neg_log_prob)

Interpreting Perplexity¶

PPL Range	Interpretation
1–10	Excellent (likely overfitting on small data)
10–50	Good for a small model
50–200	Reasonable for a tiny model on diverse text
200+	Poor — model hasn't learned much

2. BLEU Score¶

BLEU (Bilingual Evaluation Understudy) measures the quality of generated text by comparing it to reference text. It's widely used for machine translation and text generation.

BLEU computes the precision of n-gram overlaps between the generated text and reference:

\[\text{BLEU} = BP \cdot \exp\!\left(\sum_{n=1}^{N} w_n \log p_n\right)\]

where:

\(p_n\) = modified n-gram precision (clipped by reference counts)
\(w_n = 1/N\) (uniform weights, typically \(N=4\))
\(BP\) = brevity penalty to penalize short outputs:

\[BP = \begin{cases} 1 & \text{if } c > r \\ e^{1 - r/c} & \text{if } c \le r \end{cases}\]

Implementation¶

from collections import Counter

def compute_ngrams(tokens, n):
    """Extract n-grams from a token list."""
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

def bleu_score(reference_tokens, generated_tokens, max_n=4):
    """
    Compute BLEU score between reference and generated token lists.
    """
    if len(generated_tokens) == 0:
        return 0.0

    # Brevity penalty
    c = len(generated_tokens)
    r = len(reference_tokens)
    bp = np.exp(1 - r / c) if c <= r else 1.0

    # N-gram precisions
    log_precisions = 0.0
    for n in range(1, max_n + 1):
        ref_ngrams = Counter(compute_ngrams(reference_tokens, n))
        gen_ngrams = Counter(compute_ngrams(generated_tokens, n))

        # Clipped counts
        clipped = 0
        total = 0
        for ngram, count in gen_ngrams.items():
            clipped += min(count, ref_ngrams.get(ngram, 0))
            total += count

        if total == 0:
            return 0.0

        precision = clipped / total
        if precision == 0:
            return 0.0

        log_precisions += (1.0 / max_n) * np.log(precision)

    return bp * np.exp(log_precisions)

BLEU Score Ranges¶

BLEU	Quality
0.6+	Very high overlap (near-exact match)
0.3–0.6	Good quality
0.1–0.3	Understandable but imperfect
< 0.1	Poor

Note

BLEU was designed for machine translation. For Q&A evaluation, checking whether the expected answer appears in the model's output is often more practical. Perplexity remains the best single metric for overall model quality.

3. Text Generation & Question Answering¶

To evaluate qualitatively, we can generate text from the model. We use autoregressive decoding: starting from a prompt, we predict one token at a time and append it to the sequence.

For question answering, we format the prompt as Q: <question> A: and let the model complete the answer. We stop generation at the first newline (which separates Q&A pairs in our training data).

Answer a Question¶

def answer_question(model, tokenizer, question, max_tokens=60,
                    temperature=0.1, top_k=3):
    prompt = f"Q: {question} A:"
    full = generate(model, tokenizer, prompt,
                    max_tokens=max_tokens,
                    temperature=temperature, top_k=top_k)

    # Extract answer after "A:"
    if " A:" in full:
        answer = full.split(" A:", 1)[1].strip()
    else:
        answer = full[len(prompt):].strip()

    # Stop at newline (next Q&A pair boundary)
    if "\n" in answer:
        answer = answer.split("\n")[0].strip()

    return answer

Sampling Strategies¶

Greedy Decoding¶

Always pick the highest-probability token:

next_token = np.argmax(probs)

Simple but tends to produce repetitive text.

Temperature Sampling¶

Scale logits before softmax to control randomness:

\[P(x) = \text{softmax}(z / T)\]

\(T = 1.0\): normal sampling
\(T < 1.0\): sharper distribution (more confident)
\(T > 1.0\): flatter distribution (more creative)

Top-k Sampling¶

Only sample from the top \(k\) most likely tokens:

top_k_idx = np.argsort(probs)[-k:]
top_k_probs = probs[top_k_idx]
top_k_probs /= top_k_probs.sum()  # renormalize
next_token = np.random.choice(top_k_idx, p=top_k_probs)

Top-p (Nucleus) Sampling¶

Sample from the smallest set of tokens whose cumulative probability exceeds \(p\):

sorted_idx = np.argsort(probs)[::-1]
cumulative = np.cumsum(probs[sorted_idx])
cutoff = np.searchsorted(cumulative, p) + 1
top_p_idx = sorted_idx[:cutoff]
top_p_probs = probs[top_p_idx]
top_p_probs /= top_p_probs.sum()
next_token = np.random.choice(top_p_idx, p=top_p_probs)

Full Generation Function¶

def generate(model, tokenizer, prompt, max_tokens=100,
             temperature=1.0, top_k=50):
    token_ids = tokenizer.encode(prompt)
    # Remove EOS token from encode (we'll add it when done)
    if token_ids[-1] == 3:
        token_ids = token_ids[:-1]

    for _ in range(max_tokens):
        # Use last max_seq_len tokens as context
        context = token_ids[-model.max_seq_len:]
        x = np.array([context])

        logits = model.forward(x)
        next_logits = logits[0, -1, :]  # logits for last position

        # Temperature
        next_logits = next_logits / temperature

        # Softmax
        probs = softmax(next_logits)

        # Top-k
        if top_k > 0:
            top_k_idx = np.argsort(probs)[-top_k:]
            mask = np.zeros_like(probs)
            mask[top_k_idx] = probs[top_k_idx]
            probs = mask / mask.sum()

        # Sample
        next_token = np.random.choice(len(probs), p=probs)

        if next_token == 3:  # EOS
            break

        token_ids.append(next_token)

    return tokenizer.decode(token_ids)

4. Putting It All Together¶

import numpy as np
from model import TinyLLM
from tokenizer import BPETokenizer
from evaluate import perplexity, bleu_score, generate, answer_question

# Load trained model and tokenizer
model = TinyLLM(
    vocab_size=332, d_model=64, n_heads=4,
    n_layers=2, d_ff=256, max_seq_len=64,
)
tokenizer = BPETokenizer(vocab_size=332)

# --- Perplexity ---
test_ids = tokenizer.encode("Q: What sound does a cat make? A: A cat makes a meow sound.")
ppl = perplexity(model, test_ids, seq_length=64)
print(f"Perplexity: {ppl:.2f}")

# --- Question Answering ---
questions = [
    "What sound does a cat make?",
    "What is 2 plus 3?",
    "What is the opposite of hot?",
    "What day comes after Monday?",
]
for q in questions:
    ans = answer_question(model, tokenizer, q)
    print(f"Q: {q}")
    print(f"A: {ans}")

5. What to Expect from Our Tiny Model¶

Our ~121K parameter model trained on a Q&A corpus will:

Perplexity: Around 4 after 120 epochs of training
Q&A accuracy: 5/5 on trained questions; may struggle with unseen questions
Generation: Produce well-formed Q&A answers that follow the training patterns

This is expected! A tiny model can memorize a small Q&A corpus very well but won't generalize to arbitrary questions. Real LLMs use billions of parameters and terabytes of data. The purpose of this tutorial is to understand the mechanics, not to build a production model.

Summary¶

In this chapter we implemented:

Tool	Purpose
`perplexity()`	Measures how well the model predicts test data
`bleu_score()`	Compares generated vs reference text (n-gram overlap)
`generate()`	Autoregressive text generation with temperature and top-k
`answer_question()`	Q&A wrapper: formats prompt and extracts answer

What's Next?¶

Congratulations — you've built an LLM from scratch! Here are some directions to explore:

Scale up: Increase model size, data, and training time
Add dropout: Regularize to prevent overfitting
Implement KV-caching: Speed up generation by caching key/value states
Try different architectures: Add rotary positional embeddings (RoPE), grouped query attention, etc.
Use a real dataset: Train on books, Wikipedia, or code
Port more to C++: Move the entire forward pass to C++ for speed

Happy building! 🚀