Training a Document Classifier
Training a Document Classifierโ
๐ง 1. Neural Networks Learn Through Parameters (ฮธ)โ
- A neural network is just a stack of mathematical operations using parameters (called ฮธ, theta).
- These parameters = weights that are learned and adjusted during training.
- The goal is to tweak ฮธ so your predictions (ลท) get closer to the actual labels (y).
๐ 2. What Is a Loss Function? (Hint: It Measures Mistakes)โ
-
A loss function measures how far off the model is from the correct answer.
-
Think of it like this:
โ High loss = bad predictions ๐
โ Low loss = model doing well ๐
-
We donโt manually teach the model what's wrong โ the loss function tells it where it messed up.
๐ฏ 3. Enter Cross-Entropy Lossโ
- Used for classification tasks, especially when you want the model to pick between multiple categories.
- Based on comparing:
- True distribution (y): The correct class (e.g., โsportsโ)
- Predicted distribution (ลท): The probabilities the model assigns to each class after softmax
๐ธ How it works:โ
- Your model spits out logits (raw scores for each class).
- Apply softmax: turns logits into a probability distribution (all values between 0โ1 and sum to 1).
- Cross-entropy loss measures how well your predicted distribution matches the correct class.
- It punishes confident wrong answers more than unsure ones.
๐ Formula-wise (simplified):โ
pgsql
CopyEdit
Cross-Entropy = -log(P(correct class))
If the model is 90% sure the answer is correct: low loss.
If itโs 10% sure or confident in the wrong class: high loss.
๐ 4. Monte Carlo Samplingโ
- Fancy phrase for: "When we donโt know the full distribution, just average over examples."
- Itโs how we approximate the โtrueโ loss across a batch of training samples.
๐ ๏ธ 5. Optimization: How the Model Learnsโ
The way we minimize the loss is through:
๐ Gradient Descentโ
- Iteratively update parameters to reduce loss:
python
CopyEdit
ฮธ โ ฮธ - ฮท * โLoss
- ฮธ = current weights
- ฮท = learning rate (how big a step to take)
- โLoss = gradient (slope of the loss function)
โ Steps in Practice:โ
- Forward pass: Run inputs through the model โ get predictions โ compute loss.
- Backward pass: Calculate gradients using
.backward()
. - Update parameters: Use an optimizer like
SGD
to move ฮธ in the right direction. - Repeat.
๐งฎ 6. Logits โ Softmax โ Argmaxโ
- Logits = raw model output per class
- Softmax = converts logits to probabilities
- Argmax = picks the class with the highest probability
๐ 7. Learning Rate Schedulers & Gradient Clippingโ
- Scheduler: Reduces the learning rate after each epoch (to fine-tune learning).
- Gradient clipping: Prevents gradients from exploding (literally very large values that destabilize learning).
๐งช 8. Train / Validation / Test Setsโ
- Train: Used to learn parameters
- Validation: Used to tune hyperparameters (like learning rate, # of neurons, etc.)
- Test: Final evaluation to check real-world performance
โ Recap โ Bitty Styleโ
Concept | What It Means in Simple Terms |
---|---|
ฮธ (parameters) | The knobs the model turns to improve itself |
Loss function | Tells the model how bad it did |
Cross-entropy | Measures how far off the prediction is from the truth |
Softmax | Turns scores into probabilities |
Argmax | Picks the class with the highest probability |
Gradient descent | The step-by-step update method to reduce errors |
Optimizer (SGD) | Applies those updates in practice |
Learning rate | Controls how fast you update the weights |
Gradient clipping | Keeps things stable when learning gets wild |
Train/Val/Test | Each has its job in helping the model grow & generalize |
Monte Carlo sampling | Averaging over samples to estimate things we canโt measure exactly |