Skip to main content

Training a Document Classifier

Training a Document Classifierโ€‹


๐Ÿง  1. Neural Networks Learn Through Parameters (ฮธ)โ€‹

  • A neural network is just a stack of mathematical operations using parameters (called ฮธ, theta).
  • These parameters = weights that are learned and adjusted during training.
  • The goal is to tweak ฮธ so your predictions (ลท) get closer to the actual labels (y).

๐Ÿ“‰ 2. What Is a Loss Function? (Hint: It Measures Mistakes)โ€‹

  • A loss function measures how far off the model is from the correct answer.

  • Think of it like this:

    โ†’ High loss = bad predictions ๐Ÿ˜–

    โ†’ Low loss = model doing well ๐Ÿ˜Ž

  • We donโ€™t manually teach the model what's wrong โ€” the loss function tells it where it messed up.


๐ŸŽฏ 3. Enter Cross-Entropy Lossโ€‹

  • Used for classification tasks, especially when you want the model to pick between multiple categories.
  • Based on comparing:
    • True distribution (y): The correct class (e.g., โ€œsportsโ€)
    • Predicted distribution (ลท): The probabilities the model assigns to each class after softmax

๐Ÿ”ธ How it works:โ€‹

  1. Your model spits out logits (raw scores for each class).
  2. Apply softmax: turns logits into a probability distribution (all values between 0โ€“1 and sum to 1).
  3. Cross-entropy loss measures how well your predicted distribution matches the correct class.
  4. It punishes confident wrong answers more than unsure ones.

๐Ÿ“Œ Formula-wise (simplified):โ€‹

pgsql
CopyEdit
Cross-Entropy = -log(P(correct class))

If the model is 90% sure the answer is correct: low loss.

If itโ€™s 10% sure or confident in the wrong class: high loss.


๐Ÿ“š 4. Monte Carlo Samplingโ€‹

  • Fancy phrase for: "When we donโ€™t know the full distribution, just average over examples."
  • Itโ€™s how we approximate the โ€œtrueโ€ loss across a batch of training samples.

๐Ÿ› ๏ธ 5. Optimization: How the Model Learnsโ€‹

The way we minimize the loss is through:

๐Ÿ” Gradient Descentโ€‹

  • Iteratively update parameters to reduce loss:
python
CopyEdit
ฮธ โ† ฮธ - ฮท * โˆ‡Loss

  • ฮธ = current weights
  • ฮท = learning rate (how big a step to take)
  • โˆ‡Loss = gradient (slope of the loss function)

โœ… Steps in Practice:โ€‹

  1. Forward pass: Run inputs through the model โ†’ get predictions โ†’ compute loss.
  2. Backward pass: Calculate gradients using .backward().
  3. Update parameters: Use an optimizer like SGD to move ฮธ in the right direction.
  4. Repeat.

๐Ÿงฎ 6. Logits โ†’ Softmax โ†’ Argmaxโ€‹

  • Logits = raw model output per class
  • Softmax = converts logits to probabilities
  • Argmax = picks the class with the highest probability

๐Ÿ”„ 7. Learning Rate Schedulers & Gradient Clippingโ€‹

  • Scheduler: Reduces the learning rate after each epoch (to fine-tune learning).
  • Gradient clipping: Prevents gradients from exploding (literally very large values that destabilize learning).

๐Ÿงช 8. Train / Validation / Test Setsโ€‹

  • Train: Used to learn parameters
  • Validation: Used to tune hyperparameters (like learning rate, # of neurons, etc.)
  • Test: Final evaluation to check real-world performance

โœ… Recap โ€“ Bitty Styleโ€‹

ConceptWhat It Means in Simple Terms
ฮธ (parameters)The knobs the model turns to improve itself
Loss functionTells the model how bad it did
Cross-entropyMeasures how far off the prediction is from the truth
SoftmaxTurns scores into probabilities
ArgmaxPicks the class with the highest probability
Gradient descentThe step-by-step update method to reduce errors
Optimizer (SGD)Applies those updates in practice
Learning rateControls how fast you update the weights
Gradient clippingKeeps things stable when learning gets wild
Train/Val/TestEach has its job in helping the model grow & generalize
Monte Carlo samplingAveraging over samples to estimate things we canโ€™t measure exactly