๐ง Activation Functions & Derivatives
Activation functions are what make neural networks more than just glorified linear regressions. They add non-linearity โ which gives neural networks their superpower: the ability to learn complex patterns and behaviors. Their derivatives are critical for training the network effectively through backpropagation.
๐ Why Do We Use Activation Functions?โ
Neurons in a network receive inputs and combine them using weighted sums. But if we just kept doing weighted sums from layer to layer โ the whole network would collapse into one big linear transformation. Thatโs boring!
Activation functions inject non-linearity into the mix, letting networks:
- Learn curves and twists
- Make more nuanced decisions
- Model reality more effectively
๐ Common Activation Functionsโ
Letโs break them down with intuitive visuals, math, and practical insights.
1. ๐ Sigmoid Functionโ
- Formula:
- Output range: (0, 1)
It looks like an S-curve โ softly squashing large negative or positive values toward 0 or 1.
โ Use Case:โ
Perfect for binary classification โ the output behaves like a probability: โhow likely is it a cat?โ
โ Limitations:โ
- Vanishing gradient problem: at extreme values, the slope becomes nearly 0, so the network stops learning efficiently.
- Not zero-centered: everything is always positive, so updates can zigzag inefficiently.
๐งฎ Derivative:โ
- Max slope is at , where , and
slope = 0.25
- At or , slope ~ 0 โ slow learning
2. ๐งฟ Tanh Functionโ
- Formula:
- Output range: (-1, 1)
Similar shape to sigmoid, but centered around 0, so outputs can be negative too.
โ Use Case:โ
Hidden layers in older architectures, especially when zero-centering helps optimization.
โ Limitations:โ
- Still suffers from vanishing gradients at the extremes.
๐งฎ Derivative:โ
- Max
slope = 1 at
- At , ,
- Still fades at extremes, but performs better than sigmoid.
3. โก ReLU (Rectified Linear Unit)โ
- Formula:
- Output range: [0, โ)
ReLU is the tough love coach of activation functions โ if youโre below 0, it gives you nothing. If youโre above 0, it lets you grow.
โ Use Case:โ
Most popular choice for hidden layers in modern neural nets (especially deep learning)
โ Limitations:โ
- Dying ReLU problem: once a neuron gets stuck with negative inputs, it may stop updating.
๐งฎ Derivative:โ
Fast and simple, but non-zero only when
4. ๐ง Leaky ReLUโ
- Formula:
A soft ReLU variant that lets a small gradient flow when
โ Use Case:โ
Prevents "dead neurons" in ReLU networks. Great if you want stability but donโt want to sacrifice performance.
๐งฎ Derivative:โ
Always has a small slope โ it never completely shuts down.
5. ๐ Linear Activationโ
- Formula:
- Output range: (โโ, โ)
โ Use Case:โ
Only for output layers in regression problems (e.g., predicting house prices).
โ Donโt Use In Hidden Layers:โ
If you stack layers with no activation, itโs just matrix multiplication โ collapses into one linear transformation:
So your entire network becomes as dumb as a linear regression. No thanks.
๐ Why Derivatives Matter (Backpropagation)โ
Backpropagation uses derivatives to figure out how wrong each weight is. Thatโs how the network learns!
- A steep slope โ big correction
- A flat slope โ tiny or no learning
Thatโs why ReLU and Leaky ReLU are preferred โ they donโt suffer from tiny derivatives.
โ๏ธ Summary Table: Activation Functions & Their Derivativesโ
Function | Output Range | Derivative | Zero-Centered | Vanishing Gradient | Notes |
---|---|---|---|---|---|
Sigmoid | (0, 1) | โ No | โ Yes | Good for binary output, slow to train | |
Tanh | (-1, 1) | โ Yes | โ Yes (less) | Better gradient flow than sigmoid | |
ReLU | [0, โ) | 0 if , 1 if | โ No | โ No (but can die) | Fast, most commonly used |
Leaky ReLU | [~0, โ) | 0.01 or 1 | โ No | โ No | Solves ReLU dying problem |
Linear (ID) | (โโ, โ) | 1 | โ Yes | โ No | Only use for regression output layer |
๐ง TL;DR:โ
- Activation functions add non-linearity โ without them, you canโt learn complex patterns.
- Their derivatives are what fuel learning during backpropagation.
- Each function behaves differently:
- Sigmoid: slow, but probabilistic
- Tanh: better gradient flow
- ReLU: fast, sparse, simple
- Leaky ReLU: robust against dead neurons
- Linear: useful in output layer for continuous prediction
๐ Think of it like this:
> "Use ReLU to think, use linear to speak." ๐ง ๐ฃ๏ธ >