Activation Functions for Humans: Sigmoid, Tanh, and ReLU (Deep Dive)
π€ What Are Activation Functions?β
Activation functions are what make neural networks non-linear and capable of learning complex patterns. They control whether a neuron "fires" and how strong its output signal is, based on the weighted input.
In simpler terms:
> Think of them like dimmer switches for your brain cells β deciding how much light (signal) should come through. >
In technical terms: Given an input value , the activation function transforms it to produce the neuron's output .
π What Role Do They Play in Training?β
Activation functions are used during forward propagation after computing:
Then:
Where is the activation function.
They allow the network to:
- Learn complex patterns
- Avoid collapsing into just a linear regression model
- Enable gradient descent to work during backpropagation
π§ͺ Refresher: What is Gradient Descent?β
Gradient descent is how neural networks learn. Imagine youβre walking downhill in fog trying to find the lowest point (lowest error).
Steps:
- Compute how steep the slope is (gradient)
- Take a small step downhill
- Repeat
If the slope (gradient) is near-zero, your steps get smaller β thatβs the vanishing gradient problem.
π§ Comparing Activation Functionsβ
1. Sigmoid Functionβ
- Output: 0 to 1
- Looks like an S-curve
- Used often in the output layer for binary classification
Pros:
- Gives probability-like outputs
- Intuitive interpretation
Cons:
- Not zero-centered β gradients can zigzag
- Vanishing gradient for very large/small
π§ Veer Notes: βIf we move the weight to the right, the slope increases, but the change is so minimal in extremes that it becomes bad for lots of calculations.β βοΈ
2. Tanh Function (Hyperbolic Tangent)β
- Output: 1 to 1
- Also an S-curve
- Often used in hidden layers
Pros:
- Zero-centered β faster learning
- Stronger gradients than sigmoid
Cons:
- Still suffers from vanishing gradients (less than sigmoid)
π§ Veer Notes: βSince it goes from -1 to 1 we get a more detailed view of success vs. failure. It's useful in hidden layers.β βοΈ
3. ReLU (Rectified Linear Unit)β
- Output: 0 to β
- Very simple: zero if input is negative, identity if positive
Pros:
- Fast to compute
- Doesnβt squash gradients β avoids vanishing
- Works well in deep networks
Cons:
- Dying ReLU problem: if too often, neuron outputs 0 forever
π§ Veer Notes: βIt doesnβt respond when results are negative, but if theyβre positive, it just keeps going. Thatβs why itβs strong β but it can die.β βοΈ
4. Leaky ReLUβ
- Same as ReLU, but negative inputs get a small slope instead of flat zero
Pros:
- Fixes dying ReLU
- Lets gradient flow even for negative
π§ Veer Notes: βSo thatβs why you use Leaky ReLU β it keeps neurons alive that would otherwise die when they get negative inputs.β βοΈ
π Summary Tableβ
Function | Output Range | Zero-Centered | Vanishing Gradients? | Best Used For |
---|---|---|---|---|
Sigmoid | (0, 1) | β No | β Yes | Output layer (binary classification) |
Tanh | (-1, 1) | β Yes | β οΈ Sometimes | Hidden layers (classic networks) |
ReLU | (0, β) | β No | β Rarely | Hidden layers (modern deep nets) |
Leaky ReLU | (~ββ, β) | β No | β No | Deeper nets with risk of dying ReLUs |
π Final Analogy (Veer-Style π§’)β
- Sigmoid = Like deciding how much you agree (
0 = no, 1 = yes)
- Tanh = Like saying how strongly you agree or disagree (-
1 = no way, +1 = absolutely)
- ReLU = Like only listening to good news. Bad input? Silent. Good input? Amplify it.
- Leaky ReLU = Like having a tiny backup mic for when the good news guy gets too quiet.
π¬ Your Takeawayβ
> All activation functions help train the model by shaping how information flows. > > > Sigmoid is great for binary answers. Tanh adds balance for inner reasoning. > > ReLU says βyes loudlyβ or βnothing at all,β and Leaky ReLU is its backup plan. > > The better the activation, the more efficient your learning becomes β just like switching from a spoon to a shovel when digging your ideas deeper. >