Skip to main content

๐Ÿง  Activation Functions & Derivatives

Activation functions are what make neural networks more than just glorified linear regressions. They add non-linearity โ€” which gives neural networks their superpower: the ability to learn complex patterns and behaviors. Their derivatives are critical for training the network effectively through backpropagation.


๐Ÿ” Why Do We Use Activation Functions?โ€‹

Neurons in a network receive inputs and combine them using weighted sums. But if we just kept doing weighted sums from layer to layer โ€” the whole network would collapse into one big linear transformation. Thatโ€™s boring!

Activation functions inject non-linearity into the mix, letting networks:

  • Learn curves and twists
  • Make more nuanced decisions
  • Model reality more effectively

๐Ÿ” Common Activation Functionsโ€‹

Letโ€™s break them down with intuitive visuals, math, and practical insights.


1. ๐ŸŒ€ Sigmoid Functionโ€‹

  • Formula:
  • Output range: (0, 1)

It looks like an S-curve โ€” softly squashing large negative or positive values toward 0 or 1.

โœ… Use Case:โ€‹

Perfect for binary classification โ€” the output behaves like a probability: โ€œhow likely is it a cat?โ€

โŒ Limitations:โ€‹

  • Vanishing gradient problem: at extreme values, the slope becomes nearly 0, so the network stops learning efficiently.
  • Not zero-centered: everything is always positive, so updates can zigzag inefficiently.

๐Ÿงฎ Derivative:โ€‹

  • Max slope is at , where , and slope = 0.25
  • At or , slope ~ 0 โž slow learning

2. ๐Ÿงฟ Tanh Functionโ€‹

  • Formula:
  • Output range: (-1, 1)

Similar shape to sigmoid, but centered around 0, so outputs can be negative too.

โœ… Use Case:โ€‹

Hidden layers in older architectures, especially when zero-centering helps optimization.

โŒ Limitations:โ€‹

  • Still suffers from vanishing gradients at the extremes.

๐Ÿงฎ Derivative:โ€‹

  • Max slope = 1 at
  • At , ,
  • Still fades at extremes, but performs better than sigmoid.

3. โšก ReLU (Rectified Linear Unit)โ€‹

  • Formula:
  • Output range: [0, โˆž)

ReLU is the tough love coach of activation functions โ€” if youโ€™re below 0, it gives you nothing. If youโ€™re above 0, it lets you grow.

โœ… Use Case:โ€‹

Most popular choice for hidden layers in modern neural nets (especially deep learning)

โŒ Limitations:โ€‹

  • Dying ReLU problem: once a neuron gets stuck with negative inputs, it may stop updating.

๐Ÿงฎ Derivative:โ€‹

Fast and simple, but non-zero only when


4. ๐Ÿ’ง Leaky ReLUโ€‹

  • Formula:

A soft ReLU variant that lets a small gradient flow when

โœ… Use Case:โ€‹

Prevents "dead neurons" in ReLU networks. Great if you want stability but donโ€™t want to sacrifice performance.

๐Ÿงฎ Derivative:โ€‹

Always has a small slope โ€” it never completely shuts down.


5. ๐Ÿ” Linear Activationโ€‹

  • Formula:
  • Output range: (โˆ’โˆž, โˆž)

โœ… Use Case:โ€‹

Only for output layers in regression problems (e.g., predicting house prices).

โŒ Donโ€™t Use In Hidden Layers:โ€‹

If you stack layers with no activation, itโ€™s just matrix multiplication โž collapses into one linear transformation:

So your entire network becomes as dumb as a linear regression. No thanks.


๐Ÿ” Why Derivatives Matter (Backpropagation)โ€‹

Backpropagation uses derivatives to figure out how wrong each weight is. Thatโ€™s how the network learns!

  • A steep slope โ†’ big correction
  • A flat slope โ†’ tiny or no learning

Thatโ€™s why ReLU and Leaky ReLU are preferred โ€” they donโ€™t suffer from tiny derivatives.


โš–๏ธ Summary Table: Activation Functions & Their Derivativesโ€‹

FunctionOutput RangeDerivativeZero-CenteredVanishing GradientNotes
Sigmoid(0, 1)โŒ Noโœ… YesGood for binary output, slow to train
Tanh(-1, 1)โœ… Yesโœ… Yes (less)Better gradient flow than sigmoid
ReLU[0, โˆž)0 if , 1 ifโŒ NoโŒ No (but can die)Fast, most commonly used
Leaky ReLU[~0, โˆž)0.01 or 1โŒ NoโŒ NoSolves ReLU dying problem
Linear (ID)(โˆ’โˆž, โˆž)1โœ… YesโŒ NoOnly use for regression output layer

๐Ÿง  TL;DR:โ€‹

  • Activation functions add non-linearity โ€” without them, you canโ€™t learn complex patterns.
  • Their derivatives are what fuel learning during backpropagation.
  • Each function behaves differently:
    • Sigmoid: slow, but probabilistic
    • Tanh: better gradient flow
    • ReLU: fast, sparse, simple
    • Leaky ReLU: robust against dead neurons
    • Linear: useful in output layer for continuous prediction

๐Ÿ‘‰ Think of it like this:

> "Use ReLU to think, use linear to speak." ๐Ÿง ๐Ÿ—ฃ๏ธ >