π― Weight Initialization in Neural Networks
Youβve learned how to do forward propagation, backpropagation, and gradient descent. But now comes a crucial design decision: How do we initialize the weights?
βοΈ What Happens If You Initialize All Weights to Zero?β
π§ Summary:β
If all weights in a layer are initialized to 0, all neurons in that layer will compute the exact same output. As a result, during training:
- The gradients for each neuron will be the same.
- All neurons will update identically.
- There will be no diversity, and they will learn the same features.
This is known as the symmetry problem β and itβs bad because it defeats the purpose of having multiple neurons.
π₯ Why This Happens:β
Neurons are designed to specialize. If you initialize them the same way and give them the same inputs, theyβll just echo each other.
β Result:β
The model fails to learn complex representations. It might as well be a linear model.
π² Random Initialization to the Rescue!β
π Whatβs Different:β
Instead of setting weights to 0, we randomly initialize them with small values (e.g. multiply random values by 0.01).
W[1] = np.random.`randn(layer_dims[1], layer_dims[0])` * 0.01
This ensures:
- Each neuron starts with different weights
- The model breaks symmetry
- Neurons can learn different features
π Why Small Values?β
If weights are too large:
- Activations like
sigmoid
andtanh
saturate (flatten out) - Gradients vanish β learning slows down
If weights are too small:
- It might learn slowly, but at least it learns safely
β Bonus: Xavier and He Initializationβ
Modern deep learning uses smarter strategies:
- Xavier initialization for
tanh
- He initialization for
ReLU
These scale weights based on the number of neurons in a layer to maintain stable gradient flow.
π Slide Visual Recapβ
Slide 1: Zero Initialization ββ
- All neurons receive same input β same output β same gradient
- All weights update identically β no symmetry breaking
Slide 2: Random Initialization β β
- Weights are random and small
- Each neuron starts unique
- Diverse features can be learned
- Model trains effectively
π§Ύ TL;DRβ
Initialization | Symmetry Broken? | Risk of Vanishing Gradient | Learning Outcome |
---|---|---|---|
Zeros | β No | No | All neurons same |
Random Large | β Yes | β Yes | Unstable training |
Random Small | β Yes | β Safer | Stable training |
π Codex Rule of Thumb:β
> "Donβt let neurons be clones. Randomize to specialize." >