Batch Normalization's Effectiveness

Beyond Internal Covariate Shift: A Deeper Look at Batch Normalization

Batch Normalization (BN) has become an essential technique in deep learning. It is widely used because it speeds up training and improves how well models perform. The original 2015 paper, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” mainly said BN worked by reducing Internal Covariate Shift (ICS). However, later research has questioned this idea, suggesting that BN’s real benefits come from other factors.

In this post, we will look at the findings of the paper “How does batch normalization help optimization?” (link). This paper explores the actual ways BN helps with optimization. I have run my own tests, repeating important experiments from this paper. The results help explain why BN works so well, often going against the first idea about ICS.

Replicating the Experiment: My Setup

To check these claims, I set up several experiments. I used a common deep learning structure and dataset:

Model: A PyTorch VGG model (configuration A). I compared two versions: one with Batch Normalization after the 4th convolutional layer, and one without.
Initialization: I used Xavier uniform initialization for all network weights. This made sure the comparison was fair.
Loss Function: Cross-Entropy loss. This is often used for problems where items are sorted into many categories.
Optimizer: Stochastic Gradient Descent (SGD) was used for training.
Dataset: CIFAR-10. This is a popular dataset of 32x32 color images across 10 different categories.
Learning Rate: A fixed learning rate of 0.01 was used during all training.

The ICS Puzzle: Questioning the First Idea

The first Batch Normalization paper suggested that BN works by reducing Internal Covariate Shift (ICS). ICS is when the way network activations are spread out changes as network parameters change during training. However, my experiments matched what the paper I was testing found. This showed a surprising result:

In my tests, the batch-normalized model almost always had higher internal covariate shift than the model without batch normalization. You can see this in the graph below:

Internal Covariate Shift Comparison

Even with this higher ICS, Batch Normalization clearly and significantly improved performance:

Accuracy Comparison

This big difference strongly suggests that reducing ICS is not the main reason Batch Normalization works so well. So, if it’s not ICS, what really makes Batch Normalization effective?

The Real Reasons: Better Gradient Prediction and Smoother Loss Landscapes

The paper “How does batch normalization help optimization?” says that Batch Normalization helps for two main reasons: it makes gradients more predictable and it creates more stable and smoother loss landscapes.

Better Gradient Prediction

Batch Normalization greatly improves how well we can predict gradients. This means the direction a gradient points now is a better sign of where to go next to find the best solution. I measured gradient prediction using the L2 norm of the sum of old and new gradients. The graph shows that the batch-normalized model clearly predicts gradients better, especially at the start of training:

Gradient Predictiveness

Because we can trust the gradient direction more, optimizers can take bigger, more effective steps. This helps the model learn faster.

Smoother Loss Surfaces

Another important point is how Batch Normalization affects the loss landscape. BN often creates a smoother loss surface, which makes optimization simpler. A smoother surface means there are fewer sharp dips or flat areas. This allows the optimizer to find the best solution more easily.

Loss Landscape Smoothness

A smoother loss landscape and more predictable gradients mean we can use higher learning rates. This also makes training less sensitive to choosing the right hyperparameters. This strong and stable behavior is a big practical benefit of using Batch Normalization.

Summary

At first, people thought Batch Normalization worked by reducing Internal Covariate Shift. But real-world tests and later research, including my own, show different reasons. The real strength of Batch Normalization seems to be its ability to make gradients more predictable and create smoother loss landscapes. Together, these things lead to faster, more stable training that is less affected by specific settings. This makes Batch Normalization a key technique in today’s deep learning.