Curse of Batch Normalization

What are some drawbacks of using batch normalization?

May 15 ·6min read

zmaQjun.jpg!web

Batch Normalization is Indeed one of the major breakthrough in the field of Deep Learning and is one of the hot topics for discussion among researchers in the past few years. Batch Normalization is a widely adopted technique that enables faster and more stable training and has become one of the most influential methods. However, despite its versatility, there are still some points holding this method back as we are going to discuss in this article, which shows that there’s still room for improvement for normalization methods.

Why do we use Batch Normalization?

Before discussing anything, first, we should know what batch normalization is, how it works, and discuss it’s use cases.

What Batch Normalization is

During training, the output distribution of each intermediate activation layer shifts at each iteration as we update the previous weights. This phenomenon is referred to as an internal covariant shift (ICS). So a natural thing to do, if I want to prevent this from happening, is to fix all the distributions. In simple words, if I had some problem that my distributions are shifting around, ill just clamp them and not let them shift around to help gradient optimization and prevent vanishing gradients, and this will help my neural network train faster. So reducing this internal covariant shift was the key principle driving the development of batch normalization.

How it works

Batch Normalization normalizes the output of the previous output layer by subtracting the empirical mean over the batch divided by the empirical standard deviation. This will help the data look like Gaussian distribution .

Where mu and sigma_square are the batch mean and batch variance respectively.

And, we learn a new mean and covariance in terms of two learnable parameters γ and β. So in short, you can think of batch normalization is something that helps you control the first and second moments of the distribution of the batch.

VJ7Zza2.png!web

Feature distribution output from an intermediate convolution layer from VGG-16 Network. 1. (Before) without any normalization, 2. (After) applying batch normalization.

Benefits

I’ll enlist some of the benefits of using batch normalization but I won’t get into much detail, as there are tonnes of articles already covering that.

Faster convergence.
Decreases the importance of initial weights.
Robust to hyperparameters.
Requires less data for generalization.

vaQvamn.png!web

1. Faster Convergence, 2. Robust to hyperparameters

What are some drawbacks of using batch normalization?

Why do we use Batch Normalization?

What Batch Normalization is

How it works

Benefits

Recommend

The pieces of Fedora Silverblue

Complete guide to Machine learning in the Media industry

阅读文学名著需要跨越多少门槛？聊聊我眼中的陀氏文学

日媒:华为5G手机美国零部件占比降至1.5% 只剩玻璃壳等

告别春天的嫩芽，产业互联网迎来瓜熟蒂落期

天价搬运费事件之后货拉拉何去何从？

3年巨亏50亿的达达上市，依然难找到盈利口

破圈的云服务玩家：重塑规则的"下一个十年"

浙江上虞1.4亿引入阿里云城市大脑探索智能城市未来社区养老

Why Fullstaq Ruby?

About Joyk