Neural Networks Intuitions: 8. Translation Invariance in Object Detectors

May 3 ·4min read

Hello everyone!

This article is going to be a short one and focuses on a less significant but highly overlooked concept in object detectors, especially in single shot detectors — Translation Invariance.

Let’s understand what translation invariance is and what makes an image classifier/object detector translation invariant.

*Note: This article assumes you have background knowledge of how single and two-stage detectors work :-)

Translation Invariance:

Translation in computer vision means displacement in space and Invariance means the property of being unchanged.

Therefore when we say an image classifier or an object detector is translation invariant, it means:

Image Classifier can predict a class accurately despite where the class(more specifically, pattern) is located along the image’s spatial dimensions. Similarly, a detector can detect an object irrespective of where it is present in the image.

Let us look at an example for each of the problem to make things clear.

Image Classification: Untranslated and Translated versions of an image from MNIST dataset

I7BZ32Q.png!web

Object Detection: Translated versions of dog — object to be detected in the image.

In this article, we will be considering only Convolutional Neural Nets — be it a classifier or a detector and see whether they are translation invariant or not!

Translation Invariance in Convolutional Classifiers:

Are CNNs translation invariant? If so, what makes them invariant to translation?

Firstly, CNNs are not completely translation invariant but only to some extent. Next, it is ‘pooling’ that makes them translation invariant and not the convolution operation(applying filters).

The above statement is applicable only for classifiers and not for object detectors.

If we read Hinton’s paper on translation invariance in CNNs , he clearly states that the pooling layer was introduced to reduce computation complexity and that Translation Invariance was only a by-product of it.

One can make CNNs completely translation invariant by feeding the right kind of data — although this may not be 100% feasible.

Note: I won’t be addressing the question of how pooling makes CNNs translation invariant. You can check it out in the links below :-)

Geoffrey Hinton on what's wrong with CNNs

I am going to be posting some loose notes on different biologically-inspired machine learning lectures. In this note I…

moreisdifferent.com

http://cs231n.stanford.edu/reports/2016/pdfs/107_Report.pdf

Translation Invariance in Two-stage Detectors:

Two stage object detectors has the following components:

Region proposal stage
Classification stage

The first stage predicts locations of objects of interest(i.e region proposals) and the second stage classifies those region proposals.

We can see that the first stage predicts foreground object locations, which means the problem now is reduced to image classification — performed by the second stage . This reduction makes a two-stage detector translation invariant without introducing any explicit changes to the neural network architecture.

This decoupling of the object’s class prediction from the object’s bounding box prediction makes a two stage detector translation invariant !

Translation Invariance in Single stage detectors:

Now that we have looked into two stage detectors, we know that a single stage detector needs to couple box and class predictions. One way of doing that is to make dense predictions(anchors) on a feature map i.e at every grid cell or a group of cells on a feature map.

Read the following article where I explain in-depth about Anchors: Neural Networks Intuitions: 5. Anchors and Object Detection .

Since these dense predictions are made by convolving filters on feature maps, this enables the network to detect the same pattern when occurred in a different location on the feature map.

For example, let us consider a neural network trained to detect dogs present in an image. The filters in the final conv layer are responsible for recognizing these dog patterns.

We feed data to the network such that the dog always appear on the left side of the image and test it with an image where the dog appears on the right side.

Zb6nMv6.png!web

One of the filters in the last layer learns the above dog pattern and since the same filter is being convolved throughout and a prediction is being made at every location in the feature map, it recognizes the same dog pattern in a different location!

Finally to answer the question of “Why filters make detectors translation invariant but not classifiers?”

Filters in Conv Nets learn local features in an image rather than taking in the global context. Since the problem of object detection is to detect local features(objects) from the image and not make predictions from the entire feature map(which is what happens in case of an image classifier), filters help in making them invariant to translation.

That’s all in this eighth instalment of my series. I hope you folks were able to get a good grasp of translation invariance in general and what makes the detector invariant to object translation in images. Please feel free to correct me if am wrong :-)

Cheers!

Neural Networks Intuitions: 8. Translation Invariance in Object Detectors