What is R-CNN? - Summarizing Regions with CNN Features

December 20th 2021 new story

RCNN was first proposed in a paper by (Girshick et al.) in the year 2013 and became very famous soon after. The system consists of multiple major modules and the combination of these modules gave a result that was better than SOTA at that time. On PASCAL VOC 2010–12 dataset, the RCNN method has achieved a significant mAP of \~53%. On the 200-class ILSVRC2013 dataset, RCNN achieved an improvement over the second-best 24.3%.

Listen to this story

Speed:

Read by:

@dipanks

Dipankar Das

Aspiring Applied Scientist

NEWABOUT PAGE

Original Paper — Rich feature hierarchies for accurate object detection and semantic segmentation

Object Detection and Image Segmentation are two topics that are most researched in the field of Computer Vision. In this article, we take a look at one of the famous yet ‘old’ algorithms in the domain of Object Detection — RCNN. This algorithm was first proposed in a paper by (Girshick et al.) in the year 2013 and became very famous soon after.

Overview

The system itself consists of multiple major modules and the combination of these modules gave a result that was better than SOTA at that time.

Major Modules

The first module or the Region Proposal Module was used to generate category-independent region proposals. There are several methods available for region proposals but the authors in the paper chose “Selective Search(Fast Mode)” for result comparison purposes. R-CNN is however agnostic of these methods.

The second module or the Feature Extraction module is where CNN is used to extract features from the Proposed Regions of Interest. The image data in the proposed region is converted into a form that is compatible with CNN architecture. All the pixels are warped in a tight bounding box as per the required size irrespective of the size or the aspect ratio.

The features are extracted by forward propagating a mean subtracted 227*227 RGB image through five convolutional layers and two fully connected layers.”

Prediction

The selective search is run on the test set images to extract around 2000 region proposals and forward propagate it through the CNN to extract the required features in the form of a feature vector. Now, for each of the candidate classes of the objects, an SVM is trained and then is used to classify whether the feature vector belongs to that class or not. Now, what happens for images which have an overlapping region of two objects? In such cases, a greedy non-maximum suppression technique is used where it rejects a region if it has an intersection over union(IOU) overlap with a higher scoring selected region larger than a learned threshold.

Training

Pretraining — The authors discriminatively had pre-trained the CNN on a large auxiliary dataset using image-level annotations only.
Domain-specific fine-tuning — The difference with regular CNN training was that here object detection was the primary task and there were warped region proposal windows. The authors did a SGD training of CNN parameters using only warped region proposals. Aside from replacing the classification layer in CNN; everything was unchanged.

Results

On PASCAL VOC 2010–12 dataset, the RCNN method has achieved a significant mAP of ~53%.
On the 200-class ILSVRC2013 dataset, RCNN achieved an mAP of 31.4% which is almost a 7% improvement over the second-best 24.3%.

If you have liked the summary, you can consider buying me a coffee here.

What is R-CNN? - Summarizing Regions with CNN Features

What is R-CNN? - Summarizing Regions with CNN Features

Overview

Major Modules

Prediction

Training

Results

Recommend

2022 年走向海外，众筹仍是产品验证的第一站

Laravel 5.5 - Login With Only Mobile Number Using Laravel Custom Auth

透过2021年12个营销热门话题，洞悉2022营销新趋势！

移动互联网10年复盘和W3历史重复

从 Multicoin Capital 峰会中学到的 10 件事，投资方法论与未来展望

企业的本质

营销人的三大认知误区

Flask学习过程中遇到的问题

关于核方法和支持向量机

Hexo的一些自定义修改

About Joyk