MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

This repository contains code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data having aligned text and images with box annotations, as well as fine-tuning on tasks requiring fine grained understanding of image and text.

We show big gains on the phrase grounding task (Flickr30k), Referring Expression Comprehension (RefCOCO, RefCOCO+ and RefCOCOg) as well as Referring Expression Segmentation (PhraseCut, CLEVR Ref+). We also achieve competitive performance on visual question answering (GQA, CLEVR).

TL;DR. We depart from the fixed frozen object detector approach of several popular vision + language pre-trained models and achieve true end-to-end multi-modal understanding by training our detector in the loop. In addition, we only detect objects that are relevant to the given text query, where the class labels for the objects are just the relevant words in the text query. This allows us to expand our vocabulary to anything found in free form text, making it possible to detect and reason over novel combination of object classes and attributes.

For details, please see the paper: MDETR - Modulated Detection for End-to-End Multi-Modal Understanding by Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve and Nicolas Carion.

Usage

The requirements file has all the dependencies that are needed by MDETR.

We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/ashkamath/mdetr.git

Make a new conda env and activate it:

conda create -n mdetr_env python=3.8
conda activate mdetr_env

Install the the packages in the requirements.txt:

pip install -r requirements.txt

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

Pre-training

The links to data, steps for data preparation and script for running finetuning can be found in Pretraining Instructions We also provide the pre-trained model weights for MDETR trained on our combined aligned dataset of 1.3 million images paired with text.

The models are summarized in the following table. Note that the performance reported is "raw", without any fine-tuning. For each dataset, we report the class-agnostic box AP@50, which measures how well the model finds the boxes mentioned in the text. All performances are reported on the respective validation sets of each dataset.

Backbone GQA Flickr Refcoco Url
Size
AP AP R@1 AP Refcoco R@1 Refcoco+ R@1 Refcocog R@1 1 R101 58.9 75.6 82.5 60.3 72.1 58.0 55.7 model 3GB 2 ENB3 59.5 76.6 82.9 57.6 70.2 56.7 53.8 model 2.4GB 3 ENB5 59.9 76.4 83.7 61.8 73.4 58.8 57.1 model 2.7GB

Downstream tasks

Phrase grounding on Flickr30k

Instructions for data preparation and script to run evaluation can be found at Flickr30k Instructions

AnyBox protocol

Backbone Pre-training Image Data Val R@1 Val R@5 Val R@10 Test R@1 Test R@5 Test R@10 url size Resnet-101 COCO+VG+Flickr 82.5 92.9 94.9 83.4 93.5 95.3 model 3GB EfficientNet-B3 COCO+VG+Flickr 82.9 93.2 95.2 84.0 93.8 95.6 model 2.4GB EfficientNet-B5 COCO+VG+Flickr 83.6 93.4 95.1 84.3 93.9 95.8 model 2.7GB

MergedBox protocol

Backbone Pre-training Image Data Val R@1 Val R@5 Val R@10 Test R@1 Test R@5 Test R@10 url size Resnet-101 COCO+VG+Flickr 82.3 91.8 93.7 83.8 92.7 94.4 model 3GB

Referring expression comprehension on RefCOCO, RefCOCO+, RefCOCOg

Instructions for data preparation and script to run finetuning and evaluation can be found at Referring Expression Instructions

RefCOCO

Backbone Pre-training Image Data Val TestA TestB url size Resnet-101 COCO+VG+Flickr 86.75 89.58 81.41 model 3GB EfficientNet-B3 COCO+VG+Flickr 87.51 90.40 82.67 model 2.4GB

RefCOCO+

Backbone Pre-training Image Data Val TestA TestB url size Resnet-101 COCO+VG+Flickr 79.52 84.09 70.62 model 3GB EfficientNet-B3 COCO+VG+Flickr 81.13 85.52 72.96 model 2.4GB

RefCOCOg

Backbone Pre-training Image Data Val Test url size Resnet-101 COCO+VG+Flickr 81.64 80.89 model 3GB EfficientNet-B3 COCO+VG+Flickr 83.35 83.31 model 2.4GB

Referring expression segmentation on PhraseCut

Instructions for data preparation and script to run finetuning and evaluation can be found at PhraseCut Instructions

Backbone M-IoU Precision @0.5 Precision @0.7 Precision @0.9 url size Resnet-101 53.1 56.1 38.9 11.9 model 1.5GB EfficientNet-B3 53.7 57.5 39.9 11.9 model 1.2GB

Visual question answering on GQA

Instructions for data preparation and scripts to run finetuning and evaluation can be found at GQA Instructions

Backbone Test-dev Test-std url size Resnet-101 62.48 61.99 model 3GB EfficientNet-B5 62.95 62.45 model 2.7GB

Long-tailed few-shot object detection

Instructions for data preparation and scripts to run finetuning and evaluation can be found at LVIS Instructions

Data AP AP 50 AP r APc AP f url size 1% 16.7 25.8 11.2 14.6 19.5 model 3GB 10% 24.2 38.0 20.9 24.9 24.3 model 3GB 100% 22.5 35.2 7.4 22.7 25.0 model 3GB

Synthetic datasets

Instructions to reproduce our results on CLEVR-based datasets are available at CLEVR instructions

Overall Accuracy Count Exist
Compare Number Query Attribute Compare Attribute Url Size 99.7 99.3 99.9 99.4 99.9 99.9 model 446MB

License

MDETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Github GitHub - ashkamath/mdetr

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Usage

Pre-training

Downstream tasks

Phrase grounding on Flickr30k

AnyBox protocol

MergedBox protocol

Referring expression comprehension on RefCOCO, RefCOCO+, RefCOCOg

RefCOCO

RefCOCO+

RefCOCOg

Referring expression segmentation on PhraseCut

Visual question answering on GQA

Long-tailed few-shot object detection

Synthetic datasets

License

Recommend

How Data is Transforming Healthcare for Patients, Providers and Payers

Github GitHub - Wulfheart/pretty-routes: Display your Laravel routes in the cons...

Navigating from a Source Document to an Electronic Document

Github GitHub - huggingface/datasets: 🤗 The largest hub of ready-to-use NLP data...

Github GitHub - laravel/breeze: Minimal Laravel authentication scaffolding with...

Github GitHub - AntonVanke/JDBrandMember: 京东自动入会获取京豆

The End2End Journey: Advocates Service – An Introduction

Improved Error Handling in the eDocument Cockpit

Github GitHub - JDAI-CV/fast-reid: SOTA Re-identification Methods and Toolbox

五一劳动节文案，来晚了？

About Joyk