O-DRUM @ CVPR 2022

Workshop on Open-Domain Retrieval Under a Multi-Modal Setting

in conjunction with CVPR 2022, New Orleans, June 20

Information Retrieval (IR) is an essential aspect of the internet era and improvements in IR algorithms directly lead to a better search experience for the end-user. IR also serves as a vital component in many natural language processing tasks such as open-domain question answering and knowledge and commonsense-based question answering, Recent advances in visual representation learning have also enabled image retrieval applications that have become a vital part of knowledge-based and commonsense visual question answering. Many datasets and IR algorithms have been developed to deal with input queries from a single modality, such as for document retrieval from text queries, image retrieval from text queries, text retrieval form video queries, etc. However, in many cases, the query may be multi-modal, for instance an image of a milkshake and a complementary textual description “restaurants near me” should return potential matches of nearby restaurants serving milkshakes. Similarly, sick patients may be able to input their signs and symptoms (for instance photographs of swelling and natural lanaguage descriptions of fever) in order to retrieve more information about their condition. Such functionality is desirable in situations where each modality communicates partial, yet vital information about the required output.

O-DRUM 2022 seeks to address this emerging topic area of research. The workshop aims to bring together researchers from information retrieval, natural language processing, computer vision, and knowledge representation and reasoning to address information retrieval with queries that may come from multiple modalities (such as text, images, videos, audio, etc.), or multiple formats (paragraphs, tables, charts, etc.).

Call for Papers

We invite submissions related to the broad topic area of multi-modal retrieval, including but not limited to the following topic areas:

Retrieval from multi-modal queries or retrieval of multi-modal information.
New datasets or task design for open-domain retrieval from multi-modal queries, and multi-modal reasoning requiring external knowledge.
Modification, augmentation of existing benchmarks such as OK-VQA, VisualNews, Web-QA, etc.
Commentary and analysis on evaluation metrics in IR tasks, and proposals for new evaluation metrics.
New methods and empirical results for multi-modal retrieval
Faster, efficient, or scalable algorithms for retrieval.
Methods which learn from web data and knowledge bases by retrieval, rather than from fixed sources.
Retrieval methods aiding other tasks such as image and video captioning, visual grounding, VQA, image generation, graphics, etc.
Use of Retrieval as a means for data augmentation/data generation in unsupervised/few-shot/zero-shot learning.

We encourage submissions of two types:

Extended abstracts (4 pages + unlimited references).
Long papers (maximum of 8 pages + unlimited references).

Submissions should be anonymized and formatted using the CVPR 2022 template. Accepted papers will be presented as posters during the workshop, where attendees, invited speakers and organizers can engage in discussion. We plan to highlight the best 3 papers via spotlight talks during the workshop session. We will give authors of all accepted papers an option to opt-in or opt-out of CVPR proceedings.

Important Dates:

♦ Submission Deadline:April 08, 2022 (Friday), 23:59 PDT ♦ Notification of Decision:2nd week of April ♦ Camera Ready Deadline:April 19, 2022 (Tuesday), 23:59 PDT Submission website (CMT): https://cmt3.research.microsoft.com/ODRUM2022 Non-Proceedings Papers: Beyond April 8, we'll continue to accept submissions -- however they won't be eligible for proceedings (you can still present at the workshop).

Invited Speakers

Schedule coming soon. Aniruddha Kembhavi
Allen Institute for AI

Website Dr. Kembhavi leads PRIOR, the computer vision team at the Allen Institute for AI. He is also an Affiliate Associate Professor at the Computer Science & Engineering department at the University of Washington. His research interests are in research problems at the intersection of vision, language, and embodiment. Danqi Chen
Princeton University

Website Dr Chen is an assistant professor of Computer Science at Princeton University and co-lead of the Princeton NLP Group. She is also part of the larger Princeton AIML group and affiliated with Princeton Center for Statistics and Machine Learning (CSML). Her broad interests are in in natural language processing and machine learning, and her research is mostly driven by two goals: (1) developing effective and fundamental methods for learning representations of language and knowledge, and their interplay, and (2) building practical systems including question answering, information extraction and conversational agents. Diane Larlus
NAVER Labs Europe

Website Dr Larlus is a Principal research scientist at Naver Labs Europe working on computer vision and machine learning, and a chair holder on Life-long representation learning within the MIAI AI research institute of Grenoble, working towards a semantic understanding of visual scenes. Her current interests are in lifelong learning, continual domain adaptation, and instance-level, semantic, and cross-modal visual search. Mohit Bansal
University of North Carolina

Website Dr. Bansal is the John R. & Louise S. Parker Associate Professor and the Director of the MURGe-Lab (in the UNC-NLP Group) in the Computer Science department at the UNC Chapel Hill. His research expertise is in statistical natural language processing and machine learning, with a particular focus on multimodal, grounded, and embodied semantics (i.e., language with vision and speech, for robotics), human-like language generation and Q&A/dialogue, and interpretable and generalizable deep learning. Xin (Eric) Wang
University of California, Santa Cruz

Website Dr. Wang is an Assistant Professor of Computer Science and Engineering at UC Santa Cruz. His research interests include Natural Language Processing, Computer Vision, and Machine Learning, with an emphasis on building embodied AI agents that can communicate with humans using natural language to perform real-world multimodal tasks.