Computer Science > Computer Vision and Pattern Recognition

[Submitted on 7 Sep 2023]

Evaluation and Mitigation of Agnosia in Multimodal Large Language Models

While Multimodal Large Language Models (MLLMs) are widely used for a variety of vision-language tasks, one observation is that they sometimes misinterpret visual inputs or fail to follow textual instructions even in straightforward cases, leading to irrelevant responses, mistakes, and ungrounded claims. This observation is analogous to a phenomenon in neuropsychology known as Agnosia, an inability to correctly process sensory modalities and recognize things (e.g., objects, colors, relations). In our study, we adapt this similar concept to define "agnosia in MLLMs", and our goal is to comprehensively evaluate and mitigate such agnosia in MLLMs. Inspired by the diagnosis and treatment process in neuropsychology, we propose a novel framework EMMA (Evaluation and Mitigation of Multimodal Agnosia). In EMMA, we develop an evaluation module that automatically creates fine-grained and diverse visual question answering examples to assess the extent of agnosia in MLLMs comprehensively. We also develop a mitigation module to reduce agnosia in MLLMs through multimodal instruction tuning on fine-grained conversations. To verify the effectiveness of our framework, we evaluate and analyze agnosia in seven state-of-the-art MLLMs using 9K test samples. The results reveal that most of them exhibit agnosia across various aspects and degrees. We further develop a fine-grained instruction set and tune MLLMs to mitigate agnosia, which led to notable improvement in accuracy.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2309.04041 [cs.CV]
	(or arXiv:2309.04041v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.04041

Submission history

From: Jiaying Lu [view email]
[v1] Thu, 7 Sep 2023 22:59:56 UTC (4,225 KB)

[2309.04041] Evaluation and Mitigation of Agnosia in Multimodal Large Language M...

Computer Science > Computer Vision and Pattern Recognition

Evaluation and Mitigation of Agnosia in Multimodal Large Language Models

Submission history

Recommend

Java 9流API的8个改进

Pic Copilot - AI powered e-commerce image tool | Product Hunt

Wealthiest People in Australia (December 20, 2023)

[2305.18703] Domain Specialization as the Key to Make Large Language Models Disr...

Add extra stuff to a "standard" encoding? Sure, why not.

Salary Negotiation Tips

A delightful AI powered design tool for images & videos

Linux 滚动发行版详解，持续演进的操作系统

Bloganuary Is Just Around the Corner!

逆势增长15.7%！海尔专卖店延续两位数增长冲刺千亿

About Joyk