6

[2309.04041] Evaluation and Mitigation of Agnosia in Multimodal Large Language M...

 8 months ago
source link: https://arxiv.org/abs/2309.04041
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Computer Science > Computer Vision and Pattern Recognition

[Submitted on 7 Sep 2023]

Evaluation and Mitigation of Agnosia in Multimodal Large Language Models

Download PDF

While Multimodal Large Language Models (MLLMs) are widely used for a variety of vision-language tasks, one observation is that they sometimes misinterpret visual inputs or fail to follow textual instructions even in straightforward cases, leading to irrelevant responses, mistakes, and ungrounded claims. This observation is analogous to a phenomenon in neuropsychology known as Agnosia, an inability to correctly process sensory modalities and recognize things (e.g., objects, colors, relations). In our study, we adapt this similar concept to define "agnosia in MLLMs", and our goal is to comprehensively evaluate and mitigate such agnosia in MLLMs. Inspired by the diagnosis and treatment process in neuropsychology, we propose a novel framework EMMA (Evaluation and Mitigation of Multimodal Agnosia). In EMMA, we develop an evaluation module that automatically creates fine-grained and diverse visual question answering examples to assess the extent of agnosia in MLLMs comprehensively. We also develop a mitigation module to reduce agnosia in MLLMs through multimodal instruction tuning on fine-grained conversations. To verify the effectiveness of our framework, we evaluate and analyze agnosia in seven state-of-the-art MLLMs using 9K test samples. The results reveal that most of them exhibit agnosia across various aspects and degrees. We further develop a fine-grained instruction set and tune MLLMs to mitigate agnosia, which led to notable improvement in accuracy.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as: arXiv:2309.04041 [cs.CV]
  (or arXiv:2309.04041v1 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2309.04041

Submission history

From: Jiaying Lu [view email]
[v1] Thu, 7 Sep 2023 22:59:56 UTC (4,225 KB)

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK