NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?

Jiaxuan Li; Junwen Mo; MinhDuc Vo; Akihiro Sugimoto; Hideki Nakayama

arXiv:2411.17794·cs.CV·November 28, 2024

NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?

Jiaxuan Li, Junwen Mo, MinhDuc Vo, Akihiro Sugimoto, Hideki Nakayama

PDF

Open Access

TL;DR

This paper introduces NEMO, a benchmark to evaluate multimodal large language models' ability to recognize attribute-modified objects, revealing significant performance gaps and insights into their limitations.

Contribution

The paper presents NEMO, a novel benchmark with 900 images and 2,700 questions, to systematically assess MLLMs' reasoning in recognizing attribute-modified objects.

Findings

01

MLLMs show notable performance gaps on NEMO.

02

Stronger vision encoders improve recognition, but MLLMs still lag behind standalone vision models.

03

Larger LLMs can weaken vision encoders during fine-tuning.

Abstract

Multimodal Large Language Models (MLLMs) have made notable advances in visual understanding, yet their abilities to recognize objects modified by specific attributes remain an open question. To address this, we explore MLLMs' reasoning capabilities in object recognition, ranging from commonsense to beyond-commonsense scenarios. We introduce a novel benchmark, NEMO, which comprises 900 images of origiNal fruits and their corresponding attributE-MOdified ones; along with a set of 2,700 questions including open-, multiple-choice-, unsolvable types. We assess 26 recent open-sourced and commercial models using our benchmark. The findings highlight pronounced performance gaps in recognizing objects in NEMO and reveal distinct answer preferences across different models. Although stronger vision encoders improve performance, MLLMs still lag behind standalone vision encoders. Interestingly,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · linguistics and terminology studies

MethodsSparse Evolutionary Training