NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?
Jiaxuan Li, Junwen Mo, MinhDuc Vo, Akihiro Sugimoto, Hideki Nakayama

TL;DR
This paper introduces NEMO, a benchmark to evaluate multimodal large language models' ability to recognize attribute-modified objects, revealing significant performance gaps and insights into their limitations.
Contribution
The paper presents NEMO, a novel benchmark with 900 images and 2,700 questions, to systematically assess MLLMs' reasoning in recognizing attribute-modified objects.
Findings
MLLMs show notable performance gaps on NEMO.
Stronger vision encoders improve recognition, but MLLMs still lag behind standalone vision models.
Larger LLMs can weaken vision encoders during fine-tuning.
Abstract
Multimodal Large Language Models (MLLMs) have made notable advances in visual understanding, yet their abilities to recognize objects modified by specific attributes remain an open question. To address this, we explore MLLMs' reasoning capabilities in object recognition, ranging from commonsense to beyond-commonsense scenarios. We introduce a novel benchmark, NEMO, which comprises 900 images of origiNal fruits and their corresponding attributE-MOdified ones; along with a set of 2,700 questions including open-, multiple-choice-, unsolvable types. We assess 26 recent open-sourced and commercial models using our benchmark. The findings highlight pronounced performance gaps in recognizing objects in NEMO and reveal distinct answer preferences across different models. Although stronger vision encoders improve performance, MLLMs still lag behind standalone vision encoders. Interestingly,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · linguistics and terminology studies
MethodsSparse Evolutionary Training
