GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu, Shicheng Li, Zihao Yue, Yudong Wang, Wenhan Ma, Zhe Yang, Jingyuan Ma, Zhifang Sui, Fuli Luo

TL;DR
GroundingME is a comprehensive benchmark that evaluates multimodal large language models' ability to perform complex visual grounding tasks, revealing significant gaps in current models and proposing strategies for improvement.
Contribution
The paper introduces GroundingME, a multi-dimensional benchmark that rigorously assesses MLLMs' visual grounding capabilities across real-world challenges, highlighting their limitations and potential enhancement methods.
Findings
Current models achieve only 45.1% accuracy on GroundingME.
Most models score 0% on rejection tasks, indicating poor recognition of ungroundable queries.
Data-mixture training improves rejection accuracy from 0% to 27.9%.
Abstract
Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly visually ground with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate intricate references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative: distinguishing highly similar objects, (2) Spatial: understanding complex relational descriptions, (3) Limited: handling occlusions or tiny objects, and (4) Rejection:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
