GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Rang Li; Lei Li; Shuhuai Ren; Hao Tian; Shuhao Gu; Shicheng Li; Zihao Yue; Yudong Wang; Wenhan Ma; Zhe Yang; Jingyuan Ma; Zhifang Sui; Fuli Luo

arXiv:2512.17495·cs.CV·March 24, 2026

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu, Shicheng Li, Zihao Yue, Yudong Wang, Wenhan Ma, Zhe Yang, Jingyuan Ma, Zhifang Sui, Fuli Luo

PDF

Open Access 2 Datasets

TL;DR

GroundingME is a comprehensive benchmark that evaluates multimodal large language models' ability to perform complex visual grounding tasks, revealing significant gaps in current models and proposing strategies for improvement.

Contribution

The paper introduces GroundingME, a multi-dimensional benchmark that rigorously assesses MLLMs' visual grounding capabilities across real-world challenges, highlighting their limitations and potential enhancement methods.

Findings

01

Current models achieve only 45.1% accuracy on GroundingME.

02

Most models score 0% on rejection tasks, indicating poor recognition of ungroundable queries.

03

Data-mixture training improves rejection accuracy from 0% to 27.9%.

Abstract

Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly visually ground with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate intricate references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative: distinguishing highly similar objects, (2) Spatial: understanding complex relational descriptions, (3) Limited: handling occlusions or tiny objects, and (4) Rejection:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques