Beyond Referring Expressions: Scenario Comprehension Visual Grounding
Ruozhen He, Nisarg A. Shah, Qihua Dong, Zilin Xiao, Jaywon Koo, and Vicente Ordonez

TL;DR
This paper introduces a new benchmark, RSC, for scenario-based visual grounding that requires understanding roles, intentions, and context, revealing limitations of current models.
Contribution
The paper presents RSC, a challenging scenario comprehension benchmark with detailed annotations, and proposes ScenGround, a curriculum reasoning method for improved performance.
Findings
Current models struggle with scenario-based queries that require deep understanding.
Curriculum training enhances model performance on difficult cases.
ScenGround improves transferability to standard benchmarks.
Abstract
Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
