RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension
Tianyi Gao, Hao Li, Han Fang, Xin Wei, Xiaodong Dong, Hongbo Sun, Ye Yuan, Zhongjiang He, Jinglin Xu, Jingmin Xin, Hao Sun

TL;DR
RefBench-PRO is a new comprehensive benchmark for referring expression comprehension that evaluates perception and reasoning capabilities of vision-language models through diverse, challenging tasks and an automated data-generation pipeline.
Contribution
It introduces a novel benchmark decomposing REC into perception and reasoning, with six challenging sub-tasks, and proposes Ref-R1, an RL-based learning scheme for improved localization accuracy.
Findings
RefBench-PRO reveals greater challenges in perception and reasoning for MLLMs.
Ref-R1 improves localization accuracy under complex reasoning conditions.
The benchmark enables interpretable evaluation of MLLMs on REC tasks.
Abstract
Referring Expression Comprehension (REC) is a vision-language task that localizes a specific image region based on a textual description. Existing REC benchmarks primarily evaluate perceptual capabilities and lack interpretable scoring mechanisms, which cannot reveal the grounding capability of Multi-modal Large Language Model (MLLM) across different cognitive abilities. To address this limitation, we introduce RefBench-PRO, a comprehensive REC benchmark, which decomposes referring expressions into two core dimensions, i.e., perception and reasoning, and further subdivides them into six progressively challenging tasks, such as attribute, position, interaction, commonsense, relation and reject. We also develop a fully automated data-generation pipeline that produces diverse referring expressions across these six sub-dimensions. Furthermore, We propose Ref-R1, an RL-based learning scheme,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
