RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension

Tianyi Gao; Hao Li; Han Fang; Xin Wei; Xiaodong Dong; Hongbo Sun; Ye Yuan; Zhongjiang He; Jinglin Xu; Jingmin Xin; Hao Sun

arXiv:2512.06276·cs.CV·December 16, 2025

RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension

Tianyi Gao, Hao Li, Han Fang, Xin Wei, Xiaodong Dong, Hongbo Sun, Ye Yuan, Zhongjiang He, Jinglin Xu, Jingmin Xin, Hao Sun

PDF

Open Access

TL;DR

RefBench-PRO is a new comprehensive benchmark for referring expression comprehension that evaluates perception and reasoning capabilities of vision-language models through diverse, challenging tasks and an automated data-generation pipeline.

Contribution

It introduces a novel benchmark decomposing REC into perception and reasoning, with six challenging sub-tasks, and proposes Ref-R1, an RL-based learning scheme for improved localization accuracy.

Findings

01

RefBench-PRO reveals greater challenges in perception and reasoning for MLLMs.

02

Ref-R1 improves localization accuracy under complex reasoning conditions.

03

The benchmark enables interpretable evaluation of MLLMs on REC tasks.

Abstract

Referring Expression Comprehension (REC) is a vision-language task that localizes a specific image region based on a textual description. Existing REC benchmarks primarily evaluate perceptual capabilities and lack interpretable scoring mechanisms, which cannot reveal the grounding capability of Multi-modal Large Language Model (MLLM) across different cognitive abilities. To address this limitation, we introduce RefBench-PRO, a comprehensive REC benchmark, which decomposes referring expressions into two core dimensions, i.e., perception and reasoning, and further subdivides them into six progressively challenging tasks, such as attribute, position, interaction, commonsense, relation and reject. We also develop a fully automated data-generation pipeline that produces diverse referring expressions across these six sub-dimensions. Furthermore, We propose Ref-R1, an RL-based learning scheme,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling