Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang, Yizhou Wang, Huimin Zeng, Jianglin Lu, Yun Fu

TL;DR
Ref-Adv introduces a challenging new benchmark for referring expression comprehension that emphasizes complex reasoning and grounding, revealing current models' reliance on shortcuts and highlighting areas for future improvement.
Contribution
The paper presents Ref-Adv, a novel REC benchmark designed to challenge models with nontrivial expressions and hard distractors, promoting genuine visual reasoning and grounding.
Findings
Models perform well on existing benchmarks but struggle on Ref-Adv.
Ref-Adv exposes reliance on shortcuts in current models.
Comprehensive analysis reveals gaps in visual reasoning capabilities.
Abstract
Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order…
Peer Reviews
Decision·ICLR 2026 Poster
1. The benchmark is generated based on a sound rationale. This seems effective at revealing weaknesses in current REC models, although these models tend to overfit the traditional benchmarks. 2. The evaluation is comprehensive and overall convincing.
1. One thing to keep in mind is that the GPT-4O generated captions are a bit biased compared to the human captions, in that GPT-4o apparently generates more negations and humans usually unintentionally avoid negations. This is not necessarily bad, since in real applications, users may need to query with negations. 2. The analysis of experimental results is minimal. In particular, I wish the authors could analyze the performance on captions containing negations.
It proposes anti-shortcut design including bag-of-words shuffling and descriptor-deletion both cause larger drops than on legacy benchmarks. This is helpful for the need for compositional, order-aware grounding. Its data construction is considerable with strong filter pipeline and covering negation. The final benchmark is processed with strict 3-human-annotator agreement. From experiments the benchmark exposes failure modes that legacy REC underestimates, creating a clear diagnostic “stress tes
Coverage analysis of this paper is limited. There is no thorough breakdown of the 2,833 images / 5,000 instances (categories/attributes/relations/occlusion, long tails) or side-by-side coverage vs RefCOCO/+/g traiditional widely used benchmarks. The CoT conclusion on RefCOCO conflicts with prior work in top venue. ARGUS[1] reports grounded CoT improves MLLM performance on RefCOCO/+/g, but this paper finds CoT can hurt the performance on RefCOCO, making the conlcusion not convincing. There are
The paper not only points out the saturation of classic benchmarks but also deeply analyzes the three specific causes with data and examples. The proposed four-stage data pipeline is outstanding. The two-stage LLM process, particularly "Similarity Judgement" and "Minimally Sufficient Expression Generation," is cleverly designed. The extremely strict three-annotator verification process ensures the dataset's exceptionally high quality and trustworthiness. The CoT experiment is an especially ins
1. Limitation of an Evaluation-Only Benchmark: Ref-Adv (5k samples) is an evaluation benchmark, not a training set. It excels at diagnosing the flaws of current models but does not provide a pathway for models to learn how to solve these complex reasoning tasks. Given its high construction cost (low keep rate), how to scale this up into a training set is an open question. 2. The paper repeatedly claims to test "visual reasoning" and "multi-step reasoning." However, the scope of reasoning evalua
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
