Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Qihua Dong; Kuo Yang; Lin Ju; Handong Zhao; Yitian Zhang; Yizhou Wang; Huimin Zeng; Jianglin Lu; Yun Fu

arXiv:2602.23898·cs.CV·March 2, 2026

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang, Yizhou Wang, Huimin Zeng, Jianglin Lu, Yun Fu

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

Ref-Adv introduces a challenging new benchmark for referring expression comprehension that emphasizes complex reasoning and grounding, revealing current models' reliance on shortcuts and highlighting areas for future improvement.

Contribution

The paper presents Ref-Adv, a novel REC benchmark designed to challenge models with nontrivial expressions and hard distractors, promoting genuine visual reasoning and grounding.

Findings

01

Models perform well on existing benchmarks but struggle on Ref-Adv.

02

Ref-Adv exposes reliance on shortcuts in current models.

03

Comprehensive analysis reveals gaps in visual reasoning capabilities.

Abstract

Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The benchmark is generated based on a sound rationale. This seems effective at revealing weaknesses in current REC models, although these models tend to overfit the traditional benchmarks. 2. The evaluation is comprehensive and overall convincing.

Weaknesses

1. One thing to keep in mind is that the GPT-4O generated captions are a bit biased compared to the human captions, in that GPT-4o apparently generates more negations and humans usually unintentionally avoid negations. This is not necessarily bad, since in real applications, users may need to query with negations. 2. The analysis of experimental results is minimal. In particular, I wish the authors could analyze the performance on captions containing negations.

Reviewer 02Rating 4Confidence 4

Strengths

It proposes anti-shortcut design including bag-of-words shuffling and descriptor-deletion both cause larger drops than on legacy benchmarks. This is helpful for the need for compositional, order-aware grounding. Its data construction is considerable with strong filter pipeline and covering negation. The final benchmark is processed with strict 3-human-annotator agreement. From experiments the benchmark exposes failure modes that legacy REC underestimates, creating a clear diagnostic “stress tes

Weaknesses

Coverage analysis of this paper is limited. There is no thorough breakdown of the 2,833 images / 5,000 instances (categories/attributes/relations/occlusion, long tails) or side-by-side coverage vs RefCOCO/+/g traiditional widely used benchmarks. The CoT conclusion on RefCOCO conflicts with prior work in top venue. ARGUS[1] reports grounded CoT improves MLLM performance on RefCOCO/+/g, but this paper finds CoT can hurt the performance on RefCOCO, making the conlcusion not convincing. There are

Reviewer 03Rating 6Confidence 4

Strengths

The paper not only points out the saturation of classic benchmarks but also deeply analyzes the three specific causes with data and examples. The proposed four-stage data pipeline is outstanding. The two-stage LLM process, particularly "Similarity Judgement" and "Minimally Sufficient Expression Generation," is cleverly designed. The extremely strict three-annotator verification process ensures the dataset's exceptionally high quality and trustworthiness. The CoT experiment is an especially ins

Weaknesses

1. Limitation of an Evaluation-Only Benchmark: Ref-Adv (5k samples) is an evaluation benchmark, not a training set. It excels at diagnosing the flaws of current models but does not provide a pathway for models to learn how to solve these complex reasoning tasks. Given its high construction cost (low keep rate), how to scale this up into a training set is an open question. 2. The paper repeatedly claims to test "visual reasoning" and "multi-step reasoning." However, the scope of reasoning evalua

Code & Models

Datasets

dddraxxx/ref-adv-s
dataset· 231 dl
231 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques