KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge

Guanghao Jin; Jingpei Wu; Tianpei Guo; Yiyi Niu; Weidong Zhou; Guoyang Liu

arXiv:2508.14080·cs.LG·August 21, 2025

KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge

Guanghao Jin, Jingpei Wu, Tianpei Guo, Yiyi Niu, Weidong Zhou, Guoyang Liu

PDF

Open Access

TL;DR

KnowDR-REC is a new benchmark designed to evaluate multimodal models' reasoning with real-world knowledge, incorporating fine-grained annotations, negative samples, and novel metrics to assess robustness and interpretability.

Contribution

The paper introduces KnowDR-REC, a comprehensive benchmark with real-world knowledge, negative samples, and new evaluation metrics for assessing multimodal models' reasoning capabilities.

Findings

01

Existing models struggle with knowledge-driven visual grounding.

02

Many models rely on shortcut correlations rather than genuine reasoning.

03

The benchmark reveals a gap between textual understanding and visual grounding.

Abstract

Referring Expression Comprehension (REC) is a popular multimodal task that aims to accurately detect target objects within a single image based on a given textual expression. However, due to the limitations of earlier models, traditional REC benchmarks either rely solely on intra-image cues or lack sufficiently fine-grained instance annotations, making them inadequate for evaluating the reasoning capabilities of Multi-modal Large Language Models (MLLMs). To address this gap, we propose a new benchmark, KnowDR-REC, characterized by three key features: Firstly, it is built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image. Secondly, the dataset includes elaborately constructed negative samples via fine-grained expression editing, designed to evaluate a model's robustness and anti-hallucination ability. Lastly, we introduce three novel evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning