KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge
Guanghao Jin, Jingpei Wu, Tianpei Guo, Yiyi Niu, Weidong Zhou, Guoyang Liu

TL;DR
KnowDR-REC is a new benchmark designed to evaluate multimodal models' reasoning with real-world knowledge, incorporating fine-grained annotations, negative samples, and novel metrics to assess robustness and interpretability.
Contribution
The paper introduces KnowDR-REC, a comprehensive benchmark with real-world knowledge, negative samples, and new evaluation metrics for assessing multimodal models' reasoning capabilities.
Findings
Existing models struggle with knowledge-driven visual grounding.
Many models rely on shortcut correlations rather than genuine reasoning.
The benchmark reveals a gap between textual understanding and visual grounding.
Abstract
Referring Expression Comprehension (REC) is a popular multimodal task that aims to accurately detect target objects within a single image based on a given textual expression. However, due to the limitations of earlier models, traditional REC benchmarks either rely solely on intra-image cues or lack sufficiently fine-grained instance annotations, making them inadequate for evaluating the reasoning capabilities of Multi-modal Large Language Models (MLLMs). To address this gap, we propose a new benchmark, KnowDR-REC, characterized by three key features: Firstly, it is built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image. Secondly, the dataset includes elaborately constructed negative samples via fine-grained expression editing, designed to evaluate a model's robustness and anti-hallucination ability. Lastly, we introduce three novel evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning
