Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

Yike Wu; Necva Bolucu; Stephen Wan; Dadong Wang; Jiahao Xia; Jian Zhang

arXiv:2603.25004·cs.CV·March 27, 2026

Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

Yike Wu, Necva Bolucu, Stephen Wan, Dadong Wang, Jiahao Xia, Jian Zhang

PDF

Open Access

TL;DR

This paper introduces SGREC, a novel zero-shot referring expression comprehension method that uses query-driven scene graphs and large language models to improve interpretability and accuracy in locating objects in images.

Contribution

SGREC leverages scene graphs as structured intermediaries to enhance visual understanding and interpretability in zero-shot REC tasks, bridging vision and language models effectively.

Findings

01

Achieves top-1 accuracy of 66.78% on RefCOCO val

02

Outperforms existing zero-shot REC methods on multiple benchmarks

03

Provides interpretable explanations for object localization decisions

Abstract

Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning