CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions
Runtao Liu, Chenxi Liu, Yutong Bai, Alan Yuille

TL;DR
This paper introduces CLEVR-Ref+, a synthetic dataset for diagnosing visual reasoning in referring expression tasks, and proposes IEP-Ref, a modular network that reveals reasoning steps and handles false premises effectively.
Contribution
The paper presents a new diagnostic dataset CLEVR-Ref+ and a modular network IEP-Ref that improves interpretability and robustness in referring expression comprehension.
Findings
IEP-Ref outperforms other models on CLEVR-Ref+
The module can reveal the reasoning process step-by-step
IEP-Ref correctly predicts no-foreground for false-premise expressions
Abstract
Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process. To address these issues and complement similar efforts in visual question answering, we build CLEVR-Ref+, a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators. In addition to evaluating several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
