Doubly Right Object Recognition: A Why Prompt for Visual Rationales
Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit Menon, Junfeng, Yang, Xin Wang, Carl Vondrick

TL;DR
This paper introduces a benchmark for evaluating whether visual recognition models can produce correct rationales alongside their predictions, and proposes a method to improve rationale accuracy through a 'why prompt' that transfers language model rationales to visual models.
Contribution
The paper presents the 'doubly right' object recognition benchmark and a novel 'why prompt' method that enhances visual models' ability to generate correct rationales, improving interpretability.
Findings
State-of-the-art models often produce incorrect rationales.
Transferring language model rationales improves visual model explanations.
The 'why prompt' enhances zero-shot transfer to unseen tasks.
Abstract
Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a ``doubly right'' object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a ``why prompt,'' which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
