TL;DR
This paper introduces SOrT, a contrastive gradient learning method that enhances VQA models' consistency by better understanding and ranking relevant sub-questions, leading to improved reasoning accuracy and visual grounding.
Contribution
The paper proposes a novel contrastive gradient learning approach called SOrT to improve VQA model consistency and sub-question relevance understanding.
Findings
SOrT improves model consistency by up to 6.5 percentage points.
SOrT enhances visual grounding accuracy.
Gradient-based interpretability helps evaluate sub-question relevance.
Abstract
Recent research in Visual Question Answering (VQA) has revealed state-of-the-art models to be inconsistent in their understanding of the world -- they answer seemingly difficult questions requiring reasoning correctly but get simpler associated sub-questions wrong. These sub-questions pertain to lower level visual concepts in the image that models ideally should understand to be able to answer the higher level question correctly. To address this, we first present a gradient-based interpretability approach to determine the questions most strongly correlated with the reasoning question on an image, and use this to evaluate VQA models on their ability to identify the relevant sub-questions needed to answer a reasoning question. Next, we propose a contrastive gradient learning based approach called Sub-question Oriented Tuning (SOrT) which encourages models to rank relevant sub-questions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsInterpretability
