Yin and Yang: Balancing and Answering Binary Visual Questions
Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, Devi Parikh

TL;DR
This paper proposes a visual verification approach for binary visual question answering using abstract scenes, balancing datasets to reduce language priors and emphasizing high-level semantics to improve understanding.
Contribution
It introduces a concept-based verification method for binary VQA on abstract scenes and demonstrates dataset balancing to control language priors, enhancing model understanding.
Findings
The approach matches state-of-the-art on unbalanced datasets.
It outperforms existing methods on balanced datasets.
Balanced datasets reduce language bias in VQA.
Abstract
The complex compositional structure of language makes problems at the intersection of vision and language challenging. But language also provides a strong prior that can result in good superficial performance, without the underlying models truly understanding the visual content. This can hinder progress in pushing state of art in the computer vision aspects of multi-modal AI. In this paper, we address binary Visual Question Answering (VQA) on abstract scenes. We formulate this problem as visual verification of concepts inquired in the questions. Specifically, we convert the question to a tuple that concisely summarizes the visual concept to be detected in the image. If the concept can be found in the image, the answer to the question is "yes", and otherwise "no". Abstract scenes play two roles (1) They allow us to focus on the high-level semantics of the VQA task as opposed to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
