Barlow constrained optimization for Visual Question Answering
Abhishek Jha, Badri N. Patro, Luc Van Gool, Tinne Tuytelaars

TL;DR
This paper introduces COB, a regularization method for VQA that reduces redundancy in the joint embedding space, improving accuracy and interpretability by disentangling semantic concepts.
Contribution
It proposes a novel constrained optimization regularization based on Barlow's theory to enhance the information content of the VQA joint space.
Findings
Improves VQA accuracy by 1.4% on VQA-CP v2
Enhances interpretability of the model
Reduces redundancy in the joint embedding space
Abstract
Visual question answering is a vision-and-language multimodal task, that aims at predicting answers given samples from the question and image modalities. Most recent methods focus on learning a good joint embedding space of images and questions, either by improving the interaction between these two modalities, or by making it a more discriminant space. However, how informative this joint space is, has not been well explored. In this paper, we propose a novel regularization for VQA models, Constrained Optimization using Barlow's theory (COB), that improves the information content of the joint space by minimizing the redundancy. It reduces the correlation between the learned feature components and thereby disentangles semantic concepts. Our model also aligns the joint space with the answer embedding space, where we consider the answer and image+question as two different `views' of what in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Barlow constrained optimization for Visual Question Answering· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
