TL;DR
This paper introduces a differential attention mechanism for visual question answering that leverages exemplars to better mimic human attention, leading to improved accuracy on benchmark datasets.
Contribution
It proposes an exemplar-based differential attention method that aligns more closely with human focus, enhancing VQA performance over traditional image-based attention approaches.
Findings
Outperforms existing image-based attention methods.
Achieves competitive results with state-of-the-art models.
Improves question-answering accuracy on benchmark datasets.
Abstract
In this paper we aim to answer questions based on images when provided with a dataset of question-answer pairs for a number of images during training. A number of methods have focused on solving this problem by using image based attention. This is done by focusing on a specific part of the image while answering the question. Humans also do so when solving this problem. However, the regions that the previous systems focus on are not correlated with the regions that humans focus on. The accuracy is limited due to this drawback. In this paper, we propose to solve this problem by using an exemplar based method. We obtain one or more supporting and opposing exemplars to obtain a differential attention region. This differential attention is closer to human attention than other image based attention methods. It also helps in obtaining improved accuracy when answering questions. The method is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
