TL;DR
This paper introduces a weakly-supervised grounding method for visual question answering using a capsule module that improves object localization based on question cues, without relying on bounding box annotations.
Contribution
The authors propose a novel capsule-based module with query-based selection for weakly-supervised grounding in VQA, enhancing existing systems' ability to localize relevant objects.
Findings
Improved grounding accuracy on CLEVR-Answers and GQA datasets.
Comparable VQA performance with enhanced grounding capabilities.
Effective integration of capsule module into existing VQA architectures.
Abstract
The problem of grounding VQA tasks has seen an increased attention in the research community recently, with most attempts usually focusing on solving this task by using pretrained object detectors. However, pre-trained object detectors require bounding box annotations for detecting relevant objects in the vocabulary, which may not always be feasible for real-life large-scale applications. In this paper, we focus on a more relaxed setting: the grounding of relevant visual entities in a weakly supervised manner by training on the VQA task alone. To address this problem, we propose a visual capsule module with a query-based selection mechanism of capsule features, that allows the model to focus on relevant regions based on the textual cues about visual information in the question. We show that integrating the proposed capsule module in existing VQA systems significantly improves their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
