Compositional Memory for Visual Question Answering
Aiwen Jiang, Fang Wang, Fatih Porikli, Yi Li

TL;DR
This paper introduces a novel compositional memory model for visual question answering that explicitly models the interaction between language and local image features over time, leading to improved accuracy.
Contribution
It proposes a dynamic episodic memory approach that fuses language and local visual features through attention, enhancing VQA performance beyond existing methods.
Findings
Achieved 6% improvement on DARQUAR dataset
Outperformed state-of-the-art on MSCOCO-VQA
Demonstrated effectiveness of explicit local feature modeling
Abstract
Visual Question Answering (VQA) emerges as one of the most fascinating topics in computer vision recently. Many state of the art methods naively use holistic visual features with language features into a Long Short-Term Memory (LSTM) module, neglecting the sophisticated interaction between them. This coarse modeling also blocks the possibilities of exploring finer-grained local features that contribute to the question answering dynamically over time. This paper addresses this fundamental problem by directly modeling the temporal dynamics between language and all possible local image patches. When traversing the question words sequentially, our end-to-end approach explicitly fuses the features associated to the words and the ones available at multiple local patches in an attention mechanism, and further combines the fused information to generate dynamic messages, which we call episode.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
