Compositional Memory for Visual Question Answering

Aiwen Jiang; Fang Wang; Fatih Porikli; Yi Li

arXiv:1511.05676·cs.CV·November 19, 2015·37 cites

Compositional Memory for Visual Question Answering

Aiwen Jiang, Fang Wang, Fatih Porikli, Yi Li

PDF

Open Access

TL;DR

This paper introduces a novel compositional memory model for visual question answering that explicitly models the interaction between language and local image features over time, leading to improved accuracy.

Contribution

It proposes a dynamic episodic memory approach that fuses language and local visual features through attention, enhancing VQA performance beyond existing methods.

Findings

01

Achieved 6% improvement on DARQUAR dataset

02

Outperformed state-of-the-art on MSCOCO-VQA

03

Demonstrated effectiveness of explicit local feature modeling

Abstract

Visual Question Answering (VQA) emerges as one of the most fascinating topics in computer vision recently. Many state of the art methods naively use holistic visual features with language features into a Long Short-Term Memory (LSTM) module, neglecting the sophisticated interaction between them. This coarse modeling also blocks the possibilities of exploring finer-grained local features that contribute to the question answering dynamically over time. This paper addresses this fundamental problem by directly modeling the temporal dynamics between language and all possible local image patches. When traversing the question words sequentially, our end-to-end approach explicitly fuses the features associated to the words and the ones available at multiple local patches in an attention mechanism, and further combines the fused information to generate dynamic messages, which we call episode.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques