Heterogeneous Memory Enhanced Multimodal Attention Model for Video   Question Answering

Chenyou Fan; Xiaofan Zhang; Shu Zhang; Wensheng Wang; Chi Zhang; Heng; Huang

arXiv:1904.04357·cs.CV·April 10, 2019·35 cites

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, Heng, Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel VideoQA framework that leverages heterogeneous memory, question memory, and multi-step multimodal reasoning to enhance understanding and achieve state-of-the-art results.

Contribution

The paper presents a new end-to-end VideoQA model with heterogeneous memory, question memory, and multi-step reasoning for improved multimodal understanding.

Findings

01

Achieves state-of-the-art performance on four VideoQA datasets.

02

Effectively learns global context from appearance and motion features.

03

Enables iterative refinement of multimodal attention for better reasoning.

Abstract

In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global context information from appearance and motion features; 2) a redesigned question memory which helps understand the complex semantics of question and highlights queried subjects; and 3) a new multimodal fusion layer which performs multi-step reasoning by attending to relevant visual and textual hints with self-updated attention. Our VideoQA model firstly generates the global context-aware visual and textual features respectively by interacting current inputs with memory contents. After that, it makes the attentional fusion of the multimodal visual and textual representations to infer the correct answer. Multiple cycles of reasoning can be made to iteratively refine attention weights of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fanchenyou/HME-VideoQA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques