MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form   Video Question Answering

Difei Gao; Luowei Zhou; Lei Ji; Linchao Zhu; Yi Yang; Mike Zheng Shou

arXiv:2212.09522·cs.CV·December 20, 2022·5 cites

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou

PDF

Open Access 1 Repo

TL;DR

MIST is a novel multi-modal transformer model designed for long-form VideoQA, efficiently selecting relevant frames and regions for multi-event reasoning, achieving state-of-the-art results with improved interpretability.

Contribution

The paper introduces MIST, a new model that adaptively selects relevant spatial-temporal segments for long-form VideoQA, addressing computational challenges and enhancing reasoning capabilities.

Findings

01

Achieves state-of-the-art performance on four VideoQA datasets.

02

Demonstrates superior computational efficiency compared to dense sampling methods.

03

Provides improved interpretability through selective attention mechanisms.

Abstract

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multi-modal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multi-event and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

showlab/mist
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings