Temporal Pyramid Transformer with Multimodal Interaction for Video   Question Answering

Min Peng; Chongyang Wang; Yuan Gao; Yu Shi; Xiang-Dong Zhou

arXiv:2109.04735·cs.CV·September 13, 2021·5 cites

Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Min Peng, Chongyang Wang, Yuan Gao, Yu Shi, Xiang-Dong Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel Temporal Pyramid Transformer with multimodal interaction for VideoQA, effectively capturing multi-scale temporal and semantic information to improve question answering accuracy.

Contribution

The paper proposes a new TPT model with question-specific and visual inference modules that leverage multi-scale temporal features and multimodal attention for enhanced VideoQA performance.

Findings

01

Outperforms state-of-the-art methods on three VideoQA datasets

02

Effectively models multi-scale temporal and semantic interactions

03

Demonstrates significant accuracy improvements

Abstract

Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding. While existing approaches seldom leverage the appearance-motion information in the video at multiple temporal scales, the interaction between the question and the visual information for textual semantics extraction is frequently ignored. Targeting these issues, this paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA. The TPT model comprises two modules, namely Question-specific Transformer (QT) and Visual Inference (VI). Given the temporal pyramid constructed from a video, QT builds the question semantics from the coarse-to-fine multimodal co-occurrence between each word and the visual content. Under the guidance of such question-specific semantics, VI infers the visual clues from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trunpm/tpt-for-videoqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Softmax · Dropout · Dense Connections