Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering
Min Peng, Chongyang Wang, Yuan Gao, Yu Shi, Xiang-Dong Zhou

TL;DR
This paper introduces a novel Temporal Pyramid Transformer with multimodal interaction for VideoQA, effectively capturing multi-scale temporal and semantic information to improve question answering accuracy.
Contribution
The paper proposes a new TPT model with question-specific and visual inference modules that leverage multi-scale temporal features and multimodal attention for enhanced VideoQA performance.
Findings
Outperforms state-of-the-art methods on three VideoQA datasets
Effectively models multi-scale temporal and semantic interactions
Demonstrates significant accuracy improvements
Abstract
Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding. While existing approaches seldom leverage the appearance-motion information in the video at multiple temporal scales, the interaction between the question and the visual information for textual semantics extraction is frequently ignored. Targeting these issues, this paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA. The TPT model comprises two modules, namely Question-specific Transformer (QT) and Visual Inference (VI). Given the temporal pyramid constructed from a video, QT builds the question semantics from the coarse-to-fine multimodal co-occurrence between each word and the visual content. Under the guidance of such question-specific semantics, VI infers the visual clues from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Softmax · Dropout · Dense Connections
