ATM: Action Temporality Modeling for Video Question Answering

Junwen Chen; Jie Zhu; Yu Kong

arXiv:2309.02290·cs.CV·September 6, 2023

ATM: Action Temporality Modeling for Video Question Answering

Junwen Chen, Jie Zhu, Yu Kong

PDF

Open Access

TL;DR

This paper introduces Action Temporality Modeling (ATM), a novel approach for improving temporal and causal reasoning in VideoQA by enhancing motion representations and embedding training, leading to superior accuracy and reasoning ability.

Contribution

ATM rethinks optical flow for long-term temporality, uses contrastive learning for better action embeddings, and employs shuffled video fine-tuning to ensure faithful temporal reasoning.

Findings

01

ATM outperforms previous methods in VideoQA accuracy

02

ATM demonstrates improved true temporality reasoning

03

Contrastive learning enhances action representation quality

Abstract

Despite significant progress in video question answering (VideoQA), existing methods fall short of questions that require causal/temporal reasoning across frames. This can be attributed to imprecise motion representations. We introduce Action Temporality Modeling (ATM) for temporality reasoning via three-fold uniqueness: (1) rethinking the optical flow and realizing that optical flow is effective in capturing the long horizon temporality reasoning; (2) training the visual-text embedding by contrastive learning in an action-centric manner, leading to better action representations in both vision and text modalities; and (3) preventing the model from answering the question given the shuffled video in the fine-tuning stage, to avoid spurious correlation between appearance and motion and hence ensure faithful temporality reasoning. In the experiments, we show that ATM outperforms previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning