ATM: Action Temporality Modeling for Video Question Answering
Junwen Chen, Jie Zhu, Yu Kong

TL;DR
This paper introduces Action Temporality Modeling (ATM), a novel approach for improving temporal and causal reasoning in VideoQA by enhancing motion representations and embedding training, leading to superior accuracy and reasoning ability.
Contribution
ATM rethinks optical flow for long-term temporality, uses contrastive learning for better action embeddings, and employs shuffled video fine-tuning to ensure faithful temporal reasoning.
Findings
ATM outperforms previous methods in VideoQA accuracy
ATM demonstrates improved true temporality reasoning
Contrastive learning enhances action representation quality
Abstract
Despite significant progress in video question answering (VideoQA), existing methods fall short of questions that require causal/temporal reasoning across frames. This can be attributed to imprecise motion representations. We introduce Action Temporality Modeling (ATM) for temporality reasoning via three-fold uniqueness: (1) rethinking the optical flow and realizing that optical flow is effective in capturing the long horizon temporality reasoning; (2) training the visual-text embedding by contrastive learning in an action-centric manner, leading to better action representations in both vision and text modalities; and (3) preventing the model from answering the question given the shuffled video in the fine-tuning stage, to avoid spurious correlation between appearance and motion and hence ensure faithful temporality reasoning. In the experiments, we show that ATM outperforms previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning
