TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action
Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng,, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai,, Yi-Ling Chen, Vibhav Vineet, Qin Cai, and Jenq-Neng Hwang

TL;DR
TEMPURA introduces a novel two-stage training framework that enhances temporal understanding and causal reasoning in videos by reconstructing missing events and segmenting videos into detailed, timestamp-aligned events, outperforming existing models.
Contribution
The paper presents TEMPURA, a new approach combining masked event prediction and video segmentation for improved causal and temporal reasoning in videos, trained on a large-scale dataset.
Findings
Outperforms baseline models on temporal grounding tasks.
Effectively reconstructs missing events and generates causal explanations.
Enhances fine-grained temporal segmentation and understanding.
Abstract
Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
