TEMPURA: Temporal Event Masked Prediction and Understanding for   Reasoning in Action

Jen-Hao Cheng; Vivian Wang; Huayu Wang; Huapeng Zhou; Yi-Hao Peng,; Hou-I Liu; Hsiang-Wei Huang; Kuang-Ming Chen; Cheng-Yen Yang; Wenhao Chai,; Yi-Ling Chen; Vibhav Vineet; Qin Cai; and Jenq-Neng Hwang

arXiv:2505.01583·cs.CV·May 6, 2025

TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng,, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai,, Yi-Ling Chen, Vibhav Vineet, Qin Cai, and Jenq-Neng Hwang

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

TEMPURA introduces a novel two-stage training framework that enhances temporal understanding and causal reasoning in videos by reconstructing missing events and segmenting videos into detailed, timestamp-aligned events, outperforming existing models.

Contribution

The paper presents TEMPURA, a new approach combining masked event prediction and video segmentation for improved causal and temporal reasoning in videos, trained on a large-scale dataset.

Findings

01

Outperforms baseline models on temporal grounding tasks.

02

Effectively reconstructs missing events and generates causal explanations.

03

Enhances fine-grained temporal segmentation and understanding.

Abstract

Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

andy-cheng/tempura
pytorchOfficial

Models

Datasets

andaba/TEMPURA-VER
dataset· 26 dl
26 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies