ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen-Chun Chen,, Yu-Chiang Frank Wang

TL;DR
ReXTime is a new benchmark suite designed to evaluate AI models' ability to perform temporal reasoning across video segments, highlighting current limitations and providing a dataset for improving such reasoning.
Contribution
The paper introduces ReXTime, a novel benchmark with an automated pipeline for generating temporal reasoning questions, enabling large-scale evaluation and training of models on reasoning across video segments.
Findings
Frontier large language models outperform academic models but still lag human performance by 14.3%.
The automated dataset generation pipeline effectively creates training data for across-time reasoning.
Empirical results show fine-tuning on generated data improves model performance.
Abstract
We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization
