Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Songtao Jiang; Sibo Song; Chenyi Zhou; Yuan Wang; Ruizhe Chen; Tongkun Guan; Ruilin Luo; Yan Zhang; Zhihang Tang; Yuchong Sun; Hang Zhang; Zhibo Yang; Shuai Bai; Junyang Lin; Zuozhu Liu

arXiv:2603.17693·cs.CV·March 19, 2026

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Songtao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen, Tongkun Guan, Ruilin Luo, Yan Zhang, Zhihang Tang, Yuchong Sun, Hang Zhang, Zhibo Yang, Shuai Bai, Junyang Lin, Zuozhu Liu

PDF

Open Access

TL;DR

This paper introduces SynRL, a synthetic video training framework that teaches models fundamental temporal primitives, significantly improving performance on various video reasoning benchmarks by transferring simple geometric shape understanding to complex real-world scenarios.

Contribution

The paper presents a novel synthetic data-based approach to teach temporal primitives, enabling effective transfer to real-world video reasoning tasks, surpassing existing methods in efficiency and performance.

Findings

01

SynRL improves performance across 15 benchmarks.

02

Synthetic data outperforms real-world data in certain tasks.

03

Fundamental temporal skills transfer effectively from synthetic to real videos.

Abstract

The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition