Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos
Songtao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen, Tongkun Guan, Ruilin Luo, Yan Zhang, Zhihang Tang, Yuchong Sun, Hang Zhang, Zhibo Yang, Shuai Bai, Junyang Lin, Zuozhu Liu

TL;DR
This paper introduces SynRL, a synthetic video training framework that teaches models fundamental temporal primitives, significantly improving performance on various video reasoning benchmarks by transferring simple geometric shape understanding to complex real-world scenarios.
Contribution
The paper presents a novel synthetic data-based approach to teach temporal primitives, enabling effective transfer to real-world video reasoning tasks, surpassing existing methods in efficiency and performance.
Findings
SynRL improves performance across 15 benchmarks.
Synthetic data outperforms real-world data in certain tasks.
Fundamental temporal skills transfer effectively from synthetic to real videos.
Abstract
The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
