From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
Xiaoda Yang, Yuxiang Liu, Shenzhou Gao, Can Wang, Jingyang Xue, Lixin Yang, Yao Mu, Tao Jin, Shuicheng Yan, Zhimeng Zhang, Zhou Zhao

TL;DR
EgoTSR introduces a curriculum learning framework and a large-scale dataset to enhance egocentric spatiotemporal reasoning in vision-language models, significantly improving long-horizon reasoning accuracy.
Contribution
The paper presents EgoTSR, a novel curriculum-based approach and dataset for evolving embodied reasoning from spatial understanding to long-term planning.
Findings
Achieves 92.4% accuracy on long-horizon logical reasoning tasks.
Effectively eliminates chronological biases in spatiotemporal reasoning.
Outperforms existing state-of-the-art models in accuracy and perceptual precision.
Abstract
Modern vision-language models achieve strong performance in static perception, but remain limited in the complex spatiotemporal reasoning required for embodied, egocentric tasks. A major source of failure is their reliance on temporal priors learned from passive video data, which often leads to spatiotemporal hallucinations and poor generalization in dynamic environments. To address this, we present EgoTSR, a curriculum-based framework for learning task-oriented spatiotemporal reasoning. EgoTSR is built on the premise that embodied reasoning should evolve from explicit spatial understanding to internalized task-state assessment and finally to long-horizon planning. To support this paradigm, we construct EgoTSR-Data, a large-scale dataset comprising 46 million samples organized into three stages: Chain-of-Thought (CoT) supervision, weakly supervised tagging, and long-horizon sequences.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
