EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT
Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, Jiangmiao Pang

TL;DR
EgoThinker is a new framework that enhances multimodal large language models with egocentric reasoning abilities by leveraging a large-scale dataset and a two-stage training process, improving performance on egocentric video understanding tasks.
Contribution
The paper introduces EgoThinker, a novel approach combining spatio-temporal chain-of-thought supervision and a large egocentric dataset to improve first-person reasoning in multimodal models.
Findings
Outperforms existing methods on egocentric benchmarks
Achieves significant improvements in spatio-temporal localization
Demonstrates effective reasoning with detailed rationales
Abstract
Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
