EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

Baoqi Pei; Yifei Huang; Jilan Xu; Yuping He; Guo Chen; Fei Wu; Yu Qiao; Jiangmiao Pang

arXiv:2510.23569·cs.CV·October 28, 2025

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, Jiangmiao Pang

PDF

1 Models 1 Datasets 1 Video

TL;DR

EgoThinker is a new framework that enhances multimodal large language models with egocentric reasoning abilities by leveraging a large-scale dataset and a two-stage training process, improving performance on egocentric video understanding tasks.

Contribution

The paper introduces EgoThinker, a novel approach combining spatio-temporal chain-of-thought supervision and a large egocentric dataset to improve first-person reasoning in multimodal models.

Findings

01

Outperforms existing methods on egocentric benchmarks

02

Achieves significant improvements in spatio-temporal localization

03

Demonstrates effective reasoning with detailed rationales

Abstract

Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
hyf015/EgoThinker-v1
model· 15 dl· ♡ 3
15 dl♡ 3

Datasets

hyf015/EgoThinker-SFT-Dataset
dataset· 101 dl
101 dl

Videos

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT· slideslive