EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

Lulin Liu; Dayou Li; Yiqing Liang; Sicong Jiang; Hitesh Vijay; Hezhen Hu; Xuhai Xu; Zirui Liu; Srinivas Shakkottai; Manling Li; and Zhiwen Fan

arXiv:2604.09535·cs.CV·April 13, 2026

EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

Lulin Liu, Dayou Li, Yiqing Liang, Sicong Jiang, Hitesh Vijay, Hezhen Hu, Xuhai Xu, Zirui Liu, Srinivas Shakkottai, Manling Li, and Zhiwen Fan

PDF

1 Datasets

TL;DR

EgoTL introduces a think-aloud data collection pipeline for egocentric household tasks, enabling benchmarking and fine-tuning of foundation models to improve long-horizon reasoning and spatial grounding.

Contribution

It presents EgoTL, a novel egocentric data capture method with step-by-step goals and reasoning, and demonstrates its effectiveness in benchmarking and enhancing foundation models for household tasks.

Findings

01

Foundation models still struggle with egocentric reasoning and spatial grounding.

02

EgoTL enables benchmarking across six task dimensions and over 100 household tasks.

03

Fine-tuning with human CoT improves long-horizon planning and spatial reasoning.

Abstract

Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

luuuulinnnn/EgoTL-DATA
dataset· 487 dl
487 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.