Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Shulin Tian; Ruiqi Wang; Hongming Guo; Penghao Wu; Yuhao Dong; Xiuying Wang; Jingkang Yang; Hao Zhang; Hongyuan Zhu; Ziwei Liu

arXiv:2506.13654·cs.CV·June 17, 2025

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

Ego-R1 introduces a reinforcement learning-based framework utilizing a structured chain-of-tool-thought process for reasoning over ultra-long egocentric videos, enabling complex multi-modal understanding over extended time periods.

Contribution

The paper presents a novel RL-trained agent with a structured reasoning process and a new dataset for ultra-long egocentric video question answering.

Findings

01

Effective reasoning over videos spanning a week.

02

Significant improvement in temporal coverage and understanding.

03

Demonstrated superiority over baseline methods.

Abstract

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

- Originality: Frames week-scale egocentric QA as sequential decision-making via a Chain-of-Tool-Thought controller over hierarchical temporal memory with adaptive Video-LLM/VLM calls. - Quality: Strong margins on a week-long benchmark (46.0%, +7.7 vs Gemini-1.5-Pro), clear ablations (SFT+RL > SFT; CoTT > retrieval-only), and substantial frame-budget reductions. - Clarity: Explicit tool APIs, training signals, and memory construction; stepwise traces reveal evidence flow and typical failure mode

Weaknesses

- CoTT over hierarchical memory likely overlaps prior agentic long‑video approaches. Action: run strict, matched‑backbone and matched‑budget comparisons against strong agentic and training‑free video‑RAG baselines; add a lightweight-critic variant to test the incremental value of planning alone. - Data construction and inference rely on proprietary LLMs/VLMs. Action: provide a fully open stack with results, release exact prompts/tool schemas/configs, and report contamination checks between gener

Reviewer 02Rating 4Confidence 4

Strengths

- Clear and systematic hierarchical RAG structure, which improves efficiency and relevance in timestamp-based video reasoning tasks. - Experiments on multiple egocentric datasets demonstrate consistent improvement over flat retrieval methods.

Weaknesses

- The hierarchical database structure (week → day → hour → clip) appears optimized for benchmarks with clear temporal granularity, but it’s unclear if it remains effective for datasets or tasks where such segmentation is not naturally defined. - The uniform video segmentation approach might not be robust across diverse video lengths or event types. The method may fail to capture variable-duration actions or continuous interactions. - The technical contribution is moderate, focusing on database

Reviewer 03Rating 4Confidence 4

Strengths

- This paper extends egocentric video understanding into week-level duration. - This paper successfully implements reasoning-tool calling CoT in the context of video understanding. - This paper proposes new datasets for egocentric video understanding.

Weaknesses

- Long temporal retrieval is conducted in the text form instead of visual language matching. However, the transformation from visual space to textual space inevitably loses information. - In Tab.1, the performance of the base model is not reported. In addition, although samples that are overlapped with the benchmark are removed in training, the cold-start and RL stages are still focused on ego-centric videos that are in-domain data. Therefore, the comparison with other general video models seems

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games

MethodsShrink and Fine-Tune