EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

Ye Pan; Chi Kit Wong; Yuanhuiyi Lyu; Hanqian Li; Jiahao Huo; Jiacheng Chen; Lutao Jiang; Xu Zheng; Xuming Hu

arXiv:2603.12147·cs.CV·March 13, 2026

EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, Xuming Hu

PDF

Open Access

TL;DR

EgoIntent introduces a new benchmark for fine-grained, step-level understanding of human intent in egocentric videos, evaluating models on local, global, and next-step intent comprehension to advance intelligent assistive technologies.

Contribution

The paper presents EgoIntent, a comprehensive step-level intent benchmark with a novel evaluation setup that prevents future frame leakage, highlighting the challenge of fine-grained intent understanding in egocentric videos.

Findings

01

Current models achieve only 33.31 average score, indicating high difficulty.

02

Step-level intent understanding remains a significant challenge for multimodal models.

03

EgoIntent enables targeted research on anticipatory and fine-grained intent comprehension.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition