EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next
Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, Xuming Hu

TL;DR
EgoIntent introduces a new benchmark for fine-grained, step-level understanding of human intent in egocentric videos, evaluating models on local, global, and next-step intent comprehension to advance intelligent assistive technologies.
Contribution
The paper presents EgoIntent, a comprehensive step-level intent benchmark with a novel evaluation setup that prevents future frame leakage, highlighting the challenge of fine-grained intent understanding in egocentric videos.
Findings
Current models achieve only 33.31 average score, indicating high difficulty.
Step-level intent understanding remains a significant challenge for multimodal models.
EgoIntent enables targeted research on anticipatory and fine-grained intent comprehension.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition
