Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation
Qiaohui Chu, Haoyu Zhang, Meng Liu, Yisen Feng, Haoxiang Shi, Liqiang Nie

TL;DR
This paper introduces INSIGHT, a novel two-stage framework for egocentric long-term action anticipation that leverages semantic cues and explicit cognitive reasoning to improve prediction accuracy and generalization.
Contribution
It proposes a unified approach combining semantic feature extraction and reinforcement learning-based reasoning for better long-term action anticipation.
Findings
Achieves state-of-the-art results on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ datasets.
Effectively utilizes hand-object interaction cues and verb-noun semantics.
Demonstrates strong generalization across diverse egocentric datasets.
Abstract
Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Action Observation and Synchronization
