EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning
Yogesh Kulkarni, Pooyan Fazli

TL;DR
EgoVITA introduces a structured plan-then-verify framework for egocentric video reasoning, improving consistency and grounding in multimodal large language models through a cross-perspective feedback mechanism.
Contribution
The paper proposes a novel plan-then-verify approach for egocentric video understanding, enabling better reasoning without paired ego-exo supervision.
Findings
Achieves state-of-the-art results on egocentric reasoning benchmarks.
Outperforms previous models by +7.7 on EgoBlind and +4.4 on EgoOrient.
Maintains strong generalization with only 47k training samples.
Abstract
Egocentric video understanding requires procedural reasoning under partial observability and continuously shifting viewpoints. Current multimodal large language models (MLLMs) struggle with this setting, often generating plausible but visually inconsistent or weakly grounded responses. We introduce , a framework that decomposes egocentric video reasoning into a structured process. The model first generates an : a causal sequence of anticipated actions from a first-person perspective. This plan is then evaluated by an stage that validates spatiotemporal and logical consistency from a third-person viewpoint. This decomposition enables cross-perspective feedback without requiring paired ego-exo supervision. To train this reasoning process, we adopt Group Relative Policy Optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
