EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning

Yogesh Kulkarni; Pooyan Fazli

arXiv:2511.18242·cs.CV·March 17, 2026

EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning

Yogesh Kulkarni, Pooyan Fazli

PDF

Open Access

TL;DR

EgoVITA introduces a structured plan-then-verify framework for egocentric video reasoning, improving consistency and grounding in multimodal large language models through a cross-perspective feedback mechanism.

Contribution

The paper proposes a novel plan-then-verify approach for egocentric video understanding, enabling better reasoning without paired ego-exo supervision.

Findings

01

Achieves state-of-the-art results on egocentric reasoning benchmarks.

02

Outperforms previous models by +7.7 on EgoBlind and +4.4 on EgoOrient.

03

Maintains strong generalization with only 47k training samples.

Abstract

Egocentric video understanding requires procedural reasoning under partial observability and continuously shifting viewpoints. Current multimodal large language models (MLLMs) struggle with this setting, often generating plausible but visually inconsistent or weakly grounded responses. We introduce $EgoVITA$ , a framework that decomposes egocentric video reasoning into a structured $plan-then-verify$ process. The model first generates an $egocentric plan$ : a causal sequence of anticipated actions from a first-person perspective. This plan is then evaluated by an $exocentric verification$ stage that validates spatiotemporal and logical consistency from a third-person viewpoint. This decomposition enables cross-perspective feedback without requiring paired ego-exo supervision. To train this reasoning process, we adopt Group Relative Policy Optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)