EgoVLM: Policy Optimization for Egocentric Video Understanding
Ashwin Vinod, Shrey Pandit, Aditya Vavre, Linshen Liu

TL;DR
EgoVLM is a specialized vision-language model for egocentric video understanding, trained with reinforcement learning to improve reasoning, interpretability, and performance on question answering benchmarks.
Contribution
We introduce EgoVLM, a domain-specific egocentric video model trained via reinforcement learning without supervised chain-of-thought data, achieving superior performance and interpretability.
Findings
EgoVLM outperforms general-purpose models on egocentric benchmarks.
Domain-specific training significantly boosts reasoning accuracy.
Explicit reasoning trace generation enhances interpretability.
Abstract
Emerging embodied AI applications, such as wearable cameras and autonomous agents, have underscored the need for robust reasoning from first person video streams. We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning within egocentric video contexts. EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps. Following DeepSeek R1-Zero's approach, we directly tune using RL without any supervised fine-tuning phase on chain-of-thought (CoT) data. We evaluate EgoVLM on egocentric video question answering benchmarks and show that domain-specific training substantially improves performance over general-purpose VLMs. Our EgoVLM-3B, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Coding and Compression Technologies · Identity, Memory, and Therapy · Advanced Data Compression Techniques
MethodsBalanced Selection · ALIGN
