EgoVLM: Policy Optimization for Egocentric Video Understanding

Ashwin Vinod; Shrey Pandit; Aditya Vavre; Linshen Liu

arXiv:2506.03097·cs.CV·June 4, 2025

EgoVLM: Policy Optimization for Egocentric Video Understanding

Ashwin Vinod, Shrey Pandit, Aditya Vavre, Linshen Liu

PDF

Open Access 1 Repo

TL;DR

EgoVLM is a specialized vision-language model for egocentric video understanding, trained with reinforcement learning to improve reasoning, interpretability, and performance on question answering benchmarks.

Contribution

We introduce EgoVLM, a domain-specific egocentric video model trained via reinforcement learning without supervised chain-of-thought data, achieving superior performance and interpretability.

Findings

01

EgoVLM outperforms general-purpose models on egocentric benchmarks.

02

Domain-specific training significantly boosts reasoning accuracy.

03

Explicit reasoning trace generation enhances interpretability.

Abstract

Emerging embodied AI applications, such as wearable cameras and autonomous agents, have underscored the need for robust reasoning from first person video streams. We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning within egocentric video contexts. EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps. Following DeepSeek R1-Zero's approach, we directly tune using RL without any supervised fine-tuning phase on chain-of-thought (CoT) data. We evaluate EgoVLM on egocentric video question answering benchmarks and show that domain-specific training substantially improves performance over general-purpose VLMs. Our EgoVLM-3B, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adityavavre/videgovlm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Coding and Compression Technologies · Identity, Memory, and Therapy · Advanced Data Compression Techniques

MethodsBalanced Selection · ALIGN