VideoSAVi: Self-Aligned Video Language Models without Human Supervision

Yogesh Kulkarni; Pooyan Fazli

arXiv:2412.00624·cs.CV·August 12, 2025

VideoSAVi: Self-Aligned Video Language Models without Human Supervision

Yogesh Kulkarni, Pooyan Fazli

PDF

Open Access 1 Models

TL;DR

VideoSAVi introduces a self-training method for video-language models that learns directly from video content without human annotations, improving reasoning and understanding across benchmarks.

Contribution

The paper presents a novel self-aligned training pipeline for Video-LLMs that eliminates the need for external supervision or human-annotated data.

Findings

01

Achieves +4.2% on MVBench

02

Improves +3.9% on PerceptionTest

03

Enhances +6.8% on EgoSchema

Abstract

Recent advances in video-large language models (Video-LLMs) have led to significant progress in video understanding. Current preference optimization methods often rely on proprietary APIs or human-annotated captions to generate preference data (i.e., pairs of model outputs ranked by quality or alignment with human judgment), which is then used to train models for video-language alignment. This approach is both costly and labor-intensive. To address this limitation, we introduce VideoSAVi (Self-Aligned Video Language Model), a self-training pipeline that enables Video-LLMs to learn from video content without external supervision. Our approach includes a self-critiquing mechanism that identifies reasoning errors in the model's initial responses and generates improved alternatives, creating preference pairs directly from video content. VideoSAVi then applies Direct Preference Optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
yogkul2000/VideoSAVi
model· 2 dl
2 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques