AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

Yogesh Kulkarni; Pooyan Fazli

arXiv:2508.03100·cs.CV·March 31, 2026

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

Yogesh Kulkarni, Pooyan Fazli

PDF

1 Models

TL;DR

AVATAR introduces an off-policy multimodal reasoning framework with Temporal Advantage Shaping, significantly improving sample efficiency and performance on long-horizon video benchmarks.

Contribution

It proposes AVATAR, a novel off-policy training architecture with TAS for better credit assignment, addressing key limitations of prior methods like GRPO.

Findings

01

Outperforms baseline Qwen2.5-Omni by +5.4 on MMVU

02

Achieves +4.9 on OmniBench and +4.5 on Video-Holmes

03

Demonstrates 5× sample efficiency, using 80% fewer generated completions

Abstract

Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce $AVATAR$ ( $A$ udio- $V$ ideo $A$ gen $t$ for $A$ lignment and $R$ easoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
yogkul2000/AVATAR
model· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.