TL;DR
AVATAR introduces an off-policy multimodal reasoning framework with Temporal Advantage Shaping, significantly improving sample efficiency and performance on long-horizon video benchmarks.
Contribution
It proposes AVATAR, a novel off-policy training architecture with TAS for better credit assignment, addressing key limitations of prior methods like GRPO.
Findings
Outperforms baseline Qwen2.5-Omni by +5.4 on MMVU
Achieves +4.9 on OmniBench and +4.5 on Video-Holmes
Demonstrates 5× sample efficiency, using 80% fewer generated completions
Abstract
Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce (udio-ideo gen for lignment and easoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
