STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
Emad Bahrami, Olga Zatsarynna, Parth Pathak, Sunando Sengupta, Juergen Gall, Mohsen Fayyaz

TL;DR
STRIVE is a structured reinforcement learning framework that enhances video question answering by constructing multiple spatiotemporal variants of videos, enriching reward signals, and promoting stable policy updates.
Contribution
It introduces a novel spatiotemporal variant construction and importance-aware sampling mechanism to improve stability and performance in multimodal reinforcement learning for video reasoning.
Findings
Consistent improvements over baselines on six video reasoning benchmarks.
Enrichment of reward signals through structured visual perturbations.
Enhanced stability and robustness in policy learning.
Abstract
We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
