TL;DR
ROVA is a training framework that enhances video reasoning model robustness against real-world disturbances by adaptive difficulty-aware training and a new benchmark, PVRBench.
Contribution
The paper introduces ROVA, a novel robustness-aware training method, and PVRBench, a benchmark for evaluating video reasoning under real-world perturbations.
Findings
ROVA reduces accuracy and reasoning drops by up to 35% and 28% under disturbances.
ROVA improves accuracy by at least 24% and reasoning by over 9% compared to baselines.
Performance gains from ROVA transfer to standard benchmarks, showing broad effectiveness.
Abstract
In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
