Are Video Reasoning Models Ready to Go Outside?

Yangfan He; Changgyu Boo; Jaehong Yoon

arXiv:2603.10652·cs.CV·April 15, 2026

Are Video Reasoning Models Ready to Go Outside?

Yangfan He, Changgyu Boo, Jaehong Yoon

PDF

1 Repo

TL;DR

ROVA is a training framework that enhances video reasoning model robustness against real-world disturbances by adaptive difficulty-aware training and a new benchmark, PVRBench.

Contribution

The paper introduces ROVA, a novel robustness-aware training method, and PVRBench, a benchmark for evaluating video reasoning under real-world perturbations.

Findings

01

ROVA reduces accuracy and reasoning drops by up to 35% and 28% under disturbances.

02

ROVA improves accuracy by at least 24% and reasoning by over 9% compared to baselines.

03

Performance gains from ROVA transfer to standard benchmarks, showing broad effectiveness.

Abstract

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

codepassionor/ROVA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.