Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
Dazhao Du, Jian Liu, Jialong Qin, Tao Han, Bohai Gu, Fangqi Zhu, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

TL;DR
This paper introduces CRPO, a reinforcement learning framework that enhances spatiotemporal sensitivity in Video LLMs by using counterfactual videos and a novel reward, evaluated on a new benchmark.
Contribution
The paper proposes a dual-branch RL method with counterfactual data augmentation and a relation reward to improve spatiotemporal understanding in Video LLMs.
Findings
CRPO outperforms prior RL methods on spatiotemporal-sensitive benchmarks.
CRPO improves model sensitivity to dynamic video aspects without sacrificing static performance.
The DyBench benchmark effectively measures spatiotemporal sensitivity in videos.
Abstract
Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
