Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
Yuan Wang, Ouxiang Li, Yulong Xu, Borui Liao, Jiajun Liang, Jinghan Li, Meng Wang, Xintao Wang, Pengfei Wan, Kuien Liu, Xiang Wang

TL;DR
This paper introduces DeScore, a decoupled 'think-then-score' video reward model that improves generalization and training efficiency by separating reasoning and scoring processes.
Contribution
DeScore leverages a decoupled paradigm with explicit reasoning generation and dedicated scoring, enhancing robustness and interpretability over coupled models.
Findings
DeScore achieves better alignment with human preferences.
The two-stage training improves reasoning quality and reward calibration.
DeScore demonstrates superior generalization in video reward modeling.
Abstract
Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
