RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen

TL;DR
RationalRewards introduces a multi-dimensional critique-based reward model for visual generation, enhancing training interpretability and test-time output refinement through structured reasoning.
Contribution
It presents Preference-Anchored Rationalization (PARROT) to train high-quality rationales from preference data, enabling improved reward modeling with less data and better generator performance.
Findings
RationalRewards achieves state-of-the-art preference prediction among open-source models.
The critique-refine loop improves generator outputs without parameter updates.
Structured reasoning unlocks latent capabilities in visual generators.
Abstract
Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
