Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Gongye Liu; Bo Yang; Yida Zhi; Zhizhou Zhong; Lei Ke; Didan Deng; Han Gao; Yongxiang Huang; Kaihao Zhang; Hongbo Fu; Wenhan Luo

arXiv:2602.11146·cs.CV·February 12, 2026

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo

PDF

Open Access 1 Models

TL;DR

This paper introduces DiNa-LRM, a diffusion-native latent reward model that directly learns preferences on noisy diffusion states, outperforming existing methods in image alignment benchmarks with lower computational costs.

Contribution

The paper proposes a novel diffusion-native reward model that formulates preference learning directly on diffusion states, reducing reliance on costly vision-language models.

Findings

01

DiNa-LRM outperforms existing diffusion-based reward baselines.

02

DiNa-LRM achieves performance comparable to state-of-the-art VLMs.

03

The method enables faster and more resource-efficient model alignment.

Abstract

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
liuhuohuo/DiNa-LRM-SD35M-12layers
model· 2 dl
2 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications