Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences
Zhuoran Jin, Hongbang Yuan, Kejian Zhu, Jiachun Li, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

TL;DR
This paper introduces Omni-Reward, a comprehensive omni-modal reward modeling framework supporting diverse modalities and free-form preferences, aiming to improve AI alignment with human values across various data types.
Contribution
The paper presents Omni-RewardBench, Omni-RewardData, and Omni-RewardModel, pioneering generalist omni-modal reward modeling with a new benchmark, dataset, and model supporting multiple modalities and preferences.
Findings
Omni-RewardModel outperforms existing reward models on multiple benchmarks.
The dataset contains 248K preference pairs and 69K instruction pairs.
The benchmark covers nine tasks across five modalities.
Abstract
Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general…
Peer Reviews
Decision·ICLR 2026 Oral
The benchmark is notably broad (five modalities and nine tasks) with human-labeled pairs and explicit free-form criteria, evaluated under both w/ ties and w/o ties settings, which is a clear advance over prior suites. The dataset combines large general preference pools with an instruction-tuning subset whose criteria/labels are multi-model verified, improving criterion-following. Empirically, the study is broad, and the BT variant achieves strong accuracy on the new bench and competitive results
Human annotations come from a small PhD group; after quality control, disagreements are removed, yielding consensus-only labels that may under-sample ambiguous cases. Some modality imbalance remains in distribution. The instruction-tuning split relies on GPT-4o-generated criteria/labels which could encode stylistic biases (though mitigated via verification). Finally, beyond the CoT ablation, the paper would benefit from richer qualitative error analyses across modalities to illuminate failure mo
* This work proposes an omni-reward model that covers 5 modalities. While multi-modal RMs are not new [1-2], no prior work to my knowledge also covers audio and 3D modality tasks, which is novel and a valuable contribution. The authors also contextualize their work with respect to these prior works (L309-312) * The paper is well written and easy to follow * The dataset contribution is valuable to train multi-modal reward models, which the authors demonstrate via the Omni-RewardModel. * The trai
This work has no significant weaknesses in my view. I believe the benchmark, dataset, and trained reward models will be valuable for the preference learning and alignment community. I believe this paper is nearly of spotlight / oral quality, and is only lacking in contextualization to an important area for reward modeling: heterogeneous preference modeling. There are several recent works that move beyond binary preference pairs and homogenous modeling, though they do not use fully free-form pre
1. One of the paper's most valuable contributions is its clear definition of "Preference Rigidity". Existing RMs typically learn a static reward function $r(x, y)$. This paper proposes transforming it into a dynamic, controllable function $r(x, y | c)$, where $c$ is a free-form text criterion. 2.Omni-RewardBench is a direct and successful implementation of the problem defined above. It is the first benchmark to systematically evaluate RMs 3. The 69K instruction-tuning dataset built to address
1.The model achieves SOTA on VL-RewardBench. Given that the training data (Table 7) includes large datasets like RLAIF-V, was any data contamination check performed to ensure that (near) duplicates from VL-RewardBench or Multimodal RewardBench were not present in the training set? 2.The authors' own Omni-RewardData (Table 7) appears to only contain data for T2T, TI2T, T2I, and T2V tasks. The key modalities used to demonstrate this gap in the benchmark (T2A, T23D, TI2I) are absent from the train
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
