Similarity as Reward Alignment: Robust and Versatile Preference-based Reinforcement Learning
Sara Rajaram, R. James Cotton, Fabian H. Sinz

TL;DR
This paper introduces SARA, a contrastive framework for preference-based reinforcement learning that is robust to label noise and adaptable to various feedback formats and training settings, improving alignment with human intent.
Contribution
SARA is a novel contrastive approach that learns latent representations to compute rewards, enhancing robustness and versatility in preference-based RL.
Findings
Outperforms baselines on continuous control offline RL benchmarks.
Demonstrates versatility in trajectory filtering, cross-task transfer, and reward shaping.
Resilient to noisy labels and adaptable to diverse feedback formats.
Abstract
Preference-based Reinforcement Learning (PbRL) entails a variety of approaches for aligning models with human intent to alleviate the burden of reward engineering. However, most previous PbRL work has not investigated the robustness to labeler errors, inevitable with labelers who are non-experts or operate under time constraints. Additionally, PbRL algorithms often target very specific settings (e.g. pairwise ranked preferences or purely offline learning). We introduce Similarity as Reward Alignment (SARA), a simple contrastive framework that is both resilient to noisy labels and adaptable to diverse feedback formats and training paradigms. SARA learns a latent representation of preferred samples and computes rewards as similarities to the learned latent. We demonstrate strong performance compared to baselines on continuous control offline RL benchmarks. We further demonstrate SARA's…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper addresses an important problem in PbRL literature, that the quality of the human feedback can vary. - The proposed method outperforms the baselines when the label noise is introduced.
- The detailed design choices of the proposed method seem somewhat arbitrary. For instance, in Step 1 of Figure 1, the method splits the positive and negative embeddings into two subsets. However, it would also be possible to divide them into m subsets, where m is smaller than the batch size, or to omit the second encoder entirely and instead use a simple aggregate statistic—such as the mean or median—of the positive embeddings as z*_p. The authors should provide explicit justification for these
- Robustness: Maintains stable performance even with high label noise (10–40%). - Simplicity: No Bradley–Terry modeling or pairwise loss; uses a single contrastive objective. - Empirical Performance: Outperforms or matches strong PbRL baselines across D4RL benchmarks.
- The fixed prototype $z_p^*$ might drift if new or biased preference data are added; continual adaptation requires full retraining. - There is no analysis of how performance scales with the number of preference pairs or trajectories. We don’t know whether SARA needs more data than BT to learn a stable prototype, or how small-sample performance behaves.
This paper proposes a simple method that makes PbRL more robust when preference labels contain noise. The approach is practical in real settings where human annotators often make mistakes.
The experimental evaluation is limited, as standard offline PbRL studies usually include a broader set of tasks. In addition, some more recent baselines are missing. These gaps in the experimental design raise concerns about the reliability and generality of the reported performance.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Research in Systems and Signal Processing
