Trust, Don't Trust, or Flip: Robust Preference-Based Reinforcement Learning with Multi-Expert Feedback

Seyed Amir Hosseini; Maryam Abdolali; Amirhosein Tavakkoli; Fardin Ayar; Ehsan Javanmardi; Manabu Tsukada; Mahdi Javanmardi

arXiv:2601.18751·cs.LG·January 27, 2026

Trust, Don't Trust, or Flip: Robust Preference-Based Reinforcement Learning with Multi-Expert Feedback

Seyed Amir Hosseini, Maryam Abdolali, Amirhosein Tavakkoli, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi

PDF

Open Access

TL;DR

This paper introduces TriTrust-PBRL, a robust preference-based reinforcement learning framework that jointly learns reward models and trust parameters to handle multi-expert feedback, including adversarial sources, with theoretical guarantees and empirical success.

Contribution

The paper proposes a unified method that automatically identifies and mitigates adversarial preferences in multi-expert feedback during reinforcement learning.

Findings

01

Achieves state-of-the-art robustness against adversarial noise.

02

Successfully learns from mixed pools of reliable and adversarial experts.

03

Maintains near-oracle performance under various corruption scenarios.

Abstract

Preference-based reinforcement learning (PBRL) offers a promising alternative to explicit reward engineering by learning from pairwise trajectory comparisons. However, real-world preference data often comes from heterogeneous annotators with varying reliability; some accurate, some noisy, and some systematically adversarial. Existing PBRL methods either treat all feedback equally or attempt to filter out unreliable sources, but both approaches fail when faced with adversarial annotators who systematically provide incorrect preferences. We introduce TriTrust-PBRL (TTP), a unified framework that jointly learns a shared reward model and expert-specific trust parameters from multi-expert preference feedback. The key insight is that trust parameters naturally evolve during gradient-based optimization to be positive (trust), near zero (ignore), or negative (flip), enabling the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Reinforcement Learning in Robotics · Mobile Crowdsensing and Crowdsourcing