Regret Bounds for Reinforcement Learning from Multi-Source Imperfect Preferences

Ming Shi; Yingbin Liang; Ness B. Shroff; and Ananthram Swami

arXiv:2603.20453·cs.LG·April 3, 2026

Regret Bounds for Reinforcement Learning from Multi-Source Imperfect Preferences

Ming Shi, Yingbin Liang, Ness B. Shroff, and Ananthram Swami

PDF

Abstract

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $ω$ over $K$ episodes. We propose a unified algorithm with regret $\tilde{O} (K / M + ω)$ , which exhibits a best-of-both-regimes behavior: it achieves $M$ -dependent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.