Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling
Yuwei Cheng, Fan Yao, Xuefeng Liu, Haifeng Xu

TL;DR
This paper introduces a new framework for learning from imperfect human feedback in dueling bandits, providing theoretical regret bounds and a robust algorithm that accounts for decaying corruption in human signals.
Contribution
It develops a novel analysis framework for corruption-robust dueling bandit algorithms and proposes RoSMID, achieving near-optimal regret under decaying corruption.
Findings
Regret lower bound of Ω(max{√T, T^ρ}) established.
RoSMID algorithm attains nearly optimal regret of ˜O(max{√T, T^ρ}).
Framework applicable to other gradient-based dueling bandit algorithms.
Abstract
This paper studies Learning from Imperfect Human Feedback (LIHF), addressing the potential irrationality or imperfect perception when learning from comparative human feedback. Building on evidences that human's imperfection decays over time (i.e., humans learn to improve), we cast this problem as a concave-utility continuous-action dueling bandit but under a restricted form of corruption: i.e., the corruption scale is decaying over time as for some "imperfection rate" . With as the total number of iterations, we establish a regret lower bound of for LIHF, even when is known. For the same setting, we develop the Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm, which achieves nearly optimal regret . Core to our analysis is a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCorruption and Economic Development
