Learning from Imperfect Human Feedback: a Tale from Corruption-Robust   Dueling

Yuwei Cheng; Fan Yao; Xuefeng Liu; Haifeng Xu

arXiv:2405.11204·cs.LG·October 16, 2024

Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

Yuwei Cheng, Fan Yao, Xuefeng Liu, Haifeng Xu

PDF

Open Access 1 Video

TL;DR

This paper introduces a new framework for learning from imperfect human feedback in dueling bandits, providing theoretical regret bounds and a robust algorithm that accounts for decaying corruption in human signals.

Contribution

It develops a novel analysis framework for corruption-robust dueling bandit algorithms and proposes RoSMID, achieving near-optimal regret under decaying corruption.

Findings

01

Regret lower bound of Ω(max{√T, T^ρ}) established.

02

RoSMID algorithm attains nearly optimal regret of ˜O(max{√T, T^ρ}).

03

Framework applicable to other gradient-based dueling bandit algorithms.

Abstract

This paper studies Learning from Imperfect Human Feedback (LIHF), addressing the potential irrationality or imperfect perception when learning from comparative human feedback. Building on evidences that human's imperfection decays over time (i.e., humans learn to improve), we cast this problem as a concave-utility continuous-action dueling bandit but under a restricted form of corruption: i.e., the corruption scale is decaying over time as $t^{ρ - 1}$ for some "imperfection rate" $ρ \in [0, 1]$ . With $T$ as the total number of iterations, we establish a regret lower bound of $Ω (max {T, T^{ρ}})$ for LIHF, even when $ρ$ is known. For the same setting, we develop the Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm, which achieves nearly optimal regret $\tilde{O} (max {T, T^{ρ}})$ . Core to our analysis is a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning from Imperfect Human Feedback: A Tale from Corruption-Robust Dueling· slideslive

Taxonomy

TopicsCorruption and Economic Development