Online Learning from Strategic Human Feedback in LLM Fine-Tuning
Shugang Hao, Lingjie Duan

TL;DR
This paper introduces an online learning framework for LLM fine-tuning that effectively manages strategic human feedback, ensuring truthful reporting and achieving sublinear regret, thus improving alignment with human preferences.
Contribution
It presents the first online learning mechanism against strategic human labelers in LLM fine-tuning, using a dynamic Bayesian game to ensure truthful feedback and reduce regret.
Findings
Mechanism achieves sublinear regret of O(T^{1/2})
Outperforms existing benchmark schemes in simulations
Ensures truthful feedback from strategic human labelers
Abstract
Reinforcement learning from human feedback (RLHF) has become an essential step in fine-tuning large language models (LLMs) to align them with human preferences. However, human labelers are selfish and have diverse preferences. They may strategically misreport their online feedback to influence the system's aggregation towards their own preferences. Current practice simply averages labelers' feedback per time and fails to identify the most accurate human labeler, leading to linear regret for time slots. To our best knowledge, we are the first to study online learning mechanisms against strategic human labelers in the LLM fine-tuning process. We formulate a new dynamic Bayesian game and dynamically adjust human labelers' weights in the preference aggregation, ensuring their truthful feedback and sublinear regret . Simulation results demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsALIGN
