Online Learning from Strategic Human Feedback in LLM Fine-Tuning

Shugang Hao; Lingjie Duan

arXiv:2412.16834·cs.AI·December 25, 2024

Online Learning from Strategic Human Feedback in LLM Fine-Tuning

Shugang Hao, Lingjie Duan

PDF

Open Access

TL;DR

This paper introduces an online learning framework for LLM fine-tuning that effectively manages strategic human feedback, ensuring truthful reporting and achieving sublinear regret, thus improving alignment with human preferences.

Contribution

It presents the first online learning mechanism against strategic human labelers in LLM fine-tuning, using a dynamic Bayesian game to ensure truthful feedback and reduce regret.

Findings

01

Mechanism achieves sublinear regret of O(T^{1/2})

02

Outperforms existing benchmark schemes in simulations

03

Ensures truthful feedback from strategic human labelers

Abstract

Reinforcement learning from human feedback (RLHF) has become an essential step in fine-tuning large language models (LLMs) to align them with human preferences. However, human labelers are selfish and have diverse preferences. They may strategically misreport their online feedback to influence the system's aggregation towards their own preferences. Current practice simply averages labelers' feedback per time and fails to identify the most accurate human labeler, leading to linear regret $O (T)$ for $T$ time slots. To our best knowledge, we are the first to study online learning mechanisms against strategic human labelers in the LLM fine-tuning process. We formulate a new dynamic Bayesian game and dynamically adjust human labelers' weights in the preference aggregation, ensuring their truthful feedback and sublinear regret $O (T^{1/2})$ . Simulation results demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsALIGN