RLTHF: Targeted Human Feedback for LLM Alignment

Yifei Xu; Tusher Chakraborty; Emre K{\i}c{\i}man; Bibek Aryal; Eduardo Rodrigues; Srinagesh Sharma; Roberto Estevao; Maria Angels de Luis Balaguer; Jessica Wolk; Rafael Padilha; Leonardo Nunes; Shobana Balakrishnan; Songwu Lu; Ranveer Chandra

arXiv:2502.13417·cs.CL·August 8, 2025

RLTHF: Targeted Human Feedback for LLM Alignment

Yifei Xu, Tusher Chakraborty, Emre K{\i}c{\i}man, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha, Leonardo Nunes, Shobana Balakrishnan, Songwu Lu, Ranveer Chandra

PDF

Open Access

TL;DR

RLTHF is a hybrid framework that efficiently aligns large language models with human preferences by combining initial LLM-based alignment with targeted human corrections, significantly reducing human effort while maintaining high quality.

Contribution

RLTHF introduces a novel approach that combines LLMs and selective human annotations to achieve full-human alignment with minimal human effort.

Findings

01

RLTHF achieves full-human annotation-level alignment with only 6-7% of human effort.

02

Models trained on RLTHF-curated data outperform those trained on fully human-annotated datasets.

03

RLTHF effectively identifies hard-to-annotate samples for targeted human correction.

Abstract

Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model's reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF's curated datasets for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Statistical and Computational Modeling

MethodsALIGN