Reverse Engineering Human Preferences with Reinforcement Learning
Lisa Alazraki, Tan Yi-Chern, Jon Ander Campos, Maximilian Mozes, Marek Rei, Max Bartolo

TL;DR
This paper demonstrates that human preferences can be reverse engineered by using reinforcement learning to optimize LLM-generated preambles, leading to higher evaluation scores and raising concerns about the reliability of LLM-as-a-judge frameworks.
Contribution
It introduces a novel method of adversarially tuning preambles with reinforcement learning to manipulate LLM evaluation scores, which is undetectable and transferable across models.
Findings
Frozen LLMs with tuned preambles outperform existing evaluation frameworks.
The method is virtually undetectable and transferable to unseen models.
It raises questions about the reliability of current LLM evaluation methods.
Abstract
The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework--known as LLM-as-a-judge--is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model's response, our method is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsColor perception and design
MethodsADaptive gradient method with the OPTimal convergence rate · High-Order Consensuses
