Reverse Engineering Human Preferences with Reinforcement Learning

Lisa Alazraki; Tan Yi-Chern; Jon Ander Campos; Maximilian Mozes; Marek Rei; Max Bartolo

arXiv:2505.15795·cs.CL·February 3, 2026

Reverse Engineering Human Preferences with Reinforcement Learning

Lisa Alazraki, Tan Yi-Chern, Jon Ander Campos, Maximilian Mozes, Marek Rei, Max Bartolo

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that human preferences can be reverse engineered by using reinforcement learning to optimize LLM-generated preambles, leading to higher evaluation scores and raising concerns about the reliability of LLM-as-a-judge frameworks.

Contribution

It introduces a novel method of adversarially tuning preambles with reinforcement learning to manipulate LLM evaluation scores, which is undetectable and transferable across models.

Findings

01

Frozen LLMs with tuned preambles outperform existing evaluation frameworks.

02

The method is virtually undetectable and transferable to unseen models.

03

It raises questions about the reliability of current LLM evaluation methods.

Abstract

The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework--known as LLM-as-a-judge--is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model's response, our method is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Reverse Engineering Human Preferences with Reinforcement Learning· slideslive

Taxonomy

TopicsColor perception and design

MethodsADaptive gradient method with the OPTimal convergence rate · High-Order Consensuses