Relational Preference Encoding in Looped Transformer Internal States
Jan Kirin

TL;DR
This paper explores how looped transformers encode human preferences internally, revealing that preference information is stored relationally in their states, and introduces diagnostic tools for evaluating these internal representations.
Contribution
It demonstrates that relational encoding dominates preference representation in looped transformers and introduces a flip test diagnostic for internal preference evaluators.
Findings
A pairwise evaluator achieves 95.2% accuracy on unseen examples.
Relational encoding outperforms independent classification in preference prediction.
The flip test reveals stable antisymmetry correlation in internal representations.
Abstract
We investigate how looped transformers encode human preference in their internal iteration states. Using Ouro-2.6B-Thinking, a 2.6B-parameter looped transformer with iterative refinement, we extract hidden states from each loop iteration and train lightweight evaluator heads (~5M parameters) to predict human preference on the Anthropic HH-RLHF dataset. Our pairwise evaluator achieves 95.2% test accuracy on 8,552 unseen examples, surpassing a full-batch L-BFGS probe (84.5%) while the base model remains completely frozen. Our central finding is that loop states encode preference predominantly relationally: a linear probe on pairwise differences achieves 84.5%, the best nonlinear independent evaluator reaches only 65% test accuracy, and linear independent classification scores 21.75%, below chance and with inverted polarity. Interpreted precisely, the evaluator functions as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
