Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
Ye Wang, Jing Liu, Toshiaki Koike-Akino

TL;DR
This paper introduces a novel inference-time alignment method called SLOP, which uses temperature adjustment and ensemble weighting to mitigate reward hacking and improve robustness in reward models.
Contribution
It extends existing alignment techniques with reference-model temperature adjustment and a calibration algorithm for ensemble weights, enhancing robustness against reward hacking.
Findings
Temperature adjustment improves alignment robustness.
Calibration of ensemble weights enhances reward hacking mitigation.
SLOP maintains alignment performance while increasing robustness.
Abstract
Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative reward models combined as a sharpened logarithmic opinion pool (SLOP). To mitigate reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
