Discovering Implicit Large Language Model Alignment Objectives
Edward Chen, Sanmi Koyejo, Carlos Guestrin

TL;DR
The paper introduces Obj-Disco, a framework that automatically decomposes LLM alignment reward signals into interpretable objectives, improving understanding of model incentives and revealing hidden misalignments.
Contribution
It presents a novel iterative greedy algorithm to identify and validate natural language objectives that explain reward signals in LLMs, addressing limitations of existing interpretability methods.
Findings
Captures over 90% of reward behavior across tasks and models
Successfully identifies latent misaligned incentives in open-source reward models
Demonstrates robustness and human-validated effectiveness of the framework
Abstract
Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling
