Discovering Implicit Large Language Model Alignment Objectives

Edward Chen; Sanmi Koyejo; Carlos Guestrin

arXiv:2602.15338·cs.LG·May 22, 2026

Discovering Implicit Large Language Model Alignment Objectives

Edward Chen, Sanmi Koyejo, Carlos Guestrin

PDF

TL;DR

The paper introduces Obj-Disco, a framework that automatically decomposes LLM alignment reward signals into interpretable objectives, improving understanding of model incentives and revealing hidden misalignments.

Contribution

It presents a novel iterative greedy algorithm to identify and validate natural language objectives that explain reward signals in LLMs, addressing limitations of existing interpretability methods.

Findings

01

Captures over 90% of reward behavior across tasks and models

02

Successfully identifies latent misaligned incentives in open-source reward models

03

Demonstrates robustness and human-validated effectiveness of the framework

Abstract

Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling