Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits

Neha Kalibhat; Zi Wang; Prasoon Bajpai; Drew Proud; Wenjun Zeng; Been Kim; Mani Malek

arXiv:2602.00092·cs.LG·February 3, 2026

Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits

Neha Kalibhat, Zi Wang, Prasoon Bajpai, Drew Proud, Wenjun Zeng, Been Kim, Mani Malek

PDF

Open Access

TL;DR

This paper presents a framework that learns natural language summaries, called constitutions, to interpret and control model behavior through atomic concept edits, enabling predictable and verifiable prompt modifications.

Contribution

It introduces a novel black-box interpretability method using atomic concept edits to learn verifiable constitutions that control and understand model behavior.

Findings

01

Constitutions significantly improve control over model outputs.

02

Different models focus on different aspects like grammatical adherence and atmospheric coherence.

03

Atomic concept edits boost success rates by an average of 1.86 times.

Abstract

We introduce a black-box interpretability framework that learns a verifiable constitution: a natural language summary of how changes to a prompt affect a model's specific behavior, such as its alignment, correctness, or adherence to constraints. Our method leverages atomic concept edits (ACEs), which are targeted operations that add, remove, or replace an interpretable concept in the input prompt. By systematically applying ACEs and observing the resulting effects on model behavior across various tasks, our framework learns a causal mapping from edits to predictable outcomes. This learned constitution provides deep, generalizable insights into the model. Empirically, we validate our approach across diverse tasks, including mathematical reasoning and text-to-image alignment, for controlling and understanding model behavior. We found that for text-to-image generation, GPT-Image tends to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Machine Learning in Materials Science