Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis, Timo Kaufmann, Eyke H\"ullermeier, Samuel Albanie,, Robert Mullins

TL;DR
This paper introduces Inverse Constitutional AI, a method to interpret preference data by extracting principles that can reconstruct annotations, aiding understanding of biases and preferences in AI feedback datasets.
Contribution
It formulates preference interpretation as a compression task and proposes an algorithm to extract constitutions that explain annotation data, enhancing interpretability and bias detection.
Findings
Successfully reconstructs annotations from various datasets
Identifies biases and preferences in feedback data
Provides a scalable way to understand and adapt models
Abstract
Feedback data is widely used for fine-tuning and evaluating state-of-the-art AI models. Pairwise text preferences, where human or AI annotators select the "better" of two options, are particularly common. Such preferences are used to train (reward) models or to rank models with aggregate statistics. For many applications it is desirable to understand annotator preferences in addition to modelling them - not least because extensive prior work has shown various unintended biases in preference datasets. Yet, preference datasets remain challenging to interpret. Neither black-box reward models nor statistics can answer why one text is preferred over another. Manual interpretation of the numerous (long) response pairs is usually equally infeasible. In this paper, we introduce the Inverse Constitutional AI (ICAI) problem, formulating the interpretation of pairwise text preference data as a…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper introduces a new problem, named Inverse Constitutional AI (ICAI), which aims to compress human or model feedback into principles that can help uncover biases in data annotation, enhance understanding of model performance, scale feedback to unseen data, and adapt large language models to individual or group preferences. 2. The paper proposes a straightforward method to address ICAI problems and conducts extensive experiments across four different feedback datasets to validate its ap
1. The experimental results would be more convincing if the authors demonstrated the application of ICAI. For instance, providing experimental evidence of ICAI’s potential in addressing annotation biases and scaling up annotation would strengthen the paper. While the authors claim their algorithm can help discover annotation bias in the feedback dataset, the experiments focus solely on reconstructing the original feedback without analyzing bias discovery and annotation scaling. 2. The proposed
- Very interesting and well-defined research problem. - The ICAI method is simple and effective. - The experiments cover various settings, including population preference, persona-based preference, and even personalized preference.
1. Static principles (with limited quantity) may lead to some information loss for summarizing the preference patterns. The number of patterns does matter. For example, in the paper of PopAlign[1], the authors have investigated the so-called elicitive contrast for preference data synthesis, which involves generating good v.s. bad principles for each instruction as the thoughts for contrastive response generation. Such dynamic (or instruction-dependent) principles may benefit from the unlimited e
1. Developing constitutional principles from feedback data is an important research problem to build an interpretable preference learning framework. 2. This alogrithm is tested on four datasets with synthetic setting, human annotated data, individual user preferences and group preferences.
1. Without establishing causality between the principles and annotator rationale, the framework risks over-simplifying or even misrepresenting the underlying preferences. For example, it is possible that the principles reflect incidental biases of the model or dataset rather than genuine human values. This could lead to misleading interpretations and false assumptions about user or demographic intentions. 2. ICAI's approach inherently admits multiple valid constitutions for the same dataset, d
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Ethics and Social Impacts of AI · Legal and Constitutional Studies
MethodsSparse Evolutionary Training · ALIGN
