What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson

TL;DR
WIMHF is a method that uses sparse autoencoders to interpret human feedback data, revealing diverse preferences and safety concerns, and enabling better data curation and personalization.
Contribution
It introduces WIMHF, a novel autoencoder-based approach to automatically extract interpretable features from human feedback datasets.
Findings
Identifies key human-interpretable features that explain preference signals.
Re-labeling harmful data with WIMHF improves safety by 37%.
Enables personalized preference prediction with annotator-specific feature weights.
Abstract
Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
