What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Rajiv Movva; Smitha Milli; Sewon Min; Emma Pierson

arXiv:2510.26202·cs.CL·April 14, 2026

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson

PDF

1 Datasets 1 Video

TL;DR

WIMHF is a method that uses sparse autoencoders to interpret human feedback data, revealing diverse preferences and safety concerns, and enabling better data curation and personalization.

Contribution

It introduces WIMHF, a novel autoencoder-based approach to automatically extract interpretable features from human feedback datasets.

Findings

01

Identifies key human-interpretable features that explain preference signals.

02

Re-labeling harmful data with WIMHF improves safety by 37%.

03

Enables personalized preference prediction with annotator-specific feature weights.

Abstract

Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

rmovva/wimhf-data
dataset· 71 dl
71 dl

Videos

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data· slideslive