Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models
Quintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau, Xiaoli Fern

TL;DR
This paper introduces an automated evaluation pipeline that detects and explains behavioral changes in large language models caused by interventions, ensuring reliable and interpretable model auditing.
Contribution
The authors develop a contrastive, human-readable, statistically validated method for identifying and summarizing behavioral shifts in language models after interventions.
Findings
Successfully recovers known behavioral changes in synthetic tests.
Detects both intended and unexpected shifts in real-world interventions.
Does not falsely identify differences when none exist.
Abstract
We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model and an intervention model , our method compares their free-form, multi-token generations across aligned prompt contexts and produces human-readable, statistically validated natural-language hypotheses describing how the models differ, along with recurring themes that summarize patterns across validated hypotheses. We evaluate the approach in synthetic setting by injecting known behavioral changes and showing that the pipeline reliably recovers them. We then apply it to three real-world interventions, reasoning distillation, knowledge editing and unlearning, demonstrating that the method surfaces both intended and unexpected behavioral shifts, distinguishes large from subtle interventions, and does not hallucinate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
