Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models

Quintin Pope; Ajay Hayagreeve Balaji; Jacques Thibodeau; Xiaoli Fern

arXiv:2605.05090·cs.CL·May 7, 2026

Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models

Quintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau, Xiaoli Fern

PDF

TL;DR

This paper introduces an automated evaluation pipeline that detects and explains behavioral changes in large language models caused by interventions, ensuring reliable and interpretable model auditing.

Contribution

The authors develop a contrastive, human-readable, statistically validated method for identifying and summarizing behavioral shifts in language models after interventions.

Findings

01

Successfully recovers known behavioral changes in synthetic tests.

02

Detects both intended and unexpected shifts in real-world interventions.

03

Does not falsely identify differences when none exist.

Abstract

We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model $M_{1}$ and an intervention model $M_{2}$ , our method compares their free-form, multi-token generations across aligned prompt contexts and produces human-readable, statistically validated natural-language hypotheses describing how the models differ, along with recurring themes that summarize patterns across validated hypotheses. We evaluate the approach in synthetic setting by injecting known behavioral changes and showing that the pipeline reliably recovers them. We then apply it to three real-world interventions, reasoning distillation, knowledge editing and unlearning, demonstrating that the method surfaces both intended and unexpected behavioral shifts, distinguishes large from subtle interventions, and does not hallucinate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.