TL;DR
This paper emphasizes the importance of baseline controls in counterfactual prompting evaluations, demonstrating that many observed effects are indistinguishable from general model sensitivity, and proposes a statistical framework for more accurate effect measurement.
Contribution
It introduces a framework comparing targeted interventions to paraphrasing baselines, improving the robustness of effect attribution in counterfactual prompting evaluations.
Findings
Many observed effects are statistically indistinguishable from baseline paraphrasing.
The proposed framework reduces false attribution of sensitivity to specific factors.
Per-sample metrics outperform aggregate metrics in detecting true effects.
Abstract
Counterfactual prompting (i.e., perturbing a single factor and measuring output change) is widely used to evaluate things like LLM bias and CoT faithfulness. But in this work we argue that observed effects cannot be attributed to the targeted factor without accounting for baseline ``meaning-preserving'' modifications to text that establish general model sensitivity. This is because every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation; this violates treatment variation irrelevance. We observe prediction flip rates on MedQA of 14.9% when we surgically change patient gender. However, this is statistically indistinguishable from the flip rates induced by simply paraphrasing inputs (14.1%). In this case, it would therefore be unwarranted to conclude that the LLM is especially sensitive to patient gender. To account for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
