Compared to What? Baselines and Metrics for Counterfactual Prompting

Zihao Yang; Mosh Levy; Yoav Goldberg; Byron C. Wallace

arXiv:2605.01048·cs.CL·May 5, 2026

Compared to What? Baselines and Metrics for Counterfactual Prompting

Zihao Yang, Mosh Levy, Yoav Goldberg, Byron C. Wallace

PDF

1 Repo

TL;DR

This paper emphasizes the importance of baseline controls in counterfactual prompting evaluations, demonstrating that many observed effects are indistinguishable from general model sensitivity, and proposes a statistical framework for more accurate effect measurement.

Contribution

It introduces a framework comparing targeted interventions to paraphrasing baselines, improving the robustness of effect attribution in counterfactual prompting evaluations.

Findings

01

Many observed effects are statistically indistinguishable from baseline paraphrasing.

02

The proposed framework reduces false attribution of sensitivity to specific factors.

03

Per-sample metrics outperform aggregate metrics in detecting true effects.

Abstract

Counterfactual prompting (i.e., perturbing a single factor and measuring output change) is widely used to evaluate things like LLM bias and CoT faithfulness. But in this work we argue that observed effects cannot be attributed to the targeted factor without accounting for baseline ``meaning-preserving'' modifications to text that establish general model sensitivity. This is because every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation; this violates treatment variation irrelevance. We observe prediction flip rates on MedQA of 14.9% when we surgically change patient gender. However, this is statistically indistinguishable from the flip rates induced by simply paraphrasing inputs (14.1%). In this case, it would therefore be unwarranted to conclude that the LLM is especially sensitive to patient gender. To account for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

redagavin/counterfactual-prompting-baselines
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.