Counterfactual Simulation Training for Chain-of-Thought Faithfulness

Peter Hase; Christopher Potts

arXiv:2602.20710·cs.AI·February 25, 2026

Counterfactual Simulation Training for Chain-of-Thought Faithfulness

Peter Hase, Christopher Potts

PDF

Open Access

TL;DR

This paper introduces Counterfactual Simulation Training (CST), a novel method to enhance the faithfulness of Chain-of-Thought reasoning in large language models by rewarding accurate counterfactual predictions, leading to more reliable interpretability.

Contribution

CST is a new training approach that improves CoT faithfulness by using counterfactual simulation, outperforming prompting baselines and offering efficient unfaithful CoT rewriting.

Findings

01

CST improves monitor accuracy on cue-based counterfactuals by 35 points.

02

CST enhances simulatability over generic counterfactuals by 2 points.

03

Larger models benefit more from CST in terms of faithfulness.

Abstract

Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Data Visualization and Analytics