Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Peter Hase, Christopher Potts

TL;DR
This paper introduces Counterfactual Simulation Training (CST), a novel method to enhance the faithfulness of Chain-of-Thought reasoning in large language models by rewarding accurate counterfactual predictions, leading to more reliable interpretability.
Contribution
CST is a new training approach that improves CoT faithfulness by using counterfactual simulation, outperforming prompting baselines and offering efficient unfaithful CoT rewriting.
Findings
CST improves monitor accuracy on cue-based counterfactuals by 35 points.
CST enhances simulatability over generic counterfactuals by 2 points.
Larger models benefit more from CST in terms of faithfulness.
Abstract
Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Data Visualization and Analytics
