Thought Branches: Interpreting LLM Reasoning Requires Resampling
Uzay Macar, Paul C. Bogdan, Senthooran Rajamanoharan, Neel Nanda

TL;DR
This paper advocates for resampling-based methods to analyze the reasoning processes of large language models, enabling more reliable causal insights and interventions beyond single chain-of-thought samples.
Contribution
It introduces resampling techniques for interpreting LLM reasoning, providing new tools for causal analysis, intervention assessment, and understanding unfaithful reasoning.
Findings
Resampling reveals that self-preservation reasons have minimal causal impact on blackmail decisions.
Off-policy interventions have limited and unstable effects compared to resampling.
Resilience metrics show critical reasoning steps resist removal but significantly influence outcomes when eliminated.
Abstract
Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, we can measure a partial CoT's impact by resampling only the subsequent text. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In "agentic misalignment" scenarios, we find that self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? Resampling and selecting a completion with the desired property is a principled on-policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
