On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study
Shuai Yang, Qi Yang, Luoxi Tang, Yuqiao Meng, Nancy Guo, Jeremy Blackburn, Zhaohan Xi

TL;DR
This paper introduces a decompositional framework to analyze how large language models perform counterfactual reasoning across various tasks and modalities, revealing key factors that influence their reasoning capabilities.
Contribution
It presents a structured approach to dissect counterfactual reasoning in LLMs, covering multiple tasks and modalities, and identifies factors affecting their reasoning performance.
Findings
LLMs struggle with counterfactual reasoning across tasks.
Modality type and intermediate reasoning significantly impact performance.
The framework aids in developing more reliable reasoning systems.
Abstract
Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models (LLMs). By generating and analyzing counterfactual scenarios, researchers can assess the adaptability and reliability of model decision-making. Although prior work has shown that LLMs often struggle with counterfactual reasoning, it remains unclear which factors most significantly impede their performance across different tasks and modalities. In this paper, we propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions. To support decompositional analysis, we investigate \ntask datasets spanning diverse tasks, including natural language understanding, mathematics, programming, and vision-language tasks. Through extensive evaluations, we characterize LLM…
Peer Reviews
Decision·ICLR 2026 Poster
It has been observed repeatedly that LLMs perform worse when answering counterfactual queries relative to factual ones. The paper’s attempt at understanding why in a more fine-grained fashion is a significant problem. Experimental evaluations are comprehensive in terms of the number of models and the variety of datasets they consider.
It is not clear how the performance on the four sub-tasks relate to the end-to-end performance. Establishing this relation is important when interpreting performance on these sub-tasks as decomposition. For instance, the paper concludes that LLMs are generally better at Task 1 than Task 2 conditioned on correct results from Task 1 (which is supported by experiments and indeed seems to be the case). However, it could still be the case that starting from the inputs of Task 1 and directly querying
1. The overall methodology of decomposing the counterfactual reasoning process is novel and the experiments show this really helps. 2. The experiments cover a wide range of dataset design specifically for counterfactual reasoning. 3. The final proposed method seems to be easy to adopt for any LLM for reasoning.
1. The experiments covers many datasets, but it lacks comparison on model scale, for example, Qwen 3 provides models across different scales, it could make the paper stronger if some results are shown there. 2. The NER tools are designed to use Bert like models, however, would it be possible that the tools are instantiated by another model using different prompts?
1. **Systematic and Granular Evaluation Framework:** The paper's primary strength is its decompositional approach, which breaks down the complex task of counterfactual reasoning into four distinct, measurable stages. This allows for a much more precise diagnosis of *where* and *why* LLMs fail, moving beyond a monolithic "pass/fail" assessment to identify specific bottlenecks, such as the particular difficulty with implicit mediators. 2. **Comprehensive and Multimodal Benchmark:** The authors
I think there are 2 brief weaknesses of the paper: * **Potentially Artificial Evaluation:** The benchmark relies on pre-annotated causal structures, which may not reflect the challenge of inferring causality from raw, unstructured data. * **Surface-Level Diagnosis:** The analysis identifies performance bottlenecks but offers a high-level explanation (e.g., "working memory") without deeply investigating the underlying architectural mechanisms in LLMs that cause these failures.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications
