On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study

Shuai Yang; Qi Yang; Luoxi Tang; Yuqiao Meng; Nancy Guo; Jeremy Blackburn; Zhaohan Xi

arXiv:2505.11839·cs.AI·February 17, 2026

On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study

Shuai Yang, Qi Yang, Luoxi Tang, Yuqiao Meng, Nancy Guo, Jeremy Blackburn, Zhaohan Xi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a decompositional framework to analyze how large language models perform counterfactual reasoning across various tasks and modalities, revealing key factors that influence their reasoning capabilities.

Contribution

It presents a structured approach to dissect counterfactual reasoning in LLMs, covering multiple tasks and modalities, and identifies factors affecting their reasoning performance.

Findings

01

LLMs struggle with counterfactual reasoning across tasks.

02

Modality type and intermediate reasoning significantly impact performance.

03

The framework aids in developing more reliable reasoning systems.

Abstract

Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models (LLMs). By generating and analyzing counterfactual scenarios, researchers can assess the adaptability and reliability of model decision-making. Although prior work has shown that LLMs often struggle with counterfactual reasoning, it remains unclear which factors most significantly impede their performance across different tasks and modalities. In this paper, we propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions. To support decompositional analysis, we investigate \ntask datasets spanning diverse tasks, including natural language understanding, mathematics, programming, and vision-language tasks. Through extensive evaluations, we characterize LLM…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

It has been observed repeatedly that LLMs perform worse when answering counterfactual queries relative to factual ones. The paper’s attempt at understanding why in a more fine-grained fashion is a significant problem. Experimental evaluations are comprehensive in terms of the number of models and the variety of datasets they consider.

Weaknesses

It is not clear how the performance on the four sub-tasks relate to the end-to-end performance. Establishing this relation is important when interpreting performance on these sub-tasks as decomposition. For instance, the paper concludes that LLMs are generally better at Task 1 than Task 2 conditioned on correct results from Task 1 (which is supported by experiments and indeed seems to be the case). However, it could still be the case that starting from the inputs of Task 1 and directly querying

Reviewer 02Rating 6Confidence 5

Strengths

1. The overall methodology of decomposing the counterfactual reasoning process is novel and the experiments show this really helps. 2. The experiments cover a wide range of dataset design specifically for counterfactual reasoning. 3. The final proposed method seems to be easy to adopt for any LLM for reasoning.

Weaknesses

1. The experiments covers many datasets, but it lacks comparison on model scale, for example, Qwen 3 provides models across different scales, it could make the paper stronger if some results are shown there. 2. The NER tools are designed to use Bert like models, however, would it be possible that the tools are instantiated by another model using different prompts?

Reviewer 03Rating 6Confidence 5

Strengths

1. **Systematic and Granular Evaluation Framework:** The paper's primary strength is its decompositional approach, which breaks down the complex task of counterfactual reasoning into four distinct, measurable stages. This allows for a much more precise diagnosis of *where* and *why* LLMs fail, moving beyond a monolithic "pass/fail" assessment to identify specific bottlenecks, such as the particular difficulty with implicit mediators. 2. **Comprehensive and Multimodal Benchmark:** The authors

Weaknesses

I think there are 2 brief weaknesses of the paper: * **Potentially Artificial Evaluation:** The benchmark relies on pre-annotated causal structures, which may not reflect the challenge of inferring causality from raw, unstructured data. * **Surface-Level Diagnosis:** The analysis identifies performance bottlenecks but offers a high-level explanation (e.g., "working memory") without deeply investigating the underlying architectural mechanisms in LLMs that cause these failures.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications