Generalization of RLVR Using Causal Reasoning as a Testbed
Brian Lu, Hongyu Zhao, Shuo Sun, Hao Peng, Rui Ding, Hongyuan Mei

TL;DR
This paper empirically investigates how reinforcement learning with verifiable rewards (RLVR) enhances causal reasoning in large language models, showing that its benefits depend on model size, initial reasoning skills, and query complexity.
Contribution
It provides a systematic study of RLVR's impact on causal reasoning generalization across different model scales and query complexities, highlighting conditions for its effectiveness.
Findings
RLVR outperforms supervised fine-tuning in generalization within and across query levels.
Effectiveness of RLVR depends on the model's initial reasoning competence.
RLVR improves marginalization strategies and reduces errors in probability calculations.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex reasoning tasks. Yet, the conditions under which RLVR yields robust generalization remain underexplored. This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal graphical models. This setting offers two natural axes along which to examine generalization: (i) the level of the probabilistic query -- associational, interventional, or counterfactual -- and (ii) the structural complexity of the query, measured by the size of its relevant subgraph. We construct a dataset of causal graphs and queries spanning these difficulty axes and fine-tune Qwen-2.5-Instruct models using RLVR or supervised fine-tuning (SFT). We vary both the model scale (3B-32B) and the query level included in…
Peer Reviews
Decision·ICLR 2026 Poster
- The choice of probabilistic inference in causal graphical models as a testbed is genuinely innovative. Unlike prior RLVR generalization studies that focus on text/visual reasoning tasks, this formal mathematical domain enables precise control and analysis. - The findings have practical implications: practitioners should check if their base model has sufficient reasoning capability before investing in RLVR. The identification that counterfactual reasoning remains unsolved even with RLVR and 32B
- SFT is trained only to predict final answers while RLVR generates full reasoning chains. This creates an asymmetric comparison that conflates two factors: (1) reasoning vs. direct prediction and (2) RL vs. supervised learning. A fair strategy is to include an SFT baseline trained on optimal reasoning chains (generated by the solver or sampled from successful RLVR rollouts). This would isolate whether gains come from RL exploration or simply having reasoning chains. - The paper observes 3B mode
The paper is well-structured and clearly written, with a logical flow that makes it easy to follow the author’s reasoning. The analysis sections are particularly strong: insightful, well-grounded, and supported by detailed experiments. The results are extensive and could serve as a valuable reference for future researchers studying the intersection of LLM post-training and causal reasoning. I was initially debating between a rating of 6 and 8 and currently lean toward the former, though I remain
1. The RLVR experiments use 7.5K and 2.5K samples, while the SFT model is trained on 5K samples. This discrepancy makes the quantitative comparison between models less reliable. I suggest adding an ablation study where models (or checkpoints) are trained on the same amount of data and for comparable GPU hours to mitigate this concern. 2. LLMs learn differently from humans as they rely primarily on language pattern recognition rather than true causal inference. The proposed causal reasoning task
1. The experiments are comprehensive and sufficient to verify the authors' claims. 2. The analysis is comprehensive, authors provide in-depth analyses.
1. I think the first weakness lies in the writing of the abstract. The current version seems a bit colloquial rather than a formal academic paper. Specifically, the sentence "We choose this setting because causality is an important area that LLMs still struggle with, and because this setting ..." is too colloquial and lengthy. I suggest authors consider rewriting the abstract thoroughly and consider using shorter sentences. 2. Similarly, in the introduction, the sentence "However, we focus on i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
