Generalization of RLVR Using Causal Reasoning as a Testbed

Brian Lu; Hongyu Zhao; Shuo Sun; Hao Peng; Rui Ding; Hongyuan Mei

arXiv:2512.20760·cs.LG·March 5, 2026

Generalization of RLVR Using Causal Reasoning as a Testbed

Brian Lu, Hongyu Zhao, Shuo Sun, Hao Peng, Rui Ding, Hongyuan Mei

PDF

Open Access 3 Reviews

TL;DR

This paper empirically investigates how reinforcement learning with verifiable rewards (RLVR) enhances causal reasoning in large language models, showing that its benefits depend on model size, initial reasoning skills, and query complexity.

Contribution

It provides a systematic study of RLVR's impact on causal reasoning generalization across different model scales and query complexities, highlighting conditions for its effectiveness.

Findings

01

RLVR outperforms supervised fine-tuning in generalization within and across query levels.

02

Effectiveness of RLVR depends on the model's initial reasoning competence.

03

RLVR improves marginalization strategies and reduces errors in probability calculations.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex reasoning tasks. Yet, the conditions under which RLVR yields robust generalization remain underexplored. This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal graphical models. This setting offers two natural axes along which to examine generalization: (i) the level of the probabilistic query -- associational, interventional, or counterfactual -- and (ii) the structural complexity of the query, measured by the size of its relevant subgraph. We construct a dataset of causal graphs and queries spanning these difficulty axes and fine-tune Qwen-2.5-Instruct models using RLVR or supervised fine-tuning (SFT). We vary both the model scale (3B-32B) and the query level included in…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The choice of probabilistic inference in causal graphical models as a testbed is genuinely innovative. Unlike prior RLVR generalization studies that focus on text/visual reasoning tasks, this formal mathematical domain enables precise control and analysis. - The findings have practical implications: practitioners should check if their base model has sufficient reasoning capability before investing in RLVR. The identification that counterfactual reasoning remains unsolved even with RLVR and 32B

Weaknesses

- SFT is trained only to predict final answers while RLVR generates full reasoning chains. This creates an asymmetric comparison that conflates two factors: (1) reasoning vs. direct prediction and (2) RL vs. supervised learning. A fair strategy is to include an SFT baseline trained on optimal reasoning chains (generated by the solver or sampled from successful RLVR rollouts). This would isolate whether gains come from RL exploration or simply having reasoning chains. - The paper observes 3B mode

Reviewer 02Rating 6Confidence 4

Strengths

The paper is well-structured and clearly written, with a logical flow that makes it easy to follow the author’s reasoning. The analysis sections are particularly strong: insightful, well-grounded, and supported by detailed experiments. The results are extensive and could serve as a valuable reference for future researchers studying the intersection of LLM post-training and causal reasoning. I was initially debating between a rating of 6 and 8 and currently lean toward the former, though I remain

Weaknesses

1. The RLVR experiments use 7.5K and 2.5K samples, while the SFT model is trained on 5K samples. This discrepancy makes the quantitative comparison between models less reliable. I suggest adding an ablation study where models (or checkpoints) are trained on the same amount of data and for comparable GPU hours to mitigate this concern. 2. LLMs learn differently from humans as they rely primarily on language pattern recognition rather than true causal inference. The proposed causal reasoning task

Reviewer 03Rating 4Confidence 5

Strengths

1. The experiments are comprehensive and sufficient to verify the authors' claims. 2. The analysis is comprehensive, authors provide in-depth analyses.

Weaknesses

1. I think the first weakness lies in the writing of the abstract. The current version seems a bit colloquial rather than a formal academic paper. Specifically, the sentence "We choose this setting because causality is an important area that LLMs still struggle with, and because this setting ..." is too colloquial and lengthy. I suggest authors consider rewriting the abstract thoroughly and consider using shorter sentences. 2. Similarly, in the introduction, the sentence "However, we focus on i

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques