Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Chuxuan Hu; Yuxuan Zhu; Antony Kellermann; Caleb Biddulph; Suppakit Waiwitlikhit; Jason Benn; Daniel Kang

arXiv:2506.19733·cs.CL·March 3, 2026

Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit, Jason Benn, Daniel Kang

PDF

Open Access 3 Reviews

TL;DR

This paper investigates whether reinforcement post training (RPT) improvements in large language models transfer effectively to unseen domains, revealing that gains are inconsistent and often diminish outside the fine-tuning data.

Contribution

The study provides the first comprehensive analysis of RPT's domain transferability, highlighting its limitations in generalizing reasoning improvements across diverse domains.

Findings

01

RPT improves performance on similar domains

02

Gains often diminish on different reasoning patterns

03

Transferability of RPT is inconsistent across domains

Abstract

Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for post-training. To understand the generalizability of RPT, we conduct two studies with specific focus on Reinforcement Learning with Verifiable Rewards (RLVR). (1) Observational: we compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

The topic addresses a currently open important question for reasoning LLMs. Certainly a strength of the paper is its systematic and transparent setup (in particular for the selection of tested models) and statistical evaluation.

Weaknesses

A key weakness of the study is the focus on small models. While I understand the computational limitations. However, it seems reasonable that a certain model complexity might be required to actually generalize across domains. Therefore, it is not clear how the findings actually generalize to larger models that might in any case better suited for complex reasoning tasks. Similarly, only one particular Reinforcement Learning process is tested for fine tuning with a single snapshot after one epoch

Reviewer 02Rating 6Confidence 4

Strengths

- The paper tackles an important and timely question about whether reasoning improvements from reinforcement post-training can truly generalize beyond the training domain. - The study design is comprehensive and convincing, combining large-scale observational analysis of public RPT models with controlled interventional experiments under unified settings. - The experiments are extensive and well-documented, covering 16 diverse benchmarks across mathematics, code, and knowledge reasoning with appr

Weaknesses

- The experiments are conducted on relatively small models (up to 8B) with limited-scale RPT training, leaving it unclear whether the same generalization patterns would persist under larger LLMs. - The paper stops short of analyzing how different aspects of RPT training, such as reward signal quality or optimization dynamics, might contribute to the observed lack of cross-domain transfer, leaving the underlying cause somewhat underexplored. - The paper does not include any longitudinal or ablati

Reviewer 03Rating 8Confidence 4

Strengths

The results, while not incredibly surprising for those with substantial experience performing RL finetuning on language models, are quite valuable to see. The study is quite broad, only models with publicly available training data are included, and the experimental design is sound.

Weaknesses

It would be helpful to list the models that you tested, both for reproducibility and for clarity. One question I have is how diverse the *base* model pool was; e.g. were most models based on Qwen (which is quite strong on math and code already), or was there a diverse set of model families included in your study? If possible, it would be very enlightening if there could be a further study on the *kinds* of reasoning each model uses, to see if there are explicit strategies common amongst them (s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBehavioral and Psychological Studies · Software Reliability and Analysis Research