Beyond Correctness: Learning Robust Reasoning via Transfer
Hyunseok Lee, Soheil Abbasloo, Jihoon Tack, Jinwoo Shin

TL;DR
This paper introduces RLTR, a reinforcement learning method that enhances the robustness and transferability of reasoning in large language models, leading to more reliable and sample-efficient reasoning capabilities.
Contribution
The paper proposes RLTR, a novel transfer-based reward mechanism that improves reasoning robustness and efficiency in large language models.
Findings
RLTR improves sampling consistency and answer accuracy.
RLTR achieves similar performance with fewer training steps.
On MATH500, RLTR outperforms RLVR in accuracy and efficiency.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently strengthened LLM reasoning, but its focus on final answer correctness leaves a critical gap: it does not ensure the robustness of the reasoning process itself. We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it, and treat reasoning as a form of meaning transfer that must survive truncation, reinterpretation, and continuation. Building on this principle, we introduce Reinforcement Learning with Transferable Reward (RLTR), which operationalizes robustness via transfer reward that tests whether a partial reasoning prefix from one model can guide a separate model to the correct answer. This encourages LLMs to produce reasoning that is stable, interpretable, and genuinely generalizable. Our approach improves sampling consistency while improving final answer accuracy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
