Interleaved Reasoning for Large Language Models via Reinforcement Learning

Roy Xie; David Qiu; Deepak Gopinath; Dong Lin; Yanchao Sun; Chong Wang; Saloni Potdar; Bhuwan Dhingra

arXiv:2505.19640·cs.CL·January 8, 2026

Interleaved Reasoning for Large Language Models via Reinforcement Learning

Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, Bhuwan Dhingra

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a reinforcement learning-based training paradigm that encourages large language models to interleave reasoning and answering, significantly improving efficiency and accuracy on multi-hop questions without external tools.

Contribution

The paper presents a novel RL-based approach for interleaved reasoning in LLMs, enhancing reasoning efficiency and accuracy with a simple reward scheme and broad dataset generalization.

Findings

01

Achieves 12.5% improvement in Pass@1 accuracy

02

Reduces reasoning length by 37%

03

Decreases time-to-first-token by over 80%

Abstract

Long chain-of-thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs). However, extensive reasoning traces lead to inefficiencies and increased time-to-first-token (TTFT). We propose a training paradigm that uses only reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective reward scheme to incentivize correct intermediate steps, guiding the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

1. The most interesting result is the significant reduction in the TTFT. This directly addresses a critical and practical limitation of using CoT in real-world applications and demonstrates the potential for faster, more efficient reasoning in LLMs. 2. The paper is generally well-structured and easy to follow. The methodology and experimental setup are clearly explained, which aids in the comprehension of the proposed interleaved reasoning framework.

Weaknesses

1. A major concern is that the interleaved reasoning appears weak when not supported by dense intermediate rewards. The performance of the model in the setting without strong intermediate reward signals is often suboptimal or even worse than standard baselines, suggesting that the RL framework alone is not sufficient to reliably improve reasoning quality. 2. While the model's performance significantly improves when strong intermediate rewards are introduced, this outcome is largely expected as

Reviewer 02Rating 4Confidence 4

Strengths

- The motivation of reducing TTFT while maintaining or improving reasoning quality is timely. - The interleaved format is simple and compatible; the use of special tags (<think>, <answer>) is pragmatic.

Weaknesses

- Training relies on K&K and Musique, both containing explicit sub-problem labels. - Generalization to domains lacking intermediate supervision is claimed but not empirically tested. - No ablation shows sensitivity to ε or to the relative weighting β in Eq. 2. - No direct wall-clock latency or token-throughput numbers are reported. - The experiments are conducted solely on the Qwen series models, without validation on other model families. In addition, the method has not been tested on mo

Reviewer 03Rating 2Confidence 4

Strengths

- Clear motivation.

Weaknesses

1. The definition and interpretation of the TTFT metric appear conceptually weak. In real applications, users primarily care about the final answer. For example, in Figure 1, the user’s actual need is to know the final “director,” not the model’s intermediate reasoning. If users are interested in the reasoning process, they can view the entire “think” chain, so interleaving does not inherently improve usability. Moreover, TTFT can be easily reward hacking: a model could output any meaningless st

Reviewer 04Rating 4Confidence 4

Strengths

- The paper addresses an important issue in reasoning LLMs — the trade-off between long, reflective reasoning chains and real-time efficiency — and provides a clear framework to investigate this trade-off. - The method is easy to reproduce and compatible with standard RL algorithms, offering practical insights into the impact of intermediate rewards and structured reasoning patterns. - The experiments covering multiple datasets, RL algorithms, and detailed ablations that help clarify the empiric

Weaknesses

- Marginal technical novelty. The core mechanism of interleaved reasoning — introducing <think> and <answer> tags and training models to alternate between them — is relatively incremental. The method essentially reformulates the output format and applies a rule-based reward scheme within a conventional RL training setup. - Potential conflict with self-reflection. Traditional think-then-answer reasoning allows the model to perform internal reflection and correction before producing a final respon

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling