DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF

Ziyuan Gao; Di Liang; Xianjie Wu; Philippe Morel; Minlong Peng

arXiv:2511.19097·cs.CL·November 25, 2025

DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF

Ziyuan Gao, Di Liang, Xianjie Wu, Philippe Morel, Minlong Peng

PDF

Open Access 1 Video

TL;DR

DeCoRL introduces a modular, parallel reasoning framework with independent sub-step scoring, significantly improving speed, interpretability, and energy efficiency in reinforcement learning for complex reasoning tasks.

Contribution

The paper proposes DeCoRL, a novel framework that decouples reasoning into parallel sub-steps with modular rewards, enabling scalable, interpretable, and real-time reinforcement learning.

Findings

01

Achieves 3.8x faster inference compared to sequential methods.

02

Improves interpretability by 22.7% through explicit reward attribution.

03

Reduces energy consumption by 72.4% and increases throughput by 68%.

Abstract

Existing reinforcement learning methods for Chain-of-Thought reasoning suffer from two critical limitations. First, they operate as monolithic black boxes that provide undifferentiated reward signals, obscuring individual step contributions and hindering error diagnosis. Second, sequential decoding has O(n) time complexity. This makes real-time deployment impractical for complex reasoning tasks. We present DeCoRL (Decoupled Reasoning Chains via Coordinated Reinforcement Learning), a novel framework that transforms reasoning from sequential processing into collaborative modular orchestration. DeCoRL trains lightweight specialized models to generate reasoning sub-steps concurrently, eliminating sequential bottlenecks through parallel processing. To enable precise error attribution, the framework designs modular reward functions that score each sub-step independently. Cascaded DRPO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF· underline

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications