MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning

Kang Yang; Jingxue Chen; Qingkun Tang; Tianxiang Zhang; Qianchun Lu

arXiv:2507.20278·cs.CL·July 29, 2025

MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning

Kang Yang, Jingxue Chen, Qingkun Tang, Tianxiang Zhang, Qianchun Lu

PDF

TL;DR

MoL-RL introduces a novel training paradigm that effectively incorporates multi-step environmental feedback into large language models, enabling feedback-independent reasoning and improving performance on reasoning and code generation tasks.

Contribution

The paper proposes MoL-RL, a dual-objective training framework that integrates multi-step environmental feedback into LLMs, enhancing reasoning without external feedback loops.

Findings

01

Achieves state-of-the-art results on mathematical reasoning benchmarks.

02

Maintains strong performance across different model scales.

03

Enables feedback-independent reasoning through a novel distillation process.

Abstract

Large language models (LLMs) face significant challenges in effectively leveraging sequential environmental feedback (EF) signals, such as natural language evaluations, for feedback-independent chain-of-thought (CoT) reasoning. Existing approaches either convert EF into scalar rewards, losing rich contextual information, or employ refinement datasets, failing to exploit the multi-step and discrete nature of EF interactions. To address these limitations, we propose MoL-RL, a novel training paradigm that integrates multi-step EF signals into LLMs through a dual-objective optimization framework. Our method combines MoL (Mixture-of-Losses) continual training, which decouples domain-specific EF signals (optimized via cross-entropy loss) and general language capabilities (preserved via Kullback-Leibler divergence), with GRPO-based post-training to distill sequential EF interactions into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.