TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization

Peiji Li; Linyang Li; Handa Sun; Wenjin Mai; Yongkang Chen; Xiaozhe Li; Yue Shen; Yichuan Ma; Yiliu Sun; Jiaxi Cao; Zhishu He; Bo Wang; Xiaoqing Zheng; Zhaori Bi; Xipeng Qiu; Qipeng Guo; Kai Chen; Dahua Lin

arXiv:2601.16480·cs.CL·January 26, 2026

TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization

Peiji Li, Linyang Li, Handa Sun, Wenjin Mai, Yongkang Chen, Xiaozhe Li, Yue Shen, Yichuan Ma, Yiliu Sun, Jiaxi Cao, Zhishu He, Bo Wang, Xiaoqing Zheng, Zhaori Bi, Xipeng Qiu, Qipeng Guo, Kai Chen, Dahua Lin

PDF

Open Access

TL;DR

TL-GRPO is a novel turn-level reinforcement learning algorithm designed for iterative reasoning tasks, outperforming existing methods in complex scientific optimization like analog circuit sizing.

Contribution

We introduce TL-GRPO, a lightweight turn-level RL method that enables fine-grained optimization in reasoning tasks with shared environment states, addressing limitations of previous trajectory-level approaches.

Findings

01

TL-GRPO outperforms standard GRPO and Bayesian optimization in analog circuit sizing.

02

A 30B model trained with TL-GRPO achieves state-of-the-art results.

03

TL-GRPO demonstrates strong generalization and practical utility in scientific optimization.

Abstract

Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration, which is typically framed as a Markov Decision Process and optimized with trajectory-level RL algorithms such as GRPO. However, a common class of reasoning tasks, iterative optimization, presents distinct challenges: the agent interacts with the same underlying environment state across turns, and the value of a trajectory is determined by the best turn-level reward rather than cumulative returns. Existing GRPO-based methods cannot perform fine-grained, turn-level optimization in such settings, while black-box optimization methods discard prior knowledge and reasoning capabilities. To address this gap, we propose Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization. We evaluate TL-GRPO on analog circuit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Multimodal Machine Learning Applications