Reinforced Language Models for Sequential Decision Making

Jim Dilkes; Vahid Yazdanpanah; Sebastian Stein

arXiv:2508.10839·cs.CL·August 15, 2025

Reinforced Language Models for Sequential Decision Making

Jim Dilkes, Vahid Yazdanpanah, Sebastian Stein

PDF

TL;DR

This paper introduces MS-GRPO, a new post-training algorithm for small LLMs to improve their sequential decision-making abilities, outperforming larger models on specific tasks.

Contribution

The paper proposes MS-GRPO, a novel post-training method grounded in formal frameworks, with a new reward attribution and sampling strategy for decision-making in LLMs.

Findings

01

Post-trained 3B model outperforms 72B baseline by 50% on Frozen Lake.

02

MS-GRPO improves decision-making performance in small LLMs.

03

Targeted post-training can rival larger models in sequential tasks.

Abstract

Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.