STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning
Junjie Zhang, Guozheng Ma, Shunyu Liu, Zetian Hu, Yongcheng Jing, Ting-En Lin, Yongbin Li, Dacheng Tao

TL;DR
STRIDE introduces a novel training framework for LLM reasoning that uses learnable stepwise language feedback to improve reasoning trajectories without external annotations.
Contribution
It proposes a scalable, language-driven trajectory redirection method that enhances LLM reasoning by jointly training a generator and verifier with outcome-based rewards.
Findings
STRIDE outperforms state-of-the-art baselines on reasoning benchmarks.
Achieves breakthroughs on zero-pass-rate problems with no scalar feedback.
Demonstrates effective policy improvement even with noisy verifier feedback.
Abstract
Recent advances in Reinforcement Learning (RL) have underscored its potential for incentivizing reasoning capabilities of Large Language Models (LLMs). However, existing step-level efforts suffer from costly annotations that limit domain coverage, while scalar scores further impose an information bottleneck, offering insufficient semantic bandwidth to improve intermediate decisions. Alternative language-critique approaches, which rely on frozen or external critics, provide richer textual feedback but lack the scalability needed for sustained policy improvement. In this work, we propose language-driven stepwise trajectory redirection, termed as STRIDE, a novel training framework that shifts process supervision from scalar rewards to learnable stepwise language feedback. Specifically, we co-train a generator and a generative verifier using only outcome-based rewards, eliminating external…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
