TL;DR
SIOP introduces a novel turn-level credit assignment method for long-horizon LLM agents that leverages semantic clustering of final answers to improve training without requiring verifiers.
Contribution
It proposes a new framework that assigns credit to intermediate turns based on latent outcome states, generalizing information-potential shaping without gold verifiers.
Findings
SIOP outperforms verifier-free outcome baselines on seven reasoning benchmarks.
It approaches the performance of gold-supervised outcome methods.
The method effectively assigns credit without explicit answer supervision.
Abstract
Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential-based turn-level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability-aware target distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
