Sparse Rewards Can Self-Train Dialogue Agents

Barrett Martin Lattimer; Varun Gangal; Ryan McDonald; Yi Yang

arXiv:2409.04617·cs.CL·July 21, 2025

Sparse Rewards Can Self-Train Dialogue Agents

Barrett Martin Lattimer, Varun Gangal, Ryan McDonald, Yi Yang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces JOSH, a self-training method for LLM dialogue agents that improves performance using sparse reward simulation without human feedback, enhancing tool interaction capabilities.

Contribution

The paper presents JOSH, a novel self-alignment algorithm enabling LLMs to self-improve in dialogue tasks using sparse rewards, reducing reliance on human feedback.

Findings

01

Models trained with JOSH show significant improvement in tool-based interactions.

02

JOSH preserves general capabilities across diverse benchmarks.

03

The approach reduces the need for costly human feedback in LLM training.

Abstract

Recent advancements in state-of-the-art (SOTA) Large Language Model (LLM) agents, especially in multi-turn dialogue tasks, have been primarily driven by supervised fine-tuning and high-quality human feedback. However, as base LLM models continue to improve, acquiring meaningful human feedback has become increasingly challenging and costly. In certain domains, base LLM agents may eventually exceed human capabilities, making traditional feedback-driven methods impractical. In this paper, we introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback. Our method, Juxtaposed Outcomes for Simulation Harvesting (JOSH), is a self-alignment algorithm that leverages a sparse reward simulation environment to extract ideal behaviors and further train the LLM on its own outputs. We present ToolWOZ, a sparse reward…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

* This paper presents a novel approach to self-alignment in dialogue agents using sparse rewards, reducing reliance on costly human feedback. * ToolWOZ fills a gap in existing evaluation frameworks by focusing on tool usage in multi-turn dialogue settings, adapting MultiWOZ to emphasize real-world API interactions. * JOSH demonstrates significant improvements in success rates and tool-call accuracy, particularly for smaller models, validating its effectiveness.

Weaknesses

* The paper does not assess how well the user simulator aligns with real human interactions. * The evaluation of API calls lacks depth, as it does not separate analyses of API names and parameters. * The design of the average reward function is not thoroughly examined, missing a discussion of alternative reward structures and their potential effects on agent behavior. * The related work section does not cover relevant advancements in language agents for multi-turn dialogues.

Reviewer 02Rating 3Confidence 4

Strengths

1. The JOSH approach is a new solution for self-training dialogue agents, effectively utilizing sparse rewards to build a self-improvement feedback loop without external human evaluation. 2. By adapting MultiWOZ into ToolWOZ with a sparse reward structure, the paper provides a valuable benchmark tailored for tool-using task-oriented dialogue systems, which can benefit further research. 3. Results indicate that JOSH significantly improves models across benchmarks, demonstrating its potential as a

Weaknesses

1. The concept of the "goal set" in sparse rewards is insufficiently defined, particularly how it influences the agent’s behavior and the implications of duplicating actions in a path. 2. The choice to branch at the turn level rather than the agent action level lacks a comprehensive rationale, leaving questions about its impact on computational efficiency and performance outcomes. In multiwoz dataset, the agent predicts dialogue act in each turn. The delexiclized response is then generated. The

Reviewer 03Rating 6Confidence 2

Strengths

* They propose both a novel method and a benchmark, but they also make sure to evaluate on an existing benchmark to enable more robust comparisons. * They conduct good analysis to demonstrate the robustness and viability of their benchmark. * Their method demonstrates good performance gains on the tasks they study. * The paper is overall well written and easy to follow.

Weaknesses

* Their method feels a little ad-hoc. Yes, it makes sense to build off-policy preference pairs for training these models, but there are numerous ways this could be achieved and its unclear why the specific methodological decisions made in this paper are the correct ones. * They compare to an SFT baseline, but not other RL-inspired approaches for finetuning agents, so it is unclear how well their approach compares against stronger baselines.

Code & Models

Repositories

asappresearch/josh-llm-simulation-training
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation · Topic Modeling

MethodsBalanced Selection