Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations
Joey Hong, Sergey Levine, Anca Dragan

TL;DR
This paper introduces a novel RL-based method that uses LLM-generated simulated conversations to train goal-directed dialogue agents, achieving state-of-the-art results in tasks like teaching and preference elicitation.
Contribution
It proposes a new approach combining LLMs and offline RL to improve goal-directed dialogue, leveraging synthetic human-like interactions for training.
Findings
Achieves state-of-the-art performance in goal-directed dialogue tasks
Uses LLM-generated synthetic data for offline RL training
Outperforms existing methods in teaching and preference elicitation tasks
Abstract
Large language models (LLMs) have emerged as powerful and general solutions to many natural language tasks. However, many of the most important applications of language generation are interactive, where an agent has to talk to a person to reach a desired outcome. For example, a teacher might try to understand their student's current comprehension level to tailor their instruction accordingly, and a travel agent might ask questions of their customer to understand their preferences in order to recommend activities they might enjoy. LLMs trained with supervised fine-tuning or "single-step" RL, as with standard RLHF, might struggle which tasks that require such goal-directed behavior, since they are not trained to optimize for overall conversational outcomes after multiple turns of interaction. In this work, we explore a new method for adapting LLMs with RL for such goal-directed dialogue.…
Peer Reviews
Decision·Submitted to ICLR 2024
1.The experimental analysis is detailed and methodical, and the case is clear and intuitive. 2.The idea of using LLM to imitate human behavior is interesting.
1.Even thought RL can combine parts of behavious seen form behavior policies in the data, it is not convincing that the RL can take all the long-term planing responsibility in the goal-oriented conversation tasks. 2.The novelty of this paper is limited. The proposed method can be regarded as a pipeline of LLM generation and offline RL training. 3.All the evaluation methods are human evaluation, which are highly subjective. 4.More relevant works should be compared in the experiments.
1. It shifts the use of LLMs from direct interaction to data generation for optimization by introducing a zero-shot RL algorithm with a "imagination engine" that creatively creates synthetic conversation datasets for training dialogue agents. 2. Compared to traditional approaches, the method optimizes for goal-directed dialogues more effectively since it trains agents on a variety of human-like talks generated by LLMs that are customized for particular dialogue objectives. 3. The usefulness an
A shortcoming of the work is its somewhat dependent use of human-generated prompts, suggesting opportunities for further development in automating zero-shot dialogue agents' training to work without task-specific human input.
This paper introduces an approach to generate goal-directed conversations with LLMs and then train a smaller agent to improve over these conversations. The human evaluation results show that the learning agents do generate responses that are more helpful in helping the users complete the tasks and generate less overwhelming responses.
1. Why not utilize the widely used task-oriented dialogue benchmarks, MultiWOZ and SGD? There are works that leverage LLMs for task-oriented dialogue by training a small model to generate dialogue actions (plans) with RL, guiding LLMs for improved responses [1]. Have you considered comparing with them? 2. From the examples comparing GPT-agent and IE+RL agents, GPT's responses didn't seem significantly inferior. How were the responses scored by the evaluators using the four criteria? Was there co
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsEmirates Airlines Office in Dubai
