Eliciting Behaviors in Multi-Turn Conversations
Jing Huang, Shujian Zhang, Lun Wang, Andrew Hard, Rajiv Mathews, John Lambert

TL;DR
This paper explores methods for eliciting specific behaviors from large language models in multi-turn conversations, proposing a unified framework and demonstrating the effectiveness of online approaches with fewer queries.
Contribution
It introduces a generalized multi-turn formulation of online behavior elicitation methods and evaluates their efficiency in multi-turn conversation settings.
Findings
Online methods achieve higher success rates with fewer queries.
Static methods find few or no failure cases in multi-turn benchmarks.
Unified framework categorizes existing elicitation techniques.
Abstract
Identifying specific and often complex behaviors from large language models (LLMs) in conversational settings is crucial for their evaluation. Recent work proposes novel techniques to find natural language prompts that induce specific behaviors from a target model, yet they are mainly studied in single-turn settings. In this work, we study behavior elicitation in the context of multi-turn conversations. We first offer an analytical framework that categorizes existing methods into three families based on their interactions with the target model: those that use only prior knowledge, those that use offline interactions, and those that learn from online interactions. We then introduce a generalized multi-turn formulation of the online method, unifying single-turn and multi-turn elicitation. We evaluate all three families of methods on automatically generating multi-turn test cases. We…
Peer Reviews
Decision·Submitted to ICLR 2026
1) I think the authors are tackling an interesting problem domain and motivated it well. They also took a principled approach when proposing their method. I also think the adaption of their method to generate a strategy before generating a response was a good idea. 2) I think the baselines that were compared against were comprehensive.
1) The authors mention that they discovered new failure cases not covered in the single-turn settings but it is not clear what those failure cases are. It is mentioned that one of the patterns found is if the prompt says "you made a mistake" then the model is more likely to fail. Was this phrase not found in the baseline methods? It seems like an obvious addition to the prompt. Overall I think the analysis is a little underspecified.
The study is timely, and the method choices are sensible in general. It addresses an important and growing problem: static multi-turn benchmarks saturate on new models. The taxonomy of interaction regimes is well-motivated, which helps organize a fragmented literature. The empirical study spans three tasks of different natures and includes useful ablations. There is some originality in applying online RL methods to multi-turn conversations.
1. The Equations 4 and 5 seem to be misleading. The whole 4.3 section is the heart of the paper, but it is extremely strange. In Equation 4: - What is $X$? There is no set of queries there. $x$ should be sampled from the policy, $D_{online}$ - How is it even related to GRPO? Why do we have something strange instead of the KL term? Instead, it seems to be some kind of reward-regression MSE loss. - What is $D_{online}(M_t(x) | x)$? $D$ is the policy; it should be the other way around: $D_{online}(
1. Provides a detailed analysis of existing behavior elicitation methods and their effects. 2. Proposes EMBER, a multi-turn behavior elicitation method based on online reinforcement learning (RL). 3. Addresses the saturation problem of static benchmarks and demonstrates that EMBER can elicit target behaviors with a much higher success rate than static testing. This enables more efficient discovery and analysis of failure cases. 4. Achieves effective behavior elicitation with higher query efficie
1. This study only uses two target models, both of relatively small size. 2. Since the method requires interaction with the target models, there is an associated cost burden. 3. The online method is only implemented with Qwen3-4B. It's unclear whether any model ablation was conducted to evaluate robustness across different model architectures.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
