Echoing: Identity Failures when LLM Agents Talk to Each Other
Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese

TL;DR
This paper investigates a failure mode called echoing in autonomous LLM agent conversations, where agents mirror each other instead of fulfilling roles, and proposes a mitigation protocol to reduce this issue.
Contribution
The study identifies and analyzes echoing as a prevalent failure in LLM agent interactions and introduces a structured response protocol to significantly mitigate it.
Findings
Echoing occurs in up to 70% of conversations across models and domains.
Advanced reasoning models still exhibit 32.8% echoing rates, unaffected by reasoning efforts.
Structured response protocols reduce echoing to 9%.
Abstract
As large language model (LLM) based agents interact autonomously with one another, a new class of failures emerges that cannot be predicted from single agent performance: behavioral drifts in agent-agent conversations (AxA). Unlike human-agent interactions, where humans ground and steer conversations, AxA lacks such stabilizing signals, making these failures unique. We investigate one such failure, echoing, where agents abandon their assigned roles and instead mirror their conversational partners, undermining their intended objectives. Through experiments across AxA configurations, domains (3 transactional, 1 advisory), and conversations (over LLM inferences), we show that echoing occurs across major LLM providers, with echoing rates as high as depending on the model and domain. Moreover, we find that echoing is persistent even in advanced reasoning…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. The paper studies the echoing failure mode in agent-to-agent conversations, where LLM agents drift from their assigned identities and start mirroring their conversational partners. 2. Results show that echoing is prevalent (5–70%), persistent in reasoning models (32.8%). They further show that structured response can reduce echoing to 9%. 3. Overall, the paper is written in a clear and easy-to-understand manner, although some details need clarification.
1. Evaluation: The use of GPT-4o as both subject and judge introduces potential circularity and bias. Having an ablation study on different evaluators can further strengthen the model. 2. Evaluation: Many critical details regrading human validation are missing, e.g, the total number of human evaluated data, the review qualification, the agreement rate between humans/annotators, etc. 3. Only one potential mitigation method is presented: The simple structured response can reduce echoing rate fr
1. Positions AxA vs multi-agent systems (MAS) as an interesting contrast where success is not the same as role fidelity. 2. Echoing is practically relevant for AxA deployments. It might also be relevant to simulation benchmarks using LLMs. 3. The results showing that even when reasoning modes are enabled, role drift persists suggest a further direction into analyzing reasoning models. 4. Methodology and design choices are explained in detail. 5. It's an interesting idea to simulate conversation
1. All three domains are variations of customer–seller negotiation. This limits external validity. Consider collaborative planning, tool-use workflows, safety-critical settings, multilingual, or non-negotiation AxA tasks, and evaluate on existing LLM simulation benchmarks to situate results. 2. The phenomenon is largely context-overwriting (early system/role prompts diluted by long histories). 3. System-prompt location differs by provider (e.g., some put it only at the very beginning). If the de
- The paper addresses a timely and important problem in emerging agent–agent (AxA) LLM systems, highlighting identity drift as a realistic and underexplored failure mode. - It provides a comprehensive empirical study across reasoning and non-reasoning models, multiple domains, and major LLM providers, revealing that echoing persists even in advanced models. - It proposes a practical mitigation strategy, namely structured, role-reinforcing responses, that significantly reduces echoing, offering
- The study focuses solely on proprietary, closed-weight models, leaving it unclear whether the findings generalize to open-source or smaller-scale models. - There is no simple baseline where agents are explicitly instructed not to mirror or drift mid-conversation, which would help isolate whether echoing persists even under explicit anti-drift training. - The task evaluation setup and success criteria are not clearly defined, making it difficult to assess whether role drift meaningfully affec
(1) This paper formally defines echoing as a distinct AxA failure, distinguishing it from generic errors or hallucinations as well as from prior MAS settings, and clearly elucidates the AxA framework. (2) This paper presents extensive experiments across multiple LLMs and domains, providing a solid empirical foundation for echoing research. Furthermore, the study analyzes failed cases and extracts relevant principles (prevention, non-intrusion, seamlessness, and architectural integration). (3)
(1) The text in the third section of Figure 1 is too small to be easily read. (2) The three domains used in the experiments are all transactional and relatively structured. It is recommended to add at least one less-structured domain to examine whether echoing is a common AxA phenomenon beyond task-oriented setups.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Language and cultural evolution · Explainable Artificial Intelligence (XAI)
