Co-Evolving Agents: Learning from Failures as Hard Negatives
Yeonsung Jung, Trilok Padhi, Sina Shaham, Dipika Khullar, Joonhyun Jeong, Ninareh Mehrabi, Eunho Yang

TL;DR
This paper introduces a co-evolving agents framework where a failure-learning agent generates hard negatives to improve a target agent's decision boundaries, leading to better generalization and performance.
Contribution
The paper proposes a novel co-evolving agents approach that leverages failure trajectories as structured learning signals to enhance self-improving agents.
Findings
Improved performance across benchmark datasets.
Failure trajectories serve as valuable hard negatives.
Enhanced decision boundary sharpness and generalization.
Abstract
The rapid progress of large foundation models has accelerated the development of task-specialized agents across diverse domains. However, the effectiveness of agents remains tightly coupled with the quality of training data, while curating task-specific datasets remains costly and often infeasible in real-world scenarios. Recent work has explored self-improving agents that autonomously generate, refine, and re-train on their own trajectories. A prominent line of approaches further leverages preference optimization by pairing predicted trajectories with scarce ground-truth trajectories, enabling agents to learn directly from their own failures. While these methods outperform supervised fine-tuning, their heavy reliance on predicted trajectories under limited ground-truth supervision leaves them prone to overfitting. To address this, we propose a co-evolving agents framework in which a…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is clearly written and provides a strong motivation for converting interaction data into structured supervision using failure trajectories to learn strong decision boundaries. The co-evolution learning process using hard negatives that are near-success failures promotes sharper decision boundaries and finer discrimination. 2. The paper includes an extensive evaluation across different tasks. The comparisons with baseline methods demonstrate the impressive performance of the propose
1. Hard negatives: It is unclear what hard negative quantitatively mean. They are defined as trajectories that are closer to success but still unsuccessful. However, it’s unclear how close to success is quantitatively defined, is it based on reward threshold? 2. One of the claims of the paper is that the limited number of expert trajectories result in overfitting. However, no information about the number of successful trajectories versus the hard negatives generated trajectories is provided to
Clear, intuitive idea: Elevating “failures” into structured supervision via failure-vs-failure preferences is conceptually neat and practically motivated. Co-evolution design: Alternating updates between a failure generator and a target learner is a simple mechanism to keep training signals challenging and fresh. Analyses beyond toplines: The paper examines failure quality/quantity and includes ablations (e.g., replacing the failure agent with a standard “positive” agent), which supports the c
Limited Novelty in Core Idea: The framework builds heavily on existing methods like ETO (Exploration-Based Trajectory Optimization) and DPO, primarily adding a "failure agent" for hard negatives. While this is an incremental improvement, it may not be sufficiently novel for ICLR, as similar concepts (e.g., negative agents in multi-agent systems or hard negatives in contrastive learning) are referenced in related work but not deeply differentiated. Baselines are not strong enough as configured.
1. The idea of a failure agent that co-evolves with the main model is novel and well-motivated. 2. The theoretical setup (POMDP formalization, DPO-based optimization) is rigorous, and the experimental evidence convincingly supports the claims. 3. The approach directly addresses a key limitation of current self-improving LLM agents—overfitting to limited expert data—by leveraging a sustainable, self-generated source of supervision. Results across multiple domains, along with both quantitative a
1. The method relies on a multiple-stage process (SFT → failure-agent DPO on failure–failure pairs → target-agent DPO+SFT on mixed pairs) executed in alternating iterations, but Figure 1 does not illustrate this flow or the data construction steps, making Section 4 harder to follow at a glance. 2. The most significant weakness is in Section 5.2.1. The text states it will provide a qualitative comparison of failure trajectories, describing the ETO baseline's "degenerate failure" and contrasting
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
