AgentRefine: Enhancing Agent Generalization through Refinement Tuning
Dayuan Fu, Keqing He, Yejie Wang, Wentao Hong, Zhuoma Gongque, Weihao, Zeng, Wei Wang, Jingang Wang, Xunliang Cai, Weiran Xu

TL;DR
This paper introduces AgentRefine, a novel framework that improves LLM-based agent generalization by enabling models to self-correct mistakes through observation, leading to better performance across diverse tasks and robustness.
Contribution
The paper proposes a new self-refinement tuning method for LLM agents, enhancing their ability to generalize and adapt to new environments beyond manual training data.
Findings
AgentRefine outperforms state-of-the-art in generalization across diverse tasks.
It demonstrates improved robustness against perturbations.
The approach enables diversified reasoning during inference.
Abstract
Large Language Model (LLM) based agents have proved their ability to perform complex tasks like humans. However, there is still a large gap between open-sourced LLMs and commercial models like the GPT series. In this paper, we focus on improving the agent generalization capabilities of LLMs via instruction tuning. We first observe that the existing agent training corpus exhibits satisfactory results on held-in evaluation sets but fails to generalize to held-out sets. These agent-tuning works face severe formatting errors and are frequently stuck in the same mistake for a long while. We analyze that the poor generalization ability comes from overfitting to several manual agent environments and a lack of adaptation to new situations. They struggle with the wrong action steps and can not learn from the experience but just memorize existing observation-action relations. Inspired by the…
Peer Reviews
Decision·ICLR 2025 Poster
The proposed method's idea seems like meta learning, which trains the policy on diverse tasks for quickly adapting to novel tasks. This idea makes sense to me and seems new in agent domain. I appreciate authors' rethinking on the generalization of agent-tuning. The issue of memorizing trajectory leading to overfitting seems valid to me. The experiment evaluates the performance of AgentRefine from wide range of perspectives. The findings establish a correlation between agent generalization and
Overall AgentRefine is a simple and effective method. However, the main idea is not new, as discussed in related work, Agent-FLAN and AgentGen have proposed to train generalist agents using general data. The idea of refinement is also widely studied as discussed in introduction. I encourage authors to clearly differentiate AgentRefine from these prior works. Highlight unique aspects or improvements over existing methods. Consider incorporating a comparative analysis to demonstrate the advantages
1. The paper is well-organized and easy to follow, with a clear progression from motivation to methodology. 2. The identification of the generalization gap in existing LLM-based agents and the proposal of a self-refinement approach to address it is a rational step forward in the field.
1. The problem of generalization in LLM-based agents has been extensively discussed in previous literature, making the contribution of this work less novel. For example, [1] investigates the robustness of accuracy measurements in large language models (LLMs) when the order of answer labels is shuffled, using the MMLU dataset as a testbed. 2. The methodology, while intuitive, lacks significant innovation, as the approach of enhancing generalization through data synthesis is not new [2]. 3. The
1. This paper discusses the generalization ability of agents, which is a very important topic for the community. 2. The authors provide quantitative analysis to explain their insight, which is very convincing. 3. Synthesizing data with almost no task-specific information is a very practical setting, and the improvement of generalization ability in this paper is impressive.
1. The presentation of this paper should be improved and some grammar mistakes should be fixed. 2. Some important baselines, for example, Reflexion[1], are missing and should be included. 3. They only consider decision-making tasks in their experiments. However, as they claimed on the generalization ability, tasks of different types should also be included, for example, reasoning tasks. [1] Shinn, Noah, et al. "Reflexion: Language agents with verbal reinforcement learning." NeurIPS, 2023.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Weight Decay · Multi-Head Attention · Discriminative Fine-Tuning · Layer Normalization · Byte Pair Encoding · Linear Warmup With Cosine Annealing
