Alita-G: Self-Evolving Generative Agent for Agent Generation

Jiahao Qiu; Xuan Qi; Hongru Wang; Xinzhe Juan; Yimin Wang; Zelin Zhao; Jiayi Geng; Jiacheng Guo; Peihang Li; Jingzhe Shi; Shilong Liu; Mengdi Wang

arXiv:2510.23601·cs.AI·October 28, 2025

Alita-G: Self-Evolving Generative Agent for Agent Generation

Jiahao Qiu, Xuan Qi, Hongru Wang, Xinzhe Juan, Yimin Wang, Zelin Zhao, Jiayi Geng, Jiacheng Guo, Peihang Li, Jingzhe Shi, Shilong Liu, Mengdi Wang

PDF

3 Reviews

TL;DR

ALITA-G is a self-evolving framework that transforms general-purpose language agents into domain experts by systematically generating and curating tools, leading to improved accuracy and efficiency on complex reasoning benchmarks.

Contribution

This paper introduces ALITA-G, a novel self-evolution framework that automatically creates domain-specific tools from generalist agents, enhancing their specialization and performance.

Findings

01

Achieves state-of-the-art results on GAIA benchmark with 83.03% pass@1.

02

Reduces computation costs by approximately 15%.

03

Improves accuracy and efficiency on complex reasoning tasks.

Abstract

Large language models (LLMs) have been shown to perform better when scaffolded into agents with memory, tools, and feedback. Beyond this, self-evolving agents have emerged, but current work largely limits adaptation to prompt rewriting or failure retries. Therefore, we present ALITA-G, a self-evolution framework that transforms a general-purpose agent into a domain expert by systematically generating, abstracting, and curating Model Context Protocol (MCP) tools. In this framework, a generalist agent executes a curated suite of target-domain tasks and synthesizes candidate MCPs from successful trajectories. These are then abstracted to parameterized primitives and consolidated into an MCP Box. At inference time, ALITA-G performs retrieval-augmented MCP selection with the help of each tool's descriptions and use cases, before executing an agent equipped with the MCP Executor. Across…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 3

Strengths

Allowing LLMs to create their own tool is an important step in the creation of more autonomous specialized agents, and this paper is a step into that direction The benchmarks results show consistent gains

Weaknesses

I have some doubts on the set-up (see below), and that the current setting does not leak past in-domain information into the test set

Reviewer 02Rating 6Confidence 3

Strengths

1. The manuscript is well-motivated and well-written. The paper tackles a problem of significant importance. The ability to self-evolve is important for developing powerful LLM-based agent systems. Moreover, the proposed pipeline is well-presented and can be easily understood. 2. The authors validate their framework across three diverse and challenging benchmarks. Moreover, the detailed analysis of MCP Box scalability in Section 4.1 provides a nuanced investigation into the diminishing returns

Weaknesses

1. The paper claims a "new state-of-the-art result" on the GAIA benchmark. However, this result is reported on the GAIA validation set. State-of-the-art claims for major benchmarks like GAIA are adjudicated on the private test set via official leaderboards to ensure fair, robust, and non-overfit comparisons. 2. The framework's success hinges on a powerful "master agent" composed of top-tier proprietary models (Claude-Sonnet and GPT). It remains unclear whether this approach is viable with less

Reviewer 03Rating 6Confidence 4

Strengths

1. This paper proposes to generate domain-relevant MCPs, which could benefit future task solving. The idea holds the potential as an effective adaptation approach to generalize pre-trained LLMs. 2. The paper is in general well-written.

Weaknesses

1. The evaluation is conducted in a relative small subset of the original benchmark without repeat experiments. There may be bias and variance in the results. 2. The authors use part of the benchmark data to tune hyperparameters. It would be better to develop a more principled and generalizable way determine them if they have a large influence. 3. It is unclear which data the authors used in the process to generate MCPs, and whether the MCPs across 3 benchmarks are the same. I am also curious ab

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.