TL;DR
ALITA-G is a self-evolving framework that transforms general-purpose language agents into domain experts by systematically generating and curating tools, leading to improved accuracy and efficiency on complex reasoning benchmarks.
Contribution
This paper introduces ALITA-G, a novel self-evolution framework that automatically creates domain-specific tools from generalist agents, enhancing their specialization and performance.
Findings
Achieves state-of-the-art results on GAIA benchmark with 83.03% pass@1.
Reduces computation costs by approximately 15%.
Improves accuracy and efficiency on complex reasoning tasks.
Abstract
Large language models (LLMs) have been shown to perform better when scaffolded into agents with memory, tools, and feedback. Beyond this, self-evolving agents have emerged, but current work largely limits adaptation to prompt rewriting or failure retries. Therefore, we present ALITA-G, a self-evolution framework that transforms a general-purpose agent into a domain expert by systematically generating, abstracting, and curating Model Context Protocol (MCP) tools. In this framework, a generalist agent executes a curated suite of target-domain tasks and synthesizes candidate MCPs from successful trajectories. These are then abstracted to parameterized primitives and consolidated into an MCP Box. At inference time, ALITA-G performs retrieval-augmented MCP selection with the help of each tool's descriptions and use cases, before executing an agent equipped with the MCP Executor. Across…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
Allowing LLMs to create their own tool is an important step in the creation of more autonomous specialized agents, and this paper is a step into that direction The benchmarks results show consistent gains
I have some doubts on the set-up (see below), and that the current setting does not leak past in-domain information into the test set
1. The manuscript is well-motivated and well-written. The paper tackles a problem of significant importance. The ability to self-evolve is important for developing powerful LLM-based agent systems. Moreover, the proposed pipeline is well-presented and can be easily understood. 2. The authors validate their framework across three diverse and challenging benchmarks. Moreover, the detailed analysis of MCP Box scalability in Section 4.1 provides a nuanced investigation into the diminishing returns
1. The paper claims a "new state-of-the-art result" on the GAIA benchmark. However, this result is reported on the GAIA validation set. State-of-the-art claims for major benchmarks like GAIA are adjudicated on the private test set via official leaderboards to ensure fair, robust, and non-overfit comparisons. 2. The framework's success hinges on a powerful "master agent" composed of top-tier proprietary models (Claude-Sonnet and GPT). It remains unclear whether this approach is viable with less
1. This paper proposes to generate domain-relevant MCPs, which could benefit future task solving. The idea holds the potential as an effective adaptation approach to generalize pre-trained LLMs. 2. The paper is in general well-written.
1. The evaluation is conducted in a relative small subset of the original benchmark without repeat experiments. There may be bias and variance in the results. 2. The authors use part of the benchmark data to tune hyperparameters. It would be better to develop a more principled and generalizable way determine them if they have a large influence. 3. It is unclear which data the authors used in the process to generate MCPs, and whether the MCPs across 3 benchmarks are the same. I am also curious ab
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
