Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution
Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, Mengdi Wang

TL;DR
Alita is a scalable generalist agent that relies on minimal predefined components and self-evolution to perform complex tasks, achieving top accuracy on multiple benchmarks.
Contribution
The paper introduces Alita, a novel agent design emphasizing simplicity and self-evolution, enabling scalable reasoning without extensive manual tool configuration.
Findings
Achieves 75.15% pass@1 on GAIA benchmark
Outperforms complex systems on Mathvista and PathVQA
Demonstrates effective autonomous self-evolution capabilities
Abstract
Recent advances in large language models (LLMs) have enabled agents to autonomously perform complex, open-ended tasks. However, many existing frameworks depend heavily on manually predefined tools and workflows, which hinder their adaptability, scalability, and generalization across domains. In this work, we introduce Alita--a generalist agent designed with the principle of "Simplicity is the ultimate sophistication," enabling scalable agentic reasoning through minimal predefinition and maximal self-evolution. For minimal predefinition, Alita is equipped with only one component for direct problem-solving, making it much simpler and neater than previous approaches that relied heavily on hand-crafted, elaborate tools and workflows. This clean design enhances its potential to generalize to challenging questions, without being limited by tools. For Maximal self-evolution, we enable the…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
S1. The core design principle of "Minimal Predefinition and Maximal Self-Evolution" is an elegant and original contribution that addresses the reliance on manually-defined tools in agent development. S2. Alita achieves nice performance, outperforming several baselines on the GAIA benchmark and also showing good results on Mathvista and PathVQA. S3. The auto-generated MCPs are not just useful for Alita but can be exported to improve other agents and distill reasoning capabilities from large LLM
W1. The framework highly relies on the coding and reasoning abilities of top-tier LLMs (Claude-3.7-Sonnet and GPT-4o). The results in Table 4 show that when a smaller model (GPT-4o-mini) is used to generate MCPs, the performance drops drastically. This suggests the "minimal predefinition" approach is not yet practical without access to the powerful models (especially powerful coding models). W2. The paper does not provide an analysis of the cost of MCP creation. How many tokens, how much wall-c
1. Simple design and flexibility. The framework only uses a small set of tools, avoiding labour-intensive tool definition. This also makes it more flexible in different domains. 2. The method outputs baselines on several benchmarks while being less complex. The authors also showed that the generated MCPs (tools) can be reused in other scenarios, such as other agent frameworks, smaller LLMs.
1. Although the framework does not rely on predefined tools, it still needs to generate task-specific tools and manage the resulting tool set. Therefore, the advantage of simplicity is only evident at the initial stage. In essence, compared with other agent frameworks, this approach merely adds a tool creation module. From this perspective, the contribution and novelty of the method are rather incremental. 2. The experiments are not sufficiently thorough. Although the proposed framework achieve
- Clear idea, small starting system. The “minimal preset + self-evolution” design cuts manual plumbing and avoids a big tool zoo. - Good results. On GAIA, Alita beats many strong baselines; results on MathVista and PathVQA are also competitive. - Transfer to other stacks. Reusing Alita’s MCP tools inside another agent gives clear gains, which suggests a nice path for sharing skills across systems.
- Scalability over time is untested. As the MCP store grows, retrieval speed, conflicts, and tool choice errors may rise. No long-run or high-load study. - Ablations are missing. We don’t know how much each module (brainstorming, web search, env recovery, retries) actually helps. - Key method details are thin. How MCPs are checked, stored, deduped, and retrieved is not clear enough to reproduce or scale. - Safety and compliance. The agent searches the web, pulls code, and runs it. The paper d
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications
