AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song

TL;DR
AgentSynth is a scalable, cost-effective pipeline that automatically generates diverse, challenging tasks for generalist computer-use agents, enabling robust benchmarking and advancing AI capabilities.
Contribution
It introduces a novel method for synthesizing high-quality, challenging tasks with controllable complexity, significantly reducing costs compared to human annotation.
Findings
Generated over 6,000 diverse tasks.
State-of-the-art LLM agents' performance drops with increased difficulty.
Pipeline achieves low cost of $0.60 per trajectory.
Abstract
We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper tackles an important problem of testing LLM agents in complex tasks involving multiple subtasks, which is not easily controllable in existing benchmark works. The proposed approach is intuitive and cost effective for composing arbitrarily complex sequences of salient subtasks. - The results indicate that baseline agents indeed suffer from compositional tasks. - The paper is well written and easy to follow.
- My main concern with the paper is the lack of advanced agent approaches evaluated on the benchmark. The baseline agent tested consists of a basic agent with a simple prompt, and in my understanding only represents the performance lowerbound on the benchmark. Without evaluating more advanced approaches, it is difficult to understand how well competitive agents would perform on the benchmark. - A missing citation is [1]. [1] Exposing Limitations of Language Model Agents in Sequential-Task Compo
1. Clever use of information asymmetry to make generation easy but evaluation hard; difficulty is tunable via subtask count (d), enabling principled long-horizon benchmarks. 2. Results show a sharp success-rate drop as d increases, making the benchmark’s discriminative power evident. 3. Practicality & scale: diverse, multi-tool tasks on real desktop environments; cost-efficient generation with transparent cost accounting.
1. Limited causal evidence: missing ablations of asymmetry vs. direct instruction, and broader verifier calibration (agreement curves, partial-credit) across tools and difficulty. 2. Difficulty = horizon is not fair enough: other metrics like complementary axes (fine-grained perception, long-term memory, interrupt handling) would enrich difficulty control. 3. The verifier is still LLM-based and can misjudge corner cases. More human audits or adversarial stress tests for the verifier would streng
The main contribution of this paper is the establishment of a challenging benchmark that is more cost-effective compared to other methods. Additionally, the paper provides a detailed statistical analysis regarding the benchmark.
1. Despite its effectiveness, the "easy-to-hard" data synthesis approach in this paper is not entirely novel and can be observed in data synthesis across various other domains. This aspect diminishes the paper's innovativeness. 2. The paper devotes a significant amount of content to statistical analysis of the dataset. Some data cases presented in the main text are unnecessary, resulting in a limited experimental section and potentially giving the impression of a less robust study. It is recomm
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
