TaskCraft: Automated Generation of Agentic Tasks
Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, Jian Yang, Ge Zhang, Jiaheng Liu, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou

TL;DR
TaskCraft is an automated system that generates complex, multi-tool agentic tasks with verifiable execution trajectories, improving prompt optimization and model fine-tuning in NLP and AI.
Contribution
It introduces an automated workflow for creating scalable, hierarchical agentic tasks with execution traces, reducing reliance on costly human annotation.
Findings
Generated tasks enhance prompt optimization.
Improved supervised fine-tuning of agentic models.
Created a large dataset of 36,000 tasks with varying difficulty.
Abstract
Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future…
Peer Reviews
Decision·ICLR 2026 Poster
1. Originality: Proposes innovative depth-based and width-based methods for data generation. 2. Significance: Effectively addresses the scalability challenge of agentic data, a key bottleneck in training and evaluating tool-using LLM agents.
1. The paper’s method description is unclear. For example, the function $f()$ is used inconsistently across different contexts (e.g., lines 190, 196, 863, and 871). 2. Why not directly use the 7.5k TaskCraft data for SFT training? Could you include an additional experiment comparing its performance with the 7.5k MHQA dataset?
1. The proposed task generation pipeline is a simple idea that leverages LLMs to synthesize high quality tool-use tasks 2. The tasks and task execution traces generated through TaskCraft leads to training effective tool-use agents 3. Proposed method is easy to scale to large number of tasks by gathering large amounts of unlabeled corpus of webpages, PDFs, and images.
1. The method section is a bit hard to follow. It’d be good if authors can take another pass at writing and improve the flow of the content to make the method easier to understand. 2. There is a slight inconsistency in the types of datasets used (or at least in table descriptions for entries) for different model sizes for the experiments presented in table 3. For example, Qwen2.5-7B and DeepSeek R1 distill models have results for training on 7.5 MHQA tasks vs Qwen2.5-32B doesn’t mention the size
1. TaskCraft's automated workflow supports adaptive difficulty progression through depth-based and width-based extensions, eliminating the annotation bottleneck that limits existing benchmarks. 2. The generated tasks span varied difficulty levels across multiple tool modalities (web, PDF, image), including complex multi-hop reasoning tasks. This difficulty stratification mirrors human-curated benchmarks like GAIA while maintaining scalability. 3. The paper provides thorough experimental analys
Please refer to questions section.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
