TaskCraft: Automated Generation of Agentic Tasks

Dingfeng Shi; Jingyi Cao; Qianben Chen; Weichen Sun; Weizhen Li; Hongxuan Lu; Fangchen Dong; Tianrui Qin; King Zhu; Minghao Liu; Jian Yang; Ge Zhang; Jiaheng Liu; Changwang Zhang; Jun Wang; Yuchen Eleanor Jiang; Wangchunshu Zhou

arXiv:2506.10055·cs.CL·June 18, 2025

TaskCraft: Automated Generation of Agentic Tasks

Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, Jian Yang, Ge Zhang, Jiaheng Liu, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

TaskCraft is an automated system that generates complex, multi-tool agentic tasks with verifiable execution trajectories, improving prompt optimization and model fine-tuning in NLP and AI.

Contribution

It introduces an automated workflow for creating scalable, hierarchical agentic tasks with execution traces, reducing reliance on costly human annotation.

Findings

01

Generated tasks enhance prompt optimization.

02

Improved supervised fine-tuning of agentic models.

03

Created a large dataset of 36,000 tasks with varying difficulty.

Abstract

Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. Originality: Proposes innovative depth-based and width-based methods for data generation. 2. Significance: Effectively addresses the scalability challenge of agentic data, a key bottleneck in training and evaluating tool-using LLM agents.

Weaknesses

1. The paper’s method description is unclear. For example, the function $f()$ is used inconsistently across different contexts (e.g., lines 190, 196, 863, and 871). 2. Why not directly use the 7.5k TaskCraft data for SFT training? Could you include an additional experiment comparing its performance with the 7.5k MHQA dataset?

Reviewer 02Rating 4Confidence 3

Strengths

1. The proposed task generation pipeline is a simple idea that leverages LLMs to synthesize high quality tool-use tasks 2. The tasks and task execution traces generated through TaskCraft leads to training effective tool-use agents 3. Proposed method is easy to scale to large number of tasks by gathering large amounts of unlabeled corpus of webpages, PDFs, and images.

Weaknesses

1. The method section is a bit hard to follow. It’d be good if authors can take another pass at writing and improve the flow of the content to make the method easier to understand. 2. There is a slight inconsistency in the types of datasets used (or at least in table descriptions for entries) for different model sizes for the experiments presented in table 3. For example, Qwen2.5-7B and DeepSeek R1 distill models have results for training on 7.5 MHQA tasks vs Qwen2.5-32B doesn’t mention the size

Reviewer 03Rating 6Confidence 3

Strengths

1. TaskCraft's automated workflow supports adaptive difficulty progression through depth-based and width-based extensions, eliminating the annotation bottleneck that limits existing benchmarks. 2. The generated tasks span varied difficulty levels across multiple tool modalities (web, PDF, image), including complex multi-hop reasoning tasks. This difficulty stratification mirrors human-curated benchmarks like GAIA while maintaining scalability. 3. The paper provides thorough experimental analys

Weaknesses

Please refer to questions section.

Code & Models

Repositories

oppo-personalai/taskcraft
noneOfficial

Datasets

PersonalAILab/TaskCraft
dataset· 3.8k dl
3.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques