Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
Yuxuan Lu, Ziyi Wang, Yingzhou Lu, Yisi Sang, Jiri Gesi, Xianfeng Tang, Yimeng Zhang, Zhenwei Dai, Hui Liu, Hanqing Lu, Chen Luo, Qi He, Benoit Dumoulin, Jing Huang, Dakuo Wang

TL;DR
FireFly is a novel pipeline that generates large-scale, verified tool-call data from real APIs by leveraging LLM exploration and backward task synthesis, enabling improved training of tool-calling agents.
Contribution
The paper introduces a new method for generating verified tool-call data from real APIs using graph-guided exploration and backward task synthesis, ensuring label correctness.
Findings
Generated 5,144 verified tasks across 240 servers and 993 tools.
A 4B-parameter model trained on FireFly data matches or exceeds existing benchmarks.
FireFly improves tool-calling performance on multiple benchmarks.
Abstract
Training tool-calling agents requires large-scale trajectory data with verifiable labels, yet existing approaches either synthesize environments that diverge from real API behavior or generate tasks without ground-truth outcomes for verification. We present FireFly, a pipeline for generating verified tool-call data from real-world MCP servers. Our key insight is to invert the standard synthesis pipeline: rather than generating tasks and hoping they are solvable, we first let a strong LLM explore real APIs along graph-guided DAG structures, then synthesize tasks backward from observed outcomes, guaranteeing label correctness by construction. To handle the scale of real-world tool spaces (1,000 tools), we build a pairwise tool graph and sample sub-DAGs to focus exploration on semantically coherent workflows. To address environment drift in live APIs, we construct a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
