AutoData: A Multi-Agent System for Open Web Data Collection
Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, Chuxu Zhang, and Yanfang Ye

TL;DR
AutoData is a multi-agent system that automates web data collection with minimal human input, using a novel architecture and cache system to improve efficiency and reduce costs, validated on new and existing datasets.
Contribution
We introduce AutoData, a multi-agent system with a hypergraph architecture and cache system for scalable, cost-effective web data collection from natural language instructions.
Findings
AutoData outperforms baseline methods on benchmark datasets.
The system effectively reduces token costs in LLM-based data collection.
AutoData successfully collects data across diverse domains like academic, finance, and sports.
Abstract
The exponential growth of data-driven systems and AI technologies has intensified the demand for high-quality web-sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data-collecting solutions fall into two categories: wrapper-based methods that struggle with adaptability and reproducibility, and large language model (LLM)-based approaches that incur substantial computational and financial costs. To address these challenges, we propose AutoData, a novel multi-agent system for Automated web Data collection, that requires minimal human intervention, i.e., only necessitating a natural language instruction specifying the desired dataset. In addition, AutoData is designed with a robust multi-agent architecture, featuring a novel oriented message hypergraph…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Data Quality and Management · Information Retrieval and Search Behavior
