Making Sense of Data in the Wild: Data Analysis Automation at Scale
Mara Graziani, Malina Molnar, Irina Espejo Morales, Joris, Cadow-Gossweiler, Teodoro Laino

TL;DR
This paper introduces an automated system that uses intelligent agents and retrieval augmented generation to analyze, curate, and index datasets from public repositories, enhancing dataset diversity and utility for machine learning.
Contribution
The paper presents a novel automated approach combining intelligent agents with retrieval augmented generation for large-scale data analysis and curation, improving dataset descriptions and retrieval effectiveness.
Findings
More detailed dataset descriptions generated
Higher hit rates in dataset retrieval
Enhanced diversity in dataset selection
Abstract
As the volume of publicly available data continues to grow, researchers face the challenge of limited diversity in benchmarking machine learning tasks. Although thousands of datasets are available in public repositories, the sheer abundance often complicates the search for suitable data, leaving many valuable datasets underexplored. This situation is further amplified by the fact that, despite longstanding advocacy for improving data curation quality, current solutions remain prohibitively time-consuming and resource-intensive. In this paper, we propose a novel approach that combines intelligent agents with retrieval augmented generation to automate data analysis, dataset curation and indexing at scale. Our system leverages multiple agents to analyze raw, unstructured data across public repositories, generating dataset reports and interactive visual indexes that can be easily explored.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOnline Learning and Analytics · Big Data Technologies and Applications · Big Data and Business Intelligence
