SODIUM: From Open Web Data to Queryable Databases
Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang

TL;DR
This paper introduces SODIUM, a formal framework and benchmark for automating the extraction and integration of web data into queryable databases, demonstrating significant improvements with a multi-agent system.
Contribution
It formalizes the SODIUM task, creates a benchmark dataset, and develops a novel multi-agent system that substantially improves web data integration accuracy.
Findings
Existing systems achieve only 46.5% accuracy on SODIUM-Bench.
SODIUM-Agent achieves 91.1% accuracy, outperforming baselines.
The ATP-BFS algorithm enhances deep web exploration and data extraction.
Abstract
During research, domain experts often ask analytical questions whose answers require integrating data from a wide range of web sources. Thus, they must spend substantial effort searching, extracting, and organizing raw data before analysis can begin. We formalize this process as the SODIUM task, where we conceptualize open domains such as the web as latent databases that must be systematically instantiated to support downstream querying. Solving SODIUM requires (1) conducting in-depth and specialized exploration of the open web, which is further strengthened by (2) exploiting structural correlations for systematic information extraction and (3) integrating collected information into coherent, queryable database instances. To quantify the challenges in automating SODIUM, we construct SODIUM-Bench, a benchmark of 105 tasks derived from published academic papers across 6 domains, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Data Quality and Management · Information Retrieval and Search Behavior
