SciDaSynth: Interactive Structured Data Extraction From Scientific Literature With Large Language Model
Xingbo Wang, Samantha L. Huey, Rui Sheng, Saurabh Mehta, Fei Wang

TL;DR
SciDaSynth is a new system that uses large language models to extract and structure data from scientific papers more efficiently and accurately.
Contribution
SciDaSynth introduces an interactive system for structured data extraction that integrates text, tables, and figures using large language models.
Findings
SciDaSynth outperforms baseline methods in producing high-quality structured data.
The system supports data validation and refinement through visual summaries and semantic grouping.
A within-subjects study with researchers confirmed its effectiveness in data extraction tasks.
Abstract
The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence‐based decision‐making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models that automatically generates structured data tables according to users' queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi‐faceted visual summaries and semantic grouping capabilities to resolve cross‐document data inconsistencies. A within‐subjects study with nutrition and NLP researchers demonstrates…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Data Quality and Management · Web Data Mining and Analysis
