SciDaSynth: Interactive Structured Data Extraction from Scientific Literature with Large Language Model
Xingbo Wang, Samantha L. Huey, Rui Sheng, Saurabh Mehta, Fei Wang

TL;DR
SciDaSynth is an interactive system leveraging large language models to extract, validate, and refine structured data from scientific literature, improving efficiency and accuracy in data collection across multimodal sources.
Contribution
It introduces a novel LLM-powered interactive system for structured data extraction, validation, and refinement from diverse scientific documents, addressing limitations of existing tools.
Findings
Outperforms baseline methods in data extraction efficiency
Produces high-quality structured data from multimodal sources
Enhances data validation and consistency across documents
Abstract
The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence-based decision-making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models (LLMs) that automatically generates structured data tables according to users' queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi-faceted visual summaries and semantic grouping capabilities to resolve cross-document data inconsistencies. A within-subjects study with nutrition and NLP researchers demonstrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Semantic Web and Ontologies
