SQUiD: Synthesizing Relational Databases from Unstructured Text

Mushtari Sadia; Zhenning Yang; Yunming Xiao; Ang Chen; Amrita Roy Chowdhury

arXiv:2505.19025·cs.DB·March 3, 2026

SQUiD: Synthesizing Relational Databases from Unstructured Text

Mushtari Sadia, Zhenning Yang, Yunming Xiao, Ang Chen, Amrita Roy Chowdhury

PDF

Open Access

TL;DR

SQUiD is a neurosymbolic framework that uses large language models to automatically generate and populate relational databases from unstructured text, significantly outperforming existing methods.

Contribution

We introduce SQUiD, a novel four-stage neurosymbolic approach leveraging LLMs for automatic database synthesis from raw text, advancing the automation of data management.

Findings

01

SQUiD outperforms baseline methods across various datasets.

02

The framework effectively generates accurate database schemas and data.

03

Code and datasets are publicly available for reproducibility.

Abstract

Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets. Our code and datasets are publicly available at: https://github.com/Mushtari-Sadia/SQUiD.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies