DBAutoDoc: Automated Discovery and Documentation of Undocumented Database Schemas via Statistical Analysis and Iterative LLM Refinement
Amith Nagarajan, Thomas Altman

TL;DR
DBAutoDoc is an innovative system that combines statistical analysis and iterative large language model refinement to automatically discover and document undocumented relational database schemas, significantly improving schema understanding accuracy.
Contribution
The paper introduces DBAutoDoc, a novel iterative approach inspired by neural backpropagation, that enhances schema documentation through semantic propagation in dependency graphs.
Findings
Achieved 96.1% weighted score on benchmark databases.
Deterministic pipeline improves FK detection F1 by 23 points.
Open-source release ensures reproducibility and community use.
Abstract
A tremendous number of critical database systems lack adequate documentation. Declared primary keys are absent, foreign key constraints have been dropped for performance, column names are cryptic abbreviations, and no entity-relationship diagrams exist. We present DBAutoDoc, a system that automates the discovery and documentation of undocumented relational database schemas by combining statistical data analysis with iterative large language model (LLM) refinement. DBAutoDoc's central insight is that schema understanding is fundamentally an iterative, graph-structured problem. Drawing structural inspiration from backpropagation in neural networks, DBAutoDoc propagates semantic corrections through schema dependency graphs across multiple refinement iterations until descriptions converge. This propagation is discrete and semantic rather than mathematical, but the structural analogy is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Advanced Database Systems and Queries
