Synthetic SQL Column Descriptions and Their Impact on Text-to-SQL Performance
Niklas Wretblad, Oskar Holmstr\"om, Erik Larsson, Axel Wiks\"ater,, Oscar S\"oderlund, Hjalmar \"Ohman, Ture Pont\'en, Martin Forsberg, Martin, S\"orme, Fredrik Heintz

TL;DR
This paper investigates using large language models to automatically generate detailed SQL column descriptions to improve text-to-SQL performance, revealing that richer metadata enhances model accuracy especially with larger models.
Contribution
It introduces a dataset of SQL column descriptions, evaluates LLMs for description generation, and demonstrates that generated metadata improves text-to-SQL accuracy, surpassing manual descriptions in some cases.
Findings
LLMs struggle with ambiguous columns, needing expert input.
Generated descriptions improve text-to-SQL performance, especially for large models.
Qwen2 descriptions outperform manual gold descriptions, benefiting from detailed metadata.
Abstract
Relational databases often suffer from uninformative descriptors of table contents, such as ambiguous columns and hard-to-interpret values, impacting both human users and text-to-SQL models. In this paper, we explore the use of large language models (LLMs) to automatically generate detailed natural language descriptions for SQL database columns, aiming to improve text-to-SQL performance and automate metadata creation. We create a dataset of gold column descriptions based on the BIRD-Bench benchmark, manually refining its column descriptions and creating a taxonomy for categorizing column difficulty. We then evaluate several different LLMs in generating column descriptions across the columns and different difficulties in the dataset, finding that models unsurprisingly struggle with columns that exhibit inherent ambiguity, highlighting the need for manual expert input. We also find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Advanced Computational Techniques and Applications · Semantic Web and Ontologies
MethodsALIGN
