T5QL: Taming language models for SQL generation
Samuel Arcadinho, David Apar\'icio, Hugo Veiga, Ant\'onio Alegria

TL;DR
T5QL is a novel SQL generation approach that enhances performance with smaller language models and guarantees valid SQL output by using a context-free grammar, reducing reliance on expensive large models.
Contribution
It introduces T5QL, which improves SQL generation accuracy with smaller models and ensures validity through grammar constraints, and explores task division for efficiency.
Findings
13 percentage points improvement over SOTA with T5-Base
Guarantees valid SQL output using context-free grammar
Dividing semantic parsing into generation and re-ranking reduces model size requirements
Abstract
Automatic SQL generation has been an active research area, aiming at streamlining the access to databases by writing natural language with the given intent instead of writing SQL. Current SOTA methods for semantic parsing depend on LLMs to achieve high predictive accuracy on benchmark datasets. This reduces their applicability, since LLMs requires expensive GPUs. Furthermore, SOTA methods are ungrounded and thus not guaranteed to always generate valid SQL. Here we propose T5QL, a new SQL generation method that improves the performance in benchmark datasets when using smaller LMs, namely T5-Base, by 13pp when compared against SOTA methods. Additionally, T5QL is guaranteed to always output valid SQL using a context-free grammar to constrain SQL generation. Finally, we show that dividing semantic parsing in two tasks, candidate SQLs generation and candidate re-ranking, is a promising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Natural Language Processing Techniques · Scientific Computing and Data Management
MethodsLinear Layer · Byte Pair Encoding · CodeBERT · Gated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Inverse Square Root Schedule · Adafactor · Dense Connections · Softmax · Attention Dropout
