DocuT5: Seq2seq SQL Generation with Table Documentation
Elena Soare, Iain Mackie, Jeffrey Dalton

TL;DR
DocuT5 enhances SQL generation by integrating table documentation and structure knowledge into a seq2seq model, significantly improving accuracy on complex, cross-domain, multi-table questions.
Contribution
It introduces DocuT5, a novel approach that injects external documentation and table structure knowledge into a T5-based model for better domain generalization in SQL generation.
Findings
Reduced foreign key errors to 19.6%.
Improved accuracy on Spider datasets.
Achieved state-of-the-art results with domain knowledge.
Abstract
Current SQL generators based on pre-trained language models struggle to answer complex questions requiring domain context or understanding fine-grained table structure. Humans would deal with these unknowns by reasoning over the documentation of the tables. Based on this hypothesis, we propose DocuT5, which uses off-the-shelf language model architecture and injects knowledge from external `documentation' to improve domain generalization. We perform experiments on the Spider family of datasets that contain complex questions that are cross-domain and multi-table. Specifically, we develop a new text-to-SQL failure taxonomy and find that 19.6% of errors are due to foreign key mistakes, and 49.2% are due to a lack of domain knowledge. We proposed DocuT5, a method that captures knowledge from (1) table structure context of foreign keys and (2) domain knowledge through contextualizing tables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
MethodsMulti-Head Attention · Attention Is All You Need · Byte Pair Encoding · Residual Connection · Dropout · Attention Dropout · Dense Connections · Layer Normalization · Linear Layer · Gated Linear Unit
