Byte-Pair Encoding for Text-to-SQL Generation
Samuel M\"uller, Andreas Vlachos

TL;DR
This paper adapts Byte-Pair Encoding (BPE) for text-to-SQL tasks, introducing a novel stopping criterion and AST-guided BPE to improve generalization and efficiency, leading to higher accuracy on multiple datasets.
Contribution
It presents a new BPE adaptation for text-to-SQL, including a stopping criterion and AST-guided BPE, enhancing accuracy and reducing training time.
Findings
Improved accuracy on five of six datasets.
Reduced training time by over 50% on four datasets.
Exceeded previous state-of-the-art results on two datasets.
Abstract
Neural sequence-to-sequence models provide a competitive approach to the task of mapping a question in natural language to an SQL query, also referred to as text-to-SQL generation. The Byte-Pair Encoding algorithm (BPE) has previously been used to improve machine translation (MT) between natural languages. In this work, we adapt BPE for text-to-SQL generation. As the datasets for this task are rather small compared to MT, we present a novel stopping criterion that prevents overfitting the BPE encoding to the training set. Additionally, we present AST BPE, which is a version of BPE that uses the Abstract Syntax Tree (AST) of the SQL statement to guide BPE merges and therefore produce BPE encodings that generalize better. We improved the accuracy of a strong attentive seq2seq baseline on five out of six English text-to-SQL tasks while reducing training time by more than 50% on four of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Artificial Intelligence in Games
MethodsSigmoid Activation · Tanh Activation · Byte Pair Encoding · Long Short-Term Memory · Sequence to Sequence
