Structure-Grounded Pretraining for Text-to-SQL

Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr; Polozov; Huan Sun; Matthew Richardson

arXiv:2010.12773·cs.CL·September 1, 2022

Structure-Grounded Pretraining for Text-to-SQL

Xiang Deng, Ahmed Hassan Awadallah, Christopher Meek, Oleksandr, Polozov, Huan Sun, Matthew Richardson

PDF

1 Datasets

TL;DR

This paper introduces StruG, a weakly supervised pretraining framework for text-to-SQL that improves text-table alignment understanding, especially in realistic scenarios with less explicit column references.

Contribution

StruG is a novel pretraining method that leverages a parallel text-table corpus with new prediction tasks to enhance text-to-SQL models' alignment capabilities.

Findings

01

StruG significantly outperforms BERT-LARGE across all evaluation settings.

02

Achieves comparable results to GRAPPA on Spider dataset.

03

Excels on more realistic datasets with less explicit column mentions.

Abstract

Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (StruG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing text-to-SQL datasets for cross-database evaluation. STRUG brings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

aherntech/spider-realistic
dataset· 95 dl
95 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.