Importance of Synthesizing High-quality Data for Text-to-SQL Parsing

Yiyun Zhao; Jiarong Jiang; Yiqun Hu; Wuwei Lan; Henry Zhu; Anuj; Chauhan; Alexander Li; Lin Pan; Jun Wang; Chung-Wei Hang; Sheng Zhang; Marvin; Dong; Joe Lilien; Patrick Ng; Zhiguo Wang; Vittorio Castelli; Bing Xiang

arXiv:2212.08785·cs.CL·December 20, 2022

Importance of Synthesizing High-quality Data for Text-to-SQL Parsing

Yiyun Zhao, Jiarong Jiang, Yiqun Hu, Wuwei Lan, Henry Zhu, Anuj, Chauhan, Alexander Li, Lin Pan, Jun Wang, Chung-Wei Hang, Sheng Zhang, Marvin, Dong, Joe Lilien, Patrick Ng, Zhiguo Wang, Vittorio Castelli, Bing Xiang

PDF

Open Access

TL;DR

This paper highlights the importance of high-quality synthetic data in improving text-to-SQL models, proposing a novel data synthesis framework that enhances data quality and achieves state-of-the-art results.

Contribution

The paper introduces a new synthesis framework that incorporates schema relationships, strong typing, and schema-distance sampling to generate more logical and effective training data for text-to-SQL tasks.

Findings

01

Significant accuracy improvements on benchmarks

02

Achieved new state-of-the-art on Spider

03

Enhanced data quality leads to better model performance

Abstract

Recently, there has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we first examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, our experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis