Synthesizing Text-to-SQL Data from Weak and Strong LLMs

Jiaxi Yang; Binyuan Hui; Min Yang; Jian Yang; Junyang Lin; Chang Zhou

arXiv:2408.03256·cs.CL·August 7, 2024

Synthesizing Text-to-SQL Data from Weak and Strong LLMs

Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, Chang Zhou

PDF

Open Access 1 Video

TL;DR

This paper presents a synthetic data approach combining outputs from strong and weak LLMs to improve text-to-SQL models, achieving state-of-the-art results on key benchmarks and enhancing domain generalization.

Contribution

It introduces a novel data synthesis method that leverages both strong and weak LLMs, along with error supervision, to enhance open-source text-to-SQL models and close the performance gap.

Findings

01

Achieved state-of-the-art results on SPIDER and BIRD benchmarks.

02

Enhanced domain generalization of text-to-SQL models.

03

Demonstrated the effectiveness of error data supervision through preference learning.

Abstract

The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by smaller, not well-aligned models (weak models). The method not only enhances the domain generalization of text-to-SQL models but also explores the potential of error data supervision through preference learning. Furthermore, we employ the synthetic data approach for instruction tuning on open-source LLMs, resulting SENSE, a specialized text-to-SQL model. The effectiveness of SENSE is demonstrated through state-of-the-art results on the SPIDER and BIRD benchmarks, bridging the performance gap between open-source models and methods prompted by closed-source models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Synthesizing Text-to-SQL Data from Weak and Strong LLMs· underline

Taxonomy

TopicsMathematics, Computing, and Information Processing · Advanced Database Systems and Queries · Natural Language Processing Techniques