SQLForge: Synthesizing Reliable and Diverse Data to Enhance Text-to-SQL Reasoning in LLMs

Yu Guo; Dong Jin; Shenghao Ye; Shuangwu Chen; Jian Yang; Xiaobin Tan

arXiv:2505.13725·cs.CL·September 23, 2025

SQLForge: Synthesizing Reliable and Diverse Data to Enhance Text-to-SQL Reasoning in LLMs

Yu Guo, Dong Jin, Shenghao Ye, Shuangwu Chen, Jian Yang, Xiaobin Tan

PDF

TL;DR

SQLForge introduces a data synthesis method that improves the reliability and diversity of training data for open-source LLMs, significantly enhancing their performance on text-to-SQL benchmarks.

Contribution

The paper presents SQLForge, a novel data augmentation approach combining syntax constraints, reverse translation, and domain exploration to improve open-source LLMs in text-to-SQL tasks.

Findings

01

Achieves state-of-the-art open-source model performance on Spider and BIRD benchmarks.

02

Narrowed the performance gap between open-source and closed-source models.

03

Enhanced data reliability and diversity through novel synthesis techniques.

Abstract

Large Language models (LLMs) have demonstrated significant potential in text-to-SQL reasoning tasks, yet a substantial performance gap persists between existing open-source models and their closed-source counterparts. In this paper, we introduce SQLForge, a novel approach for synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs. We improve data reliability through SQL syntax constraints and SQL-to-question reverse translation, ensuring data logic at both structural and semantic levels. We also propose an SQL template enrichment and iterative data domain exploration mechanism to boost data diversity. Building on the augmented data, we fine-tune a variety of open-source models with different architectures and parameter sizes, resulting in a family of models termed SQLForge-LM. SQLForge-LM achieves the state-of-the-art performance on the widely recognized Spider…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.