OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale
Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Hong Chen, Cuiping Li

TL;DR
This paper introduces OmniSQL, a scalable framework for synthesizing large-scale high-quality text-to-SQL datasets, leading to a new open-source model that achieves state-of-the-art results without relying on expensive or closed-source LLMs.
Contribution
The paper presents SynSQL-2.5M, the first million-scale synthetic text-to-SQL dataset, and OmniSQL, a powerful open-source model trained on this data, improving performance and accessibility.
Findings
OmniSQL matches or surpasses GPT-4o and DeepSeek-V3 performance.
SynSQL-2.5M contains 2.5 million samples across 16,000 databases.
OmniSQL is effective in nine diverse datasets.
Abstract
Text-to-SQL, the task of translating natural language questions into SQL queries, plays a crucial role in enabling non-experts to interact with databases. While recent advancements in large language models (LLMs) have significantly enhanced text-to-SQL performance, existing approaches face notable limitations in real-world text-to-SQL applications. Prompting-based methods often depend on closed-source LLMs, which are expensive, raise privacy concerns, and lack customization. Fine-tuning-based methods, on the other hand, suffer from poor generalizability due to the limited coverage of publicly available training data. To overcome these challenges, we propose a novel and scalable text-to-SQL data synthesis framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention. Using this framework, we introduce SynSQL-2.5M, the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
