Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation

Hao Wang; Yuanfeng Song; Xiaoming Yin; Xing Chen

arXiv:2511.13590·cs.CL·November 25, 2025

Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation

Hao Wang, Yuanfeng Song, Xiaoming Yin, Xing Chen

PDF

Open Access

TL;DR

This paper introduces a taxonomy-guided approach to create a diverse and comprehensive Text-to-SQL benchmark dataset, SQL-Synth, to better evaluate and improve LLM performance on real-world applications.

Contribution

It proposes a new taxonomy for Text-to-SQL tasks, uses it to synthesize a diverse dataset with LLMs, and demonstrates its effectiveness over existing datasets.

Findings

01

Existing datasets lack diversity and coverage.

02

SQL-Synth outperforms previous benchmarks in diversity.

03

Fine-tuning improves LLM performance on complex scenarios.

Abstract

Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning and Data Classification · Topic Modeling