Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration
Linzhuang Sun, Tianyu Guo, Hao Liang, Yuying Li, Qifeng Cai, Jingxuan Wei, Bihui Yu, Wentao Zhang, Bin Cui

TL;DR
This paper introduces DySQL-Bench, a new benchmark for evaluating multi-turn, dynamic Text-to-SQL systems in realistic scenarios, highlighting current models' limitations in adapting to evolving user intents.
Contribution
The paper presents DySQL-Bench, an automated, multi-domain benchmark for real-world interactive Text-to-SQL tasks, and a multi-turn evaluation framework to assess model adaptability.
Findings
GPT-4o achieves only 58.34% accuracy on the benchmark.
The benchmark covers 13 domains with 1,072 tasks.
Current models struggle with dynamic, multi-turn interactions.
Abstract
Recent advances in Text-to-SQL have achieved strong results in static, single-turn tasks, where models generate SQL queries from natural language questions. However, these systems fall short in real-world interactive scenarios, where user intents evolve and queries must be refined over multiple turns. In applications such as finance and business analytics, users iteratively adjust query constraints or dimensions based on intermediate results. To evaluate such dynamic capabilities, we introduce DySQL-Bench, a benchmark assessing model performance under evolving user interactions. Unlike previous manually curated datasets, DySQL-Bench is built through an automated two-stage pipeline of task synthesis and verification. Structured tree representations derived from raw database tables guide LLM-based task generation, followed by interaction-oriented filtering and expert validation. Human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
